Connectors for External Text-to-Speech Services
Using a simple connector system, SPE can be easily connected with external Text-to-Speech (TTS) services. This article describes the principles and how-tos. By following these instructions, you can create your own connector, allowing you to use a custom third-party TTS service via SPE.
The TTS connector should be a command-line (CLI) application or script that communicates with the external TTS service via the service's native API and with SPE via standard input (stdin) and output (stdout).
The connector behavior should be as follows:
- If the connector is started with the
--info
parameter, it outputs TTS service capabilities information in JSON format to stdout. - If the connector is started without a parameter:
- Reads input JSON data from stdin.
- Outputs raw PCM signed 16-bit little-endian mono audio data to stdout:
- SPE 3.46+: With a sampling frequency according to the
naturalSampleRateHertz
value returned in capabilities. - SPE up to 3.45: With a fixed sampling frequency of 8000 Hz.
- SPE 3.46+: With a sampling frequency according to the
- (Optional) If started with the
--help
or-h
parameter, the connector outputs basic usage information to stdout.
Details of the connector behavior are listed below.
TTS service capabilities information
Launching the connector with --info
parameter is expected to provide
information about actual TTS service capabilities: a list of voice names,
supported languages, and audio quality (sampling frequencies). This info is
used:
- During the SPE startup sequence: TTS connectors enabled in the SPE
configuration file are started with the
--info
parameter, and SPE reads the connector output. Connectors that fail to provide the information won't be available for use with SPE. - When the
/external/technologies/tts/info
endpoint is called: All successfully initialized TTS connectors (see above) are asked to provide the capabilities information. This is intended to refresh the information from the TTS service.
The capabilities information (voice names, language codes, sampling frequencies) should be obtained from the actual TTS service. Returning hardcoded information instead of propagating the real capabilities of the TTS service is not a good idea, as it could become outdated over time, leading to obscure issues in the application that relies on this information.
Required capabilities information JSON structure:
Version 2 (SPE 3.46 and newer)
{
"apiVersion": 2,
"vendor": string,
"author": string,
"version": string,
"voices": \[
{
"name": string,
"languageCodes": \[string, string, ...\],
"naturalSampleRateHertz": number
},
.
.
.
]
}
Legacy (up to SPE 3.45 ):
{
"vendor": string,
"author": string,
"version": string,
"voices": \[
{
"name":string,
"languageCodes": \[string, string, ...\]
},
.
.
.
]
}
Where:
apiVersion
denotes the version of the capabilities structure/API:2
: SPE 3.46 and newer.- The
apiVersion
property is not present at all for SPE 3.45 and older.
vendor
is the name of the TTS provider. This name is then used in thePOST /external/technologies/tts
parameter.author
andversion
are intended for internal connector author description and versioning.voices
array should list available TTS voices- Voice
name
. - A list of
languageCodes
supported by that voice. - SPE 3.46 and newer only:
naturalSampleRateHertz
, which provides the default natural sampling rate of the audio.
- Voice
Connector input
The input JSON that should be accepted by the TTS connector from stdin is as follows:
{
"text": string,
"voice": {
"name": string,
"languageCode": string
}
}
Where:
text
is the text to be synthesized.name
is a voice name to be used for synthesis (ref. to the voice names provided in the connector "info" data).languageCode
is a language code defining the language to be used for synthesis (ref. to the connector "info" data).
The connector is responsible for passing the input data to the actual TTS service as needed using the service native API, retrieving the synthesized audio data from the TTS service and outputting the audio to the stdout (see the Connector output section below).
The connector can also be used for playing 'static' messages from audio files.
For example, the text
property can be used to pass the file name to be played,
and the audio files can be organized in directories whose names are passed to
the connector using the voice name
property.
Connector output
The output obtained from the TTS service should be written by the connector to stdout as raw PCM signed 16-bit little-endian mono audio data.
In SPE 3.46 and newer, the audio sampling frequency must be set to the
naturalSampleRateHertz
value provided in the TTS service capabilities
information. In SPE 3.45 and older, the audio sampling frequency must be
fixed to 8000 Hz.
SPE then reads the audio and writes it either to a file or to an output realtime stream, according to the original request. See the Text to Speech section of the REST API documentation for more details.
SPE reads the connector output continuously, meaning the connector can stream the audio data to stdout as soon as it's received from the TTS service (if the service supports streaming of the synthesized audio). This can reduce unwanted delays, especially in the case of longer texts that take more time to synthesize.
Connector naming, location, configuration
TTS connectors should be placed in the
{SPE_installation_directory}/external/technologies/tts
directory, with each
connector in a separate subdirectory.
To enable a connector, include its subdirectory name in the
external.technologies.tts_connectors
setting in the SPE configuration file.
The connector executable file must be named connector
(i.e., without a
file extension). The connector configuration, such as the TTS service
address, access credentials, API token, etc., should ideally be done using a
separate configuration file, preferably named connector.properties
, using
a .properties-like format (to be
consistent with the SPE configuration file format).
If everything is set and configured properly, SPE should log a successful TTS connector initialization:
TTSSubsystem: Retrieving external connector info from ......./external/technologies/tts/acapela
TTSSubsystem: External connector 'acapela' from ......./external/technologies/tts/acapela has been registered.
If an error occurs, SPE logs the problem:
TTSSubsystem: Retrieving external connector info from ......./external/technologies/tts/acapela
TTSSubsystem: Cannot retrieve external connector info! ERROR: Loading configuration from "......./external/technologies/tts/acapela/connector.properties";Error: acapela server is not running or address and ports are misconfigured;