Connectors for External Text-to-Speech Services

Using a simple connector system, SPE can be easily connected with external Text-to-Speech (TTS) services. This article describes the principles and how-tos. By following these instructions, you can create your own connector, allowing you to use a custom third-party TTS service via SPE.

The TTS connector should be a command-line (CLI) application or script that communicates with the external TTS service via the service's native API and with SPE via standard input (stdin) and output (stdout).

The connector behavior should be as follows:

If the connector is started with the --info parameter, it outputs TTS service capabilities information in JSON format to stdout.
If the connector is started without a parameter:
- Reads input JSON data from stdin.
- Outputs raw PCM signed 16-bit little-endian mono audio data to stdout:
  - SPE 3.46+: With a sampling frequency according to the naturalSampleRateHertz value returned in capabilities.
  - SPE up to 3.45: With a fixed sampling frequency of 8000 Hz.
(Optional) If started with the --help or -h parameter, the connector outputs basic usage information to stdout.

Details of the connector behavior are listed below.

TTS service capabilities information

Launching the connector with --info parameter is expected to provide information about actual TTS service capabilities: a list of voice names, supported languages, and audio quality (sampling frequencies). This info is used:

During the SPE startup sequence: TTS connectors enabled in the SPE configuration file are started with the --info parameter, and SPE reads the connector output. Connectors that fail to provide the information won't be available for use with SPE.
When the /external/technologies/tts/info endpoint is called: All successfully initialized TTS connectors (see above) are asked to provide the capabilities information. This is intended to refresh the information from the TTS service.

note

The capabilities information (voice names, language codes, sampling frequencies) should be obtained from the actual TTS service. Returning hardcoded information instead of propagating the real capabilities of the TTS service is not a good idea, as it could become outdated over time, leading to obscure issues in the application that relies on this information.

Required capabilities information JSON structure:

Version 2 (SPE 3.46 and newer)

{
    "apiVersion": 2,
    "vendor": string,
    "author": string,
    "version": string,
    "voices": [
        {
        "name": string,
        "languageCodes": [string, string, ...],
        "naturalSampleRateHertz": number
        },
    .
    .
    .
    ]
}

Legacy (up to SPE 3.45 ):

{
    "vendor": string,
    "author": string,
    "version": string,
    "voices": [
        {
            "name":string,
            "languageCodes": [string, string, ...]
        },
        .
        .
        .
    ]
}

Where:

apiVersion denotes the version of the capabilities structure/API:
- 2: SPE 3.46 and newer.
- The apiVersion property is not present at all for SPE 3.45 and older.
vendor is the name of the TTS provider. This name is then used in the POST /external/technologies/tts parameter.
author and version are intended for internal connector author description and versioning.
voices array should list available TTS voices
- Voice name.
- A list of languageCodes supported by that voice.
- SPE 3.46 and newer only: naturalSampleRateHertz, which provides the default natural sampling rate of the audio.

Connector input

The input JSON that should be accepted by the TTS connector from stdin is as follows:

{
    "text": string,
    "voice": {
        "name": string,
        "languageCode": string
    }
}

Where:

text is the text to be synthesized.
name is a voice name to be used for synthesis (ref. to the voice names provided in the connector "info" data).
languageCode is a language code defining the language to be used for synthesis (ref. to the connector "info" data).

The connector is responsible for passing the input data to the actual TTS service as needed using the service native API, retrieving the synthesized audio data from the TTS service and outputting the audio to the stdout (see the Connector output section below).

tip

The connector can also be used for playing 'static' messages from audio files. For example, the text property can be used to pass the file name to be played, and the audio files can be organized in directories whose names are passed to the connector using the voice name property.

Connector output

The output obtained from the TTS service should be written by the connector to stdout as raw PCM signed 16-bit little-endian mono audio data.

In SPE 3.46 and newer, the audio sampling frequency must be set to the naturalSampleRateHertz value provided in the TTS service capabilities information. In SPE 3.45 and older, the audio sampling frequency must be fixed to 8000 Hz.

SPE then reads the audio and writes it either to a file or to an output realtime stream, according to the original request. See the Text to Speech section of the REST API documentation for more details.

SPE reads the connector output continuously, meaning the connector can stream the audio data to stdout as soon as it's received from the TTS service (if the service supports streaming of the synthesized audio). This can reduce unwanted delays, especially in the case of longer texts that take more time to synthesize.

Connector naming, location, configuration

TTS connectors should be placed in the {SPE_installation_directory}/external/technologies/tts directory, with each connector in a separate subdirectory.

To enable a connector, include its subdirectory name in the external.technologies.tts_connectors setting in the SPE configuration file.

The connector executable file must be named connector (i.e., without a file extension). The connector configuration, such as the TTS service address, access credentials, API token, etc., should ideally be done using a separate configuration file, preferably named connector.properties, using a .properties-like format (to be consistent with the SPE configuration file format).

If everything is set and configured properly, SPE should log a successful TTS connector initialization:

TTSSubsystem: Retrieving external connector info from ......./external/technologies/tts/acapela
TTSSubsystem: External connector 'acapela' from ......./external/technologies/tts/acapela has been registered.

If an error occurs, SPE logs the problem:

TTSSubsystem: Retrieving external connector info from ......./external/technologies/tts/acapela
TTSSubsystem: Cannot retrieve external connector info! ERROR: Loading configuration from "......./external/technologies/tts/acapela/connector.properties";Error: acapela server is not running or address and ports are misconfigured;

TTS service capabilities information​

Connector input​

Connector output​

Connector naming, location, configuration​

TTS service capabilities information

Connector input

Connector output

Connector naming, location, configuration