Skip to main content

Dynamically Adding Words to the Language Model

Adding words to the Speech to Text (STT) language model on-the-fly is possible in SPE 3.45 or newer, as part of the preferred phrases feature.

The POST /technologies/stt or POST /technologies/stt/input_stream API calls serve two purposes:

  1. Specifying the preferred phrases (in the phrases section).
  2. Adding words to the STT language model (in the dictionary section).

Each part can be used independently. You can specify only preferred phrases, add words to the dictionary, or use both features simultaneously.

Here is an example of input for starting transcription, specifying two preferred phrases and two words to be added (one with an explicitly specified pronunciation):

{
"preferred_phrases": {
"phrases": [
{
"phrase": "this is a preferred phrase"
},
{
"phrase": "some other phrase"
}
],
"dictionary": [
{
"word": "preferred"
},
{
"word": "phrase",
"pronunciations": [
{
"phonemes": "f r ey z"
}
]
}
]
}
}

Words and pronunciations

Words to be added to the language model can be specified without a pronunciation. In such cases, the system will automatically generate a default pronunciation based on the word's letters, following internal linguistic rules for the given STT language. However, the automatically generated pronunciation may not always align with expectations, especially for foreign words due to differences between the word's native language and the STT language. Therefore, it is recommended to define pronunciations explicitly to prevent mis-transcriptions caused by incorrectly generated default pronunciations.

It is also possible to define multiple pronunciations, which can be particularly useful for uncommon or foreign words, slang terms, etc., that people might mispronounce.

Allowed characters

Generally, words should use only the letters (graphemes) allowed in the given STT language. You can use GET /technologies/stt/graphemes to retrieve the list of allowed graphemes. However, it is also permissible to use letters from different alphabets (e.g., a German word like “grüßen” in a Czech transcription) or different writing scripts (such as Cyrillic or Japanese Kana). In such cases, the word's pronunciation MUST be explicitly specified.

The pronunciation must use only phonemes supported by the STT language (use GET /technologies/stt/phonemes to retrieve the list of allowed phonemes). If a word is specified using disallowed characters without an accompanying pronunciation, that word will be ignored during transcription (see the warning_message parameter below).

Transcription results

If preferred phrases and/or words were specified when starting the transcription, the result will contain the same phrases and dictionary structures used as input for the transcription task.

The dictionary structure is enriched with the following:

  • The pronunciations part is automatically generated for words that did not have pronunciations specified in the input.
  • The out_of_vocabulary parameter indicates whether the word exists in the internal vocabulary.
  • The class parameter contains the name of the word class to which the word belongs, if applicable.
  • The warning_message parameter contains any warning messages (if a warning message is present, the corresponding word/pronunciation was ignored and not used during transcription).

The example below shows the transcription result if the transcription was started using the input example provided earlier. The added parts are highlighted:

{
"result": {
"version": 5,
"name": "SpeechRecognitionResult",
"file": "/test.wav",
"model": "EN_US_6",
"phrases": [
{
"phrase": "this is a preferred phrase"
},
{
"phrase": "and some other phrase"
}
],
"dictionary": [
{
"word": "preferred",
"pronunciations": [
{
"phonemes": "p r ih f er d",
"out_of_vocabulary": false,
"class": "",
"warning_message": ""
}
]
},
{
"word": "phrase",
"pronunciations": [
{
"phonemes": "f r ey z",
"out_of_vocabulary": false,
"class": "",
"warning_message": ""
}
]
}
]
}
}