Skip to main content

Configuring word detection parameters for stream transcription

One of the key improvements introduced since Speech Engine 3.24 is the neural network-based VAD, used for word and segment detection. This article describes the segmenter configuration parameters and how they impact the real-time stream STT results.

The default segmenter parameters are as follows:

[vad.online_segmenter:SOnlineVoiceActivitySegmenterI]
backward_extensions_length_ms=150
forward_extensions_length_ms=750
speech_threshold=0.5

backward-forward-extensions

Backward and forward extensions refer to intervals in milliseconds, which extend the portion of the signal sent to the decoder. The decoder is responsible for determining what a specific segment of the signal contains (e.g., speech or silence). Based on this analysis, the decoder also decides whether the segment has concluded or is still ongoing.

Unlike file processing (where it is possible to analyze any part of the file), in real-time processing, future signal data is not available. The backward extension value dictates how long processing must be delayed (i.e., processing waits until the specified amount of input signal is received). Increasing this value results in delayed speech activity detection, such as delayed barge-in detection in voicebot implementations.

The forward extension value specifies how much of the following signal should be included in the processing to determine if the utterance continues. Lowering this value enables the detection of even shorter pauses between words at the segment's end. Conversely, increasing this value allows longer pauses between words without identifying them as the end of a sentence.

Speech threshold is a unitless value that defines the boundary between speech and silence. The default middle value is -0.5 (note that the value in the configuration file is set higher than the default). Lower values mean "even if there is not complete silence in the signal, consider it silence," leading to more frequent segment terminations. Higher values mean "even if there is silence in the signal, consider it speech," resulting in less frequent segment terminations.

These values can be modified using a “user configuration file” – refer to the article User configuration file for further details.

In summary, create an appropriately named user configuration file with the modified parameters, place it alongside the standard configuration file in the <SPE directory>/bsapi/stt/settings directory, and restart SPE. It will automatically read the values from the user configuration file and override the standard configuration.