Version: 4.0.0

Phonexia 6th Gen Speech to Text

Phonexia Speech to Text (STT) converts speech in audio signals into plain text.

The technology relies on both acoustic and language models, making it dependent on the chosen language and dictionary. An audio file is required as input, along with the selection of a language model for transcription. The output is the transcription in a specified format. STT technology extracts features from the voice using acoustic and language models, along with pronunciation, to create a hypothesis of transcribed words and decode the most likely transcription. The transcribed text is returned with a time frame.

Application areas

Maintain high response times by routing calls with specific content or topic to human operators
Search for specific information in large call archives
Data-mine audio content and index it for search
Get advanced topic or content analysis

Technology overview

Trained with an emphasis on spontaneous telephony conversation.
Based on state-of-the-art techniques for acoustic modeling, including discriminative training and neural network-based features
Output: One-best transcription – a file with a time-aligned speech transcript (indicating the start and end time of each word)
Processing speed: Depending on the version, it can process from 1800 to 3700 hours of audio (with 50% of the audio being speech) in one day on a single server CPU with 8 cores.

Supported languages

To see all the languages supported by Speech-to-text technology, visit this documentation page.

Acoustic models

An acoustic model is created by training on audio data. It includes characteristics of the voices of a set of speakers provided in the training set.

Acoustic model can be created for different languages, such as Czech, English, French, or others, as well as for separate dialects such as Gulf Arabic, Levant Arabic, etc. From a technological perspective, the difference between various languages is the same as between dialects - each created model will be better suited for people who speak in a similar manner.

For example, the following acoustic models can be trained for English:

US English – to be used with US speakers
British English – to be used with UK speakers

Language models

Language model consists of a list of words, which is a limitation for the technology as only words from this list can be transcribed.

In addition to the list of words, the model also includes n-grams (sequences of words). N-grams are useful during decoding and making decisions, as the technology uses these sequences from training data to "decide" which possible transcriptions are most accurate.

Language models can vary even for the same acoustic models. This means they can include different words and different weights for n-grams. By adjusting the language model, users can focus on a specific domain to achieve better results.

Result

During the transcription process, there are always several alternatives for any given speech segment. The user can choose which of the result formats below will be returned by listing the desired result type values in the request. The 1-best result is returned by default.

1-best result type provides only the result with the highest score. Speech is returned in segments, each containing one word. Each segment includes information about the start and end times, the transcribed word, and a confidence score.

N-best result type provides multiple transcription alternatives for each segment. The engine returns, for every segment, a list of alternatives. Each alternative contains the segment text, the start and end timestamps, and a confidence score.

Confusion network result type returns a compact word-level lattice. The output is a series of time-aligned word slots, each containing one or more alternatives — words, silence markers, segment boundaries, or null — along with their start/end times, confidence scores, and an item type. This representation allows downstream components to traverse the network, select the most likely hypothesis, or apply custom post-processing.

Accuracy

To measure the accuracy of Speech to Text, consider the following points:

Reason for the accuracy measurement
What is the business value of measuring accuracy? What type of output will be used in the use case? For instance, only nouns, verbs, and adjectives might be important for machine understanding of speech context, whereas all words are essential when the output text is for human processing.
Data quality
Accuracy metrics require comparing the automatic transcription to a baseline transcription, usually annotated data. The quality of these annotated data is crucial as it impacts the measurement result. Annotation might vary between companies, for example:
1. Half-automated annotation – auto-transcription checked by annotators
2. Annotation by two individual people

Application areas​

Technology overview​

Acoustic models​

Language models​

Result​

Accuracy​