Phonexia 6th Gen Speech to Text
Phonexia Speech to Text (STT) converts speech in audio signals into plain text.
The technology relies on both acoustic and language models, making it dependent on the chosen language and dictionary. An audio file is required as input, along with the selection of a language model for transcription. The output is the transcription in a specified format. STT technology extracts features from the voice using acoustic and language models, along with pronunciation, to create a hypothesis of transcribed words and decode the most likely transcription. The transcribed text is returned with a time frame.
Application areas
- Maintain high response times by routing calls with specific content or topic to human operators
- Search for specific information in large call archives
- Data-mine audio content and index it for search
- Get advanced topic or content analysis
Technology overview
- Trained with an emphasis on spontaneous telephony conversation.
- Based on state-of-the-art techniques for acoustic modeling, including discriminative training and neural network-based features
- Output: One-best transcription – a file with a time-aligned speech transcript (indicating the start and end time of each word)
- Processing speed: Depending on the version, it can process from 1800 to 3700 hours of audio (with 50% of the audio being speech) in one day on a single server CPU wuth 8 cores.
To see all the languages supported by Speech-to-text technology, visit this documentation page.
Acoustic models
An acoustic model is created by training on audio data. It includes characteristics of the voices of a set of speakers provided in the training set.
Acoustic model can be created for different languages, such as Czech, English, French, or others, as well as for separate dialects such as Gulf Arabic, Levant Arabic, etc. From a technological perspective, the difference between various languages is the same as between dialects - each created model will be better suited for people who speak in a similar manner.
For example, the following acoustic models can be trained for English:
- US English – to be used with US speakers
- British English – to be used with UK speakers
Language models
Language model consists of a list of words, which is a limitation for the technology as only words from this list can be transcribed.
In addition to the list of words, the model also includes n-grams (sequences of words). N-grams are useful during decoding and making decisions, as the technology uses these sequences from training data to "decide" which possible transcriptions are most accurate.
Language models can vary even for the same acoustic models. This means they can include different words and different weights for n-grams. By adjusting the language model, users can focus on a specific domain to achieve better results.
Result
During the transcription process, there are always several alternatives for any given speech segment.
1-best result type provides only the result with highest score. Speech is returned in segments, each containing one word. Each segment includes information about the start and end times, the transcribed word, and a confidence score.
Accuracy
To measure the accuracy of Speech to Text, consider the following ponts:
-
Reason for the accuracy measurement
What is the business value of measuring accuracy? What type of output will be used in the use case? For instance, only nouns, verbs, and adjectives might be important for machine understanding of speech context, whereas all words are essential when the output text is for human processing. -
Data quality
Accuracy metrics require comparing the automatic transcription to a baseline transcription, usually annotated data. The quality of these annotated data is crucial as it impacts the measurement result. Annotation might vary between companies, for example:- Half-automated annotation – auto-transcription checked by annotators
- Annotation by two individual people