Speech to Text – Phonexia
Phonexia Speech to Text (STT) converts speech in audio signals into plain text.
Technology works with both acoustics as well as dictionary of words, acoustic model and pronunciation. This makes it dependent on language and dictionary – only some set of words can be transcribed. As an input, audio file is needed, together with selection of language model to be used for transcription. As an output the transcription in one of the formats is provided. The technology extract features out of voice, using acoustic and language models together with pronunciation all in recognition network creates a hypothesis of transcribed words and „decode“ the most possible transcription. Transcribed text is returned with a time frame.
Application areas
- Maintain high reaction times by routing calls with specific content/topic to human operators
- Search for specific information in large call archives
- Data-mine audio content and index for search
- Advanced topic/content analysis provides additional value
Technology overview
- Trained with emphasis on spontaneous telephony conversation
- Based on state-of-the-art techniques for acoustic modeling, including discriminative training and neural network-based features
- Output - One-best transcription – i.e. a file with a time-aligned speech transcript (time of word’s start and end)
- Processing speed – based on available version from 1800 to 3700 hours of audio (50% of audio is speech) in 1 day on 1 server CPU with 8 cores
Acoustic models
Acoustic model is created by training on training data. It includes characteristics of a voices of a set of speakers provided in a training set.
Acoustic model can be created for different languages, such as Czech, English, French or others, or also for a separate dialects – Gulf Arabic, Levant Arabic, …. From the technology point of view difference between various languages is the same as between dialects – every created model will be suited more for a people talking same way.
As an example for English the following acoustic models can be trained:
- US English – to be used with US speakers
- British English – to be used with UK speakers
Language models
Language model consists of a list of words. This is limitation for a technology, as only the words from this list can go to the transcription.
Together with list of words also n-grams of words are present. N-grams are useful during decoding and making a decision. The technology takes into account the word sequences gained from training to „decide“ which of the possible transcriptions are most accurate.
Language models can differ for the same acoustic models. This means that they can include different words and different weights for n-grams. Using this the user can adjust a language model focusing on a specific domain to get better results.
Result
During the process of transcribing the speech there are always several alternatives for a given speech segment.
1-best result type provides only the result with highest score. Speech is returned in segments including always one word. Every such segment provides information about start and end, the transcribed word and a score.
Accuracy
To measure the accuracy of Speech to Text the following points should be taken into account:
-
Reason for the accuracy measurement
What is the business value for measuring the accuracy? What type of output will be used in the use case for which accuracy is measured? It may be that only nouns, verbs and adjectives are important for machine understanding of speech context, or all of the words are important when the output text is intended for human processing. -
Data quality
Every metric requires comparing the automatic transcription to some baseline transcription, usually in the form of annotated data. The quality of these annotated data is crucial as it can impact the result of the measurement. Annotation of data might be done differently by different companies, for example:- Half-automated annotation – auto-transcription checked by human annotators
- Annotation by two individual people