Overview
Our new Speech to Text solution leverages Whisper, an open-source speech recognition technology, to deliver accurate and efficient audio transcriptions. This integration expands our language support, offering a wider range of options for your speech-to-text needs.
We've further optimized Whisper's open-source code, improving performance, transcription speed, and overall stability. However, the quality of the transcribed text ultimately depends on the quality of your audio files.
Whisper, the underlying technology, exhibits non-deterministic behaviour with recordings for which the transcription has low confidence. This implies that the same audio file may potentially result in slightly different transcriptions when run multiple times. This ambiguity most often arises with low-quality recordings.
To see all the languages supported by Speech-to-text technology, visit this documentation page.
Features
Voice Activity Detection (VAD) filter
To enhance transcription accuracy, we utilize a Voice Activity Detection (VAD) filter. This filter removes non-speech segments from the audio. The primary motivation behind this is that Whisper sometimes produces inaccurate results on silent or noisy portions of the audio.
Additionally, the VAD filter significantly improves transcription speed by excluding segments that don’t require processing by the Whisper model.
Language Detection
When using auto-detect mode, the entire recording is transcribed in the language prevailing in the first 30-second segment.
Language Switching
In contrast to the default language detection, language switching allows us to detect different languages within the same audio recording. The model transcribes the recording sequentially in 30-second audio blocks, applying language detection to each block - not just the initial one.
Take a look at the transcription of an example recording where the language changes after 29 seconds.
Example:
en, 00:00:01 – 00:00:08
Hi Sylvia, today I was reading the meeting notes and there are several things I'd like to clarify.
en, 00:00:08 – 00:00:15
The budget as it was presented will not be sufficient for everything planned for the first quarter.
en, 00:00:15 – 00:00:24
There are some extra activities in Boston that we need to take into account, I mean money-wise,
en, 00:00:24 – 00:00:29
and there's also the California project, but let me ask Mercedes as she is the one in charge.
es, 00:00:29 – 00:00:37
Hola Mercedes, mira, estoy explicándole a Silvia algo acerca del dinero y necesito tu ayuda.
es, 00:00:37 – 00:00:44
Me gustaría que me calcularas aproximadamente el presupuesto para el proyecto IRIS en Baja California
es, 00:00:44 – 00:00:49
para que aclaremos el presupuesto entero para este año.
es, 00:00:49 – 00:00:57
Por favor, no olvides incluir el tema de servicios externos y entrenamiento de todos los empleados involucrados, ¿vale?
es, 00:00:57 – 00:00:58
Gracias.
Language switching must be explicitly enabled; otherwise, auto-detect mode will be used.
Limitations
It is important to note that if the majority of the speech in a 30-second segment is in a single language, e.g. English, the entire segment will most likely be transcribed into English, even if a short sentence or word in another language is present in the segment. The foreign-language parts transcribed into English may even preserve their actual meaning. The consequence of this may be that the user is unaware that another language was used in that part of the recording.
Example:
Speech:
Good morning. Yesterday we were in a hurry. Mein Gott, it was a messy day.
Transcription:
en, 00:00:00 – 00:00:05
Good morning. Yesterday we were in a hurry. My God, it was a messy day.
Performance
Whisper models consist of millions of parameters, making them very resource-intensive. While it's possible to run them on CPUs, such a setup is slow and challenging to scale up, making it suitable primarily for testing purposes. For production deployments, GPUs are necessary. For further details on performance, see page Performance.