Skip to main content
Version: 3.0.0

Enhanced Speech to Text Built on Whisper

Phonexia Enhanced Speech to Text Built on Whisper converts speech in audio signals into plain text.

We have integrated the open-source transcription technology Whisper to provide our partners with a broader portfolio of languages for Speech to Text conversion. To improve the existing open-source code we have made several adjustments to enhance its performance, speed, stability, and to reduce troublesome behaviour.

It is important to keep in mind that the quality and precision of the transcription depends on the quality of the input media files.

Whisper, the underlying technology, exhibits non-deterministic behaviour with recordings for which the transcription has low confidence. This implies that the same audio file may potentially result in slightly different transcriptions when run multiple times. This ambiguity most often arises with low-quality recordings.

Features

  • Autodetect language spoken in audio
  • Language Switching - for each 30s block of audio is detected language and transcription assigned appropriate language

Language switching

The language switching feature is a part of Speech to Text technology that checks the audio every 30 seconds. It detects the language spoken in the majority of the segment and transcribes it accordingly.

This feature has to be explicitly enabled.

Limitations

It is important to note that if the majority of the speech in a 30-second segment is in a single language, e.g. English, the entire segment will most likely be transcribed into English, even if a short sentence or word in another language is present in the segment. The foreign-language parts transcribed into English may even preserve their actual meaning. The consequence of this may be that the user is unaware that another language was used in that part of the recording.

Example:

  • Speech: Good morning. Yesterday we were in a hurry. Mein Gott, it was a messy day.
  • Transcription: Good morning. Yesterday we were in a hurry. My God, it was a messy day.

Performance

For proper performance it is important to run the Enhanced Speech to Text Built on Whisper on computation grade Graphical cards (GPU).