Version: 3.2.0

Comparison with Vanilla Whisper

Enhanced Speech to Text Built on Whisper is superior to open-source Whisper in the following respects:

Provides more robust performance for challenging recordings, such as those containing silence or noise. This improvement is achieved by using Phonexia's Voice Activity Detection (VAD). VAD improves the accuracy of telephone recordings in particular.
Implements mechanisms to prevent nonsensical transcriptions (hallucinations) and looping (repeated transcriptions of the same word) that sometimes occur in open-source Whisper.
Another advantage of using Phonexia's Voice Activity Detection is that it provides more accurate timestamps.
Fine-tuned to improve the overall processing speed. This is achieved through Phonexia's VAD and the use of a faster library. See Performance page for more information and an example.
Ability to switch the language during transcription if the original recording contains multiple languages. This feature is called language switching. See Overview page for more information and an example.
We can fine-tune medium or lower resource languages (less common languages) from Whisper to achieve even higher accuracy than the original Whisper model provided. (We are currently working on this option, it will be available soon.)
We also provide Distil Whisper that is 6 times faster than the original Whisper. Currently, it is available for English, but we are able to distil Whisper in other languages as well.
Possibility to use a GUI (graphical user interface) or an easy-to-use REST API for Enhanced Speech to Text Built on Whisper.
Part of a larger platform offering a range of complementary speech technologies (not just speech transcription).
We provide maintenance and support that is not guaranteed with open-source software.