Skip to main content
Version: 4.0.0-rc1

Enhanced Speech Translation Built on Whisper

Speech translation leverages Enhanced Speech to Text Build on Whisper to provide machine translation of various languages to English language. It offers seamless conversion of spoken language into English text while preserving meaning and context.

Speech translation technology supports multi-channel audio files. Resulting translations contain details about the processed channels, detected source languages and timestamps of the utterances. There is also a possibility to specify the source languages manually. Only licensed languages can be used for translation. To achieve reasonable translation speeds, GPU is required.

Language switching

In the default auto-detect mode, first 30 seconds are used to detect the language used for the translation of the whole recording to English. This behavior might negatively affect the resulting translations if parts of the recording after 30-second mark are in a different language. By using optional parameter language switching, the behavior is modified in a way that source languages are detected in 30-second segments. More details about language switching including limitations can be found in Enhanced Speech to Text build on Whisper article.

FAQ

Why is the translation innacurate?

  1. The audio has been translated from a different language than the original language of the recording because it’s not part of Phonexia’s portfolio.
  2. The audio quality is very low, or the speech is not understandable from the recording.
  3. There is a background noise or music that deteriorates recording quality.

How can I improve the processing speed?

Make sure you’re running Speech Translation on GPU to speed up the processing.