Version: 3.1.0

Performance

The Enhanced Speech to Text built on Whisper employs a large language model with millions of parameters, making it very resource-intensive. While we've implemented optimizations to alleviate this, it still requires specialized hardware for optimal performance.

While running on CPUs is feasible, it's notably slower and challenging to scale. For best results in production, we strongly advise utilizing GPUs.

Performance optimizations

We've enhanced the performance of our Speech to Text solution by making use of the following optimizations:

Silence removal

We've implemented pre-filtering using Voice Activity Detection (VAD) to eliminate silent portions from recordings. This ensures that the model processes only segments containing speech. While the effectiveness of this optimization depends heavily on the amount of speech in recordings, it typically accelerates processing by 20% to 30%.

Inference optimization techniques

We employ various techniques, including weight quantization, layer fusion, and batch reordering, to enhance inference of the Whisper model. These optimizations lead to faster processing speeds and reduced memory usage. As a result, our Speech to Text solution requires less than half the memory compared to the OpenAI's implementation and achieves transcription speeds up to four times faster. See Table 1 for a detailed comparison.

Implementation	Precision	Processing time [s]	Max Video RAM [MB]	Max RAM [MB]
OpenAI	fp16	132	11047	4364
Phonexia	fp16	37	4395	1401

Table 1: Comparison of OpenAI and Phonexia performance on a 596-second audio file. Measurements were conducted on an Amazon g5.xlarge instance with a NVIDIA A10G GPU and 4XAMD EPYC 7R32 CPUs (each with 2 CPU cores.)

Vertical Scaling

To enhance overall throughput we enable better utilization of single hardware node resources through vertical scaling. This involves processing multiple transcription tasks in parallel. However, this technique may introduce higher latencies and is not enabled by default. Configuration for specific use cases is required. Refer to the article on scaling for more details.

Language dependency

Large language models like Whisper operates by predicting sequences of tokens (sub-word units) from audio data. As research has shown, the number of tokens needed to represent a word or sentence can vary significantly depending on the language. This variation impacts processing time, as each token prediction requires substantial computational resources.

To understand the impact of language on processing time, we evaluated the performance across various languages. We focused on the actual speech duration within recordings, not the total audio length, thanks to the Voice Activity Detection (VAD) filtering incorporated in our Speech to Text solution. The speed of transcription is expressed as a Real-Time Factor (RTF), a metric calculated by dividing processing time by speech duration. It's important to note that other factors might also influence processing time, potentially affecting the results' precision.

Table 2 illustrates the differences in RTF across various languages. As expected, languages requiring fewer tokens per word tend to have faster processing speeds.

Language	Total speech duration [s]	Processing time [s]	RTF [-]
English	10276	612	16.79
Portuguese	9963	723	13.78
Spanish	9790	781	12.54
Korean	9592	812	11.81
French	8976	814	11.03
Russian	10554	970	10.88
Japanese	9113	844	10.8
Slovak	9029	873	10.34
Polish	8560	839	10.2
Arabic	9110	916	9.95
Czech	8820	991	8.9

Table 2: STT processing speed for various languages. Measurements were made on audio files from the Common Voice and Fleurs datasets. To simulate real-world scenarios with longer audio, we concatenated multiple short recordings from each dataset into recordings averaging 100 seconds in length. The measurements were conducted on an AMD Ryzen 9 7950X3D 16-Core Processor and a NVIDIA RTX 4000 SFF Ada Generation graphic card.

Performance optimizations​

Silence removal​

Inference optimization techniques​

Vertical Scaling​

Language dependency​

Performance optimizations

Silence removal

Inference optimization techniques

Vertical Scaling

Language dependency