Skip to main content
Version: 3.4.0

Performance

The Enhanced Speech to Text built on Whisper employs a large language model with millions of parameters, making it very resource-intensive. While we've implemented optimizations to alleviate this, it still requires specialized hardware for optimal performance.

While running on CPUs is feasible, it's notably slower and challenging to scale. For best results in production, we strongly advise utilizing GPUs.

Performance optimizations

We've enhanced the performance of our Speech to Text solution by making use of the following optimizations:

Silence removal

We've implemented pre-filtering using Voice Activity Detection (VAD) to eliminate silent portions from recordings. This ensures that the model processes only segments containing speech. While the effectiveness of this optimization depends heavily on the amount of speech in recordings, it typically accelerates processing by 20% to 30%.

Inference optimization techniques

We employ various techniques, including weight quantization, layer fusion, and batch reordering, to enhance inference of the Whisper model. These optimizations lead to faster processing speeds and reduced memory usage. As a result, our Speech to Text solution requires less than half the memory compared to the OpenAI's implementation and achieves transcription speeds up to four times faster. See Table 1 for a detailed comparison.

ImplementationPrecisionProcessing time [s]Max Video RAM [MB]Max RAM [MB]
OpenAIfp16132110474364
Phonexiafp163743951401

Table 1: Comparison of OpenAI and Phonexia performance on a 596-second audio file. Measurements were conducted on an Amazon g5.xlarge instance with a NVIDIA A10G GPU and 4XAMD EPYC 7R32 CPUs (each with 2 CPU cores.)

Vertical Scaling

To enhance overall throughput we enable better utilization of single hardware node resources through vertical scaling. This involves processing multiple transcription tasks in parallel. However, this technique may introduce higher latencies and is not enabled by default. Configuration for specific use cases is required. Refer to the article on scaling for more details.

Beam Reduction

Below is a sample measurement for accuracy and speed measurements for various English datasets and Whisper models.

As it is visible from the results, while WER1 stays similar across all used models, FTRT value2 of model with Beam Reduction is very high, especially compared to regular Whisper models. This means that model with Beam Reduction is several times faster on the same hardware than regular Whisper models.

The measurement was performed with one instance of Enhanced Speech to Text Built on Whisper microservice running on an NVIDIA RTX 4000 SFF Ada Generation GPU with 15GB of RAM.

Accuracy measurement (Word Error Rate):

Language Identification

Test setdistil-beam1-large-v3distil-large-v3large-v2large-v3
Test_set_110.310.547.7810.37
Test_set_213.721413.6715.19
Test_set_319.9919.9320.720.97
Test_set_43.923.884.083.51
Speed measurement (Faster Than Real-Time):

Language Identification

Test setdistil-beam1-large-v3distil-large-v3large-v2large-v3
Test_set_1117.9892.6223.5721.62
Test_set_2135.61106.0829.6322.27
Test_set_3116.8986.6624.5719.3
Test_set_439.6336.9312.5415.75

Beam-size parameter impact (large-v2 only):

Accuracy measurement (Word Error Rate):

Language Identification

Test setbeam1beam2beam3beam4beam5 (default)
Test_set_527.6826.1324.9024.3124.89
Test_set_18.647.707.667.968.16
Speed measurement (Faster Than Real-Time):

Language Identification

Test setbeam1beam2beam3beam4beam5 (default)
Test_set_522.0820.0518.8017.9016.80
Test_set_131.7129.3428.03725.8822.79

Language dependency

Large language models like Whisper operates by predicting sequences of tokens (sub-word units) from audio data. As research has shown, the number of tokens needed to represent a word or sentence can vary significantly depending on the language. This variation impacts processing time, as each token prediction requires substantial computational resources.

To understand the impact of language on processing time, we evaluated the performance across various languages. We focused on the actual speech duration within recordings, not the total audio length, thanks to the Voice Activity Detection (VAD) filtering incorporated in our Speech to Text solution. The speed of transcription is expressed as a Real-Time Factor (RTF), a metric calculated by dividing processing time by speech duration. It's important to note that other factors might also influence processing time, potentially affecting the results' precision.

Table 2 illustrates the differences in RTF across various languages. As expected, languages requiring fewer tokens per word tend to have faster processing speeds.

LanguageTotal speech duration [s]Processing time [s]RTF [-]
English1027661216.79
Portuguese996372313.78
Spanish979078112.54
Korean959281211.81
French897681411.03
Russian1055497010.88
Japanese911384410.8
Slovak902987310.34
Polish856083910.2
Arabic91109169.95
Czech88209918.9

Table 2: STT processing speed for various languages. Measurements were made on audio files from the Common Voice and Fleurs datasets. To simulate real-world scenarios with longer audio, we concatenated multiple short recordings from each dataset into recordings averaging 100 seconds in length. The measurements were conducted on an AMD Ryzen 9 7950X3D 16-Core Processor and a NVIDIA RTX 4000 SFF Ada Generation graphic card.

Footnotes

  1. Word Error Rate - how accurate the model is. It is a percentage value that expresses how many words have been correctly transcribed with respect to a reference transcription.

  2. Faster Than Real-Time - indicates the speed-up factor of the processing relative to the audio duration - for example FTRT 10 means that 10 seconds of input audio is processed in 1 second of CPU time