Skip to main content

Speech Engine Benchmark

The SPE benchmark feature is a powerful tool for quickly and easily evaluating processing speed directly on your hardware using your audio files. Simply call the .../benchmark endpoint corresponding to the technology you want to benchmark and wait for the result. The benchmark result provides a summary of the length of the processed speech, the processing time, and the resulting Faster-than-Realtime (FtRT) processing speed.

You can run this benchmark on machines with different CPUs to compare the performance of various Phonexia technologies on them. For example, you can observe the difference between Intel processors (for which our technologies are optimized) and AMD processors. You can use the benchmark to check if a planned hardware upgrade will deliver the expected performance gain. Additionally, you can use the benchmark to compare the performance of a new SPE version, a new technology generation, or different technology models on the same hardware configuration, and so on.

Running benchmark

Benchmark can be run in two ways:

  • by calling .../benchmark endpoint as documented
  • by calling .../benchmark endpoint as documented, with an additional path parameter

The first option uses the set of audio files supplied with SPE in the {SPE}/data/benchmark directory.
The second option uses a single audio file of your choice uploaded to SPE storage, specified by the path parameter.

The set of audio files supplied with SPE includes recordings of various lengths (ranging from 30 seconds to 5 minutes) and different speech-to-non-speech ratios. This variety accounts for the fact that both the length of the audio and the amount of actual speech in the audio affect the processing speed. Non-speech parts are stripped from the audio before processing, so the processing speed is calculated as follows:
FtRT = sum_of_speech_lengths_in_all_recordings ÷ sum_of_processing_times_of_all_recordings

When using the option with your specified file, only that single recording is used. To account for various audio lengths and speech-to-non-speech ratios, it is recommended to run the benchmark with multiple different audio files and calculate the average FtRT processing speed yourself.

Alternatively, you can tune (or hack) SPE by preparing your own set of benchmarking recordings or replacing the default set.

Benchmark Recordings Set

The default sets of audio files supplied with SPE are as follows (the version number 1.0 is present only for historical reasons and is ignored):


benchmark
└── 1.0
├── default
│   ├── 030.wav
│   ├── 060.wav
│   ├── 090.wav
│   ├── 120.wav
│   ├── 150.wav
│   ├── 180.wav
│   ├── 210.wav
│   ├── 240.wav
│   ├── 270.wav
│   └── 300.wav
└── czech
├── 030.wav
├── 060.wav
├── 090.wav
├── 120.wav
├── 150.wav
├── 180.wav
├── 210.wav
├── 240.wav
├── 270.wav
└── 300.wav

For the majority of technologies, the content of the default directory is used for benchmarking.

When benchmarking language-specific technologies, such as STT (Speech To Text) and PHNREC (Phoneme Recognizer), the system first attempts to find a directory with a name that matches the beginning of the benchmarked model name. If such a directory is found, audio files from that directory are used (with the expectation that the audio contains speech in the corresponding language). If no matching directory is found, the system falls back to the default directory.
The reason for using language-specific data is that processing audio in a different language than the one for which the model was trained negatively impacts processing speed. Essentially, the processing 'slides' through the file quickly because it cannot recognize any familiar patterns.

The czech directory is included as an example (though the STT/PHNREC models were renamed some time ago to cs_cz*, so the name czech no longer matches).

Tuning the recordings sets

You can tune the sets provided with SPE by:

  • Replacing the content of the default directory with your own audio files.
  • Creating a directory with a name according to the name-matching rule (see above) and putting audio files in the corresponding language in the directory. For example:
    • Directory named es would be matched for es_6 and es_es_5 models, but not the old spanish_american model.
    • Directory named cs_cz_fin would be matched only for the old cs_cz_fin model, but not the new cs_cz_5 or cs_cz_6 models.

By carefully preparing the directory and audiofiles structure, you can create an effective way to quickly gain a basic understanding of the speech technologies performance.