Speech Engine Benchmark
The SPE benchmark feature is a powerful tool for quickly and easily evaluating
processing speed directly on your hardware using your audio files. Simply call
the .../benchmark
endpoint corresponding to the technology you want to
benchmark and wait for the result. The benchmark result provides a summary of
the length of the processed speech, the processing time, and the resulting
Faster-than-Realtime (FtRT) processing speed.
You can run this benchmark on machines with different CPUs to compare the performance of various Phonexia technologies on them. For example, you can observe the difference between Intel processors (for which our technologies are optimized) and AMD processors. You can use the benchmark to check if a planned hardware upgrade will deliver the expected performance gain. Additionally, you can use the benchmark to compare the performance of a new SPE version, a new technology generation, or different technology models on the same hardware configuration, and so on.
Running benchmark
Benchmark can be run in two ways:
- by calling
.../benchmark
endpoint as documented - by calling
.../benchmark
endpoint as documented, with an additionalpath
parameter
The first option uses the set of audio files supplied with SPE in the
{SPE}/data/benchmark
directory.
The second option uses a single audio file of your choice uploaded to SPE
storage, specified by the path
parameter.
The set of audio files supplied with SPE includes recordings of various lengths
(ranging from 30 seconds to 5 minutes) and different speech-to-non-speech
ratios. This variety accounts for the fact that both the length of the audio and
the amount of actual speech in the audio affect the processing speed. Non-speech
parts are stripped from the audio before processing, so the processing speed is
calculated as follows:
FtRT = sum_of_speech_lengths_in_all_recordings ÷
sum_of_processing_times_of_all_recordings
When using the option with your specified file, only that single recording is used. To account for various audio lengths and speech-to-non-speech ratios, it is recommended to run the benchmark with multiple different audio files and calculate the average FtRT processing speed yourself.
Alternatively, you can tune (or hack) SPE by preparing your own set of benchmarking recordings or replacing the default set.
Benchmark Recordings Set
The default sets of audio files supplied with SPE are as follows (the version number 1.0 is present only for historical reasons and is ignored):
benchmark
└── 1.0
├── default
│ ├── 030.wav
│ ├── 060.wav
│ ├── 090.wav
│ ├── 120.wav
│ ├── 150.wav
│ ├── 180.wav
│ ├── 210.wav
│ ├── 240.wav
│ ├── 270.wav
│ └── 300.wav
└── czech
├── 030.wav
├── 060.wav
├── 090.wav
├── 120.wav
├── 150.wav
├── 180.wav
├── 210.wav
├── 240.wav
├── 270.wav
└── 300.wav
For the majority of technologies, the content of the default
directory is used
for benchmarking.
When benchmarking language-specific technologies, such as STT (Speech To Text)
and PHNREC (Phoneme Recognizer), the system first attempts to find a directory
with a name that matches the beginning of the benchmarked model name. If such a
directory is found, audio files from that directory are used (with the
expectation that the audio contains speech in the corresponding language). If no
matching directory is found, the system falls back to the default
directory.
The reason for using language-specific data is that processing audio in a
different language than the one for which the model was trained negatively
impacts processing speed. Essentially, the processing 'slides' through the file
quickly because it cannot recognize any familiar patterns.
The czech
directory is included as an example (though the STT/PHNREC models
were renamed some time ago to cs_cz*
, so the name czech
no longer matches).
Tuning the recordings sets
You can tune the sets provided with SPE by:
- Replacing the content of the
default
directory with your own audio files. - Creating a directory with a name according to the name-matching rule (see
above) and putting audio files in the corresponding language in the directory.
For example:
- Directory named
es
would be matched fores_6
andes_es_5
models, but not the oldspanish_american
model. - Directory named
cs_cz_fin
would be matched only for the oldcs_cz_fin
model, but not the newcs_cz_5
orcs_cz_6
models.
- Directory named
By carefully preparing the directory and audiofiles structure, you can create an effective way to quickly gain a basic understanding of the speech technologies performance.