Version: 2026.03.0

Audio Quality Estimation

The Audio Quality Estimation technology evaluates the acoustic quality of recordings, allowing users to quickly determine whether the quality of a recording is suitable for further processing with other speech technologies such as Speaker Identification, Language Identification, Speech to Text, Emotion Recognition, etc.

Use cases and application areas

Useful in mass processing for primary assessment of recordings.
Filtering out recordings which are too short, too long, too noisy, etc.

How does it work?

The technology analyzes the provided recording and collects statistics of the audio signal.

One part of that analysis is an estimation of Perceptual Evaluation of Speech Quality (PESQ).
This is done by a small Time-Delayed Neural Network (TDNN) trained as regression on Mel-Frequency Cepstral Coefficients (MFCC) extracted from audio. The ground-truth labels for training were obtained by PESQ implementation.

Output

The output contains the collected statistics data for each input audio channel:

pesq_estimation
Estimation of PESQ (Perceptual Evaluation of Speech Quality).
The range is from -0.5 to 4.5, with higher values indicating better recording quality.
signal_noise_ratio
A signal-to-noise ratio (SNR) estimation, which describes the ratio of the useful signal to the noise signal. Calculated from the waveform distribution (silence has a Gaussian distribution; voice has a Gamma distribution). SNR is estimated by analyzing the frequency distribution within individual frames of the signal.
Measured in decibels (dB), calculated as SNR = 20 * log10(Signal/Noise).
- Higher SNR indicates better quality
- SNR > 15 usually means a signal of good quality
- SNR of 0 means an equal amount of speech and noise
audio_length
Total length of the signal in seconds.
max_amplitude
Maximum value measured among the samples. The range is from -1 to 1.
min_amplitude
Minimum value measured among the samples. The range is from -1 to 1.
peak_amplitude
Maximum absolute value of the signal, i.e., the higher of the absolute values of max_amplitude and min_amplitude.
mean_amplitude
Mean value of the samples. The range is from -1 to 1.
A value different than 0 indicates an undesirable DC offset, suggesting that audio signal is superimposed on a DC (direct current) component. This can be caused for example by a bad microphone and can indicate a possible distortion of the signal, such as signal clipping.
sampling_rate
Sampling frequency in Hertz (Hz).

FAQ

How should I understand the PESQ score?

Perceptual Evaluation of Speech Quality (PESQ) is a standardized method for assessing audio quality by comparing the tested audio to a reference. It evaluates factors like sharpness, background noise, and clipping, providing a score from -0.5 to 4.5, where higher scores indicate better quality. In contrast, Audio Quality Estimation technology uses a machine learning model to create a PESQ score estimation without requiring a reference audio, offering a more flexible approach to evaluating audio quality.

Why is my PESQ score low?

The PESQ score is affected by the signal-to-noise ratio (SNR). If the SNR is too low—meaning the background noise is too high compared to the speech, or the speech is too quiet relative to the background—Audio Quality Estimation will assign a low score due to poor audibility.

Use cases and application areas​

How does it work?​

Output​

FAQ​

How should I understand the PESQ score?​

Why is my PESQ score low?​

Use cases and application areas

How does it work?

Output

FAQ

How should I understand the PESQ score?

Why is my PESQ score low?