Audio Quality Estimation
The Audio Quality Estimation technology evaluates the acoustic quality of recordings, allowing users to quickly determine whether the quality of a recording is suitable for further processing with other speech technologies such as Speaker Identification, Language Identification, Speech to Text, Emotion Recognition, etc.
Use cases and application areas
- Useful in mass processing for primary assessment of recordings.
- Filtering out recordings which are too short, too long, too noisy, etc.
How does it work?
The technology analyzes the provided recording and collects statistics of the audio signal.
One part of that analysis is an estimation of Perceptual Evaluation of Speech
Quality (PESQ).
This is done by a small Time-Delayed Neural Network (TDNN) trained as regression
on Mel-Frequency Cepstral Coefficients (MFCC) extracted from audio. The
ground-truth labels for training were obtained by PESQ implementation.
Output
The output contains the collected statistics data for each input audio channel:
- pesq_estimation
Estimation of PESQ (Perceptual Evaluation of Speech Quality).
The range is from -0.5 to 4.5, with higher values indicating better recording quality. - signal_noise_ratio
A signal-to-noise ratio (SNR) estimation, which describes the ratio of the useful signal to the noise signal. Calculated from the waveform distribution (silence has a Gaussian distribution; voice has a Gamma distribution). SNR is estimated by analyzing the frequency distribution within individual frames of the signal.
Measured in decibels (dB), calculated as SNR = 20 * log10(Signal/Noise).- Higher SNR indicates better quality
- SNR > 15 usually means a signal of good quality
- SNR of 0 means an equal amount of speech and noise
- audio_length
Total length of the signal in seconds. - max_amplitude
Maximum value measured among the samples. The range is from -1 to 1. - min_amplitude
Minimum value measured among the samples. The range is from -1 to 1. - peak_amplitude
Maximum absolute value of the signal, i.e., the higher of the absolute values ofmax_amplitude
andmin_amplitude
. - mean_amplitude
Mean value of the samples. The range is from -1 to 1.
A value different than 0 indicates an undesirable DC offset, suggesting that audio signal is superimposed on a DC (direct current) component. This can be caused for example by a bad microphone and can indicate a possible distortion of the signal, such as signal clipping. - sampling_rate
Sampling frequency in Hertz (Hz).