Skip to main content
Version: 3.4.0

Audio Quality Estimation

The Audio Quality Estimation technology evaluates the acoustic quality of recordings, allowing users to quickly determine whether the quality of a recording is suitable for further processing with other speech technologies such as Speaker Identification, Language Identification, Speech to Text, Emotion Recognition, etc.

Use cases and application areas

  • Useful in mass processing for primary assessment of recordings.
  • Filtering out recordings which are too short, too long, too noisy, etc.

How does it work?

The technology analyzes the provided recording and collects statistics of the audio signal.

One part of that analysis is an estimation of Perceptual Evaluation of Speech Quality (PESQ).
This is done by a small Time-Delayed Neural Network (TDNN) trained as regression on Mel-Frequency Cepstral Coefficients (MFCC) extracted from audio. The ground-truth labels for training were obtained by PESQ implementation.

Output

The output contains the collected statistics data for each input audio channel:

  • pesq_estimation
    Estimation of PESQ (Perceptual Evaluation of Speech Quality).
    The range is from -0.5 to 4.5, with higher values indicating better recording quality.
  • signal_noise_ratio
    A signal-to-noise ratio (SNR) estimation, which describes the ratio of the useful signal to the noise signal. Calculated from the waveform distribution (silence has a Gaussian distribution; voice has a Gamma distribution). SNR is estimated by analyzing the frequency distribution within individual frames of the signal.
    Measured in decibels (dB), calculated as SNR = 20 * log10(Signal/Noise).
    • Higher SNR indicates better quality
    • SNR > 15 usually means a signal of good quality
    • SNR of 0 means an equal amount of speech and noise
  • audio_length
    Total length of the signal in seconds.
  • max_amplitude
    Maximum value measured among the samples. The range is from -1 to 1.
  • min_amplitude
    Minimum value measured among the samples. The range is from -1 to 1.
  • peak_amplitude
    Maximum absolute value of the signal, i.e., the higher of the absolute values of max_amplitude and min_amplitude.
  • mean_amplitude
    Mean value of the samples. The range is from -1 to 1.
    A value different than 0 indicates an undesirable DC offset, suggesting that audio signal is superimposed on a DC (direct current) component. This can be caused for example by a bad microphone and can indicate a possible distortion of the signal, such as signal clipping.
  • sampling_rate
    Sampling frequency in Hertz (Hz).