Skip to main content

Speech Quality Estimation

Phonexia’s Speech Quality Estimation (SQE) quantifies the acoustic quality of recordings. This allows users to quickly determine whether the acoustic quality of a recording is suitable for processing with other speech technologies.

As output, the SQE returns a JSON or XML file that includes general information about the technology and statistics for all channels (one or two). The statistics cover various aspects of recording quality, including the overall global score.

Technology

  • The technology is language-, accent-, text-, and channel-independent.
  • Compatible with a wide range of audio sources (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, etc.

Input

  • Input format for processing: WAV or RAW (8 or 16 bits linear coding), A-law or Mu-law, PCM, with an 8kHz or higher sampling rate.

Output

  • global score – A percentage expression of audio quality (range: 0 to 100). By default, the global score is calculated based on waveform_n_bits and waveform_snr variables.
  • pesq – A value inspired by PESQ (Perceptual Evaluation of Speech Quality). The range is from -0.5 to 4.5, with higher values indicating better recording quality.

Other important output statistics

  • name – The name of the statistic.
  • value – The measured value of the statistic.
  • min_limit and max_limit – The possible limits of the statistic in the recording, based on encoding, frequency, and bitrate.
  • string – Not currently used; reserved for future use.
  • is_valid – Indicates whether the calculations of the statistic are correct (true) or not (false). For example, in the case of an empty recording, SNR would result in a division by zero, and is_valid would be false.
  • waveform_snr – The signal-to-noise ratio (SNR), which describes the ratio of the useful signal to the noise signal.
    • Measured in dB.
    • Calculated from the waveform distribution (silence has a Gaussian distribution; voice has a Gamma distribution).
    • SNR is calculated as: SNR = 20 * log10(S/N).
    • A higher SNR indicates better quality.
    • An SNR > 15 usually means a signal of good quality.
    • An SNR of 0 means an equal amount of speech and noise.
  • waveform_max_abs_value – The maximum amplitude of the signal.
    • Without a unit of measure.
    • Typical encoding ranges from -32,768 to +32,767; ideal usage spans the entire range.
    • Values less than 5,000 indicate that the signal does not utilize enough of the spectrum.
    • In the case of a quiet signal, significant numerical errors may occur.
  • waveform_min_abs_value – The minimum amplitude of the signal.
    • Without a unit of measure.
    • Values below 1,000 suggest that the recording is too quiet.
    • Quiet signals may result in significant numerical errors.
  • waveform_clipped_length – The duration of the recording (converted from 25ms frames) that contains some speech clipping.
    • Length is given in seconds.
    • Clipping refers to the removal of signal portions that exceed a pre-set threshold.
    • Clipping may occur if the original speech was too loud.
  • waveform_n_bits – The number of bits used by the waveform.
    • Absolute value.
    • Values below 8 indicate insufficient signal quality.
  • wfilter_technical_signal_length – The length of technical signals (tones, wide-band noise, etc.), measured in seconds.

Processing speed

Approximately 2,000 FTRT on a single CPU core. For example, a standard 8 CPU core server can process 384,000 hours of audio in one day of computing time.