Skip to main content
Version: 3.2.0

Input Audio Quality

The quality of the audio plays a crucial role in achieving satisfactory results with any speech processing technology, whether it's simple voice activity detection, speech transcription, voice biometry, or other applications.

There are two main aspects of audio quality:

  • technical quality of the audio data (format, codec, bitrate, SNR, etc.)
  • sound quality of the actual content (background noise, reverberations, etc.)

Technical quality​

Using inappropriate audio codec, heavy compression, or too low bitrate can damage or even completely destroy essential parts of the audio signal required by speech technologies.

Commonly used audio compressions exploit the perceptual limitation of human hearing and can remove frequencies which are covered by other frequencies. Therefore, to get satisfactory results from speech technologies, it is crucial to use an appropriate audio format.

TIP

Tools like MediaInfo can easily give you technical information about your audio files.

πŸ‘ DO'SπŸ‘Ž DONT'S

Set your PBX, media server or recording device to one of these formats (in order of preference):

  • Uncompressed WAV (16-bit, 8 kHz or more)
  • A-law or ΞΌ-law (8-bit, 8 kHz) in WAV
  • Lossless formats like FLAC
  • OPUS format (lossy, but developed with speech in mind)

Avoid prioritizing the smallest possible audio file sizes, attempting to squeeze the maximum number of recordings into minimal storage space.

Severe compressions like MPEG 2.5 Layer 3 (MP3) with bitrates only 16 or even 12 kbit/s per channel really significantly degrade the audio quality.

If you really have to use MP3, use bitrates of at least 32 kbit/s per channel, and refrain from using joint-stereo encoding[^1] for 2-channel audio. Use full stereo instead.

[^1] The joint-stereo encoding – which is commonly used by default in MP3 encoders – is tailored for usage with music audio, where both channels usually contain almost the same signal. Using joint-stereo encoding for telephony stereo, where each channel contains completely different signal (when one side speaks, the other side is silent) actually cripples the audio further.

Note

If the audio has already been heavily compressed, converting it to one of the recommended formats does not restore the information lost during the original compression

Sound quality​

Quality of the actual audio content is just as important as the technical quality.

Unwanted sounds such as room reverberations, background noise (e.g., cars on the street, dogs barking nearby), ambient voices (e.g., people talking in the office, TV playing in the room), or compression artifacts can significantly impact the effectiveness of speech technologies (e.g., speaker identification precision, transcription accuracy).

Therefore, it is essential to ensure the audio is as clean as possible.

πŸ‘ DO'SπŸ‘Ž DONT'S

Capture the sound as close to the source as possible, i.e.

  • as close to the speaker’s mouth as possible
  • as close to the recording source as possible

to minimize the amount of ambient sounds and noise, reverberations, or artifacts caused by potential multiple recordings during transfer.

Store the audio in appropriate format (see above), to avoid distorting the sound by compression artifacts.

In general, the following recording methods or sources negatively affect sound quality:

  • Surveillance camera microphones
  • Built-in notebook microphone
  • Smartphone lying on a desk, or hidden under it, etc.
  • Hidden bug microphone

These devices are typically designed to capture all sounds, including those undesirable for speech processing, such as office ambient noise, reverberations, other people talking, and background TV noise.

Also, do not store the recorded audio in compressed formats. Typically, surveillance cameras, smartphones, or bugs tend to use heavily compressed formats by default.