Skip to main content
Version: 2.0.0

Input Audio Quality

The quality of the audio plays a crucial role in achieving satisfactory results with any speech processing technology, whether it's simple voice activity detection, speech transcription, voice biometry, or other applications.

There are two main aspects of audio quality:

  • technical quality of the audio data (format, codec, bitrate, SNR, …)
  • sound quality of the actual content (background noise, reverberations, …)

Technical quality

Using inappropriate audio codec, heavy compression, too low bitrate, etc. can damage or even completely destroy essential parts of the audio signal required by speech technologies.

Commonly used audio compressions make use of perceptual limitation of human hearing and can remove frequencies which are covered by other frequencies, etc… Therefore, to get satisfactory results from speech technologies, use appropriate audio format.

TIP

Tools like MediaInfo can easily give you technical information about your audio files.

👍 DO'S👎 DONT'S

Set your PBX, media server or recording device to one of these formats (in the order of preferrence):

  • uncompressed WAV (16-bit, 8 kHz or more)
  • A-law or μ-law (8-bit, 8 kHz) in WAV
  • lossless formats like FLAC
  • OPUS format
  • OPUS format (lossy, but developed with speech in mind)
Lossy MP3 format is not preffered. In MP3 really has to be used, it must use bitrates at least 32kbit/s per channel. Stereo audio must use full stereo encoding, not joint-stereo[^1]

Do not push for smallest possible audio file sizes, attempting to squeeze maximum number of recordings into a minimal storage space.

Brutal compressions like MPEG 2.5 Layer 3 (MP3) with bitrates only 16 or even 12 kbit/s per channel really cripple the audio way too much.

If you really have to use MP3, refrain from using joint-stereo encoding[^1] for 2-channel audio, use full stereo instead.

NOTE: If the audio was already heavily compressed, converting it to one of the “okay formats” really does NOT magically restore the information already lost during the original compression. No point trying that.

[^1] The joint-stereo encoding – which is commonly used by default in MP3 encoders – is tailored for usage with music audio, where both channels usually contain almost the same signal. Using joint-stereo encoding for telephony stereo, where each channel contains completely different signal (when one side speaks, the other side is silent) actually cripples the audio further.

Sound quality

Quality of the actual audio content is just as important as the technical quality.

Parasitic sounds like room reverberations, background noise (cars on the street, dog barking nearby), ambient voices (people talking in the office, TV playing in the room) or compression artifacts, affect the effectivity of speech technologies (precision of speaker identification, transcription accuracy, etc.).

Therefore it is essential to have as clean audio as possible.

👍 DO'S👎 DONT'S

Capture the sound as close to the source as possible, i.e.

  • as close to the speaker’s mouth as possible
  • as close to the recording source as possible

to minimize the amount of ambient sounds and noise, reverberations, or artifacts caused by possible multiple recodings during transfer.

Store the audio in appropriate format (see above), to avoid distorting the sound by compression artifacts.

In general, the following recording methods or sources affect negatively the sound quality:

  • surveillance camera microphone
  • notebook built-in microphone
  • smartphone lying on a desk, or even hidden under the desk, etc.
  • hidden bug microphone

These are usually made to capture every possible sound, including those undesired for speech processing – office ambient noise and reverberations, other people talking, TV playing in the background, etc.

Also, do not store the recorded audio in compressed formats. Typically, surveillance cameras, smartphones or bugs tend to use heavily compressed formats by default.