Input Audio Quality
The quality of the audio plays a crucial role in achieving satisfactory results with any speech processing technology, whether it's simple voice activity detection, speech transcription, voice biometry, or other applications.
There are two main aspects of audio quality:
- technical quality of the audio data (format, codec, bitrate, SNR, …)
- sound quality of the actual content (background noise, reverberations, …)
Technical quality
Using inappropriate audio codec, heavy compression, too low bitrate, etc. can damage or even completely destroy essential parts of the audio signal required by speech technologies.
Commonly used audio compressions make use of perceptual limitation of human hearing and can remove frequencies which are covered by other frequencies, etc… Therefore, to get satisfactory results from speech technologies, use appropriate audio format.
Tools like MediaInfo can easily give you technical information about your audio files.
👍 DO'S | 👎 DONT'S |
---|---|
Set your PBX, media server or recording device to one of these formats (in the order of preferrence):
| Do not push for smallest possible audio file sizes, attempting to squeeze maximum number of recordings into a minimal storage space. Brutal compressions like MPEG 2.5 Layer 3 (MP3) with bitrates only 16 or even 12 kbit/s per channel really cripple the audio way too much. If you really have to use MP3, refrain from using joint-stereo encoding[^1] for 2-channel audio, use full stereo instead. NOTE: If the audio was already heavily compressed, converting it to one of the “okay formats” really does NOT magically restore the information already lost during the original compression. No point trying that. |
[^1] The joint-stereo encoding – which is commonly used by default in MP3 encoders – is tailored for usage with music audio, where both channels usually contain almost the same signal. Using joint-stereo encoding for telephony stereo, where each channel contains completely different signal (when one side speaks, the other side is silent) actually cripples the audio further.
Sound quality
Quality of the actual audio content is just as important as the technical quality.
Parasitic sounds like room reverberations, background noise (cars on the street, dog barking nearby), ambient voices (people talking in the office, TV playing in the room) or compression artifacts, affect the effectivity of speech technologies (precision of speaker identification, transcription accuracy, etc.).
Therefore it is essential to have as clean audio as possible.
👍 DO'S | 👎 DONT'S |
---|---|
Capture the sound as close to the source as possible, i.e.
to minimize the amount of ambient sounds and noise, reverberations, or artifacts caused by possible multiple recodings during transfer. Store the audio in appropriate format (see above), to avoid distorting the sound by compression artifacts. | In general, the following recording methods or sources affect negatively the sound quality:
These are usually made to capture every possible sound, including those undesired for speech processing – office ambient noise and reverberations, other people talking, TV playing in the background, etc. Also, do not store the recorded audio in compressed formats. Typically, surveillance cameras, smartphones or bugs tend to use heavily compressed formats by default. |