Skip to main content

Supporting Technologies

Automatic Speaker Identification (SID) is the most important but not the only Phonexia technology implemented in Voice Inspector (VIN). Besides SID, forensic experts, users of VIN, can benefit from automatic signal-to-noise ratio calculation, Voice Activity Detection, Phoneme search, and a Wave editor, which incorporates the waveform, spectrum, and power panel. Let's look at how to utilize these individual technologies.

Signal-to-noise ratio

Recording quality can strongly influence the reliability of SID results and, consequently, the outcome of a forensic case. Therefore, VIN uses a module of Phonexia Speech Quality Estimation (SQE) to calculate the signal-to-noise ratio (SNR) of individual recordings. The method for calculating SNR in SQE is in detail described in the answer to one of the Frequently Asked Questions: How do you calculate SNR in Speech Quality Estimation?

By default, recordings with SNR values below 10 dB are considered unfit for further processing by the SID technology. While an SNR score above 10 dB does not guarantee reasonable results, it is likely that a recording below the 10 dB threshold should be discarded. VIN allows users the flexibility to experiment and lower the threshold in Settings > General > Recordings checking > Don't use recordings with SNR lower than. However, using low-quality recordings without valid reasons should be avoided and reported accordingly. That's why SNR values are included in the report template, and VIN automatically disables recordings with SNR values below the threshold:

snr_too_low

Voice Activity Detection

Voice Activity Detection (VAD) technology identifies parts of audio recordings that contain speech and non-speech. VAD is primarily designed to meet the needs of other technologies, so only segments that can be successfully processed by the SID technology are labeled as "voice," and the rest as "silence." The "silence" label, apart from actual silence, also covers noisy segments, technical signals, music, etc.

The primary purpose of VAD is to determine whether a recording is worth further processing based on the amount of speech present. Similar to SNR, VIN allows users to adjust the threshold in Settings > General > Recordings checking > Don't use recordings and voice-prints with speech length shorter than. The technological lower limit is 3 seconds; results from 3 to 5 seconds can serve as an indication. Reliable results can be obtained with 7 or more seconds of speech, provided the content is phonologically rich. For example, a recording with 20 seconds of simple "Yes" and "No" responses is insufficient. The speech length requirement ensures that SID technology "sees" the variability of the speaker's voice.

Our experiments have shown that precision does not increase with recordings longer than 120 seconds of speech. Longer recordings can still be used, but they offer little to no accuracy benefit and increase processing time.

The VAD module can also be used to manually clean out non-speech parts from a recording and to select parts for automatic SID processing. The VAD panel can be activated in the Wave editor menu bar, allowing easy selection of all parts of the recording labeled as "voice" or "silence":

vad_panel

Selected parts of the recording can be easily re-labeled as "voice" or "silence" if they, for example, contain another person's voice. The advantage of "removing" a part of the VAD labeling, rather than cutting a piece of the audio itself, is that you do not modify the evidence material; you only instruct the system on which parts should be used by the SID technology. It is good practice to save the VAD "transcription" of a recording (a simple text file with timestamps and labels) if it has been modified, and include this information in the report with an explanation of the changes made.

The Phonemes panel can be used in combination with the Tool panel to find and analyze characteristic pronunciations and words in individual recordings. It can search for exact phoneme sequences, phoneme classes, or let VIN4 suggest similar phoneme sequences of a certain length. The system can be set to search for sequences in different files to help compare similar sounds in the Questioned and Suspected speaker recordings.

The phoneme set is based on the Czech language and uses a machine-readable character set. Below are the equivalents in the International Phonetic Alphabet:

VIN@CDNRSTZaa:bcdee:fghi
IPAəɟɲʃcʒaa:btsdee:fɡɦɪ
VINi:jklmnoo:prssilsptuu:vxz
IPAi:jklmnoo:prs--tuu:vxz

For languages other than Czech, which may have different phoneme sets, using phoneme classes can be practical. Even if the exact phoneme and phoneme symbol are not "linguistically" correct for the target language, the phoneme class will likely be accurate. For example, {plosive} {vowel} n would find "gun," "ban," "dean," and others. To explore all possibilities, hover the mouse over the blue "i" button in the Tool panel.

Wave editor

The Wave editor can be used for basic operations with recordings. By default, it shows only the waveform, parts of which can be selected, played, copied, cut, pasted, and amplified. Besides the waveform, the editor can also display the spectrogram (which can include the fundamental frequency) and the power panel:

wave_editor

In the Spectrogram settings (right-click in the Spectrum panel > Spectrum > Spectrum settings), you can choose the window length, overlap type, pre-emphasis, spectrogram type, and LPC order. Additionally, the spectrum detail can show the frequency composition of the selected sample. The spectrum detail can easily be saved as a picture, inserted into the Report, and compared with spectrum details from other parts of the same recording or even other recordings:

spectrum_detail

Finally, the Wave editor includes the "Generic panel," which can be used to add notes and manual transcriptions to recordings. By selecting several sections of the recording and pressing F7, you can listen to the selections (shortcut Ctrl-Space), annotate them, and proceed to the next selected section by simply pressing Enter, without interrupting the workflow with unnecessary mouse clicks. The labels on the Generic panel can be filtered and processed in batches (double-click one of them and all identical labels will be selected too):

label_panel-1024x669