Supporting Technologies
Automatic Speaker Identification (SID) is the most important but not the only Phonexia technology implemented in Voice Inspector (VIN). Besides SID, forensic experts, users of VIN, can benefit from automatic signal-to-noise ratio calculation, Voice Activity Detection, Phoneme search, and a Wave editor, which incorporates the waveform, spectrum, and power panel. Let's look at how to utilize these individual technologies.
Signal-to-noise ratio
Recording quality can strongly influence the reliability of SID results and, consequently, the outcome of a forensic case. Therefore, VIN uses a module of Phonexia Speech Quality Estimation (SQE) to calculate the signal-to-noise ratio (SNR) of individual recordings. The method for calculating SNR in SQE is in detail described in the answer to one of the Frequently Asked Questions: How do you calculate SNR in Speech Quality Estimation?
By default, recordings with SNR values below 10 dB are considered unfit for
further processing by the SID technology. While an SNR score above 10 dB does
not guarantee reasonable results, it is likely that a recording below the 10 dB
threshold should be discarded. VIN allows users the flexibility to experiment
and lower the threshold in
Settings > General > Recordings checking > Don't use recordings with SNR lower than
.
However, using low-quality recordings without valid reasons should be avoided
and reported accordingly. That's why SNR values are included in the report
template, and VIN automatically disables recordings with SNR values below the
threshold:
Voice Activity Detection
Voice Activity Detection (VAD) technology identifies parts of audio recordings that contain speech and non-speech. VAD is primarily designed to meet the needs of other technologies, so only segments that can be successfully processed by the SID technology are labeled as "voice," and the rest as "silence." The "silence" label, apart from actual silence, also covers noisy segments, technical signals, music, etc.
The primary purpose of VAD is to determine whether a recording is worth
further processing based on the amount of speech present. Similar to SNR, VIN
allows users to adjust the threshold in
Settings > General > Recordings checking > Don't use recordings and voice-prints with speech length shorter than
.
The technological lower limit is 3 seconds; results from 3 to 5 seconds can
serve as an indication. Reliable results can be obtained with 7 or more
seconds of speech, provided the content is phonologically rich. For
example, a recording with 20 seconds of simple "Yes" and "No" responses is
insufficient. The speech length requirement ensures that SID technology "sees"
the variability of the speaker's voice.
Our experiments have shown that precision does not increase with recordings longer than 120 seconds of speech. Longer recordings can still be used, but they offer little to no accuracy benefit and increase processing time.
The VAD module can also be used to manually clean out non-speech parts from a recording and to select parts for automatic SID processing. The VAD panel can be activated in the Wave editor menu bar, allowing easy selection of all parts of the recording labeled as "voice" or "silence":
Selected parts of the recording can be easily re-labeled as "voice" or "silence" if they, for example, contain another person's voice. The advantage of "removing" a part of the VAD labeling, rather than cutting a piece of the audio itself, is that you do not modify the evidence material; you only instruct the system on which parts should be used by the SID technology. It is good practice to save the VAD "transcription" of a recording (a simple text file with timestamps and labels) if it has been modified, and include this information in the report with an explanation of the changes made.
Phoneme search
The Phonemes panel can be used in combination with the Tool panel to find and analyze characteristic pronunciations and words in individual recordings. It can search for exact phoneme sequences, phoneme classes, or let VIN4 suggest similar phoneme sequences of a certain length. The system can be set to search for sequences in different files to help compare similar sounds in the Questioned and Suspected speaker recordings.
The phoneme set is based on the Czech language and uses a machine-readable character set. Below are the equivalents in the International Phonetic Alphabet:
VIN | @ | C | D | N | R | S | T | Z | a | a: | b | c | d | e | e: | f | g | h | i |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IPA | ə | tʃ | ɟ | ɲ | r̝ | ʃ | c | ʒ | a | a: | b | ts | d | e | e: | f | ɡ | ɦ | ɪ |
VIN | i: | j | k | l | m | n | o | o: | p | r | s | sil | sp | t | u | u: | v | x | z |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IPA | i: | j | k | l | m | n | o | o: | p | r | s | - | - | t | u | u: | v | x | z |
For languages other than Czech, which may have different phoneme sets, using
phoneme classes can be practical. Even if the exact phoneme and phoneme symbol
are not "linguistically" correct for the target language, the phoneme class will
likely be accurate. For example, {plosive} {vowel} n
would find "gun," "ban,"
"dean," and others. To explore all possibilities, hover the mouse over the blue
"i" button in the Tool panel.
Wave editor
The Wave editor can be used for basic operations with recordings. By default, it shows only the waveform, parts of which can be selected, played, copied, cut, pasted, and amplified. Besides the waveform, the editor can also display the spectrogram (which can include the fundamental frequency) and the power panel:
In the Spectrogram settings (right-click in the
Spectrum panel > Spectrum > Spectrum settings
), you can choose the window
length, overlap type, pre-emphasis, spectrogram type, and LPC order.
Additionally, the spectrum detail can show the frequency composition of the
selected sample. The spectrum detail can easily be saved as a picture, inserted
into the Report, and compared with spectrum details from other parts of the same
recording or even other recordings:
Finally, the Wave editor includes the "Generic panel," which can be used to add notes and manual transcriptions to recordings. By selecting several sections of the recording and pressing F7, you can listen to the selections (shortcut Ctrl-Space), annotate them, and proceed to the next selected section by simply pressing Enter, without interrupting the workflow with unnecessary mouse clicks. The labels on the Generic panel can be filtered and processed in batches (double-click one of them and all identical labels will be selected too):