Supporting Technologies

While Automatic Speaker Identification (SID) is the core technology of Voice Inspector (VIN), it is not the only Phonexia technology implemented. Voice Inspector also integrates automatic Speaker Diarization, Signal-to-Noise ratio (SNR) calculation, Voice Activity Detection, Phoneme Search, and a Wave Editor featuring waveform, spectrum, and power panels. Here's how each technology can be utilized:

Speaker Diarization

Speaker Diarization labels segments of a mono recording based on the voices of individual speakers, regardless of language, domain, or channel. It outputs a list of time segments with speaker labels and can also detect technical signals and silence. Users can manually adjust the segmentation in the Diarization panel by holding Shift and dragging the segment boundaries as needed. Voice Inspector also offers the possibility to split recordings based on diarization results, creating new files for more focused investigation.

The technology operates by first filtering out silence and technical signals, then dividing the speech into smaller chunks. For each chunk, a voiceprint is created, and distances between voiceprints are calculated. Based on these distances, the system groups similar chunks together using a clustering algorithm. The final speaker assignment is refined through a variational Bayes algorithm to ensure the most accurate segmentation.

Typical applications include preprocessing audio for automatic speech recognition, labeling different parts of an utterance by speaker, and splitting mono-channel phone call recordings into separate tracks.

diarization

Signal-to-Noise ratio

Recording quality significantly impacts reliability of SID results and, consequently, the outcome of forensic cases. Therefore, VIN uses Phonexia's Speech Quality Estimation (SQE) technology to calculate the Signal-to-Noise ratio (SNR) of recordings. Detailed information about the SNR calculation method is available in the FAQ: How do you calculate SNR in Speech Quality Estimation?

By default, recordings with SNR values below 10 dB are considered unsuitable for further processing by the SID technology. Although an SNR score above 10 dB does not guarantee reliable results, it is likely that a recording below the 10 dB threshold should be discarded. VIN offers flexibility to adjust the threshold in Settings > General > Recordings checking > Don't use recordings with SNR lower than. Nonetheless, using low-quality recordings should be avoided unless necessary and should be documented. That's why SNR values are included in the report template, and VIN automatically disables recordings with SNR values below the threshold:

snr

Voice Activity Detection

Voice Activity Detection (VAD) technology identifies speech and non-speech segments in recordings. It ensures that only segments processable by SID are labeled as "voice," while the rest are labeled as "silence." The "silence" label, apart from actual silence, also covers noisy segments, technical signals, music, etc.

The primary purpose of VAD is to determine whether a recording is worth further processing based on the amount of speech present. Similar to SNR, VIN allows users to adjust the threshold in Settings > General > Recordings checking > Don't use recordings and voice-prints with speech length shorter than.

Minimum: 3 seconds
Recommended: 7 seconds or more (for phonologically rich content - a recording with 20 seconds of simple "Yes" and "No" responses is insufficient)

Our experiments have shown that precision does not increase with recordings longer than 120 seconds of speech. Longer recordings can still be used, but they offer little to no accuracy benefit and increase processing time.

The VAD module can also be used to manually clean out non-speech parts from a recording and to select parts for automatic SID processing. The VAD panel can be activated in the Wave editor menu bar, allowing easy selection of all parts of the recording labeled as "voice" or "silence":

vad

Selected parts of the recording can be easily re-labeled as "voice" or "silence" if they, for example, contain another person's voice. The advantage of "removing" a part of the VAD labeling, rather than cutting a piece of the audio itself, is that you do not modify the evidence material; you only instruct the system on which parts should be used by the SID technology. It is good practice to save the VAD "transcription" of a recording (a simple text file with timestamps and labels) if it has been modified, and include this information in the report with an explanation of the changes made.

The output of VAD can also be modified in VAD panel by holding Shift and dragging the segment boundaries as needed

Phoneme search

The Phonemes panel together with Tool panel can be used in to find and analyze characteristic pronunciations and words in individual recordings. It can search for exact phoneme sequences, phoneme classes, or let VIN suggest similar phoneme sequences of a certain length. The system can be set to search for sequences in different files to help compare similar sounds in the Questioned and Suspected speaker recordings.

The phoneme set is based on the Czech language and uses a machine-readable character set. Below are the equivalents in the International Phonetic Alphabet:

VIN	@	C	D	N	R	S	T	Z	a	a:	b	c	d	e	e:	f	g	h	i
IPA	ə	tʃ	ɟ	ɲ	r̝	ʃ	c	ʒ	a	a:	b	ts	d	e	e:	f	ɡ	ɦ	ɪ

VIN	i:	j	k	l	m	n	o	o:	p	r	s	sil	sp	t	u	u:	v	x	z
IPA	i:	j	k	l	m	n	o	o:	p	r	s	-	-	t	u	u:	v	x	z

For languages other than Czech, which may have different phoneme sets, using phoneme classes can be practical. Even if the exact phoneme and phoneme symbol are not "linguistically" correct for the target language, the phoneme class will likely be accurate. For example, {plosive} {vowel} n would find "gun," "ban," "dean," and others. To explore all possibilities, hover the mouse over the blue "i" button in the Tool panel.

Wave editor

The Wave Editor supports essential operations like selection, playback, copy, cut, paste, and amplification. It can display:

Waveform
Spectrogram (including fundamental frequency)
Power panel

graphs

Spectrogram settings (right-click in the Spectrum panel > Spectrum > Spectrum settings) allow adjusting window length, overlap type, pre-emphasis, spectrogram type, and LPC order. Additionally, the spectrum detail can show the frequency composition of the selected sample. The spectrum detail can easily be saved as a picture, inserted into the Report, and compared with spectrum details from other parts of the same recording or even other recordings:

spectrum_detail

The Wave Editor also includes the "Generic panel," which can be used for adding notes and manual transcriptions. By selecting several sections of the recording and pressing F7, you can listen to the selections (shortcut Ctrl-Space), annotate them, and proceed to the next selected section by simply pressing Enter, without interrupting the workflow with unnecessary mouse clicks. The labels on the Generic panel can be filtered and processed in batches (double-click one of them and all identical labels will be selected too):

generic_panel

Speaker Diarization​

Signal-to-Noise ratio​

Voice Activity Detection​

Phoneme search​

Wave editor​

Speaker Diarization

Signal-to-Noise ratio

Voice Activity Detection

Phoneme search

Wave editor