Speaker Identification
Phonexia Speaker Identification (SID) uses the power of voice biometrics to recognize speakers by their voice, determining whether the voice in two recordings belongs to the same person or two different people. Our goal as a regular participant of the NIST Speaker Recognition Evaluations (SRE) series is to contribute to the advancement of research efforts and the calibration of technical capabilities in text-independent speaker recognition. The objective is to boost the technology and identify the most promising algorithmic approaches for our future production-grade solutions.
Usage in Voice Inspector
While the technology itself supports various speaker recognition tasks, Voice Inspector uses the one-to-many (1:N) approach, where "1" represents the known (reference) speaker and "N" represents one or more unknown (questioned) recordings being evaluated.
How does it work?
The technology is based on the fact that the speech organs and speaking habits of every person are more or less unique. As a result, the characteristics (or features) of the speech signal captured in a recording are also more or less unique. Consequently, the technology can be language-, accent-, text-, and channel-independent.

Automatic speaker recognition systems are based on the extraction of unique
features from voices and their comparison. The systems thus usually comprise two
distinct steps: Voiceprint Extraction and Voiceprint Comparison.
During Voiceprint Extraction, acoustic features are extracted from a recording
and used to generate a speaker model, which is then transformed into a small but
highly representative numerical representation called a voiceprint.
Throughout this process, SID technology applies state-of-the-art channel
compensation techniques.
Any extracted voiceprint can be compared with existing voiceprints. The
system returns a score for each comparison, which is then used to compute
Likelihood ratio (LR) and Log-likelihood ratio (LLR).
Recommended length of recordings
| Net speech | Description | |
|---|---|---|
| Questioned recording | 7 seconds | Recommended length of net speech needed for voiceprint creation used for comparing |
| Reference recording | 20 seconds | Recommended length of net speech for accurate creation of speaker profile of the known speaker |
Voiceprint
A voiceprint (also known as x-vector) is a fixed-length matrix that captures the most unique characteristics of a speaker’s voice. Voiceprints are unique to each individual, similar to fingerprints or retinal scans.
What can be learned from a voiceprint?
- Similarity to another voiceprint
- Speaker’s gender
- Total length of original recordings
- Amount of speech used for voiceprint extraction
What cannot be learned from a voiceprint?
The most important feature of Phonexia Speaker Identification technology is that it prevents the retrieval of the audio source, call content, or original voice sound from a voiceprint. The voiceprint file only contains statistical information about the unique characteristics of the voice, without including information that could be used to reconstruct the audio contents (i.e., who said what) or for voice synthesis systems (TTS - Text to Speech).