Skip to main content

Speaker Identification

Phonexia Speaker Identification (SID) uses the power of voice biometrics to recognize speakers by their voice, determining whether the voice in two recordings belongs to the same person or two different people. Our goal as a regular participant of the NIST Speaker Recognition Evaluations (SRE) series is to contribute to the advancement of research efforts and the calibration of technical capabilities in text-independent speaker recognition. The objective is to boost the technology and identify the most promising algorithmic approaches for our future production-grade solutions.

Usage in Voice Inspector

While the technology itself supports various speaker recognition tasks, Voice Inspector uses the one-to-many (1:N) approach, where "1" represents the known (reference) speaker and "N" represents one or more unknown (questioned) recordings being evaluated.

How does it work?

The technology is based on the fact that the speech organs and speaking habits of every person are more or less unique. As a result, the characteristics (or features) of the speech signal captured in a recording are also more or less unique. Consequently, the technology can be language-, accent-, text-, and channel-independent.

Speech organs

Automatic speaker recognition systems are based on the extraction of unique features from voices and their comparison. The systems thus usually comprise two distinct steps: Voiceprint Extraction and Voiceprint Comparison.
During Voiceprint Extraction, acoustic features are extracted from a recording and used to generate a speaker model, which is then transformed into a small but highly representative numerical representation called a voiceprint. Throughout this process, SID technology applies state-of-the-art channel compensation techniques.
Any extracted voiceprint can be compared with existing voiceprints. The system returns a score for each comparison, which is then used to compute Likelihood ratio (LR) and Log-likelihood ratio (LLR).

Net speechDescription
Questioned recording7 secondsRecommended length of net speech needed for voiceprint creation used for comparing
Reference recording20 secondsRecommended length of net speech for accurate creation of speaker profile of the known speaker

Voiceprint

A voiceprint (also known as x-vector) is a fixed-length matrix that captures the most unique characteristics of a speaker’s voice. Voiceprints are unique to each individual, similar to fingerprints or retinal scans.

What can be learned from a voiceprint?

  • Similarity to another voiceprint
  • Speaker’s gender
  • Total length of original recordings
  • Amount of speech used for voiceprint extraction

What cannot be learned from a voiceprint?

The most important feature of Phonexia Speaker Identification technology is that it prevents the retrieval of the audio source, call content, or original voice sound from a voiceprint. The voiceprint file only contains statistical information about the unique characteristics of the voice, without including information that could be used to reconstruct the audio contents (i.e., who said what) or for voice synthesis systems (TTS - Text to Speech).