Skip to main content
Version: 2.1.0

Speaker Identification

Phonexia Speaker Identification (SID) uses the power of voice biometry to recognize speakers by their voice… i.e. to decide whether the voice in two recordings belongs to the same person or two different people. Our goal as a regular participant of the NIST Speaker Recognition Evaluations (SRE) series is to contribute to the direction of research efforts and the calibration of technical capabilities of text-independent speaker recognition. The objective is to drive the technology forward and through the competing find the most promising algorithmic approaches for our future production-grade technology.

Basic use cases and application areas

The technology can be used for various speaker recognition tasks. One basic distinction is based on the kind of question we want to answer.

Speaker Identification is the case when we are asking “Whose voice is this?, such as in fake emergency calls.
Usually this entails one-to-many (1:n) or many-to-many (n:n) comparisons.

Speaker identification

Speaker Search is the case when we are asking “Where is this voice speaking?”, i.e. when looking for a speaker inside a large archive.

We have to do with Speaker Spotting when we are monitoring a large number of audio recordings or streams and we are looking for the occurrence of a specific speaker(s).
Speaker spotting can be deployed for the purpose of Fraud Alert.

Speaker Verification is the case when we are asking “Is this Peter Smith’s voice?”, such as when a person calls the bank and says, “Hello, this is Peter Smith!”.
This approach of one-to-one (1:1) verification is also employed in Voice-As-a-Password systems, which can add further security to multi-factor authentication over the telephone.

Speaker verification

Large-scale automatic speaker identification is also successfully used by law enforcement agencies during investigation for the purposes of database searches and ranking of suspects. In later stages of a case, Forensic Voice Analysis uses smaller amounts of data and 1:1 comparisons to evaluate evidence and to establish probability of the identity of a speaker and use it in court.

Usual suspects

How does it work?

The technology is based on the fact that the speech organs and the speaking habits of every person are more or less unique. As a result, the characteristics (or features) of the speech signal captured in a recording are also more or less unique, thus the technology can be language-, accent-, text-, and channel-independent.

Speech organs

Automatic speaker recognition systems are based on the extraction of the unique features from voices and their comparison. The systems thus usually comprise two distinct steps: Voiceprint Extraction (Speaker enrollment) and Voiceprint comparison.

Voiceprint extraction is the most time-consuming part of the process. Voiceprint comparison, on the other hand, is extremely fast – a millions of voiceprint comparisons can be done in 1 second.

Voiceprint extraction (Speaker enrollment)

Speaker enrollment starts with the extraction of acoustic features from a recording of a known speaker. The process continues with the creation of a speaker model which is then transformed into a small but highly representative numerical representation called a voiceprint. During this process, the SID technology applies state-of-the-art channel compensation techniques. The voiceprint is a fixed-length matrix which captures the most unique characteristics of a speaker’s voice. It cannot be used to recreate the original audio file which is useful when the content has to stay anonymous.

Speaker enrollment

Voiceprint comparison

Any voiceprint of an unknown speaker can then be compared with existing enrollment voiceprints and the system returns a score for each comparison.

Scoring and conversion to percentage

Score produced by comparing two voiceprints is an estimate of the probability (P), that we get the given evidence (the compared voiceprints) if the speakers in the two voiceprints are the same or if they are two different people. The ratio between these two probabilities is called the Likelihood Ratio (LR), which is often expressed in the form of a logarithm as Log Likelihood Ratio (LLR).

score=logeP(evidenceperson)P(evidencesomeone else)score=\log_e\frac{P(evidence|person)}{P(evidence|someone\ else)}

Transformation to confidence (or percentage) is usually done using a sigmoid function:

confidence=11+esharpness×(score+shift)confidence=\frac{1}{1+e^{-sharpness\times(score+shift)}} percentage=confidence×100percentage=confidence \times 100

where:

  • shift shifts the score to be 0 at ideal decision point (default is 0)
  • sharpness specifies how the dynamic range of score is used (default is 1)

The shift value can be obtained by performing a proper SID evaluation – see the chapter below for details.
The sharpness value can be chosen according to the desired steepness of the sigmoid function

  • higher sharpness means more steep function – i.e. more sharp transition between lower and higher percentages, and only small differences in the low and high percentages
  • lower sharpness means less steep function – i.e. the transition being more linear

The interactive graph below should help you to understand the correlation between score and confidence via the sigmoid function steepness, controlled by the sharpness value.

Score−15−10−5051015Confidence0.20.40.60.81.0
Shift:
0
Sharpness:
1

SID evaluation

Before implementing Speaker Identification, it’s important to evaluate its accuracy using real data from the production environment. To evaluate the SID system, you’ll need enough of labeled data, i.e. recordings with speaker labels.

The principle of SID system evaluation is to compare (voiceprints of) all the individual recordings against each other and check the results of all the comparisons. Since it’s known which comparison is which – which compares the same speaker (called target trial) and which compares different speakers (called non-target trial) – it’s also known which comparison should give a high score and which should give a low score.

In the process of voiceprint comparison, two types of error can occur:

  • False Rejection occurs when the system incorrectly rejects a target trial, i.e., the system says that the voices are different even though in fact they belong to the same person
  • False Acceptance is when the system incorrectly accepts a non-target trial, i.e. the system says that the voices are the same, even though they belong to different persons.

One way to measure the performance of a Speaker Identification system is to calculate the trade-off between these two errors which can be shown in a Detection Error Tradeoff (DET) graph. By decreasing the threshold for acceptance we decrease the probability of a false rejection, but at the same time we increase the probability of a false acceptance.

DET with description

In an ideal system, we want both errors to be as small as possible. Better performance is indicated in a DET graph by the red line being closer to the origin (0 at both the x and y axes).
By properly setting the acceptance threshold the system can be adjusted for a particular use case.
For example, in the case of voice-as-a-password for the authentication of bank transfers when high security is desirable, the threshold should also be high.
For law enforcement agencies looking for any leads in a case, a higher false acceptance rate is an acceptable price to pay for not missing a bad guy’s call.

DET with use cases

The operating point of the system when it makes the same number of false acceptances and false rejections is called Equal Error Rate (EER). It is a common measure of the system’s overall performance.