Skip to main content
Version: 5.0.0

Deepfake Detection

Phonexia has developed the Deepfake Detection technology to enable identifying artificial voices within audio recordings, thereby enhancing the security and reliability of speaker verification systems.

Standard Deepfake Detection

The Standard Deepfake Detection technology uses a single "generic" model that analyzes individual audio recordings. This approach leverages a transformer-based architecture and is primarily trained on datasets which encompass a wide range of synthesized and converted speech examples.

The model is trained in a self-supervised manner, and its performance is improved through carefully designed data augmentation techniques. It has been trained on a large corpus of various data sources, including telephone data, resulting in fewer false alarms on such recordings. The model requires a minimum of 3 seconds of continuous speech for inference, however, the best performance is achieved on lengths of 5 seconds and above.

Referential Deepfake Detection

The Referential Deepfake Detection technology introduces a comparative approach that analyzes two recordings: a reference recording (known bona fide/authentic recording of a speaker) and a questioned recording (which needs to be evaluated for authenticity). This technology provides enhanced accuracy by leveraging the reference audio for comparison.

XL5 Model

Referential Deepfake Detection uses the XL5 model, which:

  • Is based on Phonexia Speaker Identification technology
  • Uses audio recordings as both reference and questioned input data

Possible use cases

Standard Deepfake Detection

  1. Banks and Call Centers: Enhances the security of customer interactions by ensuring that communications are with legitimate individuals, thereby preventing fraudulent activities and unauthorized access.
  2. Forensic Analysis: Assists law enforcement agencies in authenticating audio evidence, ensuring its credibility in investigations and legal proceedings.

Referential Deepfake Detection

  1. Speaker Verification Enhancement: When you have an authentic recording of a speaker, you can verify if new recordings are genuine by comparing them against the reference.
  2. Voice Cloning Detection: Detect sophisticated voice cloning attempts where an attacker has created synthetic speech mimicking a specific individual's voice.
  3. Media Authentication: Verify the authenticity of audio content when you have access to genuine recordings from the same speaker.
  4. Legal and Forensic Evidence: More robust authentication of audio evidence when reference recordings are available from the same individual.
  5. Corporate Security: Verify authenticity of audio communications from executives or key personnel when reference voice samples are available.

Scoring

Both Standard and Referential Deepfake Detection technologies use LLR (log-likelihood ratio) scoring:

  • A positive score (> 0) indicates the audio is more likely a deepfake.
  • A negative score (< 0) suggests the audio is more likely genuine.

Both systems are calibrated so that a score of 0 corresponds to the point of equal likelihood between the two classes on evaluation datasets. This means the model is maximally uncertain at this point—it considers both outcomes equally probable.

The optimal decision threshold may differ from 0 depending on your application. To achieve the desired trade-off between false positives and false negatives, you may need to adjust the threshold based on your specific dataset and requirements.

Output Range

The score is returned as an unbounded LLR, theoretically ranging from minus infinity to plus infinity. However, in practice, values typically fall within the range of -10 to 10.

FAQ

How can I improve processing speed?

To speed up processing, ensure that Deepfake Detection is running on a GPU.

What's the difference between Standard and Referential Deepfake Detection?

Standard Deepfake Detection analyzes a single audio recording to determine if it's a deepfake or genuine audio, while Referential Deepfake Detection compares two recordings--a reference (authentic) recording and a questioned recording to determine if the questioned recording is authentic relative to the reference.

When should I use Referential Deepfake Detection instead of Standard Deepfake Detection?

Use Referential Deepfake Detection when:

  • You have access to known authentic recordings from the same speaker as in the suspected recording
  • You need higher accuracy for critical applications
  • You want to detect sophisticated voice cloning attempts targeting specific individuals