Skip to main content
Version: 3.4.0

Speaker Verification

We are going to do speaker verification. This means we want to verify that a speaker in an audio recording is the same person as they claim to be. We already have an audio recording with the voice of John Doe. Now, we want to verify that it's John Doe speaking in another audio recording. This article describes how you can achieve that using our software.

The process of verification consists of voiceprint extraction and voiceprint comparison. Both steps are explained in the following sections.

Attached, you will find a ZIP file named recordings.zip containing two mono-channel audio files – john_doe.wav and unknown.wav, which will be used as examples throughout the guide.

Please note, at the end of this guide, we provide a full Python code example that encapsulates all the steps discussed. This should offer a comprehensive understanding and an actionable guide on implementing speaker verification in your own projects.

Environment setup

We are using Python 3.9 and Python library requests 2.27 in this example. You can install the requests library with pip as follows:

pip install requests~=2.27

Then, you can import the following libraries (time is built-in):

import time
import requests

Voiceprint extraction

In order to trigger voiceprint extraction for a single audio file, you should start by sending a POST request to /api/technology/speaker-identification-voiceprint-extraction as follows:

with open("john_doe.wav", mode="rb") as file:
files = {"file": file}
# Replace <speech-platform-server> with the actual server address
response = requests.post(
"https://<speech-platform-server>/api/technology/speaker-identification-voiceprint-extraction",
files=files,
)
response.raise_for_status()

If the task was successfully accepted, 202 code will be returned together with a unique task ID in the response body. The task isn't immediately processed, but only scheduled for processing. You can check the current task status whilst polling for the result.

The URL for polling the result is returned in the X-Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and task ID to the initial voiceprint extraction URL.

polling_url = response.headers["x-location"]

counter = 0
while counter < 100:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in ["done", "failed", "rejected"]:
break
counter += 1
time.sleep(5)

Once the polling finishes, data will contain the latest response from the server — either a response with the extracted voiceprint(s), or an error message with details in case processing was not able to finish. The extracted voiceprints are grouped by the channels. Example result of a successful voiceprint extraction from a stereo-channel file (voiceprints were shortened for readability):

{
"task": {"task_id": "fb9de4e5-a768-4069-aff3-c74c826f3ddf", "state": "done"},
"result": {
"channels": [
{
"channel_number": 0,
"voiceprint": "e2kDY3JjbDAWiyhpCWVtYmVkZGluZ1tkO/QWvmS8JkuGZDyv+F5kvJQzJ...",
"speech_length": 49.08,
"model": "sid-xl5",
},
{
"channel_number": 1,
"voiceprint": "e2kDY3JjbFL6NSxpCWVtYmVkZGluZ1tkO2ygcGS8CduAZDxa6ZBkPHrYf...",
"speech_length": 116.35,
"model": "sid-xl5",
},
]
},
}

Let's get back to our example with the mono-channel audio files. In our case, the target voiceprint can be accessed as follows:

known_audio_voiceprint = data["result"]["channels"][0]["voiceprint"]

Great, you've extracted your first voiceprint and stored it in a variable, congratulations! Now, you can repeat the same process for the second file, just change john_doe.wav to unknown.wav and store the resulting voiceprint in another variable:

unknown_audio_voiceprint = data["result"]["channels"][0]["voiceprint"]

Voiceprint comparison

The voiceprints are extracted, so we can move on to the actual comparison. By default, two non-empty sets of voiceprints are expected as input. Each voiceprint is expected to be a base64-encoded string, but you don't have to worry about it -- the voiceprints are already returned in this format from the voiceprint extraction.

Doing the voiceprint comparison is analogous to voiceprint extraction -- we start by requesting the voiceprint comparison task to be scheduled for our two voiceprint sets:

body = {
"voiceprints_a": [known_audio_voiceprint],
"voiceprints_b": [unknown_audio_voiceprint],
}

# Replace <speech-platform-server> with the actual server address
response = requests.post(
"https://<speech-platform-server>/api/technology/speaker-identification-voiceprint-comparison",
json=body,
)
response.raise_for_status()

Now, you can poll for the task result as it was done above for the voiceprint extraction. The result is a comparison matrix with shape based on the sizes of both input voiceprint sets. In the case of speaker verification, the resulting matrix has a shape of 1x1, since we're comparing only two files with each other. After you finish polling, the result of comparison can be found in data and should look like this:

{
"task": {"task_id": "e44557e1-94ba-4272-929a-8a5ec32f6e96", "state": "done"},
"result": {
"scores": {"rows_count": 1, "columns_count": 1, "values": [2.1514739990234375]}
},
}

In this case, we can see that the resulting score is very high, therefore we can assume that it is very likely that John Doe is also speaking in the unknown.wav file. See scoring explained here

Full Python Code

Here is the full code for this example, slightly adjusted and wrapped into functions for better readability:

import time

import requests

SPEECH_PLATFORM_SERVER = "<speech-platform-server>" # Replace with your actual server URL

def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in ["done", "failed", "rejected"]:
break
time.sleep(sleep)
return response


def do_voiceprint_extraction(audio_path: str):
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
f"https://{SPEECH_PLATFORM_SERVER}/speaker-identification-voiceprint-extraction",
files=files,
)
response.raise_for_status()
polling_url = response.headers["x-location"]
voiceprint_extraction_response = poll_result(polling_url)
return voiceprint_extraction_response.json()


def do_voiceprint_comparison(voiceprints_a: list, voiceprints_b: list):
response = requests.post(
f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speaker-identification-voiceprint-comparison",
json={"voiceprints_a": voiceprints_a, "voiceprints_b": voiceprints_b},
)
response.raise_for_status()
polling_url = response.headers["x-location"]
voiceprint_comparison_response = poll_result(polling_url)
return voiceprint_comparison_response.json()


known_audio = "john_doe.wav"
unknown_audio = "unknown.wav"

known_audio_response = do_voiceprint_extraction(known_audio)
known_audio_voiceprint = known_audio_response["result"]["channels"][0]["voiceprint"]

unknown_audio_response = do_voiceprint_extraction(unknown_audio)
unknown_audio_voiceprint = unknown_audio_response["result"]["channels"][0]["voiceprint"]

voiceprint_comparison_response = do_voiceprint_comparison(
[known_audio_voiceprint], [unknown_audio_voiceprint]
)
print(voiceprint_comparison_response)