Speaker Verification
We are going to do speaker verification. This means we want to verify that a speaker in an audio recording is the same person as they claim to be. We already have an audio recording with the voice of John Doe. Now, we want to verify that it's John Doe speaking in another audio recording. This article describes how you can achieve that using our software.
The process of verification consists of voiceprint extraction and voiceprint comparison. Both steps are explained in the following sections.
Attached, you will find a ZIP file named recordings.zip
containing two mono-channel audio files – john_doe.wav
and unknown.wav
,
which will be used as examples throughout the guide.
Please note, at the end of this guide, we provide a full Python code example that encapsulates all the steps discussed. This should offer a comprehensive understanding and an actionable guide on implementing speaker verification in your own projects.
Environment setup
We are using Python 3.9
and Python library requests 2.27
in this example.
You can install the requests
library with pip
as follows:
pip install requests~=2.27
Then, you can import the following libraries (time
is built-in):
import time
import requests
Voiceprint extraction
In order to trigger voiceprint extraction for a single audio file, you should
start by sending a POST
request to
/api/technology/speaker-identification-voiceprint-extraction
as follows:
with open("john_doe.wav", mode="rb") as file:
files = {"file": file}
# Replace <speech-platform-server> with the actual server address
response = requests.post(
"https://<speech-platform-server>/api/technology/speaker-identification-voiceprint-extraction",
files=files,
)
response.raise_for_status()
If the task was successfully accepted, 202 code will be returned together with a
unique task ID
in the response body. The task isn't immediately processed, but
only scheduled for processing. You can check the current task status whilst
polling for the result.
The URL for polling the result is returned in the X-Location
header.
Alternatively, you can assemble the polling URL on your own by appending a slash
(/
) and task ID
to the initial voiceprint extraction URL.
polling_url = response.headers["x-location"]
counter = 0
while counter < 100:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in ["done", "failed", "rejected"]:
break
counter += 1
time.sleep(5)
Once the polling finishes, data
will contain the latest response from the
server — either a response with the extracted voiceprint(s), or an error message
with details in case processing was not able to finish. The extracted
voiceprints are grouped by the channels. Example result of a successful
voiceprint extraction from a stereo-channel file (voiceprints were shortened for
readability):
{
"task": {"task_id": "fb9de4e5-a768-4069-aff3-c74c826f3ddf", "state": "done"},
"result": {
"channels": [
{
"channel_number": 0,
"voiceprint": "e2kDY3JjbDAWiyhpCWVtYmVkZGluZ1tkO/QWvmS8JkuGZDyv+F5kvJQzJ...",
"speech_length": 49.08,
"model": "sid-xl5",
},
{
"channel_number": 1,
"voiceprint": "e2kDY3JjbFL6NSxpCWVtYmVkZGluZ1tkO2ygcGS8CduAZDxa6ZBkPHrYf...",
"speech_length": 116.35,
"model": "sid-xl5",
},
]
},
}
Let's get back to our example with the mono-channel audio files. In our case, the target voiceprint can be accessed as follows:
known_audio_voiceprint = data["result"]["channels"][0]["voiceprint"]
Great, you've extracted your first voiceprint and stored it in a variable,
congratulations! Now, you can repeat the same process for the second file, just
change john_doe.wav
to unknown.wav
and store the resulting voiceprint in
another variable:
unknown_audio_voiceprint = data["result"]["channels"][0]["voiceprint"]
Voiceprint comparison
The voiceprints are extracted, so we can move on to the actual comparison. By default, two non-empty sets of voiceprints are expected as input. Each voiceprint is expected to be a base64-encoded string, but you don't have to worry about it -- the voiceprints are already returned in this format from the voiceprint extraction.
Doing the voiceprint comparison is analogous to voiceprint extraction -- we start by requesting the voiceprint comparison task to be scheduled for our two voiceprint sets:
body = {
"voiceprints_a": [known_audio_voiceprint],
"voiceprints_b": [unknown_audio_voiceprint],
}
# Replace <speech-platform-server> with the actual server address
response = requests.post(
"https://<speech-platform-server>/api/technology/speaker-identification-voiceprint-comparison",
json=body,
)
response.raise_for_status()
Now, you can poll for the task result as it was done above for the voiceprint
extraction. The result is a comparison matrix with shape based on the sizes of
both input voiceprint sets. In the case of speaker verification, the resulting
matrix has a shape of 1x1
, since we're comparing only two files with each
other. After you finish polling, the result of comparison can be found in data
and should look like this:
{
"task": {"task_id": "e44557e1-94ba-4272-929a-8a5ec32f6e96", "state": "done"},
"result": {
"scores": {"rows_count": 1, "columns_count": 1, "values": [2.1514739990234375]}
},
}
In this case, we can see that the resulting score is very high, therefore we can
assume that it is very likely that John Doe is also speaking in the
unknown.wav
file.
See scoring explained here
Full Python Code
Here is the full code for this example, slightly adjusted and wrapped into functions for better readability:
import time
import requests
SPEECH_PLATFORM_SERVER = "<speech-platform-server>" # Replace with your actual server URL
def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in ["done", "failed", "rejected"]:
break
time.sleep(sleep)
return response
def do_voiceprint_extraction(audio_path: str):
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
f"https://{SPEECH_PLATFORM_SERVER}/speaker-identification-voiceprint-extraction",
files=files,
)
response.raise_for_status()
polling_url = response.headers["x-location"]
voiceprint_extraction_response = poll_result(polling_url)
return voiceprint_extraction_response.json()
def do_voiceprint_comparison(voiceprints_a: list, voiceprints_b: list):
response = requests.post(
f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speaker-identification-voiceprint-comparison",
json={"voiceprints_a": voiceprints_a, "voiceprints_b": voiceprints_b},
)
response.raise_for_status()
polling_url = response.headers["x-location"]
voiceprint_comparison_response = poll_result(polling_url)
return voiceprint_comparison_response.json()
known_audio = "john_doe.wav"
unknown_audio = "unknown.wav"
known_audio_response = do_voiceprint_extraction(known_audio)
known_audio_voiceprint = known_audio_response["result"]["channels"][0]["voiceprint"]
unknown_audio_response = do_voiceprint_extraction(unknown_audio)
unknown_audio_voiceprint = unknown_audio_response["result"]["channels"][0]["voiceprint"]
voiceprint_comparison_response = do_voiceprint_comparison(
[known_audio_voiceprint], [unknown_audio_voiceprint]
)
print(voiceprint_comparison_response)