Skip to main content
Version: 3.5.0

Voice Activity Detection

This guide demonstrates how to perform Voice Activity Detection with Phonexia Speech Platform 4. You can find a high-level description in the Voice Activity Detection article.

For testing, we'll be using two audio files. You can download them together in the audio_files.zip archive.

At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Voice Activity Detection in your own projects.

Prerequisites

In the guide, we assume that the Virtual Appliance is running on port 8000 of http://localhost. For more information on how to install and start the Virtual Appliance, please refer to the Virtual Appliance Installation guide. The technology requires a proper model and license in order to process any files. For more details on models and licenses see the Licensing section.

Environment Setup

We are using Python 3.9 and Python library requests 2.27 in this example. You can install the requests library with pip as follows:

pip install requests~=2.27

Basic Voice Activity Detection

To run Voice Activity Detection for a single media file, you should start by sending a POST request to the /api/technology/voice-activity-detection endpoint. In Python, you can do this as follows:

import requests

SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/voice-activity-detection"

audio_path = "Kathryn_Paula.wav"

with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
)

print(response.status_code) # Should print '202'

If the task has been successfully accepted, the 202 code will be returned together with a unique task ID in the response body. The task isn't processed immediately, but only scheduled for processing. You can check the current task status by polling for the result.

The URL for polling the result is returned in the X-Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and the task ID to the endpoint URL.

import json
import requests
import time

# Use the `response` from the previous step
polling_url = response.headers["x-location"]
# Alternatively:
# polling_url = ENDPOINT_URL + "/" + response.json()["task"]["task_id"]

while True:
response = requests.get(polling_url)
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
time.sleep(5)

print(json.dumps(data, indent=2))

Once the polling finishes, data will contain the latest response from the server - either the result of Voice Activity Detection, or an error message with details, in case processing was not able to finish properly. The technology result can be accessed as data["result"], and for our sample audio, data should look as follows (result was shortened due to readability):

{
"task": {
"task_id": "5d31c8b4-f4d3-41c2-a5a5-6af6aee4812a",
"state": "done"
},
"result": {
"channels": [
{
"channel_number": 0,
"speech_length": 114.5,
"segments": [
{
"segment_type": "voice",
"start_time": 1.53,
"end_time": 14.1
},
{
"segment_type": "voice",
"start_time": 14.64,
"end_time": 25.49
},
{
"segment_type": "voice",
"start_time": 25.71,
"end_time": 38.47
},
{
"segment_type": "voice",
"start_time": 38.52,
"end_time": 41.64
},
...
]
}
]
}
}

In case of processing multichannel media files, you will obtain an independent Voice Activity Detection result for each channel in the channels list.

Full Python code

Here is the full example on how to run the Voice Activity Detection technology. The code is slightly adjusted and wrapped into functions for better readability.

import json
import requests
import time

SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/voice-activity-detection"


def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
time.sleep(sleep)
return response


def run_voice_activity_detection(audio_path: str):
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
)
response.raise_for_status()
polling_url = response.headers["x-location"]
response_result = poll_result(polling_url)
return response_result.json()


filenames = ["Laura_Harry_Veronika.wav", "Kathryn_Paula.wav"]

for filename in filenames:
print(f"Running Voice Activity Detection for file {filename}.")
data = run_voice_activity_detection(filename)
result = data["result"]
print(json.dumps(result, indent=2))