Skip to main content
Version: 4.0.0-rc1

Keyword Spotting

This guide demonstrates how to perform Keyword Spotting with Phonexia Speech Platform 4. You can find a high-level description in the Keyword Spotting article.

Attached, you will find the audio file Paula.wav which is a mono recording in Czech. It will be used as example audio throughout this guide.

At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Keyword Spotting in your own projects.

Prerequisites

In the guide, we assume that the Virtual Appliance is running on port 8000 of http://localhost and contains a proper model and license for the technology. For more information on how to install and start the Virtual Appliance, please refer to the Virtual Appliance Installation chapter.

Environment Setup

We are using Python 3.9 and Python library requests 2.27 in this example. You can install the requests library with pip as follows:

pip install requests~=2.27

Keyword Spotting

To run Keyword Spotting for a single media file, you should start by sending a POST request to the /api/technology/keyword-spotting endpoint.

List of keywords to be matched must be provided in config request body parameter. See the endpoint documentation for more details. You also need to pass the language of the recording as an argument. The example recording is in Czech, so you should pass cs, as follows:

import json
import requests

SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/keyword-spotting"

audio_path = "Paula.wav"

config_dict = {
"keywords": [
{"spelling": "termín"},
{"spelling": "děkuji", "pronunciations": ["J\\ e k u j i"]},
]
}
payload = {"config": json.dumps(config_dict)}

with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
data=payload,
params={"language": "cs"},
)

print(response.status_code) # Should print '202'

Note that to get meaningful results, the language parameter has to match the language spoken in the audio file. See the documentation for the list of supported languages.

warning

The config parameter expects a JSON formatted string. In other words, config's content is going to be interpreted as JSON by Speech Platform 4. Therefore, all rules of the JSON syntax apply.

Some phoneme symbols (e.g. the Czech phoneme J\) include the backslash which has a special meaning in JSON and must be escaped with another backslash (\\) to suppress the special meaning. For example, the correct way to capture the pronunciation of the Czech word děkuji in config.keywords[*].pronunciations is "J\\ e k u j i".

If the task has been successfully accepted, the 202 code will be returned together with a unique task ID in the response body. The task isn't processed immediately, but only scheduled for processing. You can check the current task status by polling for the result.

The URL for polling the result is returned in the Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and the task ID to the endpoint URL.

import json
import requests
import time

# Use the `response` from the previous step
polling_url = response.headers["Location"]
# Alternatively:
# polling_url = ENDPOINT_URL + "/" + response.json()["task"]["task_id"]

while True:
response = requests.get(polling_url)
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
time.sleep(5)

print(json.dumps(data, indent=2))

Once the polling finishes, data will contain the latest response from the server - either the result of Keyword Spotting, or an error message with details, in case processing was not able to finish properly. The technology result can be accessed as data["result"], and for our sample audio, data should look as follows:

{
"task": {
"task_id": "78c8ceca-225d-4fd8-a4f5-07e40f9de4fa",
"state": "done"
},
"result": {
"matches": [
{
"channel_number": 0,
"start_time": 8.51,
"end_time": 9.2,
"confidence": 0.20445,
"keyword": {
"spelling": "termín",
"pronunciation": "t e r m i: n",
"pronunciation_source": "dictionary"
}
},
{
"channel_number": 0,
"start_time": 37.32,
"end_time": 37.83,
"confidence": 0.58694,
"keyword": {
"spelling": "děkuji",
"pronunciation": "J\\ e k u j i",
"pronunciation_source": "user"
}
}
]
}
}

Note that matches found by Keyword Spotting technology are ordered by the time they appear in the recording, irrespective of their channel of origin.

tip

The Keyword Spotting technology errs on the side of caution and might return keyword matches with very low confidence. If that is the case for your data set, consider filtering data["result"]["matches"] by the confidence value.

Full Python code

Here is the full example on how to run the Keyword Spotting technology. The code is slightly adjusted and wrapped into functions for better readability.

import json
import requests
import time

SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/keyword-spotting"


def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
time.sleep(sleep)
return response


def run_keyword_spotting(audio_path: str, language: str, config: dict):
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
data={"config": json.dumps(config)},
params={"language": language},
)
response.raise_for_status()
polling_url = response.headers["Location"]
response_result = poll_result(polling_url)
return response_result.json()


filename = "Paula.wav"
config_dict = {
"keywords": [
{"spelling": "termín"},
{"spelling": "děkuji", "pronunciations": ["J\\ e k u j i"]},
]
}

print(f"Running Keyword Spotting for file {filename}.")
data = run_keyword_spotting(audio_path=filename, language="cs", config=config_dict)
result = data["result"]
print(json.dumps(result, indent=2))