Keyword Spotting
This guide demonstrates how to perform Keyword Spotting with Phonexia Speech Platform 4. You can find a high-level description in the Keyword Spotting article.
Throughout this guide, we'll be using a mono-channel audio in Czech as an example: Paula.wav.
At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Keyword Spotting in your own projects.
Prerequisites
Follow the prerequisites for setup of Virtual Appliance and Python environment as described in the Task lifecycle code examples.
Run Keyword Spotting
To run Keyword Spotting for a single media file, you should start by sending a
POST request to the
/api/technology/keyword-spotting
endpoint. file, language and list of keywords to detect are mandatory
parameters.
The list of keywords must be provided in the config field of the request
body. See the
endpoint documentation
for more details. The example file is in Czech, so you should pass cs as the
language. In Python, you can do this as follows:
import json
import requests
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/keyword-spotting"
media_file = "Paula.wav"
params = {"language": "cs"}
config = {"config": json.dumps({
"keywords": [
{"spelling": "termín"},
{"spelling": "děkuji", "pronunciations": ["J\\ e k u j i"]},
]
}
)
}
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
data=config,
params=params
)
print(start_task_response.status_code) # Should print '202'
Note that to get meaningful results, the language parameter has to match the
language spoken in the media file. See the documentation for the list of
supported languages.
The config parameter expects a JSON formatted string. In other words,
config's content is going to be interpreted as JSON by Speech Platform 4.
Therefore, all rules of the JSON syntax apply.
Some phoneme symbols (e.g. the Czech phoneme J\) include the backslash which
has a special meaning in JSON and must be escaped with another backslash (\\)
to suppress the special meaning. For example, the correct way to capture the
pronunciation of the Czech word děkuji in config.keywords[*].pronunciations
is "J\\ e k u j i".
If the task has been successfully accepted, the 202 code will be returned
together with a unique task ID in the response body. The task isn't processed
immediately, but only scheduled for processing. You can check the current task
status by polling for the result.
Polling
To obtain the final result, periodically query the task status until the task
state changes to done, failed or rejected. The general polling procedure
is described in detail in the
Task lifecycle code examples.
Result for Keyword Spotting
The result field of the task contains the list of matches. Each match
contains information about channel_number, start_time, end_time,
confidence of the match, and what exact keyword was matched.
For our sample file, the task result should look as follows:
{
"task": {
"task_id": "78c8ceca-225d-4fd8-a4f5-07e40f9de4fa",
"state": "done"
},
"result": {
"matches": [
{
"channel_number": 0,
"start_time": 8.51,
"end_time": 9.2,
"confidence": 0.20445,
"keyword": {
"spelling": "termín",
"pronunciation": "t e r m i: n",
"pronunciation_source": "dictionary"
}
},
{
"channel_number": 0,
"start_time": 37.32,
"end_time": 37.83,
"confidence": 0.58694,
"keyword": {
"spelling": "děkuji",
"pronunciation": "J\\ e k u j i",
"pronunciation_source": "user"
}
}
]
}
}
Note that matches found by Keyword Spotting technology are ordered by the time they appear in the file, irrespective of their channel of origin.
The Keyword Spotting technology tends to err on the side of false positives and
may return matches with very low confidence. If that is the case for your data,
consider filtering the resulting matches by their confidence value.
Full Python code
Here is the full example on how to run the Keyword Spotting technology. The code is slightly adjusted and wrapped into functions for better readability. Refer to the Task lifecycle code examples for a generic code template, applicable to all technologies.
import json
import requests
import time
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/keyword-spotting"
def poll_result(polling_url, polling_interval=5):
"""Poll the task endpoint until processing completes."""
while True:
polling_task_response = requests.get(polling_url)
polling_task_response.raise_for_status()
polling_task_response_json = polling_task_response.json()
task_state = polling_task_response_json["task"]["state"]
if task_state in {"done", "failed", "rejected"}:
break
time.sleep(polling_interval)
return polling_task_response
def run_media_based_task(media_file, params=None, config=None):
"""Create a media-based task and wait for results."""
if params is None:
params = {}
if config is None:
config = {}
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
params=params,
data={"config": json.dumps(config)},
)
start_task_response.raise_for_status()
polling_url = start_task_response.headers["Location"]
task_result = poll_result(polling_url)
return task_result.json()
# Run Keyword Spotting
media_file = "Paula.wav"
params = {"language": "cs"}
config = {
"keywords": [
{"spelling": "termín"},
{"spelling": "děkuji", "pronunciations": ["J\\ e k u j i"]},
]
}
print(f"Running Keyword Spotting for file {media_file}.")
media_file_based_task = run_media_based_task(media_file=media_file, params=params, config=config)
media_file_based_task_result = media_file_based_task ["result"]
print(json.dumps(media_file_based_task_result, indent=2))