Keyword Spotting
This guide demonstrates how to perform Keyword Spotting with Phonexia Speech Platform 4. You can find a high-level description in the Keyword Spotting article.
Attached, you will find the audio file Paula.wav which is a mono recording in Czech. It will be used as example audio throughout this guide.
At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Keyword Spotting in your own projects.
Prerequisites
In the guide, we assume that the Virtual Appliance is running on port 8000
of
http://localhost
and contains a proper model and license for the technology.
For more information on how to install and start the Virtual Appliance, please
refer to the Virtual Appliance Installation chapter.
Environment Setup
We are using Python 3.9
and Python library requests 2.27
in this example.
You can install the requests
library with pip
as follows:
pip install requests~=2.27
Keyword Spotting
To run Keyword Spotting for a single media file, you should start by sending a
POST
request to the
/api/technology/keyword-spotting
endpoint.
List of keywords to be matched must be provided in config
request body
parameter. See the
endpoint documentation
for more details. You also need to pass the language of the recording as an
argument. The example recording is in Czech, so you should pass cs
, as
follows:
import json
import requests
SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/keyword-spotting"
audio_path = "Paula.wav"
config_dict = {
"keywords": [
{"spelling": "termín"},
{"spelling": "děkuji", "pronunciations": ["J\\ e k u j i"]},
]
}
payload = {"config": json.dumps(config_dict)}
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
data=payload,
params={"language": "cs"},
)
print(response.status_code) # Should print '202'
Note that to get meaningful results, the language
parameter has to match the
language spoken in the audio file. See the documentation for the list of
supported languages.
The config
parameter expects a JSON formatted string. In other words,
config
's content is going to be interpreted as JSON by Speech Platform 4.
Therefore, all rules of the JSON syntax apply.
Some phoneme symbols (e.g. the Czech phoneme J\
) include the backslash which
has a special meaning in JSON and must be escaped with another backslash (\\
)
to suppress the special meaning. For example, the correct way to capture the
pronunciation of the Czech word děkuji
in config.keywords[*].pronunciations
is "J\\ e k u j i"
.
If the task has been successfully accepted, the 202
code will be returned
together with a unique task ID
in the response body. The task isn't processed
immediately, but only scheduled for processing. You can check the current task
status by polling for the result.
The URL for polling the result is returned in the Location
header.
Alternatively, you can assemble the polling URL on your own by appending a slash
(/
) and the task ID
to the endpoint URL.
import json
import requests
import time
# Use the `response` from the previous step
polling_url = response.headers["Location"]
# Alternatively:
# polling_url = ENDPOINT_URL + "/" + response.json()["task"]["task_id"]
while True:
response = requests.get(polling_url)
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
time.sleep(5)
print(json.dumps(data, indent=2))
Once the polling finishes, data
will contain the latest response from the
server - either the result of Keyword Spotting, or an error message with
details, in case processing was not able to finish properly. The technology
result can be accessed as data["result"]
, and for our sample audio, data
should look as follows:
{
"task": {
"task_id": "78c8ceca-225d-4fd8-a4f5-07e40f9de4fa",
"state": "done"
},
"result": {
"matches": [
{
"channel_number": 0,
"start_time": 8.51,
"end_time": 9.2,
"confidence": 0.20445,
"keyword": {
"spelling": "termín",
"pronunciation": "t e r m i: n",
"pronunciation_source": "dictionary"
}
},
{
"channel_number": 0,
"start_time": 37.32,
"end_time": 37.83,
"confidence": 0.58694,
"keyword": {
"spelling": "děkuji",
"pronunciation": "J\\ e k u j i",
"pronunciation_source": "user"
}
}
]
}
}
Note that matches found by Keyword Spotting technology are ordered by the time they appear in the recording, irrespective of their channel of origin.
The Keyword Spotting technology errs on the side of caution and might return
keyword matches with very low confidence. If that is the case for your data set,
consider filtering data["result"]["matches"]
by the confidence
value.
Full Python code
Here is the full example on how to run the Keyword Spotting technology. The code is slightly adjusted and wrapped into functions for better readability.
import json
import requests
import time
SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/keyword-spotting"
def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
time.sleep(sleep)
return response
def run_keyword_spotting(audio_path: str, language: str, config: dict):
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
data={"config": json.dumps(config)},
params={"language": language},
)
response.raise_for_status()
polling_url = response.headers["Location"]
response_result = poll_result(polling_url)
return response_result.json()
filename = "Paula.wav"
config_dict = {
"keywords": [
{"spelling": "termín"},
{"spelling": "děkuji", "pronunciations": ["J\\ e k u j i"]},
]
}
print(f"Running Keyword Spotting for file {filename}.")
data = run_keyword_spotting(audio_path=filename, language="cs", config=config_dict)
result = data["result"]
print(json.dumps(result, indent=2))