Speech to Text
This tutorial demonstrates how to use the Phonexia Speech Platform 4 to obtain a transcript of the speech in a media file, in other words, how to run Speech to Text. The Speech Platform supports two different Speech to Text engines — Phonexia 6th generation Speech to Text and Enhanced Speech to Text Built on Whisper. In this guide we will process audio with both of them.
Attached, you will find the audio file Harry.wav which is a mono recording in English. It will be used as example audio throughout this guide.
At the end of this guide, you will find a full Python code example that encapsulates all the steps discussed. This should offer a comprehensive understanding and an actionable guide on implementing Speech to Text in your own projects.
Environment setup
We are using Python 3.9
and Python library requests 2.27
in this example.
You can install the requests
library with pip
as follows:
pip install requests~=2.27
Then, you can import the following libraries (time
is built-in):
import time
import requests
Phonexia 6th generation
In order to run Phonexia Speech to Text processing for a single media file, you
should start by sending a POST
request to
/api/technology/speech-to-text
You also need to pass the language of the recording as an argument. The example
recording is in English, so you should pass en
, as follows:
# Replace <speech-platform-server> with the actual server address
SPEECH_PLATFORM_SERVER = "<speech-platform-server>"
with open("Harry.wav", mode="rb") as file:
files = {"file": file}
response = requests.post(
f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text",
files=files,
params={"language": "en"}, # Set the language here to matching language model.
)
response.raise_for_status()
Note that to get meaningful results, the language
parameter has to match the
language spoken in the audio file. See the documentation for the list of
supported languages.
If the task was successfully accepted, a 202 code will be returned together with
a unique task ID
in the response body. The task isn't processed immediately,
but only scheduled for processing. You can check the current task status by
polling for the result.
The URL for polling for the result is returned in the X-Location
header.
Alternatively, you can assemble the polling URL on your own by appending a slash
(/
) and task ID
to the initial URL.
polling_url = response.headers["x-location"]
counter = 0
while counter < 100:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in ["done", "failed", "rejected"]:
break
counter += 1
time.sleep(5)
Once the polling finishes, data
will contain the latest response from the
server — either a response with the transcript, or an error message with details
in case processing was not able to finish. An example result of a successful
Phonexia Speech to Text transcription looks like this (output was shortened for
readability):
{
"task": {"task_id": "f866570e-9d62-4e31-b85b-1fa385e90c71", "state": "done"},
"result": {
"one_best": {
"segments": [
{
"channel_number": 0,
"start_time": 1.182,
"end_time": 3.259,
"text": "oh <silence/> yeah sure <silence/> hold on a second",
"words": [
{"start_time": 1.182, "end_time": 1.4, "text": "oh"},
{"start_time": 1.4, "end_time": 1.444, "text": "<silence/>"},
{"start_time": 1.444, "end_time": 1.67, "text": "yeah"},
...
{"start_time": 2.81, "end_time": 3.259, "text": "second"}
]
},
...
{
"channel_number": 0,
"start_time": 51.79,
"end_time": 52.545,
"text": "cheers you too",
"words": [...]
}
]
},
"additional_words": []
}
}
Phonexia 6th generation Speech to Text provides detailed timestamps for individual words. If you are only interested in a single transcript of the entire media file, you can access the whole text of each segment like this:
transcript = ""
for segment in data["result"]["one_best"]["segments"]:
transcript += segment["text"] + "\n"
Congratulations, you have successfully run Speech to Text using the Phonexia 6th generation model.
Phonexia 6th generation with parameters
Phonexia 6th generation Speech to Text provides an option to fine-tune the
transcription. The underlying language model may struggle with jargon, local
names, region-specific pronunciation or neologisms. Furthermore, speakers in
your recordings might utter phrases whose transcription is ambiguous, for
example because of strong accent. You can use config
request body parameter to
deal with all these situations.
You can specify phrases to be preferred over other options in its
preferred_phrases
attribute. In the additional_words
attribute, you can list
words that you expect not to be part of the language model's built-in
vocabulary. See the
endpoint documentation
for more details. To use the request body parameters, you can encode them as a
JSON string with json.dumps
and pass them as a value of the data
parameter
of requests.post
. In the following example, we define a specific phrase we
expect to appear in the recording. To make the transcription even more precise,
we provide phonetic form of the word 'superspeed'.
payload = {
"config": json.dumps(
{
"preferred_phrases": ["superspeed tariff from your offer"],
"additional_words": [
{
"spelling": "superspeed",
"pronunciations": ["s u p @r s p i d"],
},
]
}
)
}
with open("Harry.wav", mode="rb") as file:
files = {"file": file}
response = requests.post(
f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text",
files=files,
data=payload, # Notice that we now send config in the request body
params={"language": "en"},
)
response.raise_for_status()
As already mentioned, the config
parameter expects a JSON formatted string. In
other words, config
's content is going to be interpreted as JSON by Speech
Platform 4. Therefore, all rules of the JSON syntax apply.
Some phoneme symbols (e.g. the Czech phoneme P\
) include the backslash which
has a special meaning in JSON and must be escaped with another backslash (\\
)
to suppress the special meaning. E.g., the correct way to capture the
pronunciation of the Czech word Řek
in
config.additional_words[*].pronunciations
is "P\\ e k"
. :::
After polling for the result as in the previous example, we'll get the following output (shortened for readability):
{
"task": {"task_id": "f866570e-9d62-4e31-b85b-1fa385e90c71", "state": "done"},
"result": {
"one_best": {...},
"additional_words": [
...
{
"spelling": "superspeed",
"pronunciations": [
{
"pronunciation": "s u p @r s p i d",
"out_of_vocabulary": false
}
]
},
{
"spelling": "tariff",
"pronunciations": [
{
"pronunciation": "t E r @ f",
"out_of_vocabulary": false
}
]
}
]
}
}
Notice that the technology returns pronunciation for all the additional words and words included in preferred phrases, even though we didn't specify pronunciations for all of them. If a word's pronunciation isn't specified by the user, it's either found in the model's vocabulary or auto-generated.
Enhanced Speech to Text Built on Whisper
In order to run Enhanced Speech to Text Built on Whisper processing for a single
media file, you should start by sending a POST
request to
/api/technology/speech-to-text-whisper-enhanced
as follows:
# Replace <speech-platform-server> with the actual server address
SPEECH_PLATFORM_SERVER = "<speech-platform-server>"
with open("Harry.wav", mode="rb") as file:
files = {"file": file}
response = requests.post(
f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text-whisper-enhanced",
files=files,
)
response.raise_for_status()
If the task was successfully accepted, a 202 code will be returned together with
a unique task ID
in the response body. The task isn't processed immediately,
but only scheduled for processing. You can check the current task status by
polling for the result.
The URL for polling for the result is returned in the X-Location
header.
Alternatively, you can assemble the polling URL on your own by appending a slash
(/
) and task ID
to the initial URL.
polling_url = response.headers["x-location"]
counter = 0
while counter < 100:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in ["done", "failed", "rejected"]:
break
counter += 1
time.sleep(5)
Once the polling finishes, data
will contain the latest response from the
server -- either a response with the transcript, or an error message with
details in case processing was not able to finish. An example result of a
successful Phonexia Speech to Text transcription looks like this (output was
shortened for readability):
{
"task": {"task_id": "b1850fed-3e4b-4f2a-94c8-9a57e73abb9f", "state": "done"},
"result": {
"one_best": {
"segments": [
{
"channel_number": 0,
"start_time": 1.06,
"end_time": 6.8,
"language": "en",
"text": "Yeah, sure. Hold on a second. Where did I put it? Ah, here we are.",
},
{
"channel_number": 0,
"start_time": 6.8,
"end_time": 19.61,
"language": "en",
"text": "So the agreement number is 7895478. Right, the third digit. Where do I...",
},
{
"channel_number": 0,
"start_time": 19.61,
"end_time": 26.01,
"language": "en",
"text": "Oh, yes, it's nine. Oh, right, the security code. Sorry, not the agreement number.",
},
...
{
"channel_number": 0,
"start_time": 49.46,
"end_time": 52.49,
"language": "en",
"text": "Cheers. You too.",
},
]
}
},
}
You can aggregate the transcript of the entire media file like this:
transcript = ""
for segment in data["result"]["one_best"]["segments"]:
transcript += segment["text"] + "\n"
Congratulations, you have successfully run Speech to Text using Enhanced Speech to Text Built on Whisper.
Full Python Code
Here is the full code for this example, slightly adjusted and wrapped into functions for better readability:
import json
import time
import requests
SPEECH_PLATFORM_SERVER = "your-actual-Speech-Platform-server-URL"
def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in ["done", "failed", "rejected"]:
break
time.sleep(sleep)
return response
def do_speech_to_text_phonexia(audio_path: str, language: str, config: dict):
print("Running Phonexia 6th generation Speech to Text.")
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text",
files=files,
params={"language": language},
data=config,
)
response.raise_for_status()
polling_url = response.headers["x-location"]
speech_to_text_response = poll_result(polling_url)
return speech_to_text_response.json()
def do_speech_to_text_whisper(audio_path: str):
print("Running Enhanced Speech to Text Built on Whisper.")
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text-whisper-enhanced",
files=files,
)
response.raise_for_status()
polling_url = response.headers["x-location"]
speech_to_text_whisper_enhanced_response = poll_result(polling_url)
return speech_to_text_whisper_enhanced_response.json()
file_name = "Harry.wav"
payload = {
"config": json.dumps(
{
"preferred_phrases": ["superspeed tariff from your offer"],
"additional_words": [
{
"spelling": "superspeed",
"pronunciations": ["s u p @r s p i d"],
},
]
}
)
}
result = do_speech_to_text_phonexia(file_name, "en", payload)
print(f"Phonexia 6th generation:\n{json.dumps(result, indent=2)}\n")
result = do_speech_to_text_whisper(file_name)
print(f"Enhanced Speech to Text Built on Whisper:\n{json.dumps(result, indent=2)}")