Version: 2.0.0

Speech to Text

This tutorial demonstrates how to use the Phonexia Speech Platform to obtain a transcript of the speech in a media file, in other words, how to run Speech to Text. The Speech Platform supports two different Speech to Text engines -- Phonexia 6th generation Speech to Text and Whisper Enhanced. In this guide we will process audio with both of them.

Attached, you will find the audio file Harry.wav which is a mono recording in English. It will be used as example audio throughout this guide.

At the end of this guide, you will find a full Python code example that encapsulates all the steps discussed. This should offer a comprehensive understanding and an actionable guide on implementing Speech to Text in your own projects.

Environment Setup

We are using Python 3.9 and Python library requests 2.27 in this example. You can install the requests library with pip as follows:

pip install requests~=2.27

Then, you can import the following libraries (time is built-in):

import time
import requests

Phonexia 6th generation

In order to run Phonexia Speech to Text processing for a single media file, you should start by sending a POST request to /api/technology/speech-to-text

You also need to pass the language of the recording as an argument. The example recording is in English, so you should pass "en", as follows:

# Replace <speech-platform-server> with the actual server address
SPEECH_PLATFORM_SERVER = "<speech-platform-server>"
with open("Harry.wav", mode="rb") as file:
    files = {"file": file}
    response = requests.post(
        f"https://{SPEECH_PLATFORM_SERVER}>/api/technology/speech-to-text",
        files=files,
        params={"language": "en"},  # Set the language here to matching language model.
    )
    response.raise_for_status()

Note that in order to get meaningful results, the language parameter has to match the language spoken in the audio file. See the documentation for the list of supported languages.

If the task was successfully accepted, the 202 code will be returned together with a unique task ID in the response body. The task isn't processed immediately, but only scheduled for processing. You can check the current task status by polling for the result.

The URL for polling for the result is returned in the X-Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and task ID to the initial URL.

polling_url = response.headers["x-location"]

counter = 0
while counter < 100:
    response = requests.get(polling_url)
    response.raise_for_status()
    data = response.json()
    task_status = data["task"]["state"]
    if task_status in ["done", "failed", "rejected"]:
        break
    counter += 1
    time.sleep(5)

Once the polling finishes, data will contain the latest response from the server -- either a response with the transcript, or an error message with details in case processing was not able to finish. An example result of a successful Phonexia Speech to Text transcription looks like this (output was shortened for readability):

{
  "result": {
    "one_best": {
      "segments": [
        {
          "channel_number": 0,
          "end_time": 3.259,
          "start_time": 1.182,
          "text": "oh <silence/> yeah sure <silence/> hold on a second",
          "words": [
            {
              "end_time": 1.4,
              "start_time": 1.182,
              "text": "oh"
            },
            {
              "end_time": 1.444,
              "start_time": 1.4,
              "text": "<silence/>"
            },
            {
              "end_time": 1.67,
              "start_time": 1.444,
              "text": "yeah"
            },

            ...

            {
              "end_time": 3.259,
              "start_time": 2.81,
              "text": "second"
            }
          ]
        },
        ...
        {
          "channel_number": 0,
          "end_time": 52.545,
          "start_time": 51.79,
          "text": "cheers you too",
          "words": [ 	...   ]
        },
      ]
    }
  },
  "task": {
    "state": "done",
    "task_id": "904d22b6-8e1d-45de-8579-471592c86f3d"
  }
}

Phonexia 6th generation Speech to Text provides detailed timestamps for individual words. If you are only interested in a single transcript of the entire media file, you can access the whole text of each segment like this:

transcript = ""
for segment in phonexia_speech_to_text["result"]["one_best"]["segments"]:
    transcript += segment["text"] + "\n"

Congratulations, you have successfully run Speech to Text using Phonexia 6th generation model.

Whisper enhanced

In order to run Whisper enhanced Speech to Text processing for a single media file, you should start by sending a POST request to /api/technology/speech-to-text-whisper-enhanced as follows:

# Replace <speech-platform-server> with the actual server address
SPEECH_PLATFORM_SERVER = "<speech-platform-server>"
with open("Harry.wav", mode="rb") as file:
    files = {"file": file}
    response = requests.post(
        f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text-whisper-enhanced",
        files=files,
    )
    response.raise_for_status()

The URL for polling for the result is returned in the X-Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and task ID to the initial URL.

polling_url = response.headers["x-location"]

counter = 0
while counter < 100:
    response = requests.get(polling_url)
    response.raise_for_status()
    data = response.json()
    task_status = data["task"]["state"]
    if task_status in ["done", "failed", "rejected"]:
        break
    counter += 1
    time.sleep(5)

{
  "result": {
    "one_best": {
      "segments": [
        {
          "channel_number": 0,
          "end_time": 16.41,
          "language": "en",
          "start_time": 1.06,
          "text": "Yeah, sure. Hold on a second. Where did I put it? Ah, here we are. So the agreement number is 7 8 9 5 4 7 8."
        },
        {
          "channel_number": 0,
          "end_time": 26.17,
          "language": "en",
          "start_time": 16.41,
          "text": "Right, the third digit. Where do I? Oh yes, it's 9. Oh right, the security code. Sorry, not the agreement number."
        },
        {
          "channel_number": 0,
          "end_time": 36.5,
          "language": "en",
          "start_time": 26.17,
          "text": "Yeah, so the fourth and seventh digit you said, right? Fourth and seventh digit. Okay, it's 3 and 4."
        },
        {
          "channel_number": 0,
          "end_time": 41.58,
          "language": "en",
          "start_time": 36.5,
          "text": "Well, I'm interested in the super speed tariff from your offer."
        },
        {
          "channel_number": 0,
          "end_time": 52.49,
          "language": "en",
          "start_time": 41.58,
          "text": "No, I think that's everything. Thank you. Yeah, sounds good. Yeah, that's all. Cheers, you too."
        }
      ]
    }
  },
  "task": {
    "state": "done",
    "task_id": "15612b3b-212e-4e17-9045-4cba53ae9fd3"
  }
}

You can aggregate the transcript of the entire media file like this:

transcript = ""
for segment in whisper_speech_to_text["result"]["one_best"]["segments"]:
    transcript += segment["text"] + "\n"

Congratulations, you have successfully run Speech to Text using Whisper enhanced model.

Full Python Code

Here is the full code for this example, slightly adjusted and wrapped into functions for better readability:

import time
import requests


SPEECH_PLATFORM_SERVER = "<speech-platform-server>"  # Replace with your actual server URL


def poll_result(polling_url: str, sleep: int = 5):
    while True:
        response = requests.get(polling_url)
        response.raise_for_status()
        data = response.json()
        task_status = data["task"]["state"]
        if task_status in ["done", "failed", "rejected"]:
            break
        time.sleep(sleep)
    return response


def do_speech_to_text_phonexia(audio_path: str, language: str):
    with open(audio_path, mode="rb") as file:
        files = {"file": file}
        response = requests.post(
            f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text",
            files=files,
            params={"language": language},
        )
        response.raise_for_status()
    polling_url = response.headers["x-location"]
    speech_to_text_response = poll_result(polling_url)
    return speech_to_text_response.json()


def do_speech_to_text_whisper(audio_path: str):
    with open(audio_path, mode="rb") as file:
        files = {"file": file}
        response = requests.post(
            f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text-whisper-enhanced",
            files=files,
        )
        response.raise_for_status()
    polling_url = response.headers["x-location"]
    speech_to_text_whisper_enhanced_response = poll_result(polling_url)
    return speech_to_text_whisper_enhanced_response.json()


def print_transcription(response: dict):
    transcription = ""
    for segment in response["result"]["one_best"]["segments"]:
        transcription += segment["text"] + "\n"
    return transcription


file_name = "Harry.wav"

print("Phonexia 6th generation transcript:")
print_transcription(do_speech_to_text_phonexia(file_name, "en"))

print("Whisper enhanced transcript:")
print_transcription(do_speech_to_text_whisper(file_name))

Environment Setup​

Phonexia 6th generation​

Whisper enhanced​

Full Python Code​

Environment Setup

Phonexia 6th generation

Whisper enhanced

Full Python Code