Skip to main content

Speech to Text: start task

POST 

/api/technology/speech-to-text

Start a Speech to Text task for a media file.

Speech to Text features

  • Multi-channel audio files are supported.
  • Channel id is included in individual transcription segments.
  • The built-in vocabulary can be extended using config field of multipart/form-data. The value of config is a string in JSON format.

Request

Query Parameters

    language Languagerequired

    A string specifying the language for Speech to Text Phonexia. The value should follow RFC 5646. It can consist of the "language", "region", and "privateuse" subtags. Refer to supported languages for a complete list of supported language tags.

    channel_mode Channel Mode

    Possible values: [split, mix]

    Default value: split

    A string enumeration value representing the channel mode for conversion. This value indicates how the audio channels should be processed during conversion. Only the channels with the specified indices (channels parameter) will be processed, and others will be ignored.

    channels Channels

    A string of integers separated by comma (without spaces), representing the channels that should be kept during conversion. If specified, only the channels with the specified indices will be processed, and others will be ignored. If empty, all channels in the audio data will be processed. Note that channels is 0-based.

Header Parameters

    x-correlation-id X-Correlation-Id

    Correlation ID is a special type of request ID which is unique over a series of requests and responses, identifying a transaction in a distributed system. Correlation ID will be generated if not provided.

    x-request-id X-Request-Id

    In distributed system architecture (microservices architecture) it is a unique ID of request and response combination throughout all components of a distributed system. Request ID will be generated if not provided.

Body

required

    file binaryrequired

    Input media file.

    config

    object

    Optional configuration for Speech to Text.

    preferred_phrases string[]

    Array of multi-word phrases that are expected to appear in the media file. Phrases provided here are preferred over other variants in the transcription.

    additional_words

    object[]

    Array of words, optionally with explicitly defined pronunciations. Words provided here extend the underlying language model for the given request. This is useful for adding names, slang, foreign or jargon terms or local pronunciations of common words. These are typically not included in the model's built-in vocabulary.

  • Array [

  • spelling Spelling (string)required

    The grapheme representation of the word. If you only use graphemes of the given language, you don't have to specify a pronunciation as it will be generated by the technology. However, generated pronunciations may be incorrect for abbreviations, foreign words and other words with unusual spelling. You can also use characters outside of the language's valid grapheme set. In that case, you have to specify at least one pronunciation.

    pronunciations string[]

    Array of the word's pronunciations. Each pronunciation must be specified using only phonemes valid for the given language. The technology accepts only pronunciations with at least three phonemes. Individual phonemes must be separated by a space.

  • ]

Responses

Speech to Text task was accepted. Follow the X-Location header to poll for the task state.

Response Headers

  • X-Location

    string

    Example: /api/technology/speech-to-text/123e4567-e89b-12d3-a456-426614174000

    A URL the client should poll for task state and result.

Schema

    task

    object

    required

    task_id uuidrequired
    state TaskInfoState (string)required

    Possible values: [pending, running, rejected, failed, done]

Loading...