Skip to main content

Speech to Text: start task

POST 

/api/technology/speech-to-text-whisper-enhanced

Start an Enhanced Speech to Text Built on Whisper task for a media file.

Enhanced Speech to Text Built on Whisper features

  • Multi-channel audio files are supported.
  • Channel id is included in individual transcription segments.
  • Language for transcription can be specified as a query parameter.
  • Language switching can be activated via a query parameter. It is a feature of the Enhanced Speech to Text built on Whisper technology, which identifies the predominant language spoken within each thirty-second interval of audio. The identified language is then utilized for transcribing that particular section.
  • To use a specific language as the source, it must be licensed. Otherwise, an error is raised.
  • If you use auto-detect or language switching, only licensed languages are considered as the source for the translation. In case the actually detected language is not licensed, the closest licensed language is used instead.

Request

Query Parameters

    channel_mode Channel Mode

    Possible values: [split, mix]

    Default value: split

    A string enumeration value representing the channel mode for conversion. This value indicates how the audio channels should be processed during conversion. Only the channels with the specified indices (channels parameter) will be processed, and others will be ignored.

    channels Channels

    A string of integers separated by comma (without spaces), representing the channels that should be kept during conversion. If specified, only the channels with the specified indices will be processed, and others will be ignored. If empty, all channels in the audio data will be processed. Note that channels is 0-based.

    language Language

    Default value: auto

    A string specifying the language for Enhanced Speech to Text Built on Whisper. The value should follow RFC 5646. It can consist of the "language", "region", and "privateuse" subtags. Use auto for automatic detection of the language. Refer to supported languages for a complete list of supported language tags.

    language_switching Language Switching

    By default, the language of the audio is detected once at the beginning of processing. Setting language_switching to true allows for dynamic language switching in the audio, with the language being detected approximately every 30 seconds.

Header Parameters

    x-correlation-id X-Correlation-Id

    Correlation ID is a special type of request ID which is unique over a series of requests and responses, identifying a transaction in a distributed system. Correlation ID will be generated if not provided.

    x-request-id X-Request-Id

    In distributed system architecture (microservices architecture) it is a unique ID of request and response combination throughout all components of a distributed system. Request ID will be generated if not provided.

Body

required

    file binaryrequired

    Input media file.

Responses

Speech to Text task was accepted. Follow the X-Location header to poll for the task state.

Response Headers

  • X-Location

    string

    Example: /api/technology/speech-to-text-whisper-enhanced/123e4567-e89b-12d3-a456-426614174000

    A URL the client should poll for task state and result.

Schema

    task

    object

    required

    task_id uuidrequired
    state TaskInfoState (string)required

    Possible values: [pending, running, rejected, failed, done]

Loading...