Version: 4.0.0

Adaptation

The Enhanced Speech to Text Built on Whisper can be scaled to optimize for automatic adaptation of the model on the customer's side. The goal of Whisper Adaptation is to enhance the performance of the Whisper Speech to text (STT) model by fine-tuning it with additional domain-specific or customer-provided data.

While the original Whisper model supports a wide range of languages, their accuracy may not always meet the expectations for specific use cases or environments.

By adapting the model with high-quality, annotated audio, we aim to deliver significantly improved transcription accuracy, making Whisper a more attractive solution for customers who require tailored STT performance in their language or domain.

note

Due to the multilingual architecture of the model, optimizing it for a specific language through retraining may lead to a reduction in accuracy for other supported languages. This trade-off is an inherent characteristic of the system’s design.

This documentation article outlines the process for preparing data, building the adaptation environment using Docker, and running the fine-tuning workflow.

Customized approach for Whisper Adaptation

Phonexia offers several ways in which customized Whisper adaptation can be achieved.

Adapted model (.zip) to be sent to Phonexia - After the customer completes all the steps successfully, automation will create a .zip file that needs to be sent to Phonexia Team. Once Phonexia team receives the file, we will create a customized Enhanced-Speech-To-Text-Build-on-Whisper licensed model that will be provided back to the client.
Annotated data to be sent to Phonexia - The customer will send annotated data from Elan to the Phonexia consulting team. The Phonexia team will run adaptation based on the provided data, and the client will obtain customized and licensed Whisper model, where the adaptation was already performed.

For more information regarding the pricing and licensing, please contact Phonexia Business representative.

Prerequisites and hardware requirements

Before you begin, ensure your system meets the following requirements:

NVIDIA GPU: A compatible NVIDIA GPU is required for model training.
NVIDIA Driver: Version 525.60.13 or higher. Verify your driver version using nvidia-smi.
Install Docker- follow the official Docker Engine - installation guide on Ubuntu
Install NVIDIA Container Toolkit - follow both installation guide and Docker configuration steps from the official NVIDIA documentation article: Installing the NVIDIA Container Toolkit and Configuring Docker

Hardware requirements

GPU: An NVIDIA GPU with sufficient Video RAM (VRAM) is essential. The required VRAM depends heavily on:
- PHX_MODEL_SIZE: Larger models (medium, large-v3) require significantly more VRAM than smaller ones (tiny, base).
- PHX_BATCH_SIZE: Larger batch sizes increase VRAM usage.
CPU: Multiple CPU cores (PHX_CPU_COUNT) speed up data preprocessing. 8+ cores are recommended.
RAM: Sufficient system RAM is needed for data loading and general operations. 32GB+ is a safe starting point, but requirements may increase with very large datasets.
Storage: Enough disk space for the training data, Docker image, and the final adapted models.

The system has been successfully tested on the following configuration:

OS: Debian GNU/Linux 12
CPU: AMD EPYC 9454 48-Core Processor
GPU: NVIDIA H100 (80GB)
- Driver Version: 550.120
- CUDA Version: 12.4
System RAM: 377 GB

Adaptation process:

1. Using the Docker image

Download Docker image from official Phonexia Docker repository

or via command: docker pull phonexia/speech-to-text-adaptation:latest

2. Prepare input data

For data annotation it is mandatory to use Elan tool to annotate the audio data. ELAN is a free, open source software that makes it possible to annotate, document, and analyze communication.

The Phonexia team prepared an annotation manual to help you easily and correctly annotate your recordings.

You can download the annotation manual here-Annotation_manual_EN_Elan.pdf

Audio Format: Audio files should ideally be in the format used in your target production environment.
Recommended: Mono, 16kHz sampling rate, 16-bit PCM WAV format (.wav extension).
Directory Structure: Organize your data into a single input directory. This directory can contain subdirectories. For each audio file (filename.wav), there must be a corresponding ELAN annotation file with the exact same base name (filename.eaf) in the same directory.

Example Directory Structure:

input_directory/
├── data_set_01/
│   ├── recording_001.wav
│   ├── recording_001.eaf
│   ├── recording_002.wav
│   └── recording_002.eaf
├── data_set_02/
│   ├── interview_part_a.wav
│   ├── interview_part_a.eaf
│   ├── interview_part_b.wav
│   └── interview_part_b.eaf
└── miscellaneous/
    └── short_clip.wav
    └── short_clip.eaf

3. Running the adaptation

Once your environment is set up and data is prepared, you can start the fine-tuning process using either using docker run or docker compose.

Configuration Parameters

The adaptation process is configured using the following parameters, typically passed as environment variables to Docker:

PHX_LANGUAGE: (Required) The target language for model adaptation (e.g., "english", "spanish", "czech"). Should match the language name expected by Whisper.
PHX_MODEL_SIZE: (Required) The size of the base Whisper model to fine-tune. Options: tiny, base, small, medium, large-v3, large-v3-turbo.
PHX_INPUT_DIR: (Required) Path to the directory containing the prepared audio (.wav) and annotation (.eaf) files.
PHX_OUTPUT_DIR: (Required) Path to the directory where the adapted model and other outputs will be saved.
PHX_BATCH_SIZE: (Optional) The number of samples processed in each training step. Higher values can speed up training but require more GPU memory. Adjust based on your GPU capacity and model size. (Default: 32).
PHX_CPU_COUNT: (Optional) The number of CPU cores used for data loading and preprocessing in parallel. Increase this if your GPU utilization is low and data loading seems to be the bottleneck. (Default: 8, or the total number of available cores if less than 8).
SHM_SIZE: Specifies the size of the shared memory (/dev/shm) available to the container (e.g., "10gb"). Increase this value if you encounter errors related to shared memory exhaustion (like shm_open: No space left on device).

Running the adaptation using Docker Run

This method runs the adaptation process inside a Docker container using the image you built or pulled.

Ensure Prerequisites Met: Docker and NVIDIA Container Toolkit installed and configured. Input data prepared. Docker image (phonexia/speech-to-text-adaptation:latest or similar) available locally.
Run Container:

docker run \
    --runtime nvidia \
    --gpus all \
    --shm-size="$SHM_SIZE" \
    -e PHX_LANGUAGE="$PHX_LANGUAGE" \
    -e PHX_MODEL_SIZE="$PHX_MODEL_SIZE" \
    -e PHX_BATCH_SIZE="$PHX_BATCH_SIZE" \
    -e PHX_CPU_COUNT="$PHX_CPU_COUNT" \
    -v /path/to/your/input_data:/input:ro \
    -v /path/to/your/output_data:/output \
    phonexia/speech-to-text-adaptation:latest

--runtime nvidia --gpus all: Enables GPU access within the container.
--shm-size=$SHM_SIZE: Sets shared memory size.
-e VARIABLE=value: Sets environment variables inside the container.
-v /host/path:/container/path: Mounts directories from your host machine into the container.
- Mount your input data directory to /input inside the container.
- Mount a host directory where you want outputs saved to /output inside the container.

Running the adaptation using Docker Compose

Docker Compose allows you easily configure the environment of the Docker applications using docker-compose.yml file.

Create and configure the docker-compose.yml:

   services:
  stt-whisper-adaptation:
    environment:
      # The target language for model adaptation, e.g., "english", "spanish" etc.).
      # To see the full list of supported languages, run the container without this variable
      # and check the output.
      - PHX_LANGUAGE= # TODO: specify the target language
      # The size of the Whisper model to use. Options are: tiny, base, small, medium, large-v3, large-v3-turbo.
      - PHX_MODEL_SIZE= # TODO: specify the model size
      # The batch size for training. Higher values may speed up training but increase GPU memory usage.
      # Reduce this value if you encounter GPU out-of-memory errors (torch.cuda.OutOfMemoryError).
      - PHX_BATCH_SIZE=32
      # The number of CPU cores to allocate for data loading and preprocessing.
      # Default is 8 or the total number of available CPU cores if less than 8.
      # Increase this value if GPU utilization is low.
      # - PHX_CPU_COUNT=8
    image: phonexia/speech-to-text-adaptation:latest
    # Shared memory size used for sharing parameters between GPU kernels.
    # Please increase this value if you encounter shared memory errors (e.g., "shm_open: No space left on device").
    shm_size: '10gb'
    volumes:
      # Mount the input directory containing audio files and annotations.
      - INPUT_DIR:/input:ro # TODO: replace `INPUT_DIR` with the path to your input directory
      # Mount the output directory where the adapted model will be saved.
      - OUTPUT_DIR:/output # TODO: replace `OUTPUT_DIR` with the path to your output directory
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              # The `device_ids` value, specified as a list of strings, represents GPU device IDs
              # from the host. You can find the device ID in the output of `nvidia-smi` on the host.
              # If no `device_ids` are set, all GPUs available on the host are used by default.
              # device_ids: ["0", "1"]
              capabilities: [gpu]

Run with Docker compose:

docker compose up # This command will start the adaptation service in foreground
# or
docker compose up -d # The service runs in detached mode as a background process

4. Monitoring progress

The adaptation script executes in several distinct stages. You can monitor the console output or Docker logs in case you are running in detached mode (docker logs CONTAINER or docker compose logs).

Data Preparation (dataprep): Scans the input directory for valid audio/annotation pairs. Splits the data into training and testing sets. Converts the data into an optimized binary format suitable for efficient loading during training.
Fine-tuning (finetune): Loads the specified pre-trained Whisper model (PHX_MODEL_SIZE). Trains the model on the prepared training dataset. This is typically the longest stage. Progress indicators should be visible in the logs.
Output Creation (create_outputs): Evaluates the fine-tuned model on the test set to measure performance improvement (e.g., Word Error Rate - WER). Includes evaluation of the original base model for comparison. Packages the fine-tuned model artifacts (weights, configuration files) into a .zip archive. Generates a summary protocol file.

5. Understanding the output

After the process completes successfully, the specified output directory (PHX_OUTPUT_DIR) will contain the following:

Fine-tuned Model Archive: A .zip file containing the complete fine-tuned model directory, ready for deployment.
Protocol File: A text file (protocol.txt or similar) summarizing the adaptation process, including:
- Base model used
- Target language
- Training parameters (batch size, etc.)
- Evaluation results (e.g., WER before and after fine-tuning).
Log Directory: Contains detailed logs from the different stages (dataprep, finetune, create_outputs), which are useful for debugging if issues arise. If you encounter problems, please provide these logs when seeking support.

6. Troubleshooting

Here are some common issues and how to address them:

1. GPU not visible in Docker container:

Symptom: Training fails immediately, logs mention "No CUDA GPUs detected" or similar.
Check: Verify Docker can access the GPU using the NVIDIA runtime: docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Solution: Ensure the NVIDIA driver, NVIDIA Container Toolkit, and Docker configuration are correct. Refer to the Prerequisites section. Restart the Docker daemon (sudo systemctl restart docker) after making changes.

2. CUDA Out of Memory (OOM) error:

Symptom: Training fails with error messages related to memory allocation on the GPU.
Solution:
- Reduce PHX_BATCH_SIZE.
- Try a smaller PHX_MODEL_SIZE if feasible.
- Ensure no other processes are consuming significant GPU memory. Use nvidia-smi to check.

3. Shared Memory Errors (e.g., shm_open: No space left on device):

Symptom: Errors related to /dev/shm during parallel data loading or processing, especially within Docker.
Solution (Docker): Increase the shared memory available to the container using the --shm-size flag in docker run (e.g., --shm-size="4g") or the shm_size option in docker-compose.yml.

4. Permission Errors (Docker Volume Mounts):

Symptom: Errors reading from the input directory or writing to the output directory inside the container.
Solution: Ensure the input directory exist and the user running the Docker command has read permissions for the input directory and write permissions for the output directory on the host machine.

FAQ

What is Whisper adaptation?

Whisper adaptation is the process of adjusting the original Enhanced Speech to Text Built on Whisper model for a specific language to improve transcription accuracy. The adaptation uses data that is representative of the Partner’s or Customer’s production environment in that language, resulting in a model that is more accurate for their specific use case.

What are the inputs needed for the adaptation?

The most important input for adaptation is annotated data in the target language—the language for which transcription accuracy is to be improved. Annotated data consists of audio files paired with precise textual transcripts, where each transcript segment is time-aligned with the corresponding audio segment.

There are two options for running the adaptation:

The primary input required for adaptation is annotated data in the target language—i.e., the language for which transcription accuracy should be improved. Annotated data consists of:

Audio recordings
Corresponding textual transcripts
Precise time alignment between transcript segments and audio intervals

There are two supported workflows for adaptation:

Option 1: On-premise adaptation

The annotated data is used as input to the adaptation Docker container, which runs on the client’s infrastructure.
Once the adaptation process is complete, the resulting adapted model is sent to Phonexia.
Phonexia issues a license for the adapted model.

Option 2: Adaptation performed by Phonexia

The client provides both the audio data and corresponding annotations to Phonexia.
Phonexia performs the adaptation process internally.
The adapted model license is generated and returned to the client.

What is the output?

The output of the adaptation process is a customized Enhanced Speech to Text Built on Whisper model, tailored to the target language and provided data. This adapted model is unique and must be licensed by Phonexia before it can be used in production.

Why can't we guarantee the results in advance?

Transcription accuracy cannot be guaranteed in advance because each production dataset is different. The effectiveness of the adapted model depends on several factors, including:

Audio quality
Annotation quality
Language of the audio
Percentage of clean speech in the audio
Presence of non-speech elements, such as background noise, music, or technical sounds

Which languages can be adapted?

Any language supported by Whisper can be adapted.

Do I need any specific knowledge or expertise to perform the adaptation?

Yes. To successfully perform the adaptation process, you will need the following:

Native or native-like proficiency in the target language
Required for producing high-quality annotations.
Familiarity with the ELAN annotation tool
Used to annotate the audio data. A user manual is available upon request.
Experience with Docker
Necessary to run the adaptation container.
Ability to deploy Docker on a system with an NVIDIA GPU
The adaptation process requires GPU acceleration.
NVIDIA Container Toolkit installed
This is required to enable GPU access within Docker.

Customized approach for Whisper Adaptation​

Prerequisites and hardware requirements​

Hardware requirements​

Adaptation process:​

1. Using the Docker image​

2. Prepare input data​

3. Running the adaptation​

Running the adaptation using Docker Run​

Running the adaptation using Docker Compose​

4. Monitoring progress​

5. Understanding the output​

6. Troubleshooting​

FAQ​

What is Whisper adaptation?​

What are the inputs needed for the adaptation?​

Option 1: On-premise adaptation​

Option 2: Adaptation performed by Phonexia​

What is the output?​

Why can't we guarantee the results in advance?​

Which languages can be adapted?​

Do I need any specific knowledge or expertise to perform the adaptation?​