Speech Platform Release Notes

Version 4.1.0

September 11, 2025 · 3 min read

This release focuses on security fixes, further improvements in the Speech Platform startup process, and some useful features and improvements in the GUI.
It also adds Denoiser technology, intended for removing noise and background sounds from recordings for better speech comprehensibility for a human listener (using the denoised audio for further processing by speech technologies is NOT recommended, as it usually makes the results worse).

Added Denoiser technology (high-level description) in REST API and GUI
Configuration and administration changes
- Updated internal ingress-nginx controller to resolve a security issue CVE-2025-1974
- Improved startup checks to detect some common network issues and give the users resolution hints in console welcome screen
- All technologies now support starting on-demand, allowing users more flexible configuration with regard to RAM consumption
- Frontend limits configuration is now common for all technologies
- Documentation of the internal limits settings was significantly enhanced for more clarity, including schema and description of the media processing flow

REST API changes

REST API endpoint /api/system/status now exposes capacities and current usage of system storages
Minor improvements in OpenAPI schema

GUI (web application) changes

Tiles on the home page can be now reordered
Added "Send to" feature, which allows to send the result of a processing (e.g. by Voice Activity Detection) to other technology
(only Speech To Text is supported as target for now, other technologies can be added if there is a demand)
Minor Keyword Spotting improvements:
- Added a column with the number of found keywords
- When keyword is played, corresponding table row is highlighted (note that since keywords tend to be rather short, the row actually rather shortly flashes)
Adjusted Deepfake Detection score scale range to reflect the latest model results
Adjusted terminology for "score", "confidence" and "probability" in GUI and export headers (CSV, Excel) in Deepfake Detection, Emotion Detection, Gender Identification, Language Identification, Keyword Spotting, Speech To Text and Speech Translation technologies

Included Components

Age Estimation 1.1.0
Audio Manipulation Detection 1.0.0
Audio Quality Estimation 3.62.0
Deepfake Detection 2.2.0
Denoiser 1.1.0
Emotion Recognition 1.2.0
Enhanced Speech to Text Built on Whisper 1.10.0
Gender Identification 1.4.0
Keyword Spotting 1.1.0
Language Identification 1.7.0
Replay Attack Detection 1.0.0
Speaker Diarization 1.6.0
Speech to Text Phonexia 6th Generation 3.62.0
Time Analysis of Speech 3.62.0
Voice Activity Detection 1.2.0
Voiceprint Comparison 1.5.0
Voiceprint Extraction 1.6.0

Version 4.0.2

July 23, 2025 · One min read

This maintenance release contains further internal changes to accommodate networks without DNS server or default gateway.

Additionally, the system readiness check internal logic was optimized and the status messages are now displayed on the screen during system boot.

Included Components

Age Estimation 1.1.0
Audio Manipulation Detection 1.0.0
Audio Quality Estimation 3.62.0
Deepfake Detection 2.2.0
Emotion Recognition 1.2.1
Enhanced Speech to Text Built on Whisper 1.10.0
Gender Identification 1.4.0
Keyword Spotting 1.1.0
Language Identification 1.7.0
Replay Attack Detection 1.0.0
Speaker Diarization 1.6.0
Speech to Text Phonexia 6th Generation 3.62.0
Time Analysis of Speech 3.62.0
Voice Activity Detection 1.2.0
Voiceprint Comparison 1.4.0
Voiceprint Extraction 1.6.0

Version 4.0.1

July 10, 2025 · 2 min read

In this release we have made several internal improvements mainly towards deployments in isolated, air-gapped environments.

Further minor improvements and fixes in configuration and administration:

New, more readable and more visually appealing console welcome screen
- Welcome screen is now also displayed when connecting remotely via SSH
Further improved detection of "system is ready" state during startup process
Fixed issue when the (auto)configuration script did not properly enable some Speech To Text languages, which could make the Virtual Appliance nonfunctional
The (auto)configuration script now also updates the number of time-slicing replicas in NVIDIA GPU plugin configuration file

Included Components

Age Estimation 1.1.0
Audio Manipulation Detection 1.0.0
Audio Quality Estimation 3.62.0
Deepfake Detection 2.2.0
Emotion Recognition 1.2.1
Enhanced Speech to Text Built on Whisper 1.10.0
Gender Identification 1.4.0
Keyword Spotting 1.1.0
Language Identification 1.7.0
Replay Attack Detection 1.0.0
Speaker Diarization 1.6.0
Speech to Text Phonexia 6th Generation 3.62.0
Time Analysis of Speech 3.62.0
Voice Activity Detection 1.2.0
Voiceprint Comparison 1.4.0
Voiceprint Extraction 1.6.0

Version 4.0.0

June 30, 2025 · 3 min read

Added Keyword Spotting technology (high-level description) in REST API and GUI
Added Age Estimation technology (high-level description) in REST API and GUI
The Authenticity Verification technology has been improved with
- further improved Deepfake Detection model 2.2.0 with significantly reduced number of false positive detections
- new Deepfake Detection scores output as Log Likelihood Ratio (LLR)
- added Audio Manipulation Detection subtechnology (high-level description)
- added Replay Attach Detection subtechnology (high-level description)
Further improvements of Virtual Appliance documentation:
- more details about HW and SW requirements
- more detailed deployment instructions for commonly used hypervisors

This release includes internal changes in licensing.
These changes are backwards compatible and should not affect existing users, with one exception - existing users of older Deepfake Detection model need to update to the new model and license to continue using Deepfake Detection in 4.0.0 release.

REST API changes

Increased API default file size limit to 500 MB (limit can be changed manually in configuration file, see documentation)
Added validation of voiceprint size in the REST API to prevent potential abuse via unusually large payloads masquerading as valid voiceprints
Introduced a new Location header; the previous X-Location header is now deprecated
Fixed missing X-Location header in some REST API responses (bug introduced in VA 3.7.0)
Fixed an issue where the GET /api/task/:task_id endpoint sometimes returned HTTP 404 even though the requested task existed
Various API documentation fixes and improvements, e.g. some query parameters minimum values are now documented correctly

GUI (web application) changes

Increased GUI default file size limit to 100 MB (Limit can be changed manually in configuration file, see documentation)
Fixed score format in Gender Identification exports: score is now exported correctly as percentage, not as decimal number
Minor changes in homepage tile order and style
Minor visual changes in the file table component (all technologies except Speaker Identification)

Included Components

Age Estimation 1.1.0
Audio Manipulation Detection 1.0.0
Audio Quality Estimation 3.62.0
Deepfake Detection 2.2.0
Emotion Recognition 1.2.1
Enhanced Speech to Text Built on Whisper 1.10.0
Gender Identification 1.4.0
Keyword Spotting 1.1.0
Language Identification 1.7.0
Replay Attack Detection 1.0.0
Speaker Diarization 1.6.0
Speech to Text Phonexia 6th Generation 3.62.0
Time Analysis of Speech 3.62.0
Voice Activity Detection 1.2.0
Voiceprint Comparison 1.4.0
Voiceprint Extraction 1.6.0

Version 3.7.0

April 7, 2025 · 2 min read

Gender Identification technology (high-level description) is now fully working in the GUI.
Emotion Recognition technology (high-level description) is now fully working in the GUI.
New improved model 2.0.0 for Deepfake Detection subtechnology (high-level description).
Virtual Appliance documentation improvements for easier navigation and better comprehension
Configuration and administration changes:
- The automation after uploading a package via Filebrowser GUI has been extended: after extracting the uploaded ZIP, it also runs the (auto)configuration script.
- The automatic extraction now recognizes any ZIP file starting with "licensed-models" in the filename (i.e. also filenames like licensed-models_foobar.zip).
- Fixed an issue where the automatic extraction could take a long time because it was performed multiple times.
- When Virtual Appliance is deployed to Amazon EC2, the root and SSH password login is automatically disabled
- Added more information to the output of the run-diag-report.sh diagnostic script
- Removed the obsolete configuration script

Known issues:

The root and SSH password login is disabled when custom cloud-init is used. This will be fixed in the next release.

Included Components

Audio Quality Estimation 3.62.0
Deepfake Detection 2.0.0
Emotion Recognition 1.1.0
Enhanced Speech to Text Built on Whisper 1.8.1
Gender Identification 1.3.1
Language Identification 1.6.1
Speaker Diarization 1.5.1
Speech to Text Phonexia 6th Generation 3.62.0
Time Analysis of Speech 3.62.0
Voice Activity Detection 1.0.2
Voiceprint Comparison 1.3.0
Voiceprint Extraction 1.5.3

Version 3.6.0

February 17, 2025 · 2 min read

Added Authenticity Verification technology with Deepfake Detection subtechnology (high-level description) in REST API and GUI.
Added Gender Identification technology (high-level description) in REST API (still only preview in GUI). The endpoints are differentiated based on the input type, which can be either a media file or a list of voiceprints from Voiceprint Extraction.
Added Denoiser technology (high-level description) preview in GUI.
Added Emotion Recognition technology (high-level description) preview in GUI.
Updated Enhanced Speech to Text Built on Whisper model (high-level description) with a query parameter for word-level segmentation that also improves the overall accuracy of the timestamps. Because this behaviour is resource heavy, it is turned-off by default.
Added settings for parallel threads and multiple instances for GPU support for Language Identification and Voice Activity Detection.
Configuration and administration changes:
- The Virtual Appliance startup process now displays system messages again for more clarity.
- Improved detection of "system is ready" state during startup process
- When licensed-models.zip package is uploaded via the Filebrowser GUI, it's automatically unpacked after upload.
- New script configure-speech-platform.sh for the Speech Platform configuration, with more functionality. Use configure-speech-platform.sh --auto-configure to automatically configure the system according to models and licenses uploaded to /data/ folder. The enable-technologies.sh script is now obsolete and will be removed in next release.
- Configuration YAML file is now much shorter, simpler and more comprehensive.
- Turning on GPU support in configuration file is now easier, all GPU images are now included, it's not needed to download/configure them separately.

Included Components

Audio Quality Estimation 3.62.0
Deepfake Detection 1.1.0
Enhanced Speech to Text Built on Whisper 1.8.1
Language Identification 1.6.1
Speaker Diarization 1.5.1
Speech to Text Phonexia 6th Generation 3.62.0
Time Analysis of Speech 3.62.0
Voice Activity Detection 1.0.2
Voiceprint Comparison 1.3.0
Voiceprint Extraction 1.5.3

Version 3.5.0

December 19, 2024 · One min read

Added consumption counting and GUI capacities indicator.
Configuration and administration changes:
- Virtual Appliance system console now displays the prompt only after the internal system is started. The system console is kept blank during that time. It may take some additional time for the GUI to become fully ready.
- All technologies are now disabled by default after first start.
- Added enable-technologies.sh script for enabling technologies according to uploaded models and licenses.
- It's now possible to change UI limits for all technologies in the configuration file.

New available models with improved Voice Activity Detection configuration:

Enhanced Speech to Text Built on Whisper: model 1.1.0
Speaker Identification: model 5.2.0
Language Identification: model 5.3.0
Speaker Diarization: model 5.1.0
Voice Activity Detection: model 5.3.0

Included Components

Audio Quality Estimation 3.62.0
Enhanced Speech to Text Built on Whisper 1.7.1
Language Identification 1.6.1
Speaker Diarization 1.5.1
Speech to Text Phonexia 6th Generation 3.62.0
Time Analysis of Speech 3.62.0
Voice Activity Detection 1.0.2
Voiceprint Comparison 1.3.0
Voiceprint Extraction 1.5.3

Version 3.4.0

November 21, 2024 · One min read

Added Audio Quality Estimation technology (high-level description) in REST API.
Voice Activity Detection technology (high-level description) is available in both REST API and GUI.
Speech Translation technology is now fully working in GUI.
Speaker Diarization technology (high-level description) is now fully working in GUI.
Updated Speech to Text Phonexia with the ability to use Preferred phrases.
Updated the Speaker Identification model to xl-5.1.0, which is capable of carrying out automatic adaptation to various input audio sources (YouTube, Skype, WhatsApp, VoLTE, AMBE).

Included Components

Audio Quality Estimation 3.62.0
Enhanced Speech to Text Built on Whisper 1.7.0
Language Identification 1.5.0
Speaker Diarization 1.4.1
Speech to Text Phonexia 6th Generation 3.62.0
Time Analysis of Speech 3.62.0
Voice Activity Detection 1.0.1
Voiceprint Comparison 1.3.0
Voiceprint Extraction 1.5.2

Version 3.3.0

September 16, 2024 · One min read

Added Speech Translation technology preview in GUI.
Added Speaker Diarization technology in (high-level description) REST API (still only preview in GUI).
Added option to speed up Enhanced Speech to Text Built on Whisper via beamSize parameter in Virtual Appliance configuration file - smaller beamSize means faster processing (up to ~30% with large_v2 model and beamSize=1) at the expense of slightly lower accuracy.
Speech to Text Phonexia and Time Analysis of Speech technologies updated to version 3.62.0.
Configuration and administration changes:
- Added support for importing Virtual Appliance to Microsoft Hyper-V.
- Both system and data disk now automatically resize according to size set in virtualization software.
- Customers can now use cloud-init with Virtual Appliance.
- Added diagnostic script for collecting logs for troubleshooting.

Included Components

Enhanced Speech to Text Built on Whisper 1.5.0
Language Identification 1.3.1
Speaker Diarization 1.3.0
Speech to Text Phonexia 6th Generation 3.62.0
Time Analysis of Speech 3.62.0
Voiceprint Comparison 1.1.0
Voiceprint Extraction 1.4.0

Version 3.2.0

August 21, 2024 · One min read

Language Identification technology now fully working in GUI.
Added Gender Identification technology (high-level description) preview in GUI.
Added possibility to share a GPU among multiple technologies, to better utilize the hardware resources.

Included Components

Enhanced Speech to Text Built on Whisper 1.4.0
Language Identification 1.2.0
Speech to Text Phonexia 6th Generation 3.61.0
Time Analysis of Speech 3.61.0
Voiceprint Comparison 1.1.0
Voiceprint Extraction 1.4.0