Version: 4.0.2

Emotion Recognition

The Emotion Recognition is a technology focused on detecting and interpreting human emotions through voice and speech patterns. By analyzing various acoustic features of audio signals, such as tone, pitch, volume, and rhythm, the system aims to understand the emotional state of a speaker.

Typical areas of use

Customer Service: technology is used in call centers to monitor customer satisfaction by detecting frustration, stress, or dissatisfaction based on vocal cues.
Mental Health Monitoring: systems are integrated into mental health applications to detect early signs of depression, anxiety, or emotional distress.
Virtual Assistants: Digital assistants use technology to provide more emotionally intelligent responses and engage users more effectively.
Market Research: Companies leverage technology in market studies to gauge emotional responses during product tests or advertisements.

How does it work?

Emotion Recognition technology consists of the following steps:

Filtering out segments of silence and technical signals (Voice Activity Detection).
Recognizing emotion by neural network
Returning scores for each of four emotions (angry, sad, happy, neutral).

Scoring

Accuracy

The most common metric for evaluating the performance of Emotion Recognition is accuracy. Accuracy is the ratio of correct predictions (both positive and negative) to the total number of predictions.

Accuracy=\frac{CorrectPredictions}{TotalPredictions} * 100

note

While useful for general classification problems, accuracy can be misleading in emotion recognition, especially when the dataset is imbalanced (e.g., some emotions like neutrality may appear more frequently).

Precision, Recall, and F1-Score

These metrics are commonly used to assess how well the model performs on specific emotion classes. Precision and recall are especially useful when classifying emotions in imbalanced datasets, as they show how well the model performs on minority classes (e.g., rare emotions). The F1-Score offers a balanced metric when both precision and recall are important. See Precision_and_recall wiki page for more information.

Example measurements

Below is an example of measurements performed on our Czech dataset.

Class	Precision	Recall	F1
Neutral	0.8016	0.8211	0.8112
Happy	0.8947	0.8718	0.8831
Sad	0.6944	0.5435	0.6098
Angry	0.4815	0.6842	0.5652

FAQ

What languages does the technology support?

The technology is multilingual, but it delivers the best results in English and Czech.

Why do most recordings appear neutral?

Human speech is generally neutral, especially in formal conversations. For deeper insights, consider checking the second most probable emotion to better understand the underlying tone or sentiment of the conversation.

How does the technology handle changing emotions or multiple speakers?

If the emotion shifts during a recording, the technology provides an average emotion for the entire recording. For recordings with multiple speakers expressing different emotions, the result will also reflect an average emotion. However, using diarization technology beforehand can help separate speakers for more precise analysis.

How accurate is the system?

The accuracy depends on the dataset used. You can find example measurements in the Measurements section.

How fast does the system process recordings?

The processing speed is approximately 5× faster than real-time per core on modern CPUs and 300× faster than real-time on a GPU. However, performance may vary depending on hardware and the amount of speech in the audio.

Typical areas of use​

How does it work?​

Scoring​

Accuracy​

Precision, Recall, and F1-Score​

Example measurements​

FAQ​

What languages does the technology support?​

Why do most recordings appear neutral?​

How does the technology handle changing emotions or multiple speakers?​

How accurate is the system?​

How fast does the system process recordings?​