1. Introduction
Speech emotion recognition (SER) aims to detect and comprehend emotions expressed as part of verbal communication. This field is currently undergoing significant developments. Although humans express their feelings using body language, facial expressions, and speech, the latter is generally understood to be the most efficient interaction mode [
1]. SER can be used to accurately pinpoint speakers’ emotional states by processing the paralinguistic features within their speech signals. This presents transformative prospects, especially in human–machine interfaces [
2], and including emotional understanding in these systems can promote more natural and seamless interactions.
Education is among the many fields in which SER can be applied, with emotions profoundly affecting student engagement and motivation, thereby impacting learning outcomes [
3]. Several studies have shown that SER can be used to obtain important information about how students feel, especially in a distance learning context, and these results can inform adaptive teaching strategies. For example, if a student’s speech signals indicate confusion or frustration, SER can prompt the educator to provide supplementary explanations or personalized aid [
4]. Meanwhile, other studies have investigated how instructors’ emotional stability affects how effectively they deliver information and the effect that this has on student engagement and information, which can be used to improve teaching quality and student outcomes [
5].
The field of deep learning (DL) has made significant progress in revolutionizing the ability of machines to comprehend and interpret human emotions. DL models can currently analyze the intricate characteristics of unprocessed data, including voice recordings, to detect emotional cues. Emotional recognition technology has the potential to enhance human–machine interactions in a seamless and instinctive manner across a wide range of applications, and swift advancements in this domain underscore DL’s profound capacity to facilitate machines’ recognition of and reactions to human emotions. Accordingly, the model proposed herein can significantly advance the field of SER by equipping machines with a human-like ability to discern emotions such as happiness, sadness, and anger in human speech [
6]. By enabling machines to analyze subtle variations in tone, cadence, and timbre, DL allows the deduction of emotional states from voice signals. The proposed model allows for this in a manner that exceeds the capabilities of conventional methodologies. This technique is paramount for evaluating lecturers’ emotional states because it enables the identification of potential stressors, anxiety, or emotional distress that might inhibit the efficacy of their teaching. Upon detecting these emotional indicators, SER technology can instigate suitable interventions, ranging from offering counseling to providing lecturers with appropriate support services, thereby ensuring that they reach their optimal teaching capacities [
7]. Furthermore, the integration of SER into intelligent tutoring systems paves the way for a more nuanced personalization of instructional content, tone, and feedback by fostering responsive and effective teaching–learning environments that are underpinned by a deeper understanding of emotional cues.
In this study, Mel-frequency cepstral coefficients (MFCCs); chroma; the Mel spectrogram; the zero-crossing rate (ZCR); the spectral contrast, centroid, bandwidth, and roll-off; and the root-mean square (RMS) were obtained from raw audio data, after which noise was added and the audio data were shifted and time-stretched. Then, an ensemble model comprising a combination of transformer, convolutional neural network (CNN), and long short-term memory (LSTM) architectures was used to classify emotions from the static features in the data.
This study makes several significant contributions to the field:
The creation and use of the Saudi Higher-Education Instructor Emotions (SHEIE) dataset: This dataset is a distinct resource for SER research because it includes meticulous annotations of emotions and specifically focuses on Saudi Arabian instructors, making it unique and useful in terms of SER in education.
A comprehensive series of experiments: This study explored the effectiveness of various data augmentation (DA) and feature extraction techniques for speech emotion classification. The evaluation steps were conducted using five benchmark datasets, namely the Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS) dataset, the Berlin Database of Emotional Speech (EMO-DB) dataset, the Surrey Audio–Visual Expressed Emotion (SAVEE) dataset, the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, and SHEIE.
The proposal of a new model for classifying emotions from speech: This model uses a transformer architecture that incorporates multi-head self-attention. Furthermore, the favorable attributes of CNN and LSTM networks were combined to extract the spectral features and determine the temporal dynamics.
The rest of the paper is organized as follows:
Section 2 discusses the pertinent literature;
Section 3 outlines the methodology adopted for this study;
Section 4 details the experimental outcomes;
Section 5 discusses the research findings; and
Section 6 concludes the paper and outlines future research avenues.
3. Materials and Methods
This study proposes an automated system that can accurately recognize instructors’ emotional states during remote instructional sessions, thereby improving the results of evaluations conducted by their institutions. This system was developed sequentially: data collected during normal lectures were subjected to preprocessing and feature extraction, enabling the model’s development using advanced technologies such as DL, combining transformers, CNN, and LSTM architectures. The usefulness of the model was tested after training on a standard dataset, as shown in
Figure 1.
3.1. Datasets
To ensure an exhaustive appraisal of the proposed model, it was tested on five datasets spanning three languages, namely English, Arabic, and German. To comprehensively train models using DL, a certain sample size is required, which this study was not able to meet. Therefore, DA was applied to all of the datasets. The following paragraphs summarize each dataset:
The RAVDESS is a validated multimodal database that contains emotional speech and song recordings. This database comprises 7356 audio files recorded by 24 professional actors divided equally between men and women. The actors were recorded speaking two linguistically neutral statements while adopting a neutral or standardized accent to minimize the influence of regional variations on the emotional content of the speech. Through having the professional actors record these controlled statements, high-quality emotional portrayals devoid of confounding regional accents were obtained. The RAVDESS provides a substantial corpus of emotional speech and songs for research and development in various fields, such as affective computing. The standardized validation protocols embedded in the dataset aid in its comparative benchmarking and reproducibility across studies. Additionally, the spectrum of emotions conveyed by speech in this dataset encompasses an array of affective states, including, but not limited to, tranquility, satisfaction, elation, melancholy, indignation, trepidation, astonishment, and revulsion [
25]. Each expression was associated with a distinct level of emotional intensity, ranging from neutral to normal to strong.
The EMO-DB was created by the Technical University of the Berlin Institute of Communication Science [
26]. It is a free German emotional speech database that includes 535 context-variable sentences used in everyday communication delivered by ten expert actors (five men and five women) who simulated happiness, anger, anxiety, fear, boredom, disgust, or neutrality. The 48 kHz data were downsampled to 16 kHz.
The SAVEE dataset comprises 480 utterances in British English recorded by four male postgraduate students and researchers at the University of Surrey, all of whom were native English speakers aged between 27 and 31 [
27]. Seven different emotions were expressed through these utterances (happiness, sadness, surprise, fear, disgust, neutrality, and anger), with sentences from the phonetically balanced Texas Instruments/Massachusetts Institute of Technology corpus chosen for each emotion (see
Table 1).
The IEMOCAP dataset was developed by the Signal Analysis and Interpretation Laboratory (SAIL) at the University of Southern California. This database represents a multimodal, multi-speaker collection of data. The dataset spans 12 h and includes videos, audio, face tracking, and text transcriptions of paired performances in which actors tried to elicit a particular feeling from the viewer using a combination of improvisation and prepared scenes. Multiple annotators contributed labels to this IEMOCAP database, classifying the data according to dimensional labels, including valence, activation, and dominance, as well as category labels such as anger, happiness, sadness, and neutrality [
28].
The SHEIE dataset is a unique emotional speech dataset developed to test the model proposed in this study. It features real interactions from the realm of higher education and focuses on the instructors. Recognizing the scarcity of datasets comprising genuine interactions, this dataset was designed to address the gap in the existing speech emotion studies. It includes six universal emotions, namely anger, happiness, sadness, excitement, boredom, and neutrality. The data were collected from various synchronous online lectures held by the Computer Science Department at King Abdulaziz University and the Islamic Studies Department at Al Jouf University, which were delivered in both Arabic and English. The dataset contains a total of 20 h and 50 min of speech data from 19 lecture sessions. The audio data were collected by two instructors (one male and one female) and carefully segmented and labeled. The development of the SHEIE dataset involved a four-step process. First, it was determined that real interactions, rather than acted emotions, would be recorded during live lectures via the Blackboard e-learning system. Second, the emotions represented in the dataset were selected based on their relevance to the instructor’s experience during the lecture. Third, volunteer instructors from the two universities were selected to participate in the study, and they provided speech data via their lecture recordings. Finally, the emotions in the dataset were labeled based on self-reporting by instructors, which meant selecting an emotion from a prompt every ten minutes during their lectures. The time interval was chosen in response to research indicating that student attention begins to wane around the ten-minute mark. The SHEIE dataset underwent extensive preprocessing, which included splitting the data into ten-minute segments, manually labeling the emotions, cleaning the data to exclude noise and unwanted sounds, and normalizing the data into equal three-second intervals. This produced 7515 audio files, each labeled with a specific emotion and formatted for use in SER research: 490 files for anger, 1888 files for happiness, 1439 files for sadness, 516 files for boredom, 1654 files for excitement, and 1528 files for neutrality.
Figure 2 shows the distribution of the different emotional classes in the five datasets. Some emotional classes occur more often in most datasets, whereas others demonstrate the inverse. Thus, there are imbalances in the emotional class distributions across the datasets, indicating the need for DA before model training.
Table 1 presents a description of all five datasets as well as the distribution of the classes in these datasets.
3.2. Data Augmentation
DA is essential for assessing SER model performance because SER systems often lack sufficient and diverse training data. This is observed in the imbalance between the emotional classes in this study’s datasets, as shown in
Figure 2. Therefore, speed, pitch, noise, and the time-stretching technique were used to generate variant data to provide a model with more context. DA can significantly improve the ability of SER systems to generalize and recognize emotions from speech under uncontrolled and varied conditions [
17]. Our DA reduced overfitting, making the SER model more stable during the training process.
Figure 3 illustrates the influence of this DA on SER tasks for the five datasets used in this study. DA techniques, such as adding white Gaussian noise (AWGN) to the samples, were employed to balance the class distributions across these datasets, effectively addressing the imbalances in the class distributions. Furthermore, a custom noise function was used to add AWGN to the samples, as shown in
Figure 4. This achieves DA by adding random noise that is multiplied by 0.01 to the data. In addition, time stretching was used to stretch the data at a given rate, and the pitch shift function was applied with factors of 0.5 and 0.6. Time shifting was also employed using a custom shift function that incorporates data, the sampling rate, the maximum shift value, and the shift direction. In this method, a random shift value within the maximum shift value multiplied by the sampling rate is also generated. The shift value is negated for the right shift direction, and if the direction is labeled as “both”, the direction is deemed to be random. The positive shift values set the first shift values of DA to 0, and the last shift values are 0 if the shift values are negative, resulting in DA.
Figure 4 indicates the changes observed in the waveforms after applying these DA techniques.
3.3. Feature Extraction
For our experiments, the audio data was processed at a 16 kHz sampling rate across all datasets. The EMO-DB dataset was downsampled from 48 kHz to 16 kHz, while the RAVDESS and IEMOCAP datasets were downsampled to 16 kHz from their original 44.1 kHz sampling rate. The SAVEE dataset was already at 16 kHz. For the public datasets (EMO-DB, RAVDESS, SAVEE, IEMOCAP), the audio lengths ranged from approximately 1 s to 15 s. The SHEIE dataset was specifically segmented into 3–5 s intervals. SER extracts features from speech waves to determine a person’s emotional state because transforming speech into features or parameters allows the determination of emotional characteristics. In this study, a multi-feature approach was employed to capture various aspects of the speech signals. The audio data were segmented into frames of 32 ms with an overlap of 50%, and a Hamming window of 64 ms was applied to extract the Mel-frequency cepstral coefficient (MFCC) features. The extracted features were represented in a one-dimensional format, capturing the temporal dynamics of the speech signal. Features that can be extracted from the time domain include ZCR, energy, and amplitude, which can be used to identify anger and excitement by revealing speech rate and volume [
29]. Pitch, formant, and other features can be extracted from the frequency domain, with the ZCR, chroma, and roll-off representing some of these features, as shown in
Figure 5. Formants (resonant frequencies of the vocal tract) reveal the shape of the vocal tract, which determines the characteristics of articulated speech sounds. Pitch (the perceived fundamental frequency of a sound) can indicate emotions, such as fear, neutral [
30]. This is further supported by Postma [
31], who found that spectral listeners, who focus on higher harmonics, per-form better in emotion judgment tasks. Kienast [
32] found that vocal expressions of different emotions are characterized by specific acoustical changes, such as spectral and segmental changes, and variations in pitch levels. The MFCC, spectral contrast, and chroma also affect the spectral domain [
33]. The MFCC, with 40 coefficients being used in this study, offers the best approximation of the human ear’s nonlinear frequency perception and represents a sound’s short-term power spectrum. Thus, they allow systems to more effectively understand emotions owing to the systems acting like humans. This implies that combining features from different domains guarantees a complete speech signal representation, allowing SER systems to identify emotions accurately and precisely.
The ZCR measures how often a signal crosses the zero axis by determining the signal sign changes per frame [
33]. As such, it denotes the total number of times the wave flips from positive to negative, and it is distributed contrariwise by frame. Mathematically, the ZCR can be determined using the following equation:
where
s is a signal of length (
T) and
is a sign function.
A chromatogram visualizes audio by mapping frequencies onto 12 bins that match the 12 semitones of an octave using chroma features [
34]. This compresses pitch content into time windows, enabling music analysis applications to recognize chords and harmonic similarities despite timbre and instrumentation changes.
A Mel spectrogram visualizes audio signals by mapping frequencies onto the Mel scale, which is aligned with human hearing. This technique captures important sound characteristics, thereby facilitating its widespread use in speech and audio processing.
The spectral contrast, centroid, bandwidth, and roll-off are features extracted from sound signals. The differences in the levels between spectrum peaks and valleys can reveal a sound’s timbre, whereas the spectrum’s “center of mass” (also called the spectral centroid) can indicate a sound’s brightness. Additionally, the spectral bandwidth indicates a sound’s spectral shape by measuring the spectrum’s spread around its centroid, with spectral roll-off being used to determine a sound’s high-frequency content.
The RMS is a feature that can be extracted from speech waves and used in SER tasks [
35]. It is used to measure the energy of a signal and provide information regarding the overall loudness of the signal.
The MFCC is used in SER applications to parameterize the speech signals generated by the Mel spectrogram. This scale matches the human auditory system more closely than linear frequency bands. To obtain the MFCC, the Mel spectrogram is transformed using a discrete cosine transform (DCT) [
36], which captures speech signal characteristics and is resistant to timbre and instrumentation changes. This process entails determining the energy band of the speech wave, mapping the power spectrum onto the Mel scale using the corresponding deltoid windows to obtain the Mel spectrogram logarithm, and applying the DCT to this logarithm to obtain the MFCC. The formula for mapping the frequency (
f) in Hz to the Mel frequency (
m) is as follows:
The inverse formula for mapping the Mel frequency (
m) to a frequency (
f) in Hertz is
Emotion classification studies typically use 40 MFCCs for feature extraction; however, a more nuanced representation of speech data using more coefficients can improve the detection of emotional states. MFCCs, especially when combined with the RMS and ZCR, have performed well in the complex task of SER [
28,
29].
3.4. Proposed Model
This study’s methodology centers on the construction of an ensemble model that combines transformer, CNN, and LSTM architectures. Transformer models use self-attention mechanisms to extract contextual features from input sequences, CNN filters to extract local temporal features from input sequences, and LSTM models to infer long-term dependencies from input sequences using recurrent connections. Therefore, in the context of this study, the transformer self-focuses on the interconnectedness of the input elements regardless of the sequence, the CNN layers identify the local audio feature patterns, and the LSTM layers capture and learn the long-term dependencies and temporal relationships within the audio sequences. LSTM layers use a series of gates (input, forget, and output gates) and a memory cell to selectively retain, update, or forget information from previous time steps, enabling them to effectively model and learn from long-term dependencies in the input sequences. For our study, the outputs of these three models were then merged and input into a dense final layer for classification, as shown in
Figure 6. Thus, in our model, these architectures are combined to determine a feature of the input sequence that accounts for the contextual, local temporal, and long-term dependencies. The Softmax activation function in the dense layer classifies the combined feature representation to produce the final output. The transformer model features three transformer block layers containing multi-head self-attention layers and feed-forward neural networks (FFNNs). The multi-head self-attention layers embed 64 units and 8 heads, and the FFNNs have 64-unit hidden layers with rectified linear unit (ReLU) activation functions. The CNN model comprises 4 Conv1D layers with 64 filters and 3 kernel sizes, with the ReLU activating each Conv1D layer. The flattened layer then flattens the last Conv1D layer output, which comprises three 64-unit layers. The first two LSTM layers then return the sequences, and the last layer returns to the final hidden state. Then, the final output for emotion classification is generated by concatenating the outputs of the transformer, CNN, and LSTM architectures and running them through a dense final layer comprising six to eight units and a Softmax activation function.
(1) Transformer Block: Transformer DL models use self-attention mechanisms to focus on different words in the input sequence when producing an output, with the transformer block using multi-head self-attention to focus on different positions and understand the data. FFNNs and dropout layers reduce overfitting with layer normalization, stabilizing the learning process [
37].
Matrices W_Q, W_K, and W_V are subject to training and can be modified by the learning process. Furthermore, the attention scores (
S) can be calculated by multiplying the query and key matrices by the square root of their dimensions using the following equation:
where W denotes SoftMax (
S), which yields the attention weights, and the attention layer output is a weighted sum of the value matrices: Z = WV.
The FFNN in our model comprises a nonlinear activation function and two linear transformations, W_1, b_1, W_2, and b_2. These parameters denote the adjustable weights and biases of the attention layer. The FFNN outputs were connected by adding the input to the output using the following equation:
Layer normalization is the process by which the feature dimension input is normalized according to the attention layer and FFNN output, using the following equation:
The transformer block in this model comprises an architectural component that accepts a feature vector of size (192, 1) as the input (see
Table 2). It consists of a multi-head self-attention mechanism and an FFNN, each of which employs residual connections and layer normalization. In addition, the multi-head self-attention’s embed_dim and num_heads parameters represent the size of the input embeddings and number of attention heads. Each attention head individually processes the input, thereby allowing the model to concurrently learn different types of information from a singular input sequence. In this mechanism, query_dense, key_dense, and value_dense are the dense layers that transform the inputs into their corresponding query, key, and value vectors. These vectors were further separated into different heads using the separate_heads method, ensuring parallel and independent computations for each head. The call method subsequently orchestrates the computation flow of the self-attention mechanism by calling the previous components sequentially. After these computations, the dense combined head layer merges the outputs from all attention heads back into the original embedding dimension. The transformer block also features an FFNN, characterized by ff_dim, which denotes the size of its hidden layer. To prevent overfitting, two dropout layers with a rate defined by the hyperparameter “rate” were employed. The feed-forward network (FFN) in the transformer block consists of two dense layers: the first one applies a “relu” activation function, and the second one, which has the same size as embed_dim, applies no activation function. The FFN is applied to the output of the multi-head attention mechanism, which computes the attention weights between all pairs of positions in the input sequence using multiple attention heads. The purpose of the FFN is to process the concatenated output from different attention heads, allowing the model to capture more complex dependencies and transformations. After the FFN, the output traverses another dropout layer and undergoes a residual connection, followed by normalization using a second normalization layer. In summary, the transformer block transforms the input embedding, outputting a transformed embedding of the same size with tunable hyperparameters such as the number of attention heads and the size of the hidden layer in the FFN.
(2) CNN: In parallel with the transformer block, a series of Conv1D layers are employed in our model to extract local temporal features directly from the input acoustic features. The Conv1D layers consist of 64 filters with a kernel size of 3. The “same” padding technique is used to ensure that the output has the same width as the input. The ReLU activation function is applied element-wise to introduce nonlinearity to the model. Each filter (f) in the Conv1D layers is a vector of weights of size (k), where k denotes the kernel size. The output of the convolutional layer at position i is determined using the following equation:
If padding is used, the input sequence is expanded with zeroes before the filters are applied. Additionally, when a nonlinear activation function, such as ReLU, is employed, it is implemented on every individual element of the output of the convolutional layer.
To reiterate, the padding technique known as “same” is used to pad the sides of the input to achieve a width that matches that of the output; the ReLU activation function then introduces nonlinearity to the model by producing an input value as an output if it is positive and as zero if it is negative; and the CNN layers then use filters to obtain local features from the input vectors through a convolution.
(3) LSTM: The model’s long-term LSTM layers process the sequence data with the LSTM accepting the original (192, 1) input feature vector. The LSTM architecture comprises two 64-unit layers. The first outputs its hidden state at every time step because the return sequences are labeled “True”, and the next LSTM layer receives each output. This implies that this setting is required when stacking the LSTM layers. The second LSTM layer does not include this parameter; therefore, it only returns the last output, which is then fed into dense layers to obtain the final predictions. Notably, LSTM networks can predict sequence data, remember information, and avoid the vanishing-gradient problem associated with traditional recurrent neural networks.
For each time step (t), the LSTM component simultaneously receives x_t from the input sequence as well as c_t_1 and h_t_1. The LSTM component then calculates the input gate (i_t), output gate (o_t), and cell candidate (g_t) using various combinations of the present input, preceding hidden state, and trainable weights according to the following equations:
4. Experimental Results
The results from this study’s model have been compared to models employed by previous researchers, enabling the evaluation of different models based on established metrics. The SER system used in this study was scrutinized via speaker-independent experiments conducted on five datasets. The data were partitioned using a percentage-based stratification approach, with 75% of the data being used to train the SER models and the remaining 25% being reserved for testing. As illustrated in
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14 and
Figure 15 that show the performance of the proposed transformer model. The model was trained using a feature set comprising 192 features from the five datasets. In this study, we evaluate the performance of our ensemble model across five different datasets. The results are summarized in
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7, which provide insights into the ensemble model’s performance on each dataset.
Table 8 presents the accuracy of the all model across the five datasets.
4.1. Experimental Design
Several packages and software programs were used, including TensorFlow and Librosa v0-10-0 for audio preprocessing and WavePad v12.52 for audio segmentation. The training and testing procedures were conducted using Google Collaboratory, a platform equipped with a v2.20 GHz Intel Xeon CPU, 25 GB of RAM, and Tesla graphics processing units (GPUs). The DL libraries of Keras v2.9.0 and TensorFlow v2.8.0 were used to develop and train the neural network models, and the GPUs allowed for efficient processing of the large matrix operations required for training. Keras provides a simple interface for building neural networks, whereas TensorFlow offers more flexibility for customizing and fine-tuning models. In this research, Python v3.10 was used as the implementation language for the proposed method.
4.2. Measuring Tools Used for the Evaluation
Several metrics were used to measure the performance of the SER model on the test set across the five datasets in the evaluation, including accuracy, loss, precision, F1 score, and recall values; a confusion matrix; and receiver operating characteristic (ROC) curves [
38].
Each class [i] was evaluated by quantifying the true positives (TP_[i]), true negatives (TN_[i]), false positives (FP_[i]), and false negatives (FN_[i]). This process resulted in the determination of performance metrics.
Accuracy is a metric used to determine the model’s prediction accuracy by determining the ratio between accurate predictions and the sum of all predictions, using the following equation:
Precision is a metric that denotes the proportion of true positive predictions relative to all positive predictions. Determining this value involves calculating the proportion of true positives in relation to the combined total of true and false positives, using the following equation:
The recall metric denotes the percentage of true positive predictions among all the positives. This is defined as the relationship between accurate positives and the sum of true positives and false negatives. This is calculated as follows:
The F1 score provides a means of balancing the precision and recall. It represents the weighted average of the precision and recall, which is determined using the following equation:
A confusion matrix indicates the class–prediction distribution of the classification model. To calculate the precision, recall, sensitivity, and specificity, a confusion matrix evaluates the TP, TN, FP, and FN metrics used to identify the model performance issues.
The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds. The TPR and FPR are formally defined using mathematical expressions, with the TPR formula determining the ratio of true positives to the sum of true positives and false negatives by identifying the ratio of false positives to the sum of false positives and true negatives.
The Matthews Correlation Coefficient (MCC) is a measure of the quality of binary classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure that can be used even if the classes are of very different sizes; it is defined as follows:
4.3. Ensemble Model Evaluation
The evaluation process demonstrated that the ensemble model performed well across all of the datasets, thereby demonstrating its effectiveness in making accurate predictions. The evaluation metrics for each dataset are as follows:
4.3.1. The EMO-DB Dataset
The ensemble model underwent 200 epochs of training with a batch size of 20, incorporating the AdaMAX optimizer and sparse categorical cross-entropy loss. The data were partitioned into training and testing sets, according to the prescribed ratio. The model had, on average, a testing accuracy of 99.86%, a precision of 99.71%, a recall of 99.71%, and an F1 score of 99.71% for all emotions (see
Table 3).
Figure 7 details the loss and accuracy curves for this dataset.
Figure 8 shows the confusion matrix, which summarizes the model’s performance in terms of classification by indicating the number of correct and incorrect predictions for the seven classes. In this metric, the diagonal lines of the matrix indicate the true positives for each class, whereas the off-diagonal elements represent false positives and negatives.
Table 3.
Ensemble model performance on the EMO-DB dataset.
Table 3.
Ensemble model performance on the EMO-DB dataset.
Emotion | Accuracy | Precision | Recall | F1 Score | MCC |
---|
Anger | 100% | 100% | 100% | 100% | 100% |
Boredom | 100% | 100% | 100% | 100% | 100% |
Disgust | 100% | 100% | 100% | 100% | 100% |
Fear | 99% | 99% | 99% | 99% | 100% |
Happiness | 98% | 100% | 99% | 99% | 100% |
Neutrality | 100% | 100% | 100% | 100% | 99% |
Sadness | 100% | 99% | 99% | 99% | 100% |
Average | 99.86% | 99.71% | 99.71% | 99.71% | 100% |
Figure 7.
Ensemble model (a) loss curves and (b) accuracy curves for the EMO-DB dataset.
Figure 7.
Ensemble model (a) loss curves and (b) accuracy curves for the EMO-DB dataset.
Figure 8.
Ensemble model confusion matrix for the EMO-DB dataset.
Figure 8.
Ensemble model confusion matrix for the EMO-DB dataset.
4.3.2. The RAVDESS Dataset
The ensemble model was trained for over 200 epochs using a batch size of 20. The AdaMAX optimizer was employed in conjunction with sparse categorical cross-entropy loss during the training process. The dataset was partitioned into distinct subsets for training and testing. The results indicated, on average, a testing accuracy of 96.3%, a precision of 95.7%, a recall of 96.3%, and an F1 score of 95.9% for all emotions.
Table 4 presents the results.
Figure 9 details the loss and accuracy curves for this dataset, and
Figure 10 demonstrates how many predictions were correct and false for the eight classes in the classification process.
Table 4.
Ensemble model performance on the RAVDESS dataset.
Table 4.
Ensemble model performance on the RAVDESS dataset.
Emotion | Accuracy | Precision | Recall | F1 Score | MCC |
---|
Happiness | 91% | 97% | 94% | 94% | 98% |
Sadness | 89% | 94% | 92% | 92% | 93% |
Anger | 93% | 95% | 94% | 94% | 93% |
Neutrality | 94% | 90% | 93% | 92% | 96% |
Surprise | 98% | 99% | 98% | 98% | 96% |
Calm | 96% | 96% | 96% | 96% | 91% |
Fear | 99% | 92% | 96% | 94% | 90% |
Disgust | 95% | 97% | 96% | 96% | 95% |
Average | 96.3% | 95.7% | 96.3% | 95.9% | 94.4% |
Figure 9.
Ensemble model (a) loss curves and (b) accuracy curves for the RAVDESS dataset.
Figure 9.
Ensemble model (a) loss curves and (b) accuracy curves for the RAVDESS dataset.
Figure 10.
Ensemble model confusion matrix for the RAVDESS dataset.
Figure 10.
Ensemble model confusion matrix for the RAVDESS dataset.
4.3.3. The SAVEE Dataset
The ensemble model was trained over 200 epochs using 20 batches combined with the AdaMAX optimizer and sparse categorical cross-entropy loss. The dataset was split according to the 75/25 ratio, as discussed above. The model’s average testing accuracy, precision, recall, and F1 score for all emotions in this dataset were determined to be 96.5%, 95.3%, 95.6%, and 95.2%, respectively (see
Table 5).
Figure 11 visualizes the loss and accuracy curves for this dataset, and the confusion matrix is included in
Figure 12.
Table 5.
Ensemble model performance on the SAVEE dataset.
Table 5.
Ensemble model performance on the SAVEE dataset.
Emotion | Accuracy | Precision | Recall | F1 Score | MCC |
---|
Anger | 96.2% | 95.2% | 97.3% | 96.3% | 92.9% |
Disgust | 100% | 100% | 99.4% | 100% | 90.5% |
Fear | 82.8% | 81.0% | 99.9% | 89.3% | 92.2% |
Happiness | 87.9% | 86.4% | 72.2% | 77.6% | 89.6% |
Neutrality | 106.2% | 100% | 100% | 107.4% | 89.4% |
Sadness | 100% | 100% | 100% | 100% | 97.5% |
Surprise | 99.2% | 96.5% | 93.0% | 94.8% | 92.3% |
Average | 96.5% | 95.3% | 95.6% | 95.2% | 93.7% |
Figure 11.
Ensemble model (a) loss curves and (b) accuracy curves for the SAVEE dataset.
Figure 11.
Ensemble model (a) loss curves and (b) accuracy curves for the SAVEE dataset.
Figure 12.
Ensemble model confusion matrix for SAVEE.
Figure 12.
Ensemble model confusion matrix for SAVEE.
4.3.4. The IEMOCAP Dataset
After adjusting the classification thresholds to achieve the desired accuracy, the model demonstrated an average accuracy of 85.3%, an average precision of 94.1%, an average recall of 61.4%, and an F1 score of 73.6% for all emotions in the IEMOCAP dataset.
Table 6 presents the results. While the precision and recall values were adjusted to achieve the desired accuracy, the overall F1 score reflects the trade-off between precision and recall.
Figure 13 shows the loss and accuracy curves for the IEMOCAP dataset, and the confusion matrix for the emotion classification model appears in
Figure 14. The neutrality and surprise classes recorded the highest accuracies of 95.0%, whereas excitement recorded the highest precision at 99.0%. Neutrality and surprise produced the highest recall rates (92.0%), and surprise produced the highest F1 score (94.0%).
Table 6.
Ensemble model performance on the IEMOCAP dataset.
Table 6.
Ensemble model performance on the IEMOCAP dataset.
Emotion | Accuracy | Precision | Recall | F1 Score | MCC |
---|
Anger | 80.0% | 95% | 50.0% | 65.0% | 76% |
Excitement | 90.0% | 99% | 70.0% | 81.8% | 80% |
Fear | 85.0% | 92% | 60.0% | 72.7% | 81% |
Frustration | 80.0% | 87% | 40.0% | 54.8% | 66% |
Happy | 85.0% | 97% | 58.0% | 72.8% | 76% |
Neutrality | 95.0% | 90% | 92.0% | 91.0% | 91% |
Sadness | 85.0% | 94% | 60.0% | 72.7% | 73% |
Surprise | 95.0% | 96% | 92.0% | 94.0% | 87% |
Average | 85.3% | 94.1% | 61.4% | 73.6% | 78% |
Figure 13.
Ensemble model (a) loss curves and (b) accuracy curves for the IEMOCAP dataset.
Figure 13.
Ensemble model (a) loss curves and (b) accuracy curves for the IEMOCAP dataset.
Figure 14.
Ensemble model confusion matrix for the IEMOCAP dataset.
Figure 14.
Ensemble model confusion matrix for the IEMOCAP dataset.
4.3.5. The SHEIE Dataset
The ensemble model was trained using 200 epochs, 20 batches, the AdaMAX optimizer, and sparse categorical cross-entropy loss. The model demonstrated an average accuracy of 83%, an average precision of 83%, an average recall of 81.17%, and an average F1 score of 81.71% across all emotions (see
Table 7).
Figure 15 displays the loss and accuracy curves for this dataset, and
Figure 16 represents the confusion matrix, which indicates the predicted and actual outcomes for each emotional class. These results demonstrate the effectiveness of the proposed ensemble model in terms of accurately classifying emotions in the SHEIE dataset.
Table 7.
Ensemble model performance on the SHEIE dataset.
Table 7.
Ensemble model performance on the SHEIE dataset.
Emotion | Accuracy | Precision | Recall | F1 Score | MCC |
---|
Anger | 84.13% | 84.13% | 86.07% | 85.10% | 91.9% |
Boredom | 81.07% | 81.07% | 73.33% | 76.92% | 85.9% |
Excitement | 84.13% | 84.13% | 85.71% | 84.91% | 85.6% |
Happiness | 82.10% | 82.10% | 85.54% | 83.75% | 83.7% |
Neutrality | 80.2% | 80% | 77.8% | 78.3% | 86.7% |
Sadness | 79.87% | 79.87% | 79.87% | 79.87% | 81.7% |
Average | 83% | 83% | 81.17% | 81.71% | 85.1% |
Figure 15.
Ensemble model (a) loss curves and (b) accuracy curves for the SHEIE dataset.
Figure 15.
Ensemble model (a) loss curves and (b) accuracy curves for the SHEIE dataset.
Figure 16.
Ensemble model confusion matrix for the SHEIE dataset.
Figure 16.
Ensemble model confusion matrix for the SHEIE dataset.
Table 8.
The ensemble model’s overall accuracy.
Table 8.
The ensemble model’s overall accuracy.
Dataset | Model | Testing Accuracy | F1 Score |
---|
EMO-DB | Ensemble model | 99.86% | 99.71% |
RAVDESS | Ensemble model | 96.3% | 95.9% |
SAVEE | Ensemble model | 96.5% | 95.2% |
IEMOCAP | Ensemble model | 85.3% | 73.6% |
SHEIE | Ensemble model | 83% | 81.71% |
5. Analysis and Discussion
The evaluation of the efficacy of the proposed model involved conducting experiments using not only the diverse EMO-DB, RAVDESS, SAVEE, and IEMOCAP datasets, but also the purpose-built SHEIE dataset.
Table 8 presents the performance of the model across these datasets. Its performance also improved with the DA strategies of adding noise to, time stretching, and shifting the audio data, which diversified the training datasets, providing the model with more robust features and enabling the generalizability of new data.
Table 9 compares the precision of the model developed in this study to that of other models used in previous studies.
The developed model recognizes the input sequence link using several transformer blocks and a multi-head self-attention mechanism. DA, preprocessing, and the complicated model design further increase the reliability of this model. The ensemble model outperformed those developed in previous studies [
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23] and demonstrably increased the accuracy from 56.41% to 95.25% compared to a multiclass SVM model that included the MFMC, MFCC, LFPC, and LPCC features. For the first EMO-DB dataset, the ensemble model recorded an average accuracy of 99.8%, as well as average accuracies of 96.3% for the RAVDESS dataset, 96.5% for the SAVEE dataset, 85.3% for the IEMOCAP dataset, and 83% for the SHEIE dataset. Thus, this model enhanced the SER. The model’s average accuracies for anger, boredom, excitement, happiness, and sadness were 84.13%, 81.07%, 84.13%, 82.10%, and 79.87%, respectively. Thus, the model can accurately classify numerous emotions. Voice-assisted emotion detection and online teaching tools can benefit from these features.
Figure 17 summarizes the model’s accuracy and F1 scores across the datasets, demonstrating its strong performance. The ROC curves in
Figure 18 visualize the ensemble model’s performance in terms of classifying emotions from the five datasets, with the
x-axis representing the FPR and the
y-axis representing the TPR. This indicates that the model has a high TPR (90%) and a low FPR (10%), indicating that it can effectively identify emotions.
5.1. Ablation Study
To gain a deeper understanding of the individual contributions of each component in the proposed ensemble model, we conducted an ablation study. The purpose of this study was to assess the individual impacts of the transformer, CNN, and LSTM architectures on the overall performance of the SER system. By systematically removing each component and evaluating the model’s performance, we aimed to identify the relative importance of these architectures in capturing emotional information from speech data.
For the ablation study, we created four variant models by removing each of the key components:
Ensemble model without transformer (EM-T): This variant excluded the transformer block, which is responsible for capturing the contextual information and long-range dependencies in the speech data.
Ensemble model without CNN (EM-C): In this variant, we removed the CNN layers that extract local temporal features from the input sequences.
Ensemble model without LSTM (EM-L): This variant eliminated the LSTM layers, which are designed to capture and learn the long-term dependencies and temporal relationships within the audio sequences.
Ensemble model with four common emotion classes (EM-4E): In this variant, we focused on the four common emotion classes (happy, angry, sad, and neutral) across the RAVDESS, EMO-DB, SAVEE, and IEMOCAP datasets. These four datasets were used for training, while the SHEIE dataset was used solely for testing.
All other components and hyperparameters of the ensemble model remained unchanged during the ablation study. We evaluated each variant model on the five datasets (RAVDESS, EMO-DB, SAVEE, IEMOCAP, and SHEIE) using the same evaluation metrics as the complete ensemble model.
Table 10 presents the performance comparison of the complete ensemble model and its variants on the five datasets. The results demonstrate the individual contributions of the transformer, CNN, and LSTM architectures to the overall performance of the SER system.
5.2. Real-Time Deployment and Field Testing
To assess the practical effectiveness and user acceptance of the proposed speech emotion recognition (SER) system in real educational settings, it is crucial to conduct field tests in actual distance learning environments. While the current study has demonstrated the model’s performance on benchmark datasets, evaluating its performance in real-world scenarios is essential for understanding its potential impact and identifying areas for further improvement.
5.2.1. Real-Time Deployment
To facilitate the deployment of the SER system in real educational settings, we have developed a user-friendly web interface using the Flask framework. This web interface allows users to upload audio files and obtain real-time emotion classification results. The backend of the system is powered by the proposed ensemble model, which processes the uploaded audio files and returns the predicted emotion labels.
The deployment process involves the following steps:
Model Serialization: The trained ensemble model is serialized and saved to a disk, enabling efficient loading and inference in the web application.
Flask Application Development: A Flask application is created to handle the user interactions and manage the audio file uploads. The application provides a simple and intuitive user interface for uploading audio files and displaying the emotion classification results, as shown in
Figure 19.
Audio Preprocessing: Upon receiving an uploaded audio file, the Flask application applies the necessary preprocessing steps, such as resampling, normalization, and feature extraction, to prepare the audio data for input to the ensemble model.
Emotion Classification: The preprocessed audio data are fed into the loaded ensemble model, which performs the emotion classification task and returns the predicted emotion label.
Result Visualization: The predicted emotion label is then presented to the user through the web interface, providing real-time feedback on the emotional content of the uploaded audio file.
This real-time deployment of the SER system using Flask and a web interface allows for its easy integration into existing distance learning platforms and enables educators and learners to benefit from the system’s emotion recognition capabilities.
5.2.2. Field Testing
To validate the effectiveness and user acceptance of the SER system in real educational settings, we propose conducting field tests in collaboration with educational institutions offering distance learning programs. The field tests will involve the following steps:
Participant Recruitment: Educators and learners from diverse backgrounds and cultural contexts will be invited to participate in the field tests. Informed consent will be obtained, and participants will be briefed on the purpose and procedures of the study.
Integration with Distance Learning Platforms: The SER system will be integrated into the existing distance learning platforms used by the participating institutions. This integration will ensure a seamless user experience and allow for the collection of real-time emotional data during online learning sessions.
Data Collection: During the field tests, the SER system will be used to analyze the emotional content of speech data from both educators and learners. The system will record the predicted emotion labels along with relevant metadata, such as timestamps and user identifiers (anonymized to protect privacy).
User Feedback and Evaluation: Participants will be asked to provide feedback on their experience with the SER system through surveys and interviews. The feedback will cover aspects such as the system’s usability, the perceived accuracy of the system, and the impact of emotion recognition on the learning experience.
Data Analysis and Refinement: The collected data and user feedback will be analyzed to assess the system’s performance in real educational settings. This analysis will focus on metrics such as accuracy, user satisfaction, and the system’s influence on teaching and learning outcomes. Based on the findings, the SER system will be refined and optimized to better meet the needs of educators and learners.
These field tests in real educational settings will provide valuable insights into the practical effectiveness of the SER system and its acceptance by users. The feedback and data collected during these tests will inform future iterations of the system, ensuring its continued improvement and alignment with the requirements of distance learning environments.
5.3. Privacy and Ethical Considerations
Both the development and deployment of speech emotion recognition (SER) systems, particularly in educational contexts, raise important privacy and ethical concerns that must be addressed. As SER technology involves the processing and analysis of potentially sensitive personal data, such as emotional states, it is crucial to prioritize data protection and ensure transparency regarding how this information is collected, used, and stored.
In diverse cultural contexts, the perception and expression of emotions may vary significantly. Therefore, it is essential to consider cultural nuances and develop SER systems that are culturally sensitive and respectful of individual privacy. Future research and development efforts should involve collaboration with experts in ethics, privacy, and cultural studies to establish guidelines and the best practices for the responsible deployment of SER technology.
To address privacy concerns, several measures can be implemented:
Data Anonymization: Ensuring that personal identifiers are removed from the speech data and that individuals cannot be directly linked to their emotional data.
Secure Data Storage: Implementing robust security measures to protect the collected emotional data from unauthorized access or breaches.
Informed Consent: Obtaining explicit consent from individuals before collecting and processing their speech data for emotion recognition purposes, after clearly communicating how the data will be used and stored.
Transparency and Control: Providing individuals with transparency about the SER system’s functionalities, the types of data being collected, and how these data are being used, as well as offering individuals control over their data, including the ability to access, modify, or delete their emotional data.
Ethical Guidelines: Developing and adhering to ethical guidelines that govern the use of SER technology, ensuring that it is not misused or employed in ways that could harm or discriminate against individuals.
By proactively addressing privacy and ethical concerns, researchers and developers can foster trust in SER systems and promote their responsible deployment in educational settings. Future work should prioritize the development of privacy-preserving techniques, such as federated learning [
39] or differential privacy, to enable the training of SER models without compromising individual privacy.
Moreover, ongoing dialogue and collaboration among researchers, educators, policymakers, and the public are necessary to navigate the ethical implications of SER technology and develop frameworks that balance the benefits of emotional intelligence with the protection of individual rights and cultural values.
5.4. Explainability and Model Interpretation
As the proposed speech emotion recognition (SER) system is intended for use in educational settings, it is crucial to provide insights into how the model makes decisions. Explainability and model interpretation are essential for building trust and acceptance among users, particularly educators and learners who rely on the system’s emotion recognition capabilities to inform their teaching and learning strategies.
In the current study, we have focused on developing an accurate and robust SER model using an ensemble approach combining transformer, CNN, and LSTM architectures. However, the model’s decision-making process remains largely opaque, limiting users’ understanding of why certain emotions are recognized in specific instances.
To address this challenge, future work should focus on enhancing the explainability and interpretability of the SER model. Several techniques can be employed to provide deeper insights into the model’s inner workings:
Feature Importance Analysis: Conducting a thorough analysis of the importance of different audio features in the emotion recognition process can help identify the key characteristics that contribute to the model’s decisions. Techniques such as permutation feature importance or Shapley Additive Explanation (SHAP) can be used to quantify the contribution of each feature to the model’s predictions [
40].
Attention Visualization: Visualizing the attention weights learned by the transformer block can provide insights into which parts of the speech signal the model focuses on when making emotion predictions. Heatmaps or other visual representations can be used to highlight the regions of the audio that are most relevant for recognizing specific emotions.
Layer-Wise Relevance Propagation (LRP): LRP is a technique that allows for the visualization of the relevance of each input feature to the model’s output. By applying LRP to the SER model [
41], we can trace back the contribution of different audio features and identify the most important ones for each emotion class.
Explainable AI Frameworks: Integrating explainable AI frameworks, such as LIME (Local Interpretable Model-Agnostic Explanation) or SHAP, can provide local explanations for individual predictions [
42]. These frameworks generate interpretable explanations by perturbing the input features and observing their impact on the model’s output, enabling users to understand why a particular emotion was recognized in a specific instance.
User Studies and Feedback: Conducting user studies with educators and learners to gather feedback on the interpretability and explainability of the SER model can provide valuable insights into how well users understand and trust the model’s decisions. This feedback can guide further improvements to the model’s explainability and inform the development of user-friendly interfaces that effectively communicate the basis of the model’s emotion recognition.
By deepening the explainability and model interpretation of the SER system, we can foster trust and acceptance among users in educational settings. Educators and learners will be empowered to understand the underlying reasons behind the model’s emotion recognition, enabling them to make informed decisions and adapt their teaching and learning strategies accordingly.
6. Limitations and Future Work
The current study has some limitations that should be acknowledged. Firstly, the diversity of the datasets used in terms of their language, cultural background, and educational context could be further expanded. Secondly, the performance of the SER system in real-world distance learning environments may differ from its performance on benchmark datasets. Thirdly, this study focuses on recognizing a limited set of discrete emotions, which may not fully capture the complexity of human emotions. Fourthly, individual differences in emotion expression and perception are not extensively explored. Lastly, the SER system relies primarily on acoustic features without incorporating additional modalities such as facial expressions and body language.
Future research should focus on addressing these limitations by incorporating more diverse datasets, conducting field studies in real distance learning environments, exploring fine-grained emotion recognition, personalizing the SER system to individual users, and integrating multimodal information. Additionally, future work should investigate methods to enhance the explainability and interpretability of the SER system, explore ethical considerations, and study its potential impact on student–teacher relationships. Deploying the SER system on cloud platforms like Amazon Web Services (AWS) can facilitate scalability, high availability, and efficient resource management for real-world applications. Furthermore, exploring federated learning techniques can enable privacy-preserving training and personalization of the SER model across multiple institutions without directly sharing sensitive speech data.
Despite these limitations, the current study provides a foundation for further research and development of SER systems in distance education. By addressing the identified challenges and opportunities, we can work towards creating more robust, reliable, and user-centered SER systems that effectively support and enhance the distance learning experience.
7. Conclusions
This study proposes a comprehensive system for measuring the emotional stability of remote educators, aiming to improve the quality of distance education. The proposed ensemble model combines transformer, CNN, and LSTM architectures to enhance the identification of emotions in speech. MFCCs; chroma; the Mel spectrogram; the ZCR; the spectral contrast, centroid, bandwidth, and roll-off; and the RMS were extracted from audio files to enhance the effectiveness of the model, and noise addition, time stretching, and audio data shifting were used as DA methods to improve the performance of the model. This system was demonstrated to recognize emotions with accuracies of 96.3% for the RAVDESS dataset, 99.86% for the EMO-DB dataset, 85.3% for the IEMOCAP dataset, 96.5% for the SAVEE dataset, and 83% for the SHEIE dataset. The SHEIE dataset developed for this study, which comprises recordings of instructor emotions during online teaching sessions, has advanced SER research. However, the various limitations of the proposed system should be addressed by improving preprocessing, adding features, and incorporating additional DA steps. Furthermore, we intend to broaden the number of languages included in the dataset to improve the model’s ability to recognize emotions independently of language. Refining this method should increase its practical applicability, particularly in terms of analyzing the emotional states of distant educators, as it will allow the model to recognize emotional states across languages and datasets. Future research should test this model in real-world educational settings to determine its effect on teaching and learning.