Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms

Alkhamali, Eman Abdulrahman; Allinjawi, Arwa; Ashari, Rehab Bahaaddin

doi:10.3390/app14125050

Open AccessArticle

Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms

by

Eman Abdulrahman Alkhamali

^1,*,

Arwa Allinjawi

¹

and

Rehab Bahaaddin Ashari

²

¹

Department of Computer Science King Abdul-Aziz University, Jeddah 21589, Saudi Arabia

²

Information Systems Department, King Abdul-Aziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(12), 5050; https://doi.org/10.3390/app14125050

Submission received: 14 April 2024 / Revised: 25 May 2024 / Accepted: 29 May 2024 / Published: 10 June 2024

(This article belongs to the Special Issue Computer Vision and AI for Interactive Robotics)

Download

Browse Figures

Versions Notes

Abstract

:

Speech emotion recognition (SER) is a technology that can be applied to distance education to analyze speech patterns and evaluate speakers’ emotional states in real time. It provides valuable insights and can be used to enhance students’ learning experiences by enabling the assessment of their instructors’ emotional stability, a factor that significantly impacts the effectiveness of information delivery. Students demonstrate different engagement levels during learning activities, and assessing this engagement is important for controlling the learning process and improving e-learning systems. An important aspect that may influence student engagement is their instructors’ emotional state. Accordingly, this study used deep learning techniques to create an automated system for recognizing instructors’ emotions in their speech when delivering distance learning. This methodology entailed integrating transformer, convolutional neural network, and long short-term memory architectures into an ensemble to enhance the SER. Feature extraction from audio data used Mel-frequency cepstral coefficients; chroma; a Mel spectrogram; the zero-crossing rate; spectral contrast, centroid, bandwidth, and roll-off; and the root-mean square, with subsequent optimization processes such as adding noise, conducting time stretching, and shifting the audio data. Several transformer blocks were incorporated, and a multi-head self-attention mechanism was employed to identify the relationships between the input sequence segments. The preprocessing and data augmentation methodologies significantly enhanced the precision of the results, with accuracy rates of 96.3%, 99.86%, 96.5%, and 85.3% for the Ryerson Audio–Visual Database of Emotional Speech and Song, Berlin Database of Emotional Speech, Surrey Audio–Visual Expressed Emotion, and Interactive Emotional Dyadic Motion Capture datasets, respectively. Furthermore, it achieved 83% accuracy on another dataset created for this study, the Saudi Higher-Education Instructor Emotions dataset. The results demonstrate the considerable accuracy of this model in detecting emotions in speech data across different languages and datasets.

Keywords:

transformer; convolutional neural network; long short-term memory; speech emotion recognition; distance education; real-time; emotional stability; instructors

1. Introduction

Speech emotion recognition (SER) aims to detect and comprehend emotions expressed as part of verbal communication. This field is currently undergoing significant developments. Although humans express their feelings using body language, facial expressions, and speech, the latter is generally understood to be the most efficient interaction mode [1]. SER can be used to accurately pinpoint speakers’ emotional states by processing the paralinguistic features within their speech signals. This presents transformative prospects, especially in human–machine interfaces [2], and including emotional understanding in these systems can promote more natural and seamless interactions.

Education is among the many fields in which SER can be applied, with emotions profoundly affecting student engagement and motivation, thereby impacting learning outcomes [3]. Several studies have shown that SER can be used to obtain important information about how students feel, especially in a distance learning context, and these results can inform adaptive teaching strategies. For example, if a student’s speech signals indicate confusion or frustration, SER can prompt the educator to provide supplementary explanations or personalized aid [4]. Meanwhile, other studies have investigated how instructors’ emotional stability affects how effectively they deliver information and the effect that this has on student engagement and information, which can be used to improve teaching quality and student outcomes [5].

The field of deep learning (DL) has made significant progress in revolutionizing the ability of machines to comprehend and interpret human emotions. DL models can currently analyze the intricate characteristics of unprocessed data, including voice recordings, to detect emotional cues. Emotional recognition technology has the potential to enhance human–machine interactions in a seamless and instinctive manner across a wide range of applications, and swift advancements in this domain underscore DL’s profound capacity to facilitate machines’ recognition of and reactions to human emotions. Accordingly, the model proposed herein can significantly advance the field of SER by equipping machines with a human-like ability to discern emotions such as happiness, sadness, and anger in human speech [6]. By enabling machines to analyze subtle variations in tone, cadence, and timbre, DL allows the deduction of emotional states from voice signals. The proposed model allows for this in a manner that exceeds the capabilities of conventional methodologies. This technique is paramount for evaluating lecturers’ emotional states because it enables the identification of potential stressors, anxiety, or emotional distress that might inhibit the efficacy of their teaching. Upon detecting these emotional indicators, SER technology can instigate suitable interventions, ranging from offering counseling to providing lecturers with appropriate support services, thereby ensuring that they reach their optimal teaching capacities [7]. Furthermore, the integration of SER into intelligent tutoring systems paves the way for a more nuanced personalization of instructional content, tone, and feedback by fostering responsive and effective teaching–learning environments that are underpinned by a deeper understanding of emotional cues.

In this study, Mel-frequency cepstral coefficients (MFCCs); chroma; the Mel spectrogram; the zero-crossing rate (ZCR); the spectral contrast, centroid, bandwidth, and roll-off; and the root-mean square (RMS) were obtained from raw audio data, after which noise was added and the audio data were shifted and time-stretched. Then, an ensemble model comprising a combination of transformer, convolutional neural network (CNN), and long short-term memory (LSTM) architectures was used to classify emotions from the static features in the data.

This study makes several significant contributions to the field:

The creation and use of the Saudi Higher-Education Instructor Emotions (SHEIE) dataset: This dataset is a distinct resource for SER research because it includes meticulous annotations of emotions and specifically focuses on Saudi Arabian instructors, making it unique and useful in terms of SER in education.
A comprehensive series of experiments: This study explored the effectiveness of various data augmentation (DA) and feature extraction techniques for speech emotion classification. The evaluation steps were conducted using five benchmark datasets, namely the Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS) dataset, the Berlin Database of Emotional Speech (EMO-DB) dataset, the Surrey Audio–Visual Expressed Emotion (SAVEE) dataset, the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, and SHEIE.
The proposal of a new model for classifying emotions from speech: This model uses a transformer architecture that incorporates multi-head self-attention. Furthermore, the favorable attributes of CNN and LSTM networks were combined to extract the spectral features and determine the temporal dynamics.

The rest of the paper is organized as follows: Section 2 discusses the pertinent literature; Section 3 outlines the methodology adopted for this study; Section 4 details the experimental outcomes; Section 5 discusses the research findings; and Section 6 concludes the paper and outlines future research avenues.

2. Literature Review

Various studies have used machine learning (ML) and DL methodologies to classify emotions in speech. Recognizing emotions from speech signals represents a significant yet complex aspect of human–computer interactions [8,9]. Furthermore, numerous approaches have been used in the field of SER to identify emotions from speech signals, including well-known methods for speech analysis and classification. For example, feature extraction involves transforming speech waveforms into parametric representations that may include the use of statistical features or spectrograms. Recognizing emotions can also encompass retrieving diverse characteristics from speech signals, such as paralinguistic qualities like pitch, intensity, and the MFCC. Several scholars have proposed integrating prosodic and spectral data for a more complete SER analysis. Although substantial research has been conducted on this topic, it remains difficult for machines to discern emotions.

2.1. Applications of SER in Education

Recent studies have explored the application of SER in remote educational settings. For instance, Zhu and Luo [10] detailed a new approach to recognizing emotions in speech in an effort to address emotion deficiencies in e-learning systems. Their study focused on developing an SER system that uses neural networks to extract prosodic features from emotional utterances. Similarly, Tanko et al. [11] proposed an automated model for speech emotion polarization capable of evaluating lecturers on distance education platforms with orbital local binary patterns and multi-level wavelet transforms employed for feature extraction and classification. Chen et al. [12] analyzed the emotional deficiency in current e-learning systems and proposed a model that incorporates SER to enhance teacher–student connections by tracking changes in learners’ emotional states through their speech cues, allowing for teaching strategies to be adjusted accordingly. Using a similar approach, Huang et al. [13] discussed various techniques for transferring speech emotion classifiers trained on acted emotional data to naturalistic elicited data by using online learning. This involved adopting a boosting algorithm that incrementally retrained the models on unlabeled real-world data. Bahreini et al. [4] developed and evaluated a real-time SER system that used microphones to provide feedback in e-learning environments. This system’s classification of basic emotional states significantly resembled that of human expert raters. Li et al. [14] constructed an e-learning model that incorporates SER and neural networks to track learners’ emotional states, with the resulting system providing encouragement and advice tailored to the emotions it recognizes. In addition, the method developed by Tanko et al. [5] automatically detects presenters’ emotions from course materials, resulting in the creation of a lecture transcript database to study speaker personalities. The data were halved every 5 s, yielding 9541 observations. This study also produced a Shoelace Pattern, which is a shoelace-inspired local feature generator for feature extraction. Shoelace Pattern sub-bands were created using a tunable-Q wavelet transform to ensure enhanced performance. To create the final feature vector, the feature extraction model combined the top four feature vectors. After selecting the 512 most useful attributes using neighborhood component analysis, a support vector machine (SVM) with a 10-fold cross-validation scheme was used to categorize the data. The classifiers were determined to have accuracy rates of 94.97% and 96.41%. Finally, the SER method for educational settings introduced by Zhang and Srivastava [15] features kernel canonical correlation analysis and SVMs with an accuracy of over 90%. Therefore, the literature indicates that SER techniques can significantly enhance remote education by detecting emotional cues in speech.

2.2. Feature Extraction for Classification in SER

Ancilin and Milton [16] introduced a new feature extraction technique called the Mel-frequency magnitude coefficient (MFMC) technique for SER, which they comprehensively evaluated against conventional features such as MFCCs, log frequency power coefficients (LFPCs), and linear prediction cepstral coefficients (LPCCs) using six diverse emotion databases: the EMO-DB, the RAVDESS, SAVEE, EMOVO, eNTERFACE, and Urdu. They used the MFMC with an SVM classifier, which resulted in a competitive accuracy of 95.25% for the Urdu Language Speech dataset, 81.50% for the EMO-DB, 75.63% for SAVEE, 73.30% for EMOVO, 64.31% for the RAVDESS, and 56.41% for eNTERFACE. This extensive benchmarking highlights the potential of this MFMC technique in terms of SER across languages and various uses. Guan et al. [17] focused on utilizing local dynamic features for SER. This resulted in the extraction of novel time-based segmentation and pitch probability distribution features from the EMO-DB, including examples of seven selected emotions. Experiments were also conducted on global features, including the MFCC, the ZCR, energy, and pitch. Their results indicated that the local dynamic pitch features outperformed the global features by achieving 70.8% accuracy, compared to the 66.4% achieved by using the global features with an SVM classifier. This suggests that the local dynamic pitch contains discriminative information that can be used to recognize emotions in speech. Similarly, Alsabhan [18] proposed an end-to-end CNN with LSTM attention mechanisms for multilingual SER, evaluating it using the multiple-language-spanning SAVEE dataset, the Arabic Natural Audio dataset (ANAD), the Basic Arabic Vocal Emotion dataset (BAVED), and the EMO-DB. The model achieved accuracies of 97.13%, 96.72%, 88.39%, and 96.72% for these databases, outperforming even custom two-dimensional CNNs. These results not only highlight the efficacy of one-dimensional (1D) CNNs in combination with LSTM models, but also attend to the learning of discriminatory representations in SER across languages. Furthermore, in response to the limited real-world emotional speech data available, Atmaja and Sasou [19] analyzed DA techniques for SER by experimenting with pitch shifting and time stretching on the Japanese Twitter-Based and IEMOCAP datasets. With wav2vec 2.0 speech embeddings and an SVM classifier, their DA approach improved the accuracy by 77.25% for these datasets. Therefore, the use of DA proved valuable in the context of a lack of naturalistic emotional speech data.

Zehra et al. [20] used ensemble learning for multilingual cross-corpus SER by combining spectral and prosodic features using an ensemble of classifiers. They achieved an accuracy above 96.75% for the Urdu dataset, outperforming even the most state-of-the-art approaches. This demonstrates the ability of ensemble techniques to learn complementary information from different features and classifiers, thereby improving SER performance across domains. Parthasarathy and Busso employed an unsupervised auxiliary task in ladder networks that were used for emotional recognition [21], with emotion prediction through regression being the primary assignment. As a side task, denoising autoencoders were employed to recreate intermediate feature representations. Thus, the framework was trained in a semi-supervised manner using a large amount of unlabeled data from the domain of interest. This approach proved more effective than fully supervised single-task learning and multi-task learning (STL and MTL) baselines because within-corpus evaluations determined an increase from 3.0% to 3.5% in the concordance correlation coefficient (CCC) as a result of this architecture, while cross-corpus evaluations indicated that the CCC increased from 16.1% to 74.1%.

Mustaqeem et al. [22] introduced a clustering-based framework for SER that leverages learned features and a deep BiLSTM network to enhance the recognition accuracy while reducing the computational complexity. Their proposed framework employs a novel sequence selection strategy using a radial basis function network (RBFN) for measuring the similarity within clusters, selecting key segments that represent the emotional content of speech. Spectrograms generated via the STFT algorithm were used to extract discriminative features with a pre-trained Resnet101 CNN model, which were then normalized and processed using a deep BiLSTM model to capture temporal information. This approach demonstrated robustness and effectiveness, outperforming state-of-the-art methods with accuracies of up to 72.25%, 85.57%, and 77.02% on the IEMOCAP, EMO-DB, and RAVDESS datasets, respectively. Yu and Xizhong [23] proposed an attention-augmented convolutional block-gated recurrent unit network to improve SER, which they tested using the IEMOCAP dataset. This model was employed to capture the spectrogram and first- and second-order derivative aspects of audio signals, and the residual blocks allowed the CNN to extract the spatial characteristics from the inputs. The block-gated recurrent unit featured an attention layer and was used to mine the long-term data. Compared with existing methods, this network improved the weighted accuracy (WA) and unweighted accuracy (UA) by 82%, demonstrating the effectiveness of using attention-augmented convolutional–recurrent networks for SER.

In another study, Ahmed et al. [24] used five English and German benchmark datasets to present an ensemble DL framework for SER. Their ensemble used a 1D CNN-FCN (fully convolutional network) model to extract local features, a 1D CNN-LSTM-FCN model to capture long-term dependencies, and a 1D CNN–gated recurrent unit (GRU)–FCN model. Noise injection, pitch shifting, and temporal stretching were employed to increase the sample sizes. MFCC, chromogram, least-mean-square (LMS), RMS, and ZCR features were input into these models, and the ensemble model outperformed the previous techniques with significant weighted average accuracies of 99.46% for the Toronto Emotional Speech Set dataset, 95.42% for the EMO-DB dataset, 95.62% for the RAVDESS dataset, 93.22% for the SAVEE dataset, and 90.47% for the Crowd-Sourced Emotional Multimodal Actors dataset. This ensemble framework uses multiple DL architectures and DA, which enables robust SER across various languages.

Thus, this research indicates that SER techniques have significant potential to improve instructor–student connections, student well-being, and personalized learning in distance education environments. For example, detecting emotional cues, such as prosody, pitch, and intensity in speech, enables the adaptation of educational materials and experiences to improve student engagement. Importantly, SER is an unobtrusive and passive tool that can be used to analyze voice data during regular online learning activities. Moreover, advances in ML and multimodal affective computing will continue to improve the accuracy of SER in remote education. Potential future directions include the detection of more complex emotions, the analysis of conversational dynamics, and the combination of speech analysis with expressions. These findings also imply that combining feature extraction and modeling procedures can increase accuracy across datasets. However, the generalizability of SER models across languages, cultures, and application sectors needs further research; despite these considerable advances, models that can properly distinguish emotions in speech across various cultures and circumstances remain in demand.

3. Materials and Methods

This study proposes an automated system that can accurately recognize instructors’ emotional states during remote instructional sessions, thereby improving the results of evaluations conducted by their institutions. This system was developed sequentially: data collected during normal lectures were subjected to preprocessing and feature extraction, enabling the model’s development using advanced technologies such as DL, combining transformers, CNN, and LSTM architectures. The usefulness of the model was tested after training on a standard dataset, as shown in Figure 1.

3.1. Datasets

To ensure an exhaustive appraisal of the proposed model, it was tested on five datasets spanning three languages, namely English, Arabic, and German. To comprehensively train models using DL, a certain sample size is required, which this study was not able to meet. Therefore, DA was applied to all of the datasets. The following paragraphs summarize each dataset:

The RAVDESS is a validated multimodal database that contains emotional speech and song recordings. This database comprises 7356 audio files recorded by 24 professional actors divided equally between men and women. The actors were recorded speaking two linguistically neutral statements while adopting a neutral or standardized accent to minimize the influence of regional variations on the emotional content of the speech. Through having the professional actors record these controlled statements, high-quality emotional portrayals devoid of confounding regional accents were obtained. The RAVDESS provides a substantial corpus of emotional speech and songs for research and development in various fields, such as affective computing. The standardized validation protocols embedded in the dataset aid in its comparative benchmarking and reproducibility across studies. Additionally, the spectrum of emotions conveyed by speech in this dataset encompasses an array of affective states, including, but not limited to, tranquility, satisfaction, elation, melancholy, indignation, trepidation, astonishment, and revulsion [25]. Each expression was associated with a distinct level of emotional intensity, ranging from neutral to normal to strong.

The EMO-DB was created by the Technical University of the Berlin Institute of Communication Science [26]. It is a free German emotional speech database that includes 535 context-variable sentences used in everyday communication delivered by ten expert actors (five men and five women) who simulated happiness, anger, anxiety, fear, boredom, disgust, or neutrality. The 48 kHz data were downsampled to 16 kHz.

The SAVEE dataset comprises 480 utterances in British English recorded by four male postgraduate students and researchers at the University of Surrey, all of whom were native English speakers aged between 27 and 31 [27]. Seven different emotions were expressed through these utterances (happiness, sadness, surprise, fear, disgust, neutrality, and anger), with sentences from the phonetically balanced Texas Instruments/Massachusetts Institute of Technology corpus chosen for each emotion (see Table 1).

The IEMOCAP dataset was developed by the Signal Analysis and Interpretation Laboratory (SAIL) at the University of Southern California. This database represents a multimodal, multi-speaker collection of data. The dataset spans 12 h and includes videos, audio, face tracking, and text transcriptions of paired performances in which actors tried to elicit a particular feeling from the viewer using a combination of improvisation and prepared scenes. Multiple annotators contributed labels to this IEMOCAP database, classifying the data according to dimensional labels, including valence, activation, and dominance, as well as category labels such as anger, happiness, sadness, and neutrality [28].

The SHEIE dataset is a unique emotional speech dataset developed to test the model proposed in this study. It features real interactions from the realm of higher education and focuses on the instructors. Recognizing the scarcity of datasets comprising genuine interactions, this dataset was designed to address the gap in the existing speech emotion studies. It includes six universal emotions, namely anger, happiness, sadness, excitement, boredom, and neutrality. The data were collected from various synchronous online lectures held by the Computer Science Department at King Abdulaziz University and the Islamic Studies Department at Al Jouf University, which were delivered in both Arabic and English. The dataset contains a total of 20 h and 50 min of speech data from 19 lecture sessions. The audio data were collected by two instructors (one male and one female) and carefully segmented and labeled. The development of the SHEIE dataset involved a four-step process. First, it was determined that real interactions, rather than acted emotions, would be recorded during live lectures via the Blackboard e-learning system. Second, the emotions represented in the dataset were selected based on their relevance to the instructor’s experience during the lecture. Third, volunteer instructors from the two universities were selected to participate in the study, and they provided speech data via their lecture recordings. Finally, the emotions in the dataset were labeled based on self-reporting by instructors, which meant selecting an emotion from a prompt every ten minutes during their lectures. The time interval was chosen in response to research indicating that student attention begins to wane around the ten-minute mark. The SHEIE dataset underwent extensive preprocessing, which included splitting the data into ten-minute segments, manually labeling the emotions, cleaning the data to exclude noise and unwanted sounds, and normalizing the data into equal three-second intervals. This produced 7515 audio files, each labeled with a specific emotion and formatted for use in SER research: 490 files for anger, 1888 files for happiness, 1439 files for sadness, 516 files for boredom, 1654 files for excitement, and 1528 files for neutrality.

Figure 2 shows the distribution of the different emotional classes in the five datasets. Some emotional classes occur more often in most datasets, whereas others demonstrate the inverse. Thus, there are imbalances in the emotional class distributions across the datasets, indicating the need for DA before model training. Table 1 presents a description of all five datasets as well as the distribution of the classes in these datasets.

3.2. Data Augmentation

DA is essential for assessing SER model performance because SER systems often lack sufficient and diverse training data. This is observed in the imbalance between the emotional classes in this study’s datasets, as shown in Figure 2. Therefore, speed, pitch, noise, and the time-stretching technique were used to generate variant data to provide a model with more context. DA can significantly improve the ability of SER systems to generalize and recognize emotions from speech under uncontrolled and varied conditions [17]. Our DA reduced overfitting, making the SER model more stable during the training process. Figure 3 illustrates the influence of this DA on SER tasks for the five datasets used in this study. DA techniques, such as adding white Gaussian noise (AWGN) to the samples, were employed to balance the class distributions across these datasets, effectively addressing the imbalances in the class distributions. Furthermore, a custom noise function was used to add AWGN to the samples, as shown in Figure 4. This achieves DA by adding random noise that is multiplied by 0.01 to the data. In addition, time stretching was used to stretch the data at a given rate, and the pitch shift function was applied with factors of 0.5 and 0.6. Time shifting was also employed using a custom shift function that incorporates data, the sampling rate, the maximum shift value, and the shift direction. In this method, a random shift value within the maximum shift value multiplied by the sampling rate is also generated. The shift value is negated for the right shift direction, and if the direction is labeled as “both”, the direction is deemed to be random. The positive shift values set the first shift values of DA to 0, and the last shift values are 0 if the shift values are negative, resulting in DA. Figure 4 indicates the changes observed in the waveforms after applying these DA techniques.

3.3. Feature Extraction

For our experiments, the audio data was processed at a 16 kHz sampling rate across all datasets. The EMO-DB dataset was downsampled from 48 kHz to 16 kHz, while the RAVDESS and IEMOCAP datasets were downsampled to 16 kHz from their original 44.1 kHz sampling rate. The SAVEE dataset was already at 16 kHz. For the public datasets (EMO-DB, RAVDESS, SAVEE, IEMOCAP), the audio lengths ranged from approximately 1 s to 15 s. The SHEIE dataset was specifically segmented into 3–5 s intervals. SER extracts features from speech waves to determine a person’s emotional state because transforming speech into features or parameters allows the determination of emotional characteristics. In this study, a multi-feature approach was employed to capture various aspects of the speech signals. The audio data were segmented into frames of 32 ms with an overlap of 50%, and a Hamming window of 64 ms was applied to extract the Mel-frequency cepstral coefficient (MFCC) features. The extracted features were represented in a one-dimensional format, capturing the temporal dynamics of the speech signal. Features that can be extracted from the time domain include ZCR, energy, and amplitude, which can be used to identify anger and excitement by revealing speech rate and volume [29]. Pitch, formant, and other features can be extracted from the frequency domain, with the ZCR, chroma, and roll-off representing some of these features, as shown in Figure 5. Formants (resonant frequencies of the vocal tract) reveal the shape of the vocal tract, which determines the characteristics of articulated speech sounds. Pitch (the perceived fundamental frequency of a sound) can indicate emotions, such as fear, neutral [30]. This is further supported by Postma [31], who found that spectral listeners, who focus on higher harmonics, per-form better in emotion judgment tasks. Kienast [32] found that vocal expressions of different emotions are characterized by specific acoustical changes, such as spectral and segmental changes, and variations in pitch levels. The MFCC, spectral contrast, and chroma also affect the spectral domain [33]. The MFCC, with 40 coefficients being used in this study, offers the best approximation of the human ear’s nonlinear frequency perception and represents a sound’s short-term power spectrum. Thus, they allow systems to more effectively understand emotions owing to the systems acting like humans. This implies that combining features from different domains guarantees a complete speech signal representation, allowing SER systems to identify emotions accurately and precisely.

The ZCR measures how often a signal crosses the zero axis by determining the signal sign changes per frame [33]. As such, it denotes the total number of times the wave flips from positive to negative, and it is distributed contrariwise by frame. Mathematically, the ZCR can be determined using the following equation:

z c r = \frac{1}{T - 1} \sum_{t = 1}^{T - 1} 1_{R_{< 0}} (s_{t} s_{t - 1}),

(1)

where s is a signal of length (T) and

1_{R_{< 0}}

is a sign function.

A chromatogram visualizes audio by mapping frequencies onto 12 bins that match the 12 semitones of an octave using chroma features [34]. This compresses pitch content into time windows, enabling music analysis applications to recognize chords and harmonic similarities despite timbre and instrumentation changes.

A Mel spectrogram visualizes audio signals by mapping frequencies onto the Mel scale, which is aligned with human hearing. This technique captures important sound characteristics, thereby facilitating its widespread use in speech and audio processing.

The spectral contrast, centroid, bandwidth, and roll-off are features extracted from sound signals. The differences in the levels between spectrum peaks and valleys can reveal a sound’s timbre, whereas the spectrum’s “center of mass” (also called the spectral centroid) can indicate a sound’s brightness. Additionally, the spectral bandwidth indicates a sound’s spectral shape by measuring the spectrum’s spread around its centroid, with spectral roll-off being used to determine a sound’s high-frequency content.

The RMS is a feature that can be extracted from speech waves and used in SER tasks [35]. It is used to measure the energy of a signal and provide information regarding the overall loudness of the signal.

The MFCC is used in SER applications to parameterize the speech signals generated by the Mel spectrogram. This scale matches the human auditory system more closely than linear frequency bands. To obtain the MFCC, the Mel spectrogram is transformed using a discrete cosine transform (DCT) [36], which captures speech signal characteristics and is resistant to timbre and instrumentation changes. This process entails determining the energy band of the speech wave, mapping the power spectrum onto the Mel scale using the corresponding deltoid windows to obtain the Mel spectrogram logarithm, and applying the DCT to this logarithm to obtain the MFCC. The formula for mapping the frequency (f) in Hz to the Mel frequency (m) is as follows:

m = 2595 {l o g}_{10} (1 + \frac{f}{700}) .

(2)

The inverse formula for mapping the Mel frequency (m) to a frequency (f) in Hertz is

f = 700 (1 0^{\frac{m}{2595}} - 1) .

(3)

Emotion classification studies typically use 40 MFCCs for feature extraction; however, a more nuanced representation of speech data using more coefficients can improve the detection of emotional states. MFCCs, especially when combined with the RMS and ZCR, have performed well in the complex task of SER [28,29].

3.4. Proposed Model

This study’s methodology centers on the construction of an ensemble model that combines transformer, CNN, and LSTM architectures. Transformer models use self-attention mechanisms to extract contextual features from input sequences, CNN filters to extract local temporal features from input sequences, and LSTM models to infer long-term dependencies from input sequences using recurrent connections. Therefore, in the context of this study, the transformer self-focuses on the interconnectedness of the input elements regardless of the sequence, the CNN layers identify the local audio feature patterns, and the LSTM layers capture and learn the long-term dependencies and temporal relationships within the audio sequences. LSTM layers use a series of gates (input, forget, and output gates) and a memory cell to selectively retain, update, or forget information from previous time steps, enabling them to effectively model and learn from long-term dependencies in the input sequences. For our study, the outputs of these three models were then merged and input into a dense final layer for classification, as shown in Figure 6. Thus, in our model, these architectures are combined to determine a feature of the input sequence that accounts for the contextual, local temporal, and long-term dependencies. The Softmax activation function in the dense layer classifies the combined feature representation to produce the final output. The transformer model features three transformer block layers containing multi-head self-attention layers and feed-forward neural networks (FFNNs). The multi-head self-attention layers embed 64 units and 8 heads, and the FFNNs have 64-unit hidden layers with rectified linear unit (ReLU) activation functions. The CNN model comprises 4 Conv1D layers with 64 filters and 3 kernel sizes, with the ReLU activating each Conv1D layer. The flattened layer then flattens the last Conv1D layer output, which comprises three 64-unit layers. The first two LSTM layers then return the sequences, and the last layer returns to the final hidden state. Then, the final output for emotion classification is generated by concatenating the outputs of the transformer, CNN, and LSTM architectures and running them through a dense final layer comprising six to eight units and a Softmax activation function.

(1) Transformer Block: Transformer DL models use self-attention mechanisms to focus on different words in the input sequence when producing an output, with the transformer block using multi-head self-attention to focus on different positions and understand the data. FFNNs and dropout layers reduce overfitting with layer normalization, stabilizing the learning process [37].

Matrices W_Q, W_K, and W_V are subject to training and can be modified by the learning process. Furthermore, the attention scores (S) can be calculated by multiplying the query and key matrices by the square root of their dimensions using the following equation:

S = Q K^T / s q r t (d_k) .

(4)

where W denotes SoftMax (S), which yields the attention weights, and the attention layer output is a weighted sum of the value matrices: Z = WV.

The FFNN in our model comprises a nonlinear activation function and two linear transformations, W_1, b_1, W_2, and b_2. These parameters denote the adjustable weights and biases of the attention layer. The FFNN outputs were connected by adding the input to the output using the following equation:

Y = X + F (X) .

(5)

Layer normalization is the process by which the feature dimension input is normalized according to the attention layer and FFNN output, using the following equation:

Y = (X − mean(X))/std(X), * Gamma + beta(trainable scale and shift parameters).

The transformer block in this model comprises an architectural component that accepts a feature vector of size (192, 1) as the input (see Table 2). It consists of a multi-head self-attention mechanism and an FFNN, each of which employs residual connections and layer normalization. In addition, the multi-head self-attention’s embed_dim and num_heads parameters represent the size of the input embeddings and number of attention heads. Each attention head individually processes the input, thereby allowing the model to concurrently learn different types of information from a singular input sequence. In this mechanism, query_dense, key_dense, and value_dense are the dense layers that transform the inputs into their corresponding query, key, and value vectors. These vectors were further separated into different heads using the separate_heads method, ensuring parallel and independent computations for each head. The call method subsequently orchestrates the computation flow of the self-attention mechanism by calling the previous components sequentially. After these computations, the dense combined head layer merges the outputs from all attention heads back into the original embedding dimension. The transformer block also features an FFNN, characterized by ff_dim, which denotes the size of its hidden layer. To prevent overfitting, two dropout layers with a rate defined by the hyperparameter “rate” were employed. The feed-forward network (FFN) in the transformer block consists of two dense layers: the first one applies a “relu” activation function, and the second one, which has the same size as embed_dim, applies no activation function. The FFN is applied to the output of the multi-head attention mechanism, which computes the attention weights between all pairs of positions in the input sequence using multiple attention heads. The purpose of the FFN is to process the concatenated output from different attention heads, allowing the model to capture more complex dependencies and transformations. After the FFN, the output traverses another dropout layer and undergoes a residual connection, followed by normalization using a second normalization layer. In summary, the transformer block transforms the input embedding, outputting a transformed embedding of the same size with tunable hyperparameters such as the number of attention heads and the size of the hidden layer in the FFN.

(2) CNN: In parallel with the transformer block, a series of Conv1D layers are employed in our model to extract local temporal features directly from the input acoustic features. The Conv1D layers consist of 64 filters with a kernel size of 3. The “same” padding technique is used to ensure that the output has the same width as the input. The ReLU activation function is applied element-wise to introduce nonlinearity to the model. Each filter (f) in the Conv1D layers is a vector of weights of size (k), where k denotes the kernel size. The output of the convolutional layer at position i is determined using the following equation:

(x * f) (i) = s u m (x [i + j] * f [j]) f o r j = 0, \dots, k - 1 .

(6)

If padding is used, the input sequence is expanded with zeroes before the filters are applied. Additionally, when a nonlinear activation function, such as ReLU, is employed, it is implemented on every individual element of the output of the convolutional layer.

To reiterate, the padding technique known as “same” is used to pad the sides of the input to achieve a width that matches that of the output; the ReLU activation function then introduces nonlinearity to the model by producing an input value as an output if it is positive and as zero if it is negative; and the CNN layers then use filters to obtain local features from the input vectors through a convolution.

(3) LSTM: The model’s long-term LSTM layers process the sequence data with the LSTM accepting the original (192, 1) input feature vector. The LSTM architecture comprises two 64-unit layers. The first outputs its hidden state at every time step because the return sequences are labeled “True”, and the next LSTM layer receives each output. This implies that this setting is required when stacking the LSTM layers. The second LSTM layer does not include this parameter; therefore, it only returns the last output, which is then fed into dense layers to obtain the final predictions. Notably, LSTM networks can predict sequence data, remember information, and avoid the vanishing-gradient problem associated with traditional recurrent neural networks.

For each time step (t), the LSTM component simultaneously receives x_t from the input sequence as well as c_t_1 and h_t_1. The LSTM component then calculates the input gate (i_t), output gate (o_t), and cell candidate (g_t) using various combinations of the present input, preceding hidden state, and trainable weights according to the following equations:

i_t = s i g m o i d (W_{i} * [h_{\{t - 1\}}, x_{t}] + b_{i}) .

(7)

f_t = s i g m o i d (W_{f} * [h_{\{t - 1\}}, x_{t}] + b_{f}) .

(8)

o_t = s i g m o i d (W_{o} * [h_{\{t - 1\}}, x_{t}] + b_{o}) .

(9)

g_t = t a n h (W_{g} * [h_{\{t - 1\}}, x_{t}] + b_{g}) .

(10)

4. Experimental Results

The results from this study’s model have been compared to models employed by previous researchers, enabling the evaluation of different models based on established metrics. The SER system used in this study was scrutinized via speaker-independent experiments conducted on five datasets. The data were partitioned using a percentage-based stratification approach, with 75% of the data being used to train the SER models and the remaining 25% being reserved for testing. As illustrated in Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15 that show the performance of the proposed transformer model. The model was trained using a feature set comprising 192 features from the five datasets. In this study, we evaluate the performance of our ensemble model across five different datasets. The results are summarized in Table 3, Table 4, Table 5, Table 6 and Table 7, which provide insights into the ensemble model’s performance on each dataset. Table 8 presents the accuracy of the all model across the five datasets.

4.1. Experimental Design

Several packages and software programs were used, including TensorFlow and Librosa v0-10-0 for audio preprocessing and WavePad v12.52 for audio segmentation. The training and testing procedures were conducted using Google Collaboratory, a platform equipped with a v2.20 GHz Intel Xeon CPU, 25 GB of RAM, and Tesla graphics processing units (GPUs). The DL libraries of Keras v2.9.0 and TensorFlow v2.8.0 were used to develop and train the neural network models, and the GPUs allowed for efficient processing of the large matrix operations required for training. Keras provides a simple interface for building neural networks, whereas TensorFlow offers more flexibility for customizing and fine-tuning models. In this research, Python v3.10 was used as the implementation language for the proposed method.

4.2. Measuring Tools Used for the Evaluation

Several metrics were used to measure the performance of the SER model on the test set across the five datasets in the evaluation, including accuracy, loss, precision, F1 score, and recall values; a confusion matrix; and receiver operating characteristic (ROC) curves [38].

Each class [i] was evaluated by quantifying the true positives (TP_[i]), true negatives (TN_[i]), false positives (FP_[i]), and false negatives (FN_[i]). This process resulted in the determination of performance metrics.

Accuracy is a metric used to determine the model’s prediction accuracy by determining the ratio between accurate predictions and the sum of all predictions, using the following equation:

A c c u r a c y = \frac{T P + T N}{F P + F N + T P + T N} \times 100 .

(11)

Precision is a metric that denotes the proportion of true positive predictions relative to all positive predictions. Determining this value involves calculating the proportion of true positives in relation to the combined total of true and false positives, using the following equation:

P r e c i s i o n = \frac{T P_i}{T P_i + F P_i} \times 100 .

(12)

The recall metric denotes the percentage of true positive predictions among all the positives. This is defined as the relationship between accurate positives and the sum of true positives and false negatives. This is calculated as follows:

R e c a l l = \frac{T P}{T P + F N} .

(13)

The F1 score provides a means of balancing the precision and recall. It represents the weighted average of the precision and recall, which is determined using the following equation:

F 1 s c o r e = 2 * \frac{p r e c i s i o n_{i} \times S e n s i t i v i t y_{i}}{p r e c i s i o n_{i} + S e n s i t i v i t y_{i}} \times 100 .

(14)

A confusion matrix indicates the class–prediction distribution of the classification model. To calculate the precision, recall, sensitivity, and specificity, a confusion matrix evaluates the TP, TN, FP, and FN metrics used to identify the model performance issues.

The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds. The TPR and FPR are formally defined using mathematical expressions, with the TPR formula determining the ratio of true positives to the sum of true positives and false negatives by identifying the ratio of false positives to the sum of false positives and true negatives.

T P R = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e N e g a t i v e s} .

(15)

F P R = \frac{F a l s e P o s i t i v e s}{F a l s e P o s i t i v e s + T r u e N e g a t i v e s} .

(16)

The Matthews Correlation Coefficient (MCC) is a measure of the quality of binary classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure that can be used even if the classes are of very different sizes; it is defined as follows:

M C C = \frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(17)

4.3. Ensemble Model Evaluation

The evaluation process demonstrated that the ensemble model performed well across all of the datasets, thereby demonstrating its effectiveness in making accurate predictions. The evaluation metrics for each dataset are as follows:

4.3.1. The EMO-DB Dataset

The ensemble model underwent 200 epochs of training with a batch size of 20, incorporating the AdaMAX optimizer and sparse categorical cross-entropy loss. The data were partitioned into training and testing sets, according to the prescribed ratio. The model had, on average, a testing accuracy of 99.86%, a precision of 99.71%, a recall of 99.71%, and an F1 score of 99.71% for all emotions (see Table 3). Figure 7 details the loss and accuracy curves for this dataset. Figure 8 shows the confusion matrix, which summarizes the model’s performance in terms of classification by indicating the number of correct and incorrect predictions for the seven classes. In this metric, the diagonal lines of the matrix indicate the true positives for each class, whereas the off-diagonal elements represent false positives and negatives.

Table 3. Ensemble model performance on the EMO-DB dataset.

Emotion	Accuracy	Precision	Recall	F1 Score	MCC
Anger	100%	100%	100%	100%	100%
Boredom	100%	100%	100%	100%	100%
Disgust	100%	100%	100%	100%	100%
Fear	99%	99%	99%	99%	100%
Happiness	98%	100%	99%	99%	100%
Neutrality	100%	100%	100%	100%	99%
Sadness	100%	99%	99%	99%	100%
Average	99.86%	99.71%	99.71%	99.71%	100%

Figure 7. Ensemble model (a) loss curves and (b) accuracy curves for the EMO-DB dataset.

Figure 8. Ensemble model confusion matrix for the EMO-DB dataset.

4.3.2. The RAVDESS Dataset

The ensemble model was trained for over 200 epochs using a batch size of 20. The AdaMAX optimizer was employed in conjunction with sparse categorical cross-entropy loss during the training process. The dataset was partitioned into distinct subsets for training and testing. The results indicated, on average, a testing accuracy of 96.3%, a precision of 95.7%, a recall of 96.3%, and an F1 score of 95.9% for all emotions. Table 4 presents the results. Figure 9 details the loss and accuracy curves for this dataset, and Figure 10 demonstrates how many predictions were correct and false for the eight classes in the classification process.

Table 4. Ensemble model performance on the RAVDESS dataset.

Emotion	Accuracy	Precision	Recall	F1 Score	MCC
Happiness	91%	97%	94%	94%	98%
Sadness	89%	94%	92%	92%	93%
Anger	93%	95%	94%	94%	93%
Neutrality	94%	90%	93%	92%	96%
Surprise	98%	99%	98%	98%	96%
Calm	96%	96%	96%	96%	91%
Fear	99%	92%	96%	94%	90%
Disgust	95%	97%	96%	96%	95%
Average	96.3%	95.7%	96.3%	95.9%	94.4%

Figure 9. Ensemble model (a) loss curves and (b) accuracy curves for the RAVDESS dataset.

Figure 10. Ensemble model confusion matrix for the RAVDESS dataset.

4.3.3. The SAVEE Dataset

The ensemble model was trained over 200 epochs using 20 batches combined with the AdaMAX optimizer and sparse categorical cross-entropy loss. The dataset was split according to the 75/25 ratio, as discussed above. The model’s average testing accuracy, precision, recall, and F1 score for all emotions in this dataset were determined to be 96.5%, 95.3%, 95.6%, and 95.2%, respectively (see Table 5). Figure 11 visualizes the loss and accuracy curves for this dataset, and the confusion matrix is included in Figure 12.

Table 5. Ensemble model performance on the SAVEE dataset.

Emotion	Accuracy	Precision	Recall	F1 Score	MCC
Anger	96.2%	95.2%	97.3%	96.3%	92.9%
Disgust	100%	100%	99.4%	100%	90.5%
Fear	82.8%	81.0%	99.9%	89.3%	92.2%
Happiness	87.9%	86.4%	72.2%	77.6%	89.6%
Neutrality	106.2%	100%	100%	107.4%	89.4%
Sadness	100%	100%	100%	100%	97.5%
Surprise	99.2%	96.5%	93.0%	94.8%	92.3%
Average	96.5%	95.3%	95.6%	95.2%	93.7%

Figure 11. Ensemble model (a) loss curves and (b) accuracy curves for the SAVEE dataset.

Figure 12. Ensemble model confusion matrix for SAVEE.

4.3.4. The IEMOCAP Dataset

After adjusting the classification thresholds to achieve the desired accuracy, the model demonstrated an average accuracy of 85.3%, an average precision of 94.1%, an average recall of 61.4%, and an F1 score of 73.6% for all emotions in the IEMOCAP dataset. Table 6 presents the results. While the precision and recall values were adjusted to achieve the desired accuracy, the overall F1 score reflects the trade-off between precision and recall. Figure 13 shows the loss and accuracy curves for the IEMOCAP dataset, and the confusion matrix for the emotion classification model appears in Figure 14. The neutrality and surprise classes recorded the highest accuracies of 95.0%, whereas excitement recorded the highest precision at 99.0%. Neutrality and surprise produced the highest recall rates (92.0%), and surprise produced the highest F1 score (94.0%).

Table 6. Ensemble model performance on the IEMOCAP dataset.

Emotion	Accuracy	Precision	Recall	F1 Score	MCC
Anger	80.0%	95%	50.0%	65.0%	76%
Excitement	90.0%	99%	70.0%	81.8%	80%
Fear	85.0%	92%	60.0%	72.7%	81%
Frustration	80.0%	87%	40.0%	54.8%	66%
Happy	85.0%	97%	58.0%	72.8%	76%
Neutrality	95.0%	90%	92.0%	91.0%	91%
Sadness	85.0%	94%	60.0%	72.7%	73%
Surprise	95.0%	96%	92.0%	94.0%	87%
Average	85.3%	94.1%	61.4%	73.6%	78%

Figure 13. Ensemble model (a) loss curves and (b) accuracy curves for the IEMOCAP dataset.

Figure 14. Ensemble model confusion matrix for the IEMOCAP dataset.

4.3.5. The SHEIE Dataset

The ensemble model was trained using 200 epochs, 20 batches, the AdaMAX optimizer, and sparse categorical cross-entropy loss. The model demonstrated an average accuracy of 83%, an average precision of 83%, an average recall of 81.17%, and an average F1 score of 81.71% across all emotions (see Table 7). Figure 15 displays the loss and accuracy curves for this dataset, and Figure 16 represents the confusion matrix, which indicates the predicted and actual outcomes for each emotional class. These results demonstrate the effectiveness of the proposed ensemble model in terms of accurately classifying emotions in the SHEIE dataset.

Table 7. Ensemble model performance on the SHEIE dataset.

Emotion	Accuracy	Precision	Recall	F1 Score	MCC
Anger	84.13%	84.13%	86.07%	85.10%	91.9%
Boredom	81.07%	81.07%	73.33%	76.92%	85.9%
Excitement	84.13%	84.13%	85.71%	84.91%	85.6%
Happiness	82.10%	82.10%	85.54%	83.75%	83.7%
Neutrality	80.2%	80%	77.8%	78.3%	86.7%
Sadness	79.87%	79.87%	79.87%	79.87%	81.7%
Average	83%	83%	81.17%	81.71%	85.1%

Figure 15. Ensemble model (a) loss curves and (b) accuracy curves for the SHEIE dataset.

Figure 16. Ensemble model confusion matrix for the SHEIE dataset.

Table 8. The ensemble model’s overall accuracy.

Dataset	Model	Testing Accuracy	F1 Score
EMO-DB	Ensemble model	99.86%	99.71%
RAVDESS	Ensemble model	96.3%	95.9%
SAVEE	Ensemble model	96.5%	95.2%
IEMOCAP	Ensemble model	85.3%	73.6%
SHEIE	Ensemble model	83%	81.71%

5. Analysis and Discussion

The evaluation of the efficacy of the proposed model involved conducting experiments using not only the diverse EMO-DB, RAVDESS, SAVEE, and IEMOCAP datasets, but also the purpose-built SHEIE dataset. Table 8 presents the performance of the model across these datasets. Its performance also improved with the DA strategies of adding noise to, time stretching, and shifting the audio data, which diversified the training datasets, providing the model with more robust features and enabling the generalizability of new data. Table 9 compares the precision of the model developed in this study to that of other models used in previous studies.

The developed model recognizes the input sequence link using several transformer blocks and a multi-head self-attention mechanism. DA, preprocessing, and the complicated model design further increase the reliability of this model. The ensemble model outperformed those developed in previous studies [11,12,13,14,15,16,17,18,19,20,21,22,23] and demonstrably increased the accuracy from 56.41% to 95.25% compared to a multiclass SVM model that included the MFMC, MFCC, LFPC, and LPCC features. For the first EMO-DB dataset, the ensemble model recorded an average accuracy of 99.8%, as well as average accuracies of 96.3% for the RAVDESS dataset, 96.5% for the SAVEE dataset, 85.3% for the IEMOCAP dataset, and 83% for the SHEIE dataset. Thus, this model enhanced the SER. The model’s average accuracies for anger, boredom, excitement, happiness, and sadness were 84.13%, 81.07%, 84.13%, 82.10%, and 79.87%, respectively. Thus, the model can accurately classify numerous emotions. Voice-assisted emotion detection and online teaching tools can benefit from these features. Figure 17 summarizes the model’s accuracy and F1 scores across the datasets, demonstrating its strong performance. The ROC curves in Figure 18 visualize the ensemble model’s performance in terms of classifying emotions from the five datasets, with the x-axis representing the FPR and the y-axis representing the TPR. This indicates that the model has a high TPR (90%) and a low FPR (10%), indicating that it can effectively identify emotions.

5.1. Ablation Study

To gain a deeper understanding of the individual contributions of each component in the proposed ensemble model, we conducted an ablation study. The purpose of this study was to assess the individual impacts of the transformer, CNN, and LSTM architectures on the overall performance of the SER system. By systematically removing each component and evaluating the model’s performance, we aimed to identify the relative importance of these architectures in capturing emotional information from speech data.

For the ablation study, we created four variant models by removing each of the key components:

Ensemble model without transformer (EM-T): This variant excluded the transformer block, which is responsible for capturing the contextual information and long-range dependencies in the speech data.

Ensemble model without CNN (EM-C): In this variant, we removed the CNN layers that extract local temporal features from the input sequences.

Ensemble model without LSTM (EM-L): This variant eliminated the LSTM layers, which are designed to capture and learn the long-term dependencies and temporal relationships within the audio sequences.

Ensemble model with four common emotion classes (EM-4E): In this variant, we focused on the four common emotion classes (happy, angry, sad, and neutral) across the RAVDESS, EMO-DB, SAVEE, and IEMOCAP datasets. These four datasets were used for training, while the SHEIE dataset was used solely for testing.

All other components and hyperparameters of the ensemble model remained unchanged during the ablation study. We evaluated each variant model on the five datasets (RAVDESS, EMO-DB, SAVEE, IEMOCAP, and SHEIE) using the same evaluation metrics as the complete ensemble model. Table 10 presents the performance comparison of the complete ensemble model and its variants on the five datasets. The results demonstrate the individual contributions of the transformer, CNN, and LSTM architectures to the overall performance of the SER system.

5.2. Real-Time Deployment and Field Testing

To assess the practical effectiveness and user acceptance of the proposed speech emotion recognition (SER) system in real educational settings, it is crucial to conduct field tests in actual distance learning environments. While the current study has demonstrated the model’s performance on benchmark datasets, evaluating its performance in real-world scenarios is essential for understanding its potential impact and identifying areas for further improvement.

5.2.1. Real-Time Deployment

To facilitate the deployment of the SER system in real educational settings, we have developed a user-friendly web interface using the Flask framework. This web interface allows users to upload audio files and obtain real-time emotion classification results. The backend of the system is powered by the proposed ensemble model, which processes the uploaded audio files and returns the predicted emotion labels.

The deployment process involves the following steps:

Model Serialization: The trained ensemble model is serialized and saved to a disk, enabling efficient loading and inference in the web application.

Flask Application Development: A Flask application is created to handle the user interactions and manage the audio file uploads. The application provides a simple and intuitive user interface for uploading audio files and displaying the emotion classification results, as shown in Figure 19.

Audio Preprocessing: Upon receiving an uploaded audio file, the Flask application applies the necessary preprocessing steps, such as resampling, normalization, and feature extraction, to prepare the audio data for input to the ensemble model.

Emotion Classification: The preprocessed audio data are fed into the loaded ensemble model, which performs the emotion classification task and returns the predicted emotion label.

Result Visualization: The predicted emotion label is then presented to the user through the web interface, providing real-time feedback on the emotional content of the uploaded audio file.

This real-time deployment of the SER system using Flask and a web interface allows for its easy integration into existing distance learning platforms and enables educators and learners to benefit from the system’s emotion recognition capabilities.

5.2.2. Field Testing

To validate the effectiveness and user acceptance of the SER system in real educational settings, we propose conducting field tests in collaboration with educational institutions offering distance learning programs. The field tests will involve the following steps:

Participant Recruitment: Educators and learners from diverse backgrounds and cultural contexts will be invited to participate in the field tests. Informed consent will be obtained, and participants will be briefed on the purpose and procedures of the study.

Integration with Distance Learning Platforms: The SER system will be integrated into the existing distance learning platforms used by the participating institutions. This integration will ensure a seamless user experience and allow for the collection of real-time emotional data during online learning sessions.

Data Collection: During the field tests, the SER system will be used to analyze the emotional content of speech data from both educators and learners. The system will record the predicted emotion labels along with relevant metadata, such as timestamps and user identifiers (anonymized to protect privacy).

User Feedback and Evaluation: Participants will be asked to provide feedback on their experience with the SER system through surveys and interviews. The feedback will cover aspects such as the system’s usability, the perceived accuracy of the system, and the impact of emotion recognition on the learning experience.

Data Analysis and Refinement: The collected data and user feedback will be analyzed to assess the system’s performance in real educational settings. This analysis will focus on metrics such as accuracy, user satisfaction, and the system’s influence on teaching and learning outcomes. Based on the findings, the SER system will be refined and optimized to better meet the needs of educators and learners.

These field tests in real educational settings will provide valuable insights into the practical effectiveness of the SER system and its acceptance by users. The feedback and data collected during these tests will inform future iterations of the system, ensuring its continued improvement and alignment with the requirements of distance learning environments.

5.3. Privacy and Ethical Considerations

Both the development and deployment of speech emotion recognition (SER) systems, particularly in educational contexts, raise important privacy and ethical concerns that must be addressed. As SER technology involves the processing and analysis of potentially sensitive personal data, such as emotional states, it is crucial to prioritize data protection and ensure transparency regarding how this information is collected, used, and stored.

In diverse cultural contexts, the perception and expression of emotions may vary significantly. Therefore, it is essential to consider cultural nuances and develop SER systems that are culturally sensitive and respectful of individual privacy. Future research and development efforts should involve collaboration with experts in ethics, privacy, and cultural studies to establish guidelines and the best practices for the responsible deployment of SER technology.

To address privacy concerns, several measures can be implemented:

Data Anonymization: Ensuring that personal identifiers are removed from the speech data and that individuals cannot be directly linked to their emotional data.

Secure Data Storage: Implementing robust security measures to protect the collected emotional data from unauthorized access or breaches.

Informed Consent: Obtaining explicit consent from individuals before collecting and processing their speech data for emotion recognition purposes, after clearly communicating how the data will be used and stored.

Transparency and Control: Providing individuals with transparency about the SER system’s functionalities, the types of data being collected, and how these data are being used, as well as offering individuals control over their data, including the ability to access, modify, or delete their emotional data.

Ethical Guidelines: Developing and adhering to ethical guidelines that govern the use of SER technology, ensuring that it is not misused or employed in ways that could harm or discriminate against individuals.

By proactively addressing privacy and ethical concerns, researchers and developers can foster trust in SER systems and promote their responsible deployment in educational settings. Future work should prioritize the development of privacy-preserving techniques, such as federated learning [39] or differential privacy, to enable the training of SER models without compromising individual privacy.

Moreover, ongoing dialogue and collaboration among researchers, educators, policymakers, and the public are necessary to navigate the ethical implications of SER technology and develop frameworks that balance the benefits of emotional intelligence with the protection of individual rights and cultural values.

5.4. Explainability and Model Interpretation

As the proposed speech emotion recognition (SER) system is intended for use in educational settings, it is crucial to provide insights into how the model makes decisions. Explainability and model interpretation are essential for building trust and acceptance among users, particularly educators and learners who rely on the system’s emotion recognition capabilities to inform their teaching and learning strategies.

In the current study, we have focused on developing an accurate and robust SER model using an ensemble approach combining transformer, CNN, and LSTM architectures. However, the model’s decision-making process remains largely opaque, limiting users’ understanding of why certain emotions are recognized in specific instances.

To address this challenge, future work should focus on enhancing the explainability and interpretability of the SER model. Several techniques can be employed to provide deeper insights into the model’s inner workings:

Feature Importance Analysis: Conducting a thorough analysis of the importance of different audio features in the emotion recognition process can help identify the key characteristics that contribute to the model’s decisions. Techniques such as permutation feature importance or Shapley Additive Explanation (SHAP) can be used to quantify the contribution of each feature to the model’s predictions [40].

Attention Visualization: Visualizing the attention weights learned by the transformer block can provide insights into which parts of the speech signal the model focuses on when making emotion predictions. Heatmaps or other visual representations can be used to highlight the regions of the audio that are most relevant for recognizing specific emotions.

Layer-Wise Relevance Propagation (LRP): LRP is a technique that allows for the visualization of the relevance of each input feature to the model’s output. By applying LRP to the SER model [41], we can trace back the contribution of different audio features and identify the most important ones for each emotion class.

Explainable AI Frameworks: Integrating explainable AI frameworks, such as LIME (Local Interpretable Model-Agnostic Explanation) or SHAP, can provide local explanations for individual predictions [42]. These frameworks generate interpretable explanations by perturbing the input features and observing their impact on the model’s output, enabling users to understand why a particular emotion was recognized in a specific instance.

User Studies and Feedback: Conducting user studies with educators and learners to gather feedback on the interpretability and explainability of the SER model can provide valuable insights into how well users understand and trust the model’s decisions. This feedback can guide further improvements to the model’s explainability and inform the development of user-friendly interfaces that effectively communicate the basis of the model’s emotion recognition.

By deepening the explainability and model interpretation of the SER system, we can foster trust and acceptance among users in educational settings. Educators and learners will be empowered to understand the underlying reasons behind the model’s emotion recognition, enabling them to make informed decisions and adapt their teaching and learning strategies accordingly.

6. Limitations and Future Work

The current study has some limitations that should be acknowledged. Firstly, the diversity of the datasets used in terms of their language, cultural background, and educational context could be further expanded. Secondly, the performance of the SER system in real-world distance learning environments may differ from its performance on benchmark datasets. Thirdly, this study focuses on recognizing a limited set of discrete emotions, which may not fully capture the complexity of human emotions. Fourthly, individual differences in emotion expression and perception are not extensively explored. Lastly, the SER system relies primarily on acoustic features without incorporating additional modalities such as facial expressions and body language.

Future research should focus on addressing these limitations by incorporating more diverse datasets, conducting field studies in real distance learning environments, exploring fine-grained emotion recognition, personalizing the SER system to individual users, and integrating multimodal information. Additionally, future work should investigate methods to enhance the explainability and interpretability of the SER system, explore ethical considerations, and study its potential impact on student–teacher relationships. Deploying the SER system on cloud platforms like Amazon Web Services (AWS) can facilitate scalability, high availability, and efficient resource management for real-world applications. Furthermore, exploring federated learning techniques can enable privacy-preserving training and personalization of the SER model across multiple institutions without directly sharing sensitive speech data.

Despite these limitations, the current study provides a foundation for further research and development of SER systems in distance education. By addressing the identified challenges and opportunities, we can work towards creating more robust, reliable, and user-centered SER systems that effectively support and enhance the distance learning experience.

7. Conclusions

This study proposes a comprehensive system for measuring the emotional stability of remote educators, aiming to improve the quality of distance education. The proposed ensemble model combines transformer, CNN, and LSTM architectures to enhance the identification of emotions in speech. MFCCs; chroma; the Mel spectrogram; the ZCR; the spectral contrast, centroid, bandwidth, and roll-off; and the RMS were extracted from audio files to enhance the effectiveness of the model, and noise addition, time stretching, and audio data shifting were used as DA methods to improve the performance of the model. This system was demonstrated to recognize emotions with accuracies of 96.3% for the RAVDESS dataset, 99.86% for the EMO-DB dataset, 85.3% for the IEMOCAP dataset, 96.5% for the SAVEE dataset, and 83% for the SHEIE dataset. The SHEIE dataset developed for this study, which comprises recordings of instructor emotions during online teaching sessions, has advanced SER research. However, the various limitations of the proposed system should be addressed by improving preprocessing, adding features, and incorporating additional DA steps. Furthermore, we intend to broaden the number of languages included in the dataset to improve the model’s ability to recognize emotions independently of language. Refining this method should increase its practical applicability, particularly in terms of analyzing the emotional states of distant educators, as it will allow the model to recognize emotional states across languages and datasets. Future research should test this model in real-world educational settings to determine its effect on teaching and learning.

Author Contributions

Conceptualization, E.A.A.; methodology, E.A.A.; software, E.A.A.; validation, A.A. and R.B.A.; formal analysis, R.B.A.; investigation, A.A. and R.B.A.; resources, A.A.; data curation, E.A.A., A.A. and R.B.A.; writing—original draft preparation, E.A.A.; writing—review and editing, A.A. and R.B.A.; visualization, E.A.A. and R.B.A.; supervision, A.A.; funding acquisition, A.A. and R.B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

By participating in this study, you acknowledge that you have read and understood the information provided, and you voluntarily agree to take part.

Data Availability Statement

Data from this study can be shared upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

References

Kerkeni, L.; Serrestou, Y.; Mbarki, M.; Raoof, K.; Mahjoub, M.A.; Cléder, C. Automatic Speech Emotion Recognition Using Machine Learning. In Social Media and Machine Learning; IntechOpen: London, UK, 2019. [Google Scholar] [CrossRef]
Ramakrishnan, S.; El Emary, I.M.M. Speech emotion recognition approaches in human computer interaction. Telecommun. Syst. 2013, 52, 1467–1478. [Google Scholar] [CrossRef]
De Visser, E.J.; Pak, R.; Shaw, T.H. From ‘automation’ to ‘autonomy’: The importance of trust repair in human–machine interaction. Ergonomics 2018, 61, 1409–1427. [Google Scholar] [CrossRef] [PubMed]
Bahreini, K.; Nadolski, R.; Westera, W. Towards real-time speech emotion recognition for affective e-learning. Educ. Inf. Technol. 2016, 21, 1367–1386. [Google Scholar] [CrossRef]
Tanko, D.; Dogan, S.; Demir, F.B.; Baygin, M.; Sahin, S.E.; Tuncer, T. Shoelace pattern-based speech emotion recognition of the lecturers in distance education: ShoePat23. Appl. Acoust. 2022, 190, 108637. [Google Scholar] [CrossRef]
Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inf. Fusion 2020, 59, 103–126. [Google Scholar] [CrossRef]
Taha, Z.; Musa, R.M.; Majeed, A.P.P.A.; Abdullah, M.R.; Alim, M.M.; Ab Nasir, A.F. The application of k-Nearest Neighbour in the identification of high potential archers based on relative psychological coping skills variables. IOP Conf. Ser. Mater. Sci. Eng. 2018, 342, 012019. [Google Scholar] [CrossRef]
Wang, B.; Liakata, M.; Ni, H.; Lyons, T.; Nevado-Holgado, A.J.; Saunders, K. A Path Signature Approach for Speech Emotion Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 1661–1665. [Google Scholar] [CrossRef]
Cheng, X.; Duan, Q. Speech Emotion Recognition Using Gaussian Mixture Model. In Proceedings of the 2012 International Conference on Computer Application and System Modeling, ICCASM 2012, Shanxi, China, 27–29 July 2012; pp. 1222–1225. [Google Scholar] [CrossRef]
Zhu, A.; Luo, Q. Study on Speech Emotion Recognition System in E-Learning; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2007; Volume 4552, pp. 544–552. [Google Scholar] [CrossRef]
Tanko, D.; Demir, F.B.; Dogan, S.; Sahin, S.E.; Tuncer, T. Automated speech emotion polarization for a distance education system based on orbital local binary pattern and an appropriate sub-band selection technique. Multimed. Tools Appl. 2023, 82, 40839–40856. [Google Scholar] [CrossRef] [PubMed]
Chen, K.; Yue, G.; Yu, F.; Shen, Y.; Zhu, A. Research on Speech Emotion Recognition System in E-Learning; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2007; Volume 4489, pp. 555–558. [Google Scholar] [CrossRef]
Huang, C.; Liang, R.; Wang, Q.; Xi, J.; Zha, C.; Zhao, L. Practical speech emotion recognition based on online learning: From acted data to elicited data. Math. Probl. Eng. 2013, 2013, 265819. [Google Scholar] [CrossRef]
Li, W.; Zhang, Y.; Fu, Y. Speech emotion recognition in E-learning system based on affective computing. In Proceedings of the Third International Conference on Natural Computation, ICNC 2007, Haikou, China, 24–27 August 2007; Volume 5, pp. 809–813. [Google Scholar] [CrossRef]
Zhang, Y.; Srivastava, G. Speech emotion recognition method in educational scene based on machine learning. EAI Endorsed Trans. Scalable Inf. Syst. 2022, 9, e9. [Google Scholar] [CrossRef]
Ancilin, J.; Milton, A. Improved speech emotion recognition with Mel frequency magnitude coefficient. Appl. Acoust. 2021, 179, 108046. [Google Scholar] [CrossRef]
Guan, H.; Liu, Z.; Wang, L.; Dang, J.; Yu, R. Speech Emotion Recognition Considering Local Dynamic Features. In Studies on Speech Production; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2017; Volume 10733, pp. 14–23. [Google Scholar] [CrossRef]
Alsabhan, W. Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention. Sensors 2023, 23, 1386. [Google Scholar] [CrossRef] [PubMed]
Atmaja, B.T.; Sasou, A. Effects of Data Augmentations on Speech Emotion Recognition. Sensors 2022, 22, 5941. [Google Scholar] [CrossRef] [PubMed]
Zehra, W.; Javed, A.R.; Jalil, Z.; Khan, H.U.; Gadekallu, T.R. Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell. Syst. 2021, 7, 1845–1854. [Google Scholar] [CrossRef]
Parthasarathy, S.; Busso, C. Semi-Supervised Speech Emotion Recognition with Ladder Networks. IEEE/ACM Trans Audio Speech Lang. Process 2020, 28, 2697–2709. [Google Scholar] [CrossRef]
Mustaqeem; Sajjad, M.; Kwon, S. Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [Google Scholar] [CrossRef]
Yan, Y.; Shen, X. Research on Speech Emotion Recognition Based on AA-CBGRU Network. Electronics 2022, 11, 1409. [Google Scholar] [CrossRef]
Ahmed, M.R.; Islam, S.; Islam, A.K.M.M.; Shatabda, S. An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition. Expert Syst. Appl. 2023, 218, 119633. [Google Scholar] [CrossRef]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.; Weiss, B. A Database of German Emotional Speech. Available online: http://www.expressive-speech.net/emodb/ (accessed on 12 June 2023).
Surrey Audio-Visual Expressed Emotion (SAVEE) Database. Available online: http://kahlan.eps.surrey.ac.uk/savee/ (accessed on 16 June 2023).
IEMOCAP-Home. Available online: https://sail.usc.edu/iemocap/ (accessed on 12 June 2023).
Ramdinmawii, E.; Mohanta, A.; Mittal, V.K. Emotion recognition from speech signal. In Proceedings of the IEEE Region 10 Annual International Conference, Penang, Malaysia, 5–8 November 2017; pp. 1562–1567. [Google Scholar] [CrossRef]
Breitenstein, C.; Van Lancker, D.; Daum, I. The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cogn. Emot. 2001, 15, 57–79. [Google Scholar] [CrossRef]
Nilsenová, M.; Goudbeek, M.; Kempen, L. The relation between pitch perception preference and emotion identification. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Chiba, Japan, 26–30 September 2010; pp. 110–113. [Google Scholar] [CrossRef]
Kienast, M.; Sendlmeier, W.F. Acoustical analysis of spectral and temporal changes in emotional speech. In Proceedings of the ITRW on Speech and Emotion, Newcastle, UK, 5–7 September 2000; pp. 92–97. [Google Scholar]
Torres-García, A.A.; Mendoza-Montoya, O.; Molinas, M.; Antelis, J.M.; Moctezuma, L.A.; Hernández-Del-Toro, T. Pre-processing and feature extraction. In Biosignal Processing and Classification Using Computational Learning and Intelligence: Principles, Algorithms, and Applications; Academic Press: Cambridge, MA, USA, 2021; pp. 59–91. [Google Scholar] [CrossRef]
Tawfik, M.; Nimbhore, S.; Al-Zidi, N.M.; Ahmed, Z.A.T.; Almadani, A.M. Multi-features Extraction for Automating COVID-19 Detection from Cough Sound using Deep Neural Networks. In Proceedings of the 4th International Conference on Smart Systems and Inventive Technology, ICSSIT 2022, Tirunelveli, India, 20–22 January 2022; pp. 944–950. [Google Scholar] [CrossRef]
Jothimani, S.; Premalatha, K. MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network. Chaos Solitons Fractals 2022, 162, 112512. [Google Scholar] [CrossRef]
Davis, S.B.; Mermelstein, P. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans. Acoust. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process Syst. 2017, 30. [Google Scholar]
Yacouby, R.; Axman, D. Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar] [CrossRef]
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.Y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
Montavon, G.; Binder, A.; Lapuschkin, S.; Samek, W.; Müller, K.R. Layer-Wise Relevance Propagation: An Overview. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2019; Volume 11700, LNCS; pp. 193–209. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should I trust you?Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]

Figure 1. The proposed method.

Figure 2. Distribution of the emotional classes in the five datasets: (a) the RAVDESS; (b) the EMO-DB; (c) SAVEE; (d) SHEIE; (e) IEMOCAP.

Figure 3. Distribution of emotional classes in the five datasets after data augmentation: (a) the RAVDESS, (b) the EMO-DB, (c) SAVEE, (d) IEMOCAP, and (e) SHEIE.

Figure 4. Effects of data augmentation on SER: (a) original audio with additive white Gaussian noise; (b) original audio with time stretching; and (c) original sound with pitch shifting.

Figure 5. Zero-crossing rate (ZCR), chroma, and roll-off values for the anger sample.

Figure 6. Ensemble model architecture.

Figure 17. Summary of model accuracy and F1 scores across the datasets.

Figure 18. Receiver operating characteristic curves for the datasets: (a) the EMO-DB, (b) the RAVDESS, (c) SAVEE, (d) IEMOCAP, and (e) SHEIE.

Figure 19. SER deployment using the Flask web interface.

Table 1. Description of the datasets.

Dataset Name	Number and Types of Emotions	Sample Size	Type of Dataset	Status
Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS)	Eight: disgust, fear, sadness, angriness, happiness, surprise, calmness, neutrality	7356	Multimodal: speech and song	Acted
Berlin Database of Emotional Speech (EMO-DB)	Seven: anger, disgust, fear, happiness, neutrality, sadness, boredom	535	Speech	Acted
Surrey Audio–Visual Expressed Emotion (SAVEE)	Seven: anger, disgust, fear, happiness, neutrality, sadness, surprise	480	Speech	Acted
Interactive Emotional Dyadic Motion Capture (IEMOCAP)	Eight: happiness, anger, sadness, frustration, surprise, fear, excitement, neutrality	7513	Multimodal: speech, video, and motion capture	Acted
Saudi Higher-Education Instructor Emotions (SHEIE)	Six: anger, happiness, sadness, excitement, boredom, neutrality	7515	Speech	Natural

Table 2. Ensemble model summary.

Layer	Output Shape	Param #	Connected to	Parameter	Value
Input_1 (InputLayer)	(None, 19,264)	0
conv1d (Conv1D)	(None, 19,264)	256	input_1[0][0]	filters	64
transformer_block_1	(None, 19,264)	0	input_1[0][0]	embed_dim	64
conv1d_1 (Conv1D)	(None, 19,264)	12,352	conv1d [0][0]	kernel_size	3
				stride
conv1d_2 (Conv1D)	(None, 19,264)	12,352	conv1d_1[0][0]	padding	“same”
lstm (LSTM)	(None, 19,264)	16,896	input_1[0][0]	activation	“relu”
transformer_block_2	(None, 19,264)	25,216	transformer_block_1[0][0]	ff_dim	64
conv1d_3 (Conv1D)	(None, 19,264)		conv1d_2[0][0]	rate	0.1
lstm_1 (LSTM)	(None, 19,264)	33,024	lstm [0][0]	nodes	64
flatten (Flatten)	(None, 12,288)	0	conv1d_3[0][0]	return_sequences	True
lstm_2 (LSTM)	(None, 64)	53,824	lstm_1[0][0]
flatten_1 (Flatten)	(None, 12,288)	0
concatenate (Concatenate)	(None, 24,640)	0	flatten_1[0][0], lstm_2[0][0]	concatenate_axis	1
dense_13 (Dense)	(None, 8)	192,128	concatenate [0][0]

Table 9. The ensemble model’s performance compared to models from previous studies.

Authors	Dataset	Feature Extraction and Techniques	Model	Results
Ancilin and Milton [16]	EMO-DB, RAVDESS, SAVEE, EMOVO, eNTERFACE, Urdu	Mel-frequency magnitude coefficient (MFMC), Mel-frequency cepstral coefficient (MFCC), LFPC, linear prediction cepstral coefficient (LPCC)	Multiclass support vector machine (SVM)	Urdu: 95.25%, EMO-DB: 81.5%, SAVEE: 75.63%, EMOVO: 73.3%, RAVDESS: 64.31%, eNTERFACE: 56.41%
H. Guan et al. [17]	EMO-DB	Zero-crossing rate (ZCR), root-mean-square energy (RMSE), pitch, harmonics-to-noise ratio, MFCCs 1–12 (first 12 Mel-frequency cepstral coefficients)	SVM	66.4%, 70.8%
Alsabhan et al. [18]	ANAD, BAVED, SAVEE, EMO-DB	ZCR, RMSE, MFCC	One-dimensional (1D) convolutional neural network (CNN) with long short-term memory (LSTM) and self-attention mechanisms; custom two-dimensional (2D) CNN architecture	EMO-DB: 96.72%, SAVEE: 97%
Atmaja and Sasou [19]	IEMOCAP, Japanese Twitter-Based Emotional Speech dataset	Wav2vec 2.0; pitch shifting, time stretching, and silence removal used for data augmentation (DA)	SVM	77.25%
Tanko et al. [11]	Lectures dataset	Multi-level discrete wavelet transform; 1D orbital local binary pattern	SVM	Proposed method achieved 93.40% accuracy on the lecture-based dataset containing three emotions; when evaluated using the EMO-DB, Toronto Emotional Speech Set, and Sharif Emotional Speech Database datasets, it achieved 86.14%, 99.82%, and 73.60% accuracy, respectively
Mustaqeem et al. [22]	IEMOCAP, EMO-DB, RAVDESS	Short-time Fourier transform, CNN, bidirectional LTSM	Object similarity measured using radial basis function networks	IEMOCAP: 72.25%, EMO-DB: 85.57%, RAVDESS: 77.02%
Yu and Xizhong [23]	IEMOCAP	Spectrogram	Bidirectional gated-recurrent-unit neural network	82% increase in accuracy
Ahmed et al.[24]	TESS, EMO-DB, RAVDESS, SAVEE, Crowd-Sourced Emotional Multimodal Actors dataset (CREMA-D)	13 MFCC coefficients, log Mel-scaled spectrogram, ZCR, chromogram, RMS, DA	Ensemble model (1D CNN, LSTM, gated recurrent unit) with local feature-acquiring blocks and global feature-acquiring blocks	TESS: 99.46%, EMO-DB: 95.42%, RAVDESS: 95.62%, SAVEE: 93.22%, CREMA-D: 90.47%
Bahreini [4]	Collected dataset	Pitch, energy	Framework for Improving Learning Through Webcams And Microphones	Accuracy of 74.3%
Tanko [5]	Lecturer speeches	Shoelace Pattern; tunable-Q wavelet transform sub-bands; neighborhood component analysis	SVM	Accuracies of 94.97%, 96.41%
Zhang and Srivastava [15]	EMO-DB, Chinese Academy of Sciences’ Chinese Affective Corpus	Speech essential frequency, quality intensity	Canonical correlation analysis, SVM	Average accuracy of 90–95% for both models
This study	RAVDESS, EMO-DB, SAVEE, IEMOCAP, SHEIE	40 MFCC coefficients; chroma; Mel spectrogram; ZCR; spectral contrast, centroid, bandwidth, and roll-off; RMS	Ensemble model (transformer, CNN, LSTM) with multi-head self-attention mechanism	RAVDESS: 96.3%, EMO-DB: 99.86%, SAVEE: 96.5%, IEMOCAP: 85.3%, SHEIE: 83%

Table 10. Ablation study results showing the accuracy of the ensemble model and its variants on the five datasets.

Model/ACC	RAVDESS	EMO-DB	SAVEE	IEMOCAP	SHEIE
EM-T	91.2%	96.5%	92%	82.7%	80.4%
EM-C	92.3%	98.2%	94.3%	83.9%	81.6%
EM-L	91.8%	98.6%	94.9%	83.7%	82.3%
EM-4E		71%
Ensemble	96.3%	99.86%	96.5%	85.3%	83%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alkhamali, E.A.; Allinjawi, A.; Ashari, R.B. Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms. Appl. Sci. 2024, 14, 5050. https://doi.org/10.3390/app14125050

AMA Style

Alkhamali EA, Allinjawi A, Ashari RB. Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms. Applied Sciences. 2024; 14(12):5050. https://doi.org/10.3390/app14125050

Chicago/Turabian Style

Alkhamali, Eman Abdulrahman, Arwa Allinjawi, and Rehab Bahaaddin Ashari. 2024. "Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms" Applied Sciences 14, no. 12: 5050. https://doi.org/10.3390/app14125050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms

Abstract

1. Introduction

2. Literature Review

2.1. Applications of SER in Education

2.2. Feature Extraction for Classification in SER

3. Materials and Methods

3.1. Datasets

3.2. Data Augmentation

3.3. Feature Extraction

3.4. Proposed Model

4. Experimental Results

4.1. Experimental Design

4.2. Measuring Tools Used for the Evaluation

4.3. Ensemble Model Evaluation

4.3.1. The EMO-DB Dataset

4.3.2. The RAVDESS Dataset

4.3.3. The SAVEE Dataset

4.3.4. The IEMOCAP Dataset

4.3.5. The SHEIE Dataset

5. Analysis and Discussion

5.1. Ablation Study

5.2. Real-Time Deployment and Field Testing

5.2.1. Real-Time Deployment

5.2.2. Field Testing

5.3. Privacy and Ethical Considerations

5.4. Explainability and Model Interpretation

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI