Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Next Article in Journal
Biomimetic Strategies for Sustainable Resilient Cities: Review across Scales and City Systems
Previous Article in Journal
Design and Performance Analysis of Robotic Vertebral-Disc Unit with Cable-Driven Mechanism
Previous Article in Special Issue
Advancements in Computer-Aided Diagnosis of Celiac Disease: A Systematic Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition

Department of Artificial Intelligence Convergence, Chuncheon 24252, Republic of Korea
*
Author to whom correspondence should be addressed.
Biomimetics 2024, 9(9), 513; https://doi.org/10.3390/biomimetics9090513
Submission received: 22 July 2024 / Revised: 19 August 2024 / Accepted: 23 August 2024 / Published: 26 August 2024

Abstract

:
Speech emotion recognition (SER) tasks are conducted to extract emotional features from speech signals. The characteristic parameters are analyzed, and the speech emotional states are judged. At present, SER is an important aspect of artificial psychology and artificial intelligence, as it is widely implemented in many applications in the human–computer interface, medical, and entertainment fields. In this work, six transforms, namely, the synchrosqueezing transform, fractional Stockwell transform (FST), K-sine transform-dependent integrated system (KSTDIS), flexible analytic wavelet transform (FAWT), chirplet transform, and superlet transform, are initially applied to speech emotion signals. Once the transforms are applied and the features are extracted, the essential features are selected using three techniques: the Overlapping Information Feature Selection (OIFS) technique followed by two biomimetic intelligence-based optimization techniques, namely, Harris Hawks Optimization (HHO) and the Chameleon Swarm Algorithm (CSA). The selected features are then classified with the help of ten basic machine learning classifiers, with special emphasis given to the extreme learning machine (ELM) and twin extreme learning machine (TELM) classifiers. An experiment is conducted on four publicly available datasets, namely, EMOVO, RAVDESS, SAVEE, and Berlin Emo-DB. The best results are obtained as follows: the Chirplet + CSA + TELM combination obtains a classification accuracy of 80.63% on the EMOVO dataset, the FAWT + HHO + TELM combination obtains a classification accuracy of 85.76% on the RAVDESS dataset, the Chirplet + OIFS + TELM combination obtains a classification accuracy of 83.94% on the SAVEE dataset, and, finally, the KSTDIS + CSA + TELM combination obtains a classification accuracy of 89.77% on the Berlin Emo-DB dataset.

1. Introduction

Emotions have been found to have a significant influence on the psychological and physical well-being of human beings [1]. The sharing of emotions by patients and how well they are received by therapists’ aid in the assessment of good treatment options. A huge amount of data is processed by therapists over a long period of time, which is quite a tumultuous task [2]. Thus, if a speech-based emotion training process is made feasible, it would be highly beneficial to therapists, as it would save a lot of time and energy. For instance, voice samples portraying different emotions, such as sadness, surprise, joy, anger, and neutral mode, could be considered and used to train a neural network to allow for their recognition [3]. Without the need for intrusive technology, spoken audio signals can be examined so that emotional information can be obtained. Using various applications, emotions have been grouped into text and audio based on how they evolve with time [4]. For human communication, emotion serves as an utmost important skill so that interpersonal connections can be managed well. With the help of emotional inputs, multiple cognitive computing tasks are achieved, and they sometimes greatly aid in perception and rational thinking. Some versatile examples of emotion classification and recognition include banking, computer games, video and audio monitoring, and psychiatric diagnosis [5]. Emotional speech recognition can be utilized for entertainment purposes, clinical investigations, online learning, and corporate applications. When voice signals are combined with emotions, it can create a holistic method called emotion recognition from speech. This method was chosen in this research as it has a negligible cost and is more appealing and useful than processing other biomedical signals [6,7]. In short, SER is widely used to trace and identify emotions in human speech signals. This technique has received a lot of attention in the last decade, as it bridges the gap and serves as an important link in human–computer interfaces [8]. SER has been applied for drowsiness detection, emotional state assessment, sleep stage scoring, medical diagnosis, etc. In the past few years, various SER studies have been conducted, primarily based on acoustic features [9]. They include assessments of voice quality, spectral features, prosodic features, etc., implemented for emotion recognition. Earlier studies used either one type of acoustic feature to identify emotions or multiple features to identify emotions, supported by classification with traditional machine learning or deep learning classifiers [10]. Some of the common works carried out in the past few years are discussed below. In this work, experiments were conducted on four publicly available datasets, namely, EMOVO [11], RAVDESS [12], SAVEE [13], and Berlin Emo-DB [14], and results previously reported on these datasets are discussed specifically for performance comparison analyses.
Regarding the EMOVO dataset, the following techniques and results have been reported: speaker awareness for SER was analyzed, obtaining a classification accuracy of 68.50% [15]; automatic feature selection techniques were applied for emotion recognition in low-resource settings, obtaining a classification accuracy of 41% [16]; novel feature selection techniques were applied for entire recognition, obtaining a classification accuracy of 60.40% [17]; and the transfer learning concept was applied to enhance SER where the knowledge gained through one task or dataset is utilized to improve the model performance on another related task or different dataset, obtaining a classification accuracy of 76.22% [18].
Regarding the RAVDESS database, the following techniques and results have been reported: machine learning techniques were implemented, obtaining a classification accuracy of 80.21% [19]; convolutional neural networks (CNNs) were implemented, obtaining a classification accuracy of 79.50% [20]; multimodal SER using CNN was implemented, obtaining a classification accuracy of 78.20% [21]; spiking neural networks were implemented, obtaining a classification accuracy of 83.60% [22]; the capsule routing technique was implemented, obtaining a classification accuracy of 77.02% [23]; bagged SVM was implemented, obtaining a classification accuracy of 75.69% [24]; spectrogram-based multi-task audio classification was carried out, obtaining a classification accuracy of 64.48% [25]; frequency cepstral coefficients with neural networks were applied, obtaining a classification accuracy of 79.80% [26]; and continuous wavelet transform (CWT) was applied, obtaining a classification accuracy of 60.10% [27].
Regarding the SAVEE dataset, the following techniques and results have been reported: a hierarchical classifier was used for SER, obtaining a classification accuracy of 83.78% [28]; joint deep cross-domain transfer learning was used, obtaining a classification accuracy of 69.00% [29]; the concept of negative emotion recognition was used, obtaining a classification accuracy of 65.83% [30]; 3D-CNN-based SER was carried out using K-means clustering and spectrograms, obtaining a classification accuracy of 81.05% [31]; recurrence dynamics were integrated for SER, obtaining a classification accuracy of 80.20% [32]; a comparison of cepstral features for SER was carried out, obtaining a classification accuracy of 78.60% [33]; and hybrid Particle Swarm Optimization (PSO)-based biogeography optimization was carried out for emotion and stress recognition, obtaining a classification accuracy of 78.44% [34].
Regarding the Berlin Emo-DB database, the following techniques and results have been reported: a two-layer fuzzy multiple random forest concept obtained a classification accuracy of 87.85% [35], a modified quantum-behaved PSO obtained a classification accuracy of 82.82% [36], a wavelet packet analysis obtained a classification accuracy of 79.50% [37], multi-time-scale convolution obtained a classification accuracy of 70.97% [38], voting mechanisms obtained a classification accuracy of 64.52% [39], the stacked generalization method obtained a classification accuracy of 82.45% [40], and the divide-and-conquer-based ensemble technique obtained a classification accuracy of 82.00% [41].
The main contributions of this work are as follows:
(a)
After the basic pre-processing of the signals is carried out using an Independent Component Analysis (ICA), the pre-processed signals are subjected to six transforms, and then features are extracted.
(b)
The extracted features are selected using three efficient techniques, namely, OIFS, HHO, and CSA, and they are finally fed into machine learning classifiers.
(c)
While the techniques may already be established, the combinations with which the workflow proceeds are completely novel and interesting, as no previous works have reported the combination of transforms with optimization techniques, followed by classification with machine learning classifiers. A simplified version of a block diagram for a holistic understanding is presented in Figure 1.

2. Implementation of Transforms

The following six important transforms were applied to pre-processed speech emotion signals: the synchrosqueezing transform, FST, KSTDIS, FAWT, chirplet transform, and superlet transform. Some of the advantages considered when choosing these transforms are that they should be computationally very fast and these transforms should be able to explore the fine and innate details of the signal clearly. Some of the transforms can also provide a good and simultaneous localization in both time and frequency domain. The implementation aspects should also be easy so that it could ease the burden of the researchers to a great extent. The implementation of these transforms is described below.

2.1. Synchrosqueezing Transform

The synchrosqueezing transform is highly dependent on the continuous wavelet transform (CWT) [42]. This transform is highly useful for obtaining the localized time–frequency specifications of non-stationary signals and the oscillation components of a signal. The basic form of the signal q ( t ) is indicated as follows:
q ( t ) = n = 1 N q n ( t ) + e ( t )
where every component in q n ( t ) = M n ( t ) cos ( ϕ n ( t ) ) represents an oscillation function with time-varying frequency and amplitude, and the noise is specified by e ( t ) . For every component in n = 1 , , N , the intention is to obtain the amplitude M n ( t ) and the instantaneous phase ϕ n ( t ) . Hilbert transforms are applied to the original q n ( t ) component, and the formation of the analytic signal a n ( t ) is obtained as follows:
a n ( t ) = M n ( t ) e j ϕ n ( t ) = q n ( t ) + i H { q n ( t ) }
For the multicomponent signal, a n ( t ) is extended by means of a vector construction to every analytic signal of the channel; therefore, a multivariate analytic signal is obtained as follows:
a n ( t ) = [ M 1 ( t ) e j Φ 1 ( t ) M 2 ( t ) e j Φ 2 ( t ) M 3 ( t ) e j Φ 3 ( t ) M 4 ( t ) e j Φ 4 ( t ) : : : M N ( t ) e j Φ N ( t )   ]
Every speech emotion is represented as a monocomponent signal, as shown in Equation (3). For the multicomponent signal q ( t ) , the highly localized time–frequency indication is represented by the SST algorithm utilizing the instantaneous amplitude. By utilizing the CWT, the frequency information is extracted from the analytic signal. The oscillatory component of a particular signal is imbibed by the CWT algorithm via the use of time–frequency filters termed wavelets. The finite oscillation function is represented by the mother wavelet ψ ( t ) , and it is usually convolved with the signal q ( t ) . The computation of the CWT of q n ( t ) is carried out as follows:
W ( c , d ) = c 1 2 Ψ ( t d c ) q n ( t ) d t
Here, the wavelet coefficients for every scale–time pair ( c , d ) are indicated by W ( c , d ) . The bandpass filter output consists of only the wavelet coefficients. In the frequency domain, the bandpass filter is scrolled by the scale factor c so that the bandwidth of the filter changes. The scale parameter c highly influences the frequency properties, while the time of interest is indicated by the shift parameter d . Therefore, at a particular frequency of f s , the spreading of the energy of a sinusoidal wavelet transform occurs around a scale factor and is represented as c s = f ψ f s . Here, f ψ indicates the wavelet’s center frequency, and the energy of the original frequency f s is widely spread across c s . The actual frequency f r and the estimated frequency in the relevant scales are similar. For every scale–time pair ( c , d ) , the evaluation of the instantaneous frequency f q ( c , d ) is carried out as follows:
f q ( c , d ) = i F ( c , d ) 1 F ( c , d ) d
The wavelet coefficients are transformed from the scale–time domain to the time–frequency domain using the SST technique. Every point ( c , d ) is transformed into ( d , f q ( c , d ) ) . A scaling step is required, as c and d are discrete values. The computation of Δ c k = c k 1 c k for any c k , where f q ( c , d ) , is carried out. When the domain transformations are carried out (from the scale–time domain to the time–frequency domain), ( d , c ) ( d , f i n s t ( c , d ) ) , the computation of the SST coefficient T ( f l , d ) is carried out at the center f l of the frequency range ( f l Δ f 2 , f l + Δ f 2 ) , with Δ f = f l f l 1 . At a resolution of Δ f , the frequency bias is indicated by f l . The linear frequency scales are specified by Δ f , and this acts as a constant value. The oscillations in the univariate mode are reconstructed using SST so that q n ( t ) = M n ( t ) cos ( ϕ n ( t ) ) . The reconstructed signal q ( d ) and the synchrosqueezing coefficient are expressed as follows:
T ( f l , d ) = c k : | f q ( c k , d ) f l | Δ f 2 F ( c k , d ) c 3 2 Δ c k
q ( d ) = [ R ψ 1 T ( f l , d ) Δ f ]
The time–frequency indication of the signal is represented by Equation (4) and is synchrosqueezed along with the frequency scale. In Equation (7), R ψ = 1 2 0 ψ ˜ * ( ξ ) d ξ ξ represents the normalization constant. For the mother wavelet ψ ( t ) , ψ ^ ( ξ ) represents the Fourier transform. To maintain a consistent indication in the time–frequency domain, the coefficient of the CWT is reallocated using the SST algorithm so that instantaneous frequencies are automatically generated. The multivariate synchrosqueezing transform is explained as follows: The expressive indication in the time–frequency domain is provided using the modulated oscillation model. A multivariate extension is presented to the synchrosqueezing transform so that the common oscillations pertaining to the multiple data channels are recognized. For multivariate data such as speech emotion data, the procedure is started by applying the SST to every channel. The synchrosqueezing coefficients are initially obtained, and then the multiple channels are assigned depending on the rules. From a collection of multivariate signals, a group of monocomponent signals are identified, and the separation of the time–frequency domain components into K frequency band components is carried out as { f k } k = 1 , , K . Assessments of the instantaneous frequencies and amplitude are conducted. The computation of the multivariate instantaneous frequency and amplitude is carried out across the channels, and, ultimately, an assessment of the multivariate synchrosqueezing coefficient is conducted.

2.2. Fractional Stockwell Transform (FST)

Among the prevalent joint time–frequency analysis tools, one of the most famous techniques is the FST, as it helps to improve the time–frequency concentration to a greater extent [43]. Here, two widely used transforms are combined—the fractional Fourier transform (FrFT) and the Stockwell transform (ST). For the signal q ( t ) , a z t h -order continuous and discrete FST is represented mathematically as follows:
F S T q 2 ( τ , v φ ) = q ( t ) h ( τ t , v φ ) K z ( t , v φ ) d t
and
F S T q z ( k , p ) = n = N N q ( n ) h ( k n , p ) K z ( n , p )
where the kernel function of the FrFT is expressed as K z ( t , v φ ) , which is expressed as follows:
K z ( t , v φ ) = { 1 j cot φ 2 π exp [ j ( t 2 + v φ 2 2 ) cot φ j v φ t csc φ ] , φ n π δ ( t v φ ) i f φ = 2 n π δ ( t v φ ) i f φ + π = 2 n π
The scalable Gaussian window function is expressed as h ( τ t , v φ ) ; it is dependent on time t and the fractional Fourier frequency v φ , and it is represented as follows:
h ( t , v φ ) = | v φ csc φ | 2 π s exp ( t 2 ( v φ csc φ ) 2 r 2 s 2 )
where φ = z π 2 and 0 < | z | < 2 . There are two window adjustment blocks of parameters alongside the fractional order parameter ( z ) , namely, r and s , which help to mitigate the window shape. The energy of the signal is localized using these parameters so that the time–frequency resolution is improved. The calculation of the FST is carried out as follows:
In the context of chirp basis functions, the Gaussian-windowed signals are decomposed with respect to the phase and time, and they are highly influenced by different v φ values. In the time–frequency plane, it is best to analyze the non-stationary signals, as they are more interpretable and robust. As it provides a good frequency and time resolution, the ST remains a very famous window-based technique among the joint time–frequency analysis tools. At lower frequencies, the frequency resolution is improved, and, at higher frequencies, a good time resolution is provided. Despite having such versatile features, it has also certain shortcomings. The time resolution is sometimes degraded, as the Gaussian window function is quite lengthy with tapering ends; therefore, the pertinent data present in the signal are suppressed. The window shape parameters in the ST cannot be adjusted, as they are highly dependent on the frequency. Therefore, to improve performance, these limitations must be overcome successfully, and, to this end, a fractional variation of the ST called the FST is used in this work. To eliminate noise, the fractional Fourier domain is quite effective when dealing with non-stationary signals. The noisy counterparts can be separated in the intermediate time–frequency domain using the fractional order of the transform, and a better resolution can be obtained using the ST. The extension of the ST in the fractional domain is aided as the signal localization is improved, thereby enhancing the flexibility of the entire spectrum of the signal. Thus, to process the noisy content of the signals, FST is utilized. A scalable fractional Gaussian window is utilized by this modified variant of the ST in the fractional domain with the help of the fractional parameter ( z ) and two window adjustment parameters r and s . A good improvement can be seen in the resolution when using these two window adjustment parameters and fractional parameter. The rotation angle can also be changed so that even multi-resolution spectrograms can be achieved, which are sometimes large, thereby improving the flexibility.

2.3. K-Sine Transform-Dependent Integrated System (KSTDIS)

A K-sine transform-dependent integrated system is described below to investigate its effectiveness [44]. The sinusoidal transformation exhibits a very powerful nonlinear dynamic behavior and it can be implemented to chaotic systems quite easily. With the help of the sine map, the chaos of a chaotic system can be enhanced as follows:
v 1 = y sin ( π Q ) , y ( 0 , 1 )
The K-sine map is dependent on the sine map, and it is expressed as follows:
v 2 = | y sin ( k π q ) | , y ( 0 , 1 )
The chaos of the function is enhanced by v 2 with the mitigation of periodicity of the trigonometric function. If the value of k = 1 , then v 2 becomes equivalent to v 1 . The chaotic property of v 2 is higher than that of v 1 , and it is specified with the help of the Lyapunov exponent [45] and sample entropy [46]. The numerical features of adjacent trajectories are indicated by the Lyapunov exponent, and this aids in tracing the chaotic motion. To identify the chaos of a system, the LE is highly useful. For a variable chaotic system, Q i + 1 = F ( Q i ) , and its Lyapunov exponent is expressed as follows:
λ F = lim n 1 n i = 1 n 1 ln | F ( Q i ) |
The system progresses towards a chaotic state if λ F > 0 , and such a mapping is termed a chaotic mapping. If the λ F value is large, then the chaotic behavior is quite complex in nature. To assess the time series complexity, sample entropy is highly useful when compared with approximate entropy. The time series complexity is assessed using sample entropy, and this is carried out by specifically measuring the probability of a normal pattern in the time sequences. The complexity will be higher if the probability of the generated new patterns is high. The complexity will be lower if the sample entropy value is small and the time series self-similarity value is high. If the sample entropy value is large, then the time series will be more complex. Based on v 2 , this study proposes the use of KSTDIS, and its mathematical model is represented as follows:
Q i + 1 = | sin k π ( F 1 ( q i , y ) + F 2 ( q i , 1 y ) ) |
where F 1 ( q i , r ) and F 2 ( q i , 1 r ) are two known 1D chaotic map seeds. The control parameters are k and y , k [ 1 , ] , y [ 0 , 1 ] , and the absolute value is denoted by | | . The coupling operation is performed on F 1 ( q i , y ) and F 2 ( q i , 1 y ) . Then, sine transformation is performed on the coupling result with the help of v 2 . The chaotic dynamics can be easily absorbed by the coupled operation. The merits of the nonlinear dynamic behavior of the sine transform are greatly taken care of, and, thus, the arbitrary selection of seed maps can be carried out so that novel chaotic maps larger in number can be generated. The effectiveness of the KSTDIS must be proven. Here, k is considered to have a value of 3, and the chosen seed chaos is a logistic map, a sine map, and a tent map.
Sine map:
q i + 1 = y sin ( π q i )
Logistic map:
q i + 1 = 4 y q i ( 1 q i )
Tent map:
q i + 1 = { 2 y q i , q i < 0.5 2 y ( 1 q i ) , q i 0.5
Then, the control parameters for all the maps are the variables r and r [ 0 , 1 ] . The mathematical model for the 3-KSTDIS is expressed as follows:
Q i + 1 = | sin ( 3 π ( F 1 ( q i , y ) + F 2 ( q i , 1 y ) ) ) | k [ 1 , + ) , y [ 0 , 1 ]
Here, the three seed mappings are not combined, as the chosen value of k is only 3.

2.4. FAWT

To analyze speech emotion signals, the analytic wavelet transforms with flexible time–frequency covering is utilized [47]. The Hilbert transform pairs of atoms are utilized by this transform so that a higher flexibility is provided, thereby controlling the dilation factor, redundancy, and Quality Factor (QF). The signals can be easily analyzed by adjusting the parameters: the down sampling factor for the low-pass channel (l) and high-pass channel (h) and the up-sampling factor for the low-pass channel (u) and high-pass channel (p). The parameter that helps to ascertain and control the QF  ( β ) is expressed as follows:
Q F = 2 β β
By implementing the iterative filter bank structure, the k t h level of the FAWT is obtained. At every level, the decomposition in this technique provides dual channels and a single channel, which corresponds to high-pass channels and a low-pass channel, respectively.
The frequency response of the low-pass filter is expressed as follows:
H ( w ) = { ( u l ) 1 2 | w | < w r ( u l ) 1 2 θ ( w w r w u w r ) w r w w u ( u l ) 1 2 θ ( π ( w w r ) w u w r ) w u w w r 0 | w | w u
The frequency response of the high-pass filter is expressed as follows:
G ( w ) = { ( 2 p h ) 1 2 θ ( π ( w + w 0 w 1 w 0 ) w 0 w < w 1 ( 2 p h ) 1 2 w 1 w w 2 ( 2 p h ) 1 2 θ ( ( w w 2 w 3 w 2 ) w 2 w w 3 0 w [ 0 , w 0 ) ( w 3 , 2 π )
where
w r = ( 1 β ) π u + e u
w u = π l
w 0 = ( 1 β ) π + e p
w 1 = u π l p
w 2 = π e p
w 3 = π + e p
e u l + β t u + l π
θ ( w ) is expressed as follows:
θ ( w ) = [ 1 + cos ( w ) [ 2 cos ( w ) ] ] 1 2 2   for   w [ 0 , π ]
The following conditions should be fulfilled so that a proper reconstruction is allowed:
| θ ( π w ) | 2 + | θ ( w ) | 2 = 1
( 1 u l ) β p h
Without using the chirp parameters, the parameter of the FAWT is deduced. The value of β is set to 0.5, and the QF is set to 2 in our experiment.

2.5. Chirplet Transform

For non-stationary signals, the time–frequency domain specifications are expressed well with the chirplet transform [48]. Information on the amplitude variation with time is expressed by the time domain signal. Information on the amplitude variation with frequency is expressed by the frequency domain signal. With respect to both frequency and time, information on the variation in energy or amplitude is expressed by the CT. The generalized specification of the STFT and CWT is the chirplet transform, and it has numerous applications in the fields of biosignal processing, image processing, etc. q ( s ) is a speech emotion signal with s = 1 , 2 , , S , and the total number of samples present in the speech signal q ( s ) is represented by S . Then, the chirplet transform of q ( s ) is expressed as follows:
Q γ , σ ( m , s ) = s = 1 S q ˜ ( s ) e j 2 π m s S χ τ , γ , σ * ( s )
where q ˜ ( s ) represents the analytic signal of q ( s ) and is evaluated using the Hilbert transform. The analytic signal of q ( s ) is expressed as q ˜ ( s ) = q ( s ) + j H [ q ( s ) ] , where the Hilbert transform of the speech emotion q ( s ) is represented by H [ q ( s ) ] . The complex conjugate of χ is represented by the factor χ * . In the chirplet transform, the window function is χ τ , γ , σ * ( s ) , and it is expressed as follows:
χ τ 0 , γ , σ * ( s ) = w σ ( s τ 0 ) e γ 2 ( s τ 0 ) 2
where the Gaussian window function is represented by the factor w σ ( s τ 0 ) and is expressed as follows:
w σ ( s ) = 1 σ 2 π e s 2 2 σ 2
By utilizing the chirplet transform of the speech emotion signal, the time–frequency matrix is obtained, and it is mathematically expressed as follows:
Q γ , σ ( m , τ 0 ) = Q γ , σ R ( m , τ 0 ) + j Q γ , σ I ( m , τ 0 )
where the real part of the time–frequency matrix Q γ , σ ( m , τ 0 ) is expressed as Q γ , σ R ( m , τ 0 ) , and the imaginary part of the time–frequency matrix Q γ , σ ( m , τ 0 ) is expressed as Q γ , σ I ( m , τ 0 ) . For the time–frequency matrix, the magnitude is expressed as follows:
| Q γ , σ ( m , τ 0 ) | = [ Q γ , σ R ( m , τ 0 ] 2 + [ Q γ , σ I ( m , τ 0 ] 2

2.6. Superlet Transform

A collection of wavelets with a high bandwidth is used by the superlet transform [49]. The excellent temporal resolution of the wavelength is combined geometrically so that a high frequency resolution is attained. A time–frequency analysis of the signal is performed using the STFT, and, hence, both time and frequency representations of the signals are specified. A quick localization of both the time and frequency is not attained by the generated features of the STFT because of the Heisenberg uncertainty principle. The drawbacks of the STFT are mitigated by the CWT, and, thus, at higher frequencies, a good temporal resolution is achieved [50]. Thus, DWT is utilized where the specification of the frequencies is determined as a power of two. There are only a limited number of constraints in both the STFT and CWT. A good frequency resolution is provided by the STFT, and a good temporal resolution is provided by the CWT [51]. At higher frequencies, a good frequency resolution with poor temporal resolution is provided by STFT, whereas a good relative temporal resolution is maintained by CWT throughout the spectrum but it can easily degrade in frequency resolution and becomes redundant with increasing frequency [52]. To obtain a high time–frequency resolution, different wavelets are utilized by the superlet transform. A good temporal but low frequency resolution is provided by wavelets with a smaller number of cycles, and a high frequency resolution is obtained by wavelets with a greater number of cycles, but a degraded temporal resolution is obtained. To obtain a super-resolution, both high temporal resolutions and high frequency resolutions are combined by the superlet [53]. For various applications such as biosignal processing, image processing, wireless communication, and electrical signal processing, superlets have been widely used. The Morlet is utilized as a mother wavelet by the superlet transform, which expresses a multi-resolution spectro-temporal specification. One of the most complex wavelets is the Morlet wavelet, and it is expressed as follows:
ϕ ( t ) = ( e j w 0 t e w 0 2 2 ) e t 2 2
where the central frequency of the mother wavelet is expressed as w 0 . The Morlet wavelet is almost like the Gabor transform in terms of its process. The scaling parameter helps to scale the window function in the Morlet transform. In the Gabor transform, the window size is already prefixed. The modified Morlet implemented in the superlet is expressed as follows:
ψ w 0 , n ( t ) = 1 D n 2 π e t 2 2 D n 2 e j 2 π w 0 t
The displacement parameter is expressed as D n , where the time variance of the wavelet is regulated as follows:
D n = n S . D × w 0
It is generally inversely proportional to the frequency. A broad frequency response is obtained if the value of D n is small. A narrow frequency response is obtained if the value of D n is large. D n is adjusted so that the wave covers the full cycles within the standard deviation S . D of the Gaussian envelope. Different wavelets are used by the superlet with w 0 so that a better time–frequency representation is obtained. The mathematical specification of the superlet is as follows:
S L w 0 , p = { ψ w 0 , n | n = n 1 , n 2 , , n p }
The order of the superlet that manages the wavelet is expressed by p . For every wavelet, the number of cycles is represented by n 1 , n 2 , , n p in the superlet. In the wavelets of the superlet transforms, the number of cycles is selected additively or multiplicatively. The number of cycles is utilized and selected by making use of the multiplicity concept as follows:
n i = n × i
Here, i = 1 , 2 , 3 , , p . As far as the individual wavelet response is concerned, the response of the superlet to the speech emotion signal q ( t ) is specified by the geometric mean, and it is expressed as follows:
G [ S L w 0 , p ] = p i = 1 p G [ Ψ w 0 , n i ]
Here, the response of the i t h wavelet of q ( t ) is represented as Ψ w 0 , n i .
For the Morlet, it is expressed as follows:
G [ Ψ w 0 , n i ] = 2 . q Ψ w 0 , n i
The speech emotion signal is represented by q , and the complex convolution is represented by . Superlet transforms are almost like CWT, except for the fact that the superlet transforms utilize superlets instead of wavelets, which are utilized in the CWT. A CWT is explained as a superlet that includes superlets of order 1. Superlets with a higher order provide a good indication of the signal. For a wide frequency, adaptive superlets are preferred. If the adaptive superlet is used with an order higher than 1, then a good improvement in the time–frequency domain of the signal is obtained.

3. Feature Selection Approaches

An important step in the machine learning pipeline is automatic feature selection. It involves selecting the most important features while eliminating the least important ones. To enhance the performance of any machine learning model and to reduce the computational overhead, automatic feature selection techniques are utilized. Novel feature selection techniques utilize new ideas to extract these relevant features and can employ a plethora of techniques to do it successfully. Using three various approaches—two filter approaches and a wrapper approach—feature selection can be carried out [54]. Filter algorithms employ the inherent properties of the dataset to estimate the priority of features. Wrapper algorithms assess the performance of a particular classifier with respect to the chosen features. Hybrid methods use the techniques of both the wrapper and filter algorithms; therefore, the number of candidate features is reduced. The general wrapper algorithm is implemented by using an optimization algorithm with a classifier. Depending on the training results of the classifier, the feature combination elements are fine tuned. The training phase and the optimization algorithm are time consuming, so wrapper algorithms sometimes seem to be less suited for high-dimensional datasets. Hybrid techniques that involve the combination of both the filter and wrapper algorithms are also time consuming and can be computationally expensive. When comparing wrapper and hybrid techniques, filter algorithms have several advantages, such as easy implementation and high efficiency; thus, in this work, initially, Overlapping Information Feature Selection (OIFS) is used to choose the features depending on the overlapping of samples concept to improve the feature selection efficiency [55]. Geometrical separation is ensured for every pair of categories in at least one feature dimension. Using an approximate equation, the computation of the overlapping matrices of every feature is analyzed. Global features are projected as the features that have the lowest overlapping rate. Then, the pair of categories that cannot be split by the global features is identified, and the pair with the lowest overlapping rate is considered. Following OIFS, the HHO algorithm and CSA algorithm are also utilized for the efficient selection of features in this work.

3.1. OIFS Method

For every class, the boundaries must be identified, and this serves as the main objective of a classification problem. When there is a geometrical distribution among the various boundaries of the classes, the problem becomes a straightforward one. Here, the overlapping rate of the sample is considered an important factor for the feature selection method [55]. If one feature is used to geometrically separate two classes, then a low overlap between these two classes is implied, thereby facilitating the easy discrimination of the features. For each pair of classes, if an appropriate feature satisfies the above condition, then the complexity is reduced. The approximation of the overlapping rate between two classes is expressed as follows:
F ( h 1 , h 2 ) = σ h 1 2 + σ h 2 2 ( μ h 1 μ h 2 ) 2
where μ h 1 and μ h 2 represent the sample means, σ h 1 and σ h 2 represent the standard deviation of class h 1 and class h 2 , respectively. For the N class classification issue, the overlapping matrix of feature g is expressed as follows:
F g = [ f g ( 1 , 1 ) f g ( 1 , 2 ) f g ( 1 , N ) f g ( 2 , 1 ) f g ( 2 , 2 ) f g ( 2 , 1 ) : : : : f g ( N , 1 ) f g ( N , 2 ) f g ( N , N ) ]
The overlapping matrix is considered symmetric in nature, and, using Equation (45), all the values of every element can be derived. It is not necessary to compute the diagonal elements, as they specify the overlapping rate of every class by itself. A feature set comprises N f . Depending on these matrices, the features are selected. To distinguish every pair of classes, at least one feature is required. Overlapping matrices generally have a ( N × N ) structure. Training the overlapping matrices is the best method for choosing a feature for every pair of classes. The corresponding feature of the overlapping matrix is then selected. Ultimately, in the final feature combination, N ( N 1 ) / 2 features are chosen. Some redundant features may exist in the chosen combination. There are two important phases of OIFS, and they aim to mitigate the redundancy of feature combination. The first phase consists of choosing the global features so that multiple pairs of classes can be clearly discriminated against and computational efficiency is easily improved. This process can also aid in easy classification. Due to the inherent properties, some classes may show blurred boundaries, so local features must be used. The next phase in OIFS concentrates on choosing the local features so that the classification of two ambiguous classes can be very well designed and facilitated. To choose the global features, the procedure is as follows: The average overlapping rate is computed for every overlapping matrix, and, for the g t h feature, it is represented as follows:
F ¯ g = h 1 = 1 N h 2 = 1 , h 2 h 1 N f g ( h 1 , h 2 ) N ( N 1 )
The average value of the overlapping matrix is computed using Equation (47). Depending on the average overlapping rates, the features are sorted in ascending order. The global features consist of only the top features selected from this procedure. The number of candidates chosen as global features is identified as N c . The input is converted to an integer using the function int ( ) . N z ( g ) specifies the g t h element of vector N z . Using the g t h feature, the pair of classes can be easily discriminated. For global features, the overlapping limit is indicated by ε 1 . Here, the parameters q and ε 1 are considered tuning parameters, and their values are assigned as follows: q = 10 and ε 1 = 0.5 . The values of q and ε 1 are increased if the number of features is reduced; thus, the global features are increased, and the local features are decreased. The pair of classes that is not identified by the global features is traced out. Local features are assigned to these pair of classes, and the minimum overlapping rate is computed. In between the two related classes, the determination of a feature with the ability to assign itself to a high overlapping rate or low overlapping rate is ascertained. In the feature set, the minimum overlapping rate between the two classes is recorded so that the local features can be selected. The overlapping rate limit is ascertained and checked to ensure that it always satisfies the criterion. Several iterations are required to choose only the efficient features, thereby eliminating the redundant ones.

3.2. Harris Hawks Optimization

In nature, the associated and relative behavior of Harris Hawks influences the HHO algorithm [56]. Due to its versatility, it has been proven to be a promising search technique, and it is used to address a plethora of optimization issues. The HHO algorithm has two exploration stages and four exploitation stages. To improve the quality of the results, different intelligent schemes are employed, which utilize a greedy scheme.

3.2.1. Stages of Initialization

In this stage, both the search spaces and fitness are presented. The basic chaotic opposition-dependent initialization technique is used, and all the parameters are assigned values.

3.2.2. Stages of Exploration

In the HHO algorithm, Harris Hawks is tested as a candidate solution. Using two strategies, the fitness is computed depending on the planned prey, and this stage is expressed as follows:
a t + 1 = { a r a n d ( t ) r 1 | a r a n d ( t ) 2 r 2 a ( t ) | p 0.5 a r a b b i t ( t ) a m ( t ) r 3 ( L B + r 4 ( U B L B ) ) p < 0.5
where the new position in the second iteration is expressed as a t + 1 , and the current position is expressed as a ( t ) . The random position of the hawk is indicated as a r a n d ( t ) , and the optimal position of the intended rabbit is expressed as a r a b b i t ( t ) . r 1 , r 2 , r 3 , r 4 and p specify the random numbers within [0, 1], and this is updated in every iteration. The lower bound is represented by LB, and the upper bound is represented by UB. The average position of the N solution is represented by a m ( t ) , and it is expressed as follows:
a m ( t ) = 1 N i = 1 N a i ( t )
where the total number of hawks is denoted as N , and the position in iteration t of every hawk is represented as a i ( t ) .

3.2.3. Exploration Stage to Exploitation Stage

In this stage, when the escaping action occurs, the energy of the prey is mitigated; therefore, the exploration stage leads to the exploitation stage, and this is expressed as follows:
E = 2 E 0 ( 1 1 T )
where the escape energy of the prey is indicated by E , the starting stage of the energy is represented by E 0 , and the maximum number of iterations is represented by T . The position of the prey is observed by the hawk in the case of | E | 1 , and this indicates the exploration phase of the HHO. In the case of | E | < 1 , the hawk is in the exploitation phase.

3.2.4. Stages of Exploitation

This stage comprises four important steps: soft surrounding, hard surrounding, hard surrounding, hard surrounding, hard surrounding, hard surrounding with rapid dives, and soft surrounding with rapid dives. A summary of the steps is expressed as follows: A soft surrounding strategy occurs when r and | E | < 0.5 . The position of the hawk is updated as follows:
a ( t + 1 ) = Δ a ( t ) E | J a r a b b i t ( t ) a ( t ) |
Δ a ( t ) = a r a b b i t ( t ) a ( t )
where the difference between the position vector of the rabbit and the present position in iteration t is expressed as Δ a ( t ) . The random jumping power of the rabbit is expressed by J = 2 ( 1 r 5 ) , where the random variable is expressed as r 5 . The hard surrounding strategy is considered if r 0.5 and | E | < 0.5 . In this case, the position of the hawk is updated as follows:
a ( t + 1 ) = a r a b b i t ( t ) E | Δ a ( t ) |
A soft surrounding strategy with rapid dives occurs when | E | 0.5 and r < 0.5 . In this stage, the prey successfully escapes, and the hawks must make a collection decision; this is expressed as follows:
B ( t ) = a r a b b i t ( t ) E | J a r a b b i t | t | a ( t ) |
To model this strategy, the best Levy flight motion-based patterns are utilized by HHO, and they are defined as follows:
C = B + V × L F ( D )
where the dimension of the solution is represented by D , a random number vector of size 1 × D is represented by V , and the Levy flight motion is denoted by LF. LF is expressed as follows:
L F ( a ) = 0.01 × μ × α | w | 1 β ,
α = ( τ ( 1 β ) × sin ( π β 2 ) τ ( 1 β 2 ) × β × 2 ( 1 β 2 ) ) 1 β
where w is a random value in the range of [0, 1].
The constant is defined by β and is assigned a value of 2 in our work. The hawk’s position is updated as follows:
a ( t + 1 ) = { B i f F ( B ) < F ( a ( t ) ) C i f F ( C ) < F ( a ( t ) )
where B and C are utilized by using (54) and (55), and both equations refer to the locations of the next new iteration. A hard surrounding strategy with rapid dives is implemented if r < 0.5 and | E | < 0.5 , and this is expressed as follows:
a ( t + 1 ) = { B i f F ( B ) < F ( a ( t ) ) C i f F ( C ) < F ( a ( t ) )
where
B ( t ) = a r a b b i t ( t ) E | J a r a b b i t ( t ) a ( t ) |
C = B + V × L F ( D )

3.2.5. Initialization Based on Chaotic Opposition

The candidate solutions are randomly initialized, and the search process is started with the traditional optimization technique. The solution space of the population is initialized by the implementation of the chaotic map strategy. The convergence speed must be increased, so the chaotic opposition-dependent learning population technique replaces the random initialization solution. An initial state is generated by the chaotic opposition to improve the solution diversity, and this is achieved by utilizing a chaotic map to introduce randomness. Thus, the algorithm is prevented from prematurely converging to a local optimum, and the global search ability is greatly improved.

3.2.6. Simulated Annealing

Simulated annealing is a very famous local search technique and is analyzed as a single heuristic dependent on solid annealing [57]. The issue of stagnation in the local economy can be surmounted by the application of this approach. With a specific probability, even a worse solution can be accepted by SA. The Boltzmann probability e θ T controls the worst solution/worst neighbor, where the difference between the best solution and generated neighbor fitness is expressed by θ . During the search process, the temperature is indicated by T , and this is periodically decreased. The starting temperature is set to 2 × | N | , where the number of attributes for every data point is expressed by | N | , and the cooling time is computed appropriately.

3.2.7. Operators of Mutation and Crossover

Every Harris Hawk is placed in a random location in the exploration of HHO so that the location of the rabbit can be determined. The exploration stage of the feature space is improved, and the actual location updated point is detected so that the crossover and mutation can be well prevented as described below. The mutation rate is expressed as follows:
b ( t ) = 0.9 + 0.9 × a ( t 1 ) T 1
where the maximum number is denoted by T , and the current iteration is expressed as t . As the total number of iterations increases, there is a decrease in mutation b , and this happens in a linear manner from 0.9 to 0. During the iteration procedure, the crossover is added between the current solution a ( t ) and the resultant value of the mutation, and this operation is expressed as follows:
a ( t + 1 ) i = { a ( t ) i p < 0.5 a i M u t p 0.5
where the value resulting from the mutation is expressed as a M u t , a random number in the range [0, 1] is denoted as p , and the freshly generated solution is denoted by a ( t + 1 ) . The i t h size in A ( t + 1 ) is denoted as a ( t + 1 ) i .

3.2.8. Tournament Choosing

One useful selection strategy is tournament choosing, and it accesses a specific tournament among randomly chosen individuals from a specific object. There are four main phases in this strategy. The population size and tournament are utilized as input values; then, a random number r is generated in the range of [0, 1]. This random number is compared with that of the selection probability to adjust the selection pressure. If the fitness value is high, then the best solution is obtained if the tournament size is great; otherwise, the weak solution is selected. Ultimately, diversity is preserved if the tournament choosing strategy is utilized.

3.2.9. KNN Classifier

A versatile classification technique is KNN [58]. The main concept of this technique is that a particular sample refers to a specific group and the characteristics quite common to the samples present in the group are exhibited. If a high majority of K samples belong to that category, then classification occurs. This technique is used in signal processing, image processing, text classification, financial risk level assessment, etc., as it is quite straightforward, easy to understand, and simple to implement. The distance chosen in this study is the Euclidean distance, and the mathematical equation is as follows:
d i s tan c e ( G , H ) = K = 1 N ( G K H K ) 2
where the dimension of the sample is represented by N . The sample in the training set is indicated by G , and the sample in the test set is indicated by H . The feature selection method utilizing the optimization algorithm and KNN is shown in a flowchart in Figure 2.

3.2.10. Parameter Settings of HHO

Experiments were carried out on a Window 10 computer with an i5 core processor and 8 GB memory. When using the KNN classifier, the Euclidean distance metrics’ K value was assigned 8 in our experiment. The results of various optimization algorithms were run over 50 times. The size of the population was set to 25, and the maximum number of iterations was set to 100. The value of r was set to 4, and the number of hawks was set to 10. The α value was fixed at 0.001, and the β value was fixed at 0.95. For the SA algorithm, the particle number was set to 20. The final temperature was set to 0.2, and the cooling plan was set to 0.5.

3.3. Standard Chameleon Swarm Algorithm (CSA)

The foraging behavior of the chameleons is characterized by the mathematical model [59] described below.

3.3.1. Population Initialization

With a collection of n potential solutions representing the number of chameleons, the position of chameleons z in the search space is randomly initialized, where a possible solution is specified by every chameleon. At a specific iteration t , the position of the j t h chameleon in the search space is expressed as follows:
z t j = [ z t , 1 j , z t , 2 j , , z t , d j ]
where j = 1 , 2 , , n , the current iteration is represented by t , and the dimension is represented by d . The position of chameleon j in a particular dimension d is represented by z t , d j . In the search space, the initial population of the Chameleon Swarm Algorithm is randomly generated depending on the total number of chameleons and the given problem dimension as expressed below:
z j = l k + r × ( u k l k )
where the initial vector of chameleon i is represented by z i . In dimension k , the lower and upper limits of the search space are specified by u k and l k , respectively. The random value, which is uniformly created, is represented by r and is in the range of [0, 1]. Predetermined fitness criteria are used to determine the new position solution for every chameleon. For every chameleon, the present position is updated if a solution with a better quality than the current position solution is identified. When the CSA algorithm is simulated, if the current solution quality is high and better than that of the new position, then the chameleon stays in its original position.

3.3.2. Position Update

When searching for the prey, the positions of the chameleons are updated with the help of the position update strategy. The foraging conduct of the chameleon is represented as follows:
z t + 1 j , k = { z t j , k + a 1 ( A t j , k B t k ) + a 2 ( B t k z t j , k ) r 1 r i A a z t j , k + μ ( ( u k l k ) r 3 + l b k ) . sgn ( r a n d 0.5 ) r i A a
where the normal position of chameleon j at distance k and iteration t + 1 is represented by z t + 1 j , k . The present position of chameleon j in dimension k and iteration t is represented by z t j , k . A t j , k traces the highest and best position achieved by chameleon j in dimension k . B t k represents the global best position achieved by the chameleon in dimension k . a 1 = 0.5 and a 2 = 0.75 are considered two positive values that help in assessing the exploration behavior. In the interval [0, 1], the random values are denoted r1, r2, and r3. The random value present in index i is denoted by r i and is in the range of [0, 1]. A a denotes the discerning prey of the chameleon, and sgn ( r a n d 0.5 ) is either −0.5 or +0.5 and can highly influence the exploitation/exploration phase. The function of the iterations is denoted as μ and is expressed as follows:
λ = c 1 e ( c 2 × ( t T ) ) C 3
where t represents the current number of iterations, and T indicates the maximum number of iterations. c 1 , c 2 , c 3 are constant values, with values of 1, 1.5, and 2, respectively, and they help to mitigate the exploitation behavior. Parameter B t k indicates that the present candidate solution is highly similar to the optimal solution. The probability of a chameleon managing the prey in the environment is represented by parameter A a , and A a 0.1 . This condition is fixed so that the present position can be altered based on the prey observation in the search space. When searching for the prey, the locations in the search space are altered by the chameleons in a random manner. To identify prey, the search space is randomly explored in various areas and directions so that a high potential is achieved; thus, the optimal goals can be easily represented.

3.3.3. Update Model Based on the Eye Rotation of the Chameleon

The chameleon possesses an innate ability to perceive the position of prey, as they are able to rotate their eyes to detect prey within a 360-degree range; this gives them special access and power. A chameleon can easily spin and move towards the prey’s position, and the novel position is represented as follows:
z t + 1 j = r × ( z t j z ¯ t j ) + z ¯ t j
where the updated position of the chameleon is represented by z t + 1 j , and the rotation matrix of the chameleon is represented by r . The current position of the chameleon is represented by z t j , and the center of the chameleon’s present position is represented as z ¯ t j , which is determined as follows:
r = R ( θ , o 1 , o 2 )
where the orthonormal vectors are represented as o 1 and o 2 , and they have a size of d × 1 . The rotation matrices are indicated by R , and the cycle of rotation of the eyes is denoted by θ , which can be expressed as follows:
θ = r a . sgn ( r a n d 0.5 ) × 180 °
where r a denotes the rotation angle from 0 to 180 degrees, which is in an interval of [0, 1]. sgn ( r a n d 0.5 ) is either −1 or +1, and it denotes the rotation direction.

3.3.4. Velocity Update Model

The chameleon terminates the stalking process once the prey is assaulted and it is not far from the chameleon’s position. This chameleon is considered the best among all the chameleons. The position of this chameleon is updated, as it can extend its tongue to twice its length. This helps the chameleons catch the prey efficiently, as it allows them to exploit the search area quickly. In this algorithm, the chameleon’s tongue moves towards the prey rapidly, and the velocity of this can be modeled as follows:
v t + 1 j , k = w v t j , k + c 1 ( B t k z t j , k ) r 1 + c 2 ( A t j , k z t j , k ) r 2
In dimension k , iteration t , and t + 1 , the following and current velocities of chameleon j is represented by v t + 1 j , k and v t j , k . The present position of chameleon i is indicated by z t j , k . The two random values are denoted by r1 and r2 in an interval of [0, 1]. To manage the effects of A t j , k and B t k on the chameleon’s tongue, the position values c1 and c2 are considered important. The inertia weight is indicated by w and is determined as follows:
w = ( 1 t / T ) ( ρ ( t / T ) )
where, to mitigate the exploitation features, the positive values are utilized as ρ .
When moving towards the prey, the position of the chameleon’s tongue is computed as follows:
z t + 1 j , k = z t j , k + ( ( v t j , k ) 2 ( v t 1 j , k ) 2 ) / ( 2 a )
where v t 1 j , k denotes the former velocity of chameleon j and dimension k . The rate of acceleration is denoted by a , and it reaches an approximate value of 2600 m/s, as shown below:
a = 2600 × ( 1 e log ( t ) )
By randomly generating the chameleon position, the CSA performs optimization and feature selection. Within every iteration, the positions of all the chameleons are consecutively updated. The chameleons are returned to the boundary if they exist in the search space. With the help of fitness functions, solutions are assigned so that the fittest chameleon is identified. In every loop, the steps are reiterated until the iteration condition is met. The chameleon incessantly explores and exploits the search space so that they can move towards the prey and catch it with their tongue. The mathematical model provided above helps to address the optimization issue, even with very large search spaces. The simplified implementation of the CSA algorithm is explained in Algorithm 1.
Algorithm 1: Implementation of CSA algorithm
The Key parameters of CSA are assigned.
The position for all chameleons is analyzed using u and l , then initialized.
The velocity of falling chameleon’s tongue is assigned.
Evaluate the starting position of all chameleons
While (t < T) do
    Update the position of chameleons based on Equation (67)
    Update the position of chameleons depending on their eyes turn merit using Equation (69)
    Update the tongue velocity of the chameleon using Equation (72)
    Update the tongue position of the chameleon using Equation (74)
    Integrate the position of chameleons using u and l of the problem variables.
    Update the position of every chameleon
    t = t + 1
end while
Return the global best position of the chameleons.

3.3.5. Parameter Settings of CSA

We conducted experiments on a Window 10 computer with an i5 core processor and 8 GB memory. The fitness was evaluated using fitness-dependent KNN with Euclidean distance metrics, and K was assigned a value of 8. The results of various optimization algorithms were run over 50 times. The size of the population was set to 25, and the maximum number of iterations was set to 100. The α value was fixed at 0.99, and the β value was fixed at 0.05.

4. Classification Using Machine Learning Algorithms

In this study, ten machine learning algorithms were employed, as described below.

4.1. Random Forest (RF)

Multiple decision trees are constructed randomly by means of sampling with replacement. The bagging technique is adopted by random forest, where many weak learners are integrated to form a strong learner. Weak learners are generated in an independent manner and based on the majority votes of the prediction of the weak learners, the final prediction is obtained.

4.2. Support Vector Machine (SVM)

SVM is a famous machine learning algorithm utilized for classification and regression problems. The margin is maximized around the separation bounds of two classes in a classification problem using a plane, line, or hyperplane to split the data.

4.3. Extreme Gradient Boosting (XGB)

XGB is quite an efficient, portable, and flexible parallel tree boosting system. The generalization issue is overcome by XGB, and the modal complexity is greatly reduced by the addition of a regularization parameter to the objective function.

4.4. Naïve Bayesian Classifier (NBC)

NBC depends on Bayes’ theorem learning technique. The conditional probability of every point is computed based on the conditions of every class. The class of the test data is like a class with a high conditional probability. Every feature has a strong independence, and there is no correlation between them in the NBC.

4.5. Gradient Boosting

Gradient boosting is a famous gradient descent-dependent boosting technique. Here, gradient descent is used so that the errors of the individuals are minimized. In an iterative manner, base learners are constructed by means of reweighting the misclassified instances. In every training observation, the negative partial derivatives of the loss function are operated by the gradient boosting to determine the weights. To obtain a high-performance model, weak learners are combined sequentially.

4.6. K Nearest Neighbor (KNN)

The KNN algorithm is a famous non-parametric technique. The K-nearest training points are used in the test so that the class value of the test points can be easily predicted. KNN is suitable for a high amount of training data, and it is not influenced by the noise of the training dataset.

4.7. Adaptive Boosting (AB)

Adaptive boosting (AB) is a famous boosting classifier, and, here, multiple base decision tree classifiers are combined in order to create a robust classifier with a good classification accuracy. Using weighted data, a sequence of weak classifiers is trained iteratively, and an ensemble is constructed by focusing on prior misclassified cases. Depending on the weighted summation of all weak classifiers, the final boosted classifiers are determined.

4.8. Decision Trees (DTs)

DT comprises a non-parametric tree-like structure with a root, leaf, and internal nodes. The training dataset is divided into many non-overlapping data subsets through the Gini Index. Depending on the attribute value, a partition is made by the root node, features are indicated by the internal nodes, and the outcome is specified by the leaf node.

4.9. Extreme Learning Machine (ELM)

For the training set T = { ( a i , b i ) : a i n } , b i { 1 , 1 } , i = 1 , 2 , , N , and the number of hidden nodes is represented by L . The following optimization is solved using the standard ELM:
min ( β , ξ ) 1 2 β 2 + c 2 i = 1 N ξ i 2
This is subject to
h ( a i ) = [ J ( w 1 , s 1 , a i ) , J ( w 2 , s 2 , a i ) , , J ( w L , s L , a i ) ]
which denotes the output vector of the hidden layer for input a i , where all weights w i and biases b i are randomly selected. A user-defined parameter is expressed as c 0 so that a tradeoff is achieved between the regularization and the empirical error. The optimization problem of the ELM is expressed as follows:
min ( β ) 1 2 β 2 + c 2 i = 1 N ( h ( a i ) β b i ) 2
For the output weight vector β , the optimal value is expressed as follows:
β ^ = H T = { H T ( I C + H H T ) 1 B i f N < L ( I C + H H T ) 1 H T Y i f N L
where the Moore–Penrose generalized inverse is expressed as H , and the least norm least squares solution is obtained [60]. For the binary classification issue, the output function of the ELM considering the optimal weight β is analyzed as follows:
f ( a ) = s i g n ( i = 1 L β i J ( w i , s i , a ) ) = s i g n ( h ( a ) β )
In the ELM optimization issue, the least squares loss function is used so that it is made sensitive to noise in the data. When working with imbalanced datasets, poor performance is obtained when using the standard ELM. To manage class-imbalanced datasets efficiently, many variants of the ELM model have been developed. Various schemes are utilized by these variants so that the training points are weighed based on their corresponding class representation. In the past decade, deep learning has also been coupled with the ELM, as the iterative tuning of weights is not required. Some famous examples include the residual ELM, stacked ELM autoencoder, and local receptive field ELM.

4.10. Twin ELM

A twin ELM is used to obtain a better classification accuracy. A pair of quadratic programming problems are solved by the TELM, and two non-parallel hyperplanes are obtained in a random feature space [61].
f 1 ( a ) : = h ( a ) β 1 = 0   and   f 2 ( a ) : = h ( a ) β 2 = 0
The following pairs of QPPs are solved using the TELM model:
min β , ξ 1 2 Y β 1 2 + c 1 e 2 T ξ
This is subject to
Z β 1 + ξ e 2 , ξ 0
and min β , η 1 2 Z β 2 2 + c 2 e 1 T η
which is subject to
Y β 2 + η e 1 , η 0
The user-defined parameters are c 1 > 0 and c 2 > 0 .
The corresponding Wolfe dual problems are obtained to derive efficient solutions to the primal problems, and they are expressed as follows:
max α e 2 T α 1 2 α T Z ( Y T Y + I ) 1 Z T α
This is subject to 0 α c 1 e and
max γ e 1 T γ 1 2 γ T Y ( Z T Z + I ) 1 Y T γ
which is subject to 0 γ c 2 e . The vectors of the Lagrange multiplier are denoted as α and γ .

5. Results and Discussion

The proposed work is tested on four robust datasets: the EMOVO [11], RAVDESS [12], SAVEE [13], and Berlin Emo-DB datasets [14]. EMOVO is an Italian speech database, and Berlin Emo-DB contains speeches made by German actors. The language of the SAVEE and RAVDESS datasets is English. There are seven emotions in the EMOVO, SAVEE, and Emo-DB datasets, while the RAVDESS speech dataset contains eight emotions. The EMOVO dataset consists of six actors (three males and three females) and about 588 observations, with a sampling rate of approximately 48 KHz. The RAVDESS dataset consists of 24 actors (12 males and 12 females) and about 1440 observations, with a sampling rate of approximately 48 KHz. The SAVEE dataset consists of four males and 480 observations, with a sampling rate of approximately 44.1 KHz. The EMO-DB dataset consists of 10 actors (5 males and 5 females) and 535 observations, with a sampling rate of approximately 16 KHz. A 10-fold cross validation method is used in our experiment. The proposed techniques are implemented and the results are obtained as described below. Table 1 shows a performance analysis of the classifiers with the transforms for the OIFS technique on the EMOVO dataset. A high classification accuracy of 78.65% is obtained for the FAWT with the twin ELM classification method. Table 2 shows a performance analysis of the classifiers with the transforms for the HHO feature selection technique on the EMOVO dataset. A high classification accuracy of 79.98% is obtained for the superlet transform with the twin ELM classification method. Table 3 shows a performance analysis of the classifiers with the transforms for the CSA feature selection technique on the EMOVO dataset. A high classification accuracy of 80.63% is obtained for the chirplet transform with the twin ELM classification method. Table 4 shows a performance analysis of the classifiers with the transforms for the OIFS technique on the RAVDESS dataset. A high classification accuracy of 84.19% is obtained for the FAWT with the twin ELM classification method. Table 5 shows a performance analysis of the classifiers with the transforms for the HHO feature selection technique on the RAVDESS dataset. A high classification accuracy of 85.76% is obtained for the FAWT with the twin ELM classification method. Table 6 shows a performance analysis of the classifiers with the transforms for the CSA feature selection technique on the RAVDESS dataset. A high classification accuracy of 85.54% is obtained for the chirplet transform with the twin ELM classification method. Table 7 shows a performance analysis of the classifiers with the transforms for the OIFS technique on the SAVEE dataset. A high classification accuracy of 83.94% is obtained for the chirplet transform with the twin ELM classification method. Table 8 shows a performance analysis of the classifiers with the transforms for the HHO feature selection technique on the SAVEE dataset. A high classification accuracy of 83.21% is obtained for the chirplet transform with the twin ELM classification method. Table 9 shows a performance analysis of the classifiers with the transforms for the CSA feature selection technique on the SAVEE dataset. A high classification accuracy of 83.11% is obtained for the KSTDIS transform with the twin ELM classification method. Table 10 shows a performance analysis of the classifiers with the transforms for the OIFS technique on the Berlin Emo-DB dataset. A high classification accuracy of 87.46% is obtained for the superlet transform with the twin ELM classification method. Table 11 shows a performance analysis of the classifiers with the transforms for the HHO feature selection technique on the Berlin Emo-DB dataset. A high classification accuracy of 87.57% is obtained for the KSTDIS transform with the twin ELM classification method. Table 12 shows a performance analysis of the classifiers with the transforms for the CSA feature selection technique on the Berlin Emo-DB dataset. A high classification accuracy of 89.77% is obtained for the KSTDIS transform with the twin ELM classification method.
Figure 3 displays a performance analysis of the classifiers with the transforms for the OIFS technique on the EMOVO dataset. Figure 3 shows that a low classification accuracy of 60.23% is obtained when the FST is implemented with the GB classifier. Figure 4 displays a performance analysis of the classifiers with the transforms for the OIFS technique on the RAVDESS dataset. Figure 4 shows that a low classification accuracy of 69.76% is obtained when the synchrosqueezing transform is implemented with the GB classifier. Figure 5 displays a performance analysis of the classifiers with the transforms for the HHO feature selection technique on the SAVEE dataset. Figure 5 shows that a low classification accuracy of 70.84% is obtained when the synchrosqueezing transform is implemented with the GB classifier. Figure 6 displays a performance analysis of the classifiers with the transforms for the CSA feature selection technique on the Berlin Emo-DB dataset. Figure 6 shows that a low classification accuracy of 83.12% is obtained when the FST is implemented with the DT classifier.

Comparison with Previous Works

The results obtained here are compared with previously obtained results for all four databases, and the results are tabulated in Table 13, Table 14, Table 15 and Table 16.
An analysis of Table 13 shows that the proposed works produced very good results when compared with the previous results. The best classification accuracy result of 80.63% was obtained for the Chirplet + CSA + TELM combination on the EMOVO dataset. An analysis of Table 14 shows that the proposed works once again produced very good results when compared with the previous works. A high classification accuracy of 85.76% was obtained for the FAWT + HHO + TELM combination on the RAVDESS dataset. An analysis of Table 15 shows that a high classification accuracy of 83.94% was obtained for the Chirplet + OIFS + TELM combination on the SAVEE dataset, which is good when compared with previous works. An analysis of Table 16 shows that a high classification accuracy of 89.77% was obtained for the KSTDIS + CSA + TELM combination on the Berlin Emo-DB dataset, surpassing the results of previous works.

6. Conclusions and Future Works

SER is considered an important aspect of automatic speech recognition, where procedures such as feature extraction, feature selection, and classification using machine learning and deep learning seem to be the general norm. SER is an important area of research for the advancement of human–computer interactions and human–robot interactions. For automated emotion recognition, the most common data sources are physiological data, obtained from EEG and wearable sensors, and speech-based data. The fusion of single modal data or multimodal data can also be carried out. In this work, transform-dependent optimization techniques with machine learning classifiers are utilized for the classification of speech emotion, and the highest classification accuracy of 89.77% is obtained when the KSTDIS + CSA + TELM combination is applied to the Berlin Emo-DB dataset. The highest classification accuracy of 80.63% is obtained when the Chirplet + CSA + TELM combination is applied to the EMOVO dataset, the highest classification accuracy of 85.76% is obtained when the FAWT + HHO + TELM combination is applied to the RAVDESS dataset and the highest classification accuracy of 83.94% is obtained when the Chirplet + OIFS + TELM combination is applied to the SAVEE dataset. The proposed combinations can be applied for emotion detection in various domains. Future works include extending the methods to different types of speech-related pathologies. Moreover, the implementation of more efficient transforms coupled with optimization and machine learning/deep learning algorithms like Graph Neural Networks (GNN), Variational Autoencoders (VAE), Deep Belief Networks (DBN), Generative Adversarial Networks (GAN) and Transformer Networks is planned as future works so that a higher classification accuracy can be obtained. Future works also plan to implement this work in telemedicine applications and cloud-based domains so that remote telehealth services can be largely improved.

Author Contributions

S.K.P.—Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Resources, Data Curation, Writing—Original draft preparation; D.-O.W.—Validation, Formal Analysis, Writing—review and editing, visualization, supervision, project administration, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by Korea government (MIST) (No. 2022R1A5A8019303) and partly supported by the Bio&Medical Technology Development Program of the NRF funded by the Korean government (MSIT) (No. RS-2023-00223501).

Institutional Review Board Statement

Not Applicable.

Data Availability Statement

The original data presented in this study are openly available in the EMOVO dataset [11], RAVDESS dataset [12], SAVEE dataset [13], and Berlin EMO-DB dataset [14].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhu, J.; Wu, X.; Lv, Z. Speech emotion recognition algorithm based on SVM. Comput. Syst. Appl. 2011, 20, 87–91. [Google Scholar]
  2. Kim, Y.; Lee, H.; Provost, E.M. Deep learning for robust feature generation in audio-visual emotion recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ‘13), Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
  3. Shimmura, T. Analyzing prosodic components of normal speech and emotive speech. Prepr. Acoust. Soc. Jpn. 1995, 18, 3–18. [Google Scholar]
  4. Bhaskar, J.; Sruthi, K.; Nedungadi, P. Hybrid Approach for Emotion Classification of Audio Conversation based on text and speech mining. Procedia Comput. Sci. 2015, 46, 635–643. [Google Scholar] [CrossRef]
  5. Pengjuan, G.; Dongmei, J. Research on emotional speech recognition based on pitch. Appl. Res. Comput. 2007, 24, 101–103. [Google Scholar]
  6. Rani, P.; Kotwal, S.; Manhas, J.; Sharma, V.; Sharma, S. Machine Learning and Deep Learning Based Computational Approaches in Automatic Microorganisms Image Recognition: Methodologies, Challenges, and Developments. Arch. Comput. Methods Eng. 2022, 29, 1801–1837. [Google Scholar] [CrossRef]
  7. Sim, J.H.; Yoo, J.; Lee, M.L.; Han, S.H.; Han, S.K.; Lee, J.Y.; Yi, S.W.; Nam, J.; Kim, D.S.; Yang, Y.S. Deep Learning Model for Cosmetic Gel Classification Based on a Short-Time Fourier Transform and Spectrogram. ACS Appl. Mater. Interfaces 2024, 16, 25825. [Google Scholar] [CrossRef]
  8. Nwe, T.L.; Foo, S.W.; de Silva, L.C. Speech emotion recognition using hidden Markov models. Speech Commun. 2003, 41, 603–623. [Google Scholar] [CrossRef]
  9. Zhao, L.; Jiang, C.; Zou, C.; Wu, Z. Study on emotional feature analysis and recognition in speech. Acta Electron. Sin. 2004, 32, 606–609. [Google Scholar]
  10. Yongzhao, Z.; Peng, C. Research and implementation of emotional feature extraction and recognition in speech signal. J. Jiangsu Univ. 2005, 26, 72–75. [Google Scholar]
  11. Costantini, G.; Iaderola, I.; Paoloni, A.; Todisco, M. EMOVO Corpus: An Italian emotional speech database. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2014), European Language Resources Association (ELRA), Reykjavik, Iceland, 26–31 May 2014; pp. 3501–3504. [Google Scholar]
  12. Livingstone, S.R.; Russo, F.A. The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
  13. Haq, S.; Jackson, P.J. Machine Audition: Principles, Algorithms and Systems Multimodal Emotion Recognition; IGI Global: Hershey, PA, USA, 2010; pp. 398–423. [Google Scholar]
  14. Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lissabon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
  15. Assunção, G.; Menezes, P.; Perdigão, F. Speaker awareness for speech emotion recognition. Int. J. Online Biomed. Eng. 2020, 16, 15–22. [Google Scholar] [CrossRef]
  16. Haider, F.; Pollak, S.; Albert, P.; Luz, S. Emotion recognition in low-resource settings: An evaluation of automatic feature selection methods. Comput. Speech Lang. 2020, 65, 101119. [Google Scholar] [CrossRef]
  17. Özseven, T. A novel features selection method for speech emotion recognition. Appl. Acoust. 2019, 146, 320–326. [Google Scholar] [CrossRef]
  18. Latif, S.; Rana, R.; Younis, S.; Qadir, J.; Epps, J. Transfer learning for improving speech emotion classification accuracy. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 257–261. [Google Scholar]
  19. Jason, C.A.; Kumar, S. An appraisal on speech and emotion recognition technologies based on machine learning. Int. J. Recent Technol. Eng. 2020, 8, 2266–2276. [Google Scholar]
  20. Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2020, 20, 183. [Google Scholar]
  21. Christy, A.; Vaithyasubramanian, S.; Jesudoss, A.; Praveena, M.A. Multimodal speech emotion recognition and classification using convolutional neural network techniques. Int. J. Speech Technol. 2020, 23, 381–388. [Google Scholar] [CrossRef]
  22. Mansouri-Benssassi, E.; Ye, J. Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
  23. Jalal, M.A.; Loweimi, E.; Moore, R.K.; Hain, T. Learning temporal clusters using capsule routing for speech emotion recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 1701–1705. [Google Scholar]
  24. Bhavan, A.; Chauhan, P.; Shah, R.R. Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 2019, 184, 104886. [Google Scholar] [CrossRef]
  25. Zeng, Y.; Mao, H.; Peng, D.; Yi, Z. Spectrogram based multi-task audio classification. Multimed. Tools Appl. 2019, 78, 3705–3722. [Google Scholar] [CrossRef]
  26. Liu, G.K. Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv 2018, arXiv:1806.09010. [Google Scholar]
  27. Shegokar, P.; Sircar, P. Continuous wavelet transform based speech emotion recognition. In Proceedings of the 2016 10th International Conference on Signal Processing and Communication Systems (ICSPCS), Surfers Paradise, Gold Coast, Australia, 19–21 December 2016; pp. 1–8. [Google Scholar]
  28. Vasuki, P.; Aravindan, C. Hierarchical classifier design for speech emotion recognition in the mixed-cultural environment. J. Exp. Theor. Artif. Intell. 2020, 33, 451–466. [Google Scholar] [CrossRef]
  29. Nguyen, D.; Sridharan, S.; Nguyen, D.T.; Denman, S.; Tran, S.N.; Zeng, R.; Fookes, C. Joint deep cross-domain transfer learning for emotion recognition. arXiv 2020, arXiv:2003.11136. [Google Scholar] [CrossRef]
  30. Mekruksavanich, S.; Jitpattanakul, A.; Hnoohom, N. Negative emotion recognition using deep learning for Thai language. In Proceedings of the 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON), Pattaya, Thailand,, 11–14 March 2020; pp. 71–74. [Google Scholar]
  31. Hajarolasvadi, N.; Demirel, H. 3D CNN-Based speech emotion recognition using K-means clustering and spectrograms. Entropy 2019, 21, 479. [Google Scholar] [CrossRef] [PubMed]
  32. Tzinis, E.; Paraskevopoulos, G.; Baziotis, C.; Potamianos, A. Integrating recurrence dynamics for speech emotion recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 927–931. [Google Scholar]
  33. Sugan, N.; Srinivas, N.S.; Kar, N.; Kumar, L.; Nath, M.; Kanhe, A. Performance comparison of different cepstral features for speech emotion recognition. In Proceedings of the International CET Conference on Control, Communication, and Computing (IC4), Trivandrum, Thiruvananthapuram, India, 5–7 July 2018; pp. 266–271. [Google Scholar]
  34. Yogesh, C.; Hariharan, M.; Ngadiran, R.; Adom, A.H.; Yaacob, S.; Berkai, C.; Polat, K. A new hybrid PSO assisted biogeography-based optimization for emotion and stress recognition from speech signal. Expert Syst. Appl. 2017, 69, 149–158. [Google Scholar]
  35. Chen, L.; Su, W.; Feng, Y.; Wu, M.; She, J.; Hirota, K. Two-layer fuzzy multiple random forest for speech emotion recognition in human–robot interaction. Inform. Sci. 2020, 509, 150–163. [Google Scholar] [CrossRef]
  36. Daneshfar, F.; Kabudian, S.J. Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multimed. Tools Appl. 2020, 79, 1261–1289. [Google Scholar] [CrossRef]
  37. Wang, K.; Su, G.; Liu, L.; Wang, S. Wavelet packet analysis for speaker-independent emotion recognition. Neurocomputing 2020, 398, 257–264. [Google Scholar] [CrossRef]
  38. Guizzo, E.; Weyde, T.; Leveson, J.B. Multi-time-scale convolution for emotion recognition from speech audio signals. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6489–6493. [Google Scholar]
  39. Zamil, A.A.A.; Hasan, S.; Baki, S.M.J.; Adam, J.M.; Zaman, I. Emotion detection from speech 8signals using voting mechanism on classified frames. In Proceedings of the 2019 International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), Dhaka, Bangladesh, 10–12 January 2019; pp. 281–285. [Google Scholar]
  40. Álvarez, A.; Sierra, B.; Arruti, A.; López-Gil, J.-M.; Garay-Vitoria, N. Classifier subset selection for the stacked generalization method applied to emotion recognition in speech. Sensors 2016, 16, 21. [Google Scholar] [CrossRef]
  41. Badshah, A.M.; Ahmad, J.; Lee, M.Y.; Baik, S.W. Divide-and-conquer based ensemble to spot emotions in speech using MFCC and random forest. In Proceedings of the 2nd International Integrated Conference & Concert on Convergence, Saint Petersburg, Russia, 7–14 August 2016; pp. 1–8. [Google Scholar]
  42. Yu, G.; Wang, Z.; Zhao, P. Multisynchrosqueezing transform. IEEE Trans. Ind. Electron. 2019, 66, 5441–5455. [Google Scholar] [CrossRef]
  43. Khoa, N.M.; Van Dai, L. Detection and classification of power quality disturbances in power system using modified-combination between the stockwell transform and decision tree methods. Energies 2020, 13, 14. [Google Scholar] [CrossRef]
  44. Barone, A.; Esposito, F.; Magee, C.J.; Scott, A.C. Theory and applications of the Sine-Gordon equation. Riv. Nuovo C. 1971, 1, 227–267. [Google Scholar] [CrossRef]
  45. Pesin, Y.; Zelerowicz, A.; Zhao, Y. Time Rescaling of Lyapunov Exponents. In Advances in Dynamics, Patterns, Cognition; Nonlinear Systems and Complexity; Aranson, I., Pikovsky, A., Rulkov, N., Tsimring, L., Eds.; Springer: Cham, Switzerland, 2017; Volume 20. [Google Scholar] [CrossRef]
  46. Ramírez-Parietti, I.; Contreras-Reyes, J.E.; Idrovo-Aguirre, B.J. Cross-sample entropy estimation for time series analysis: A nonparametric approach. Nonlinear Dyn 2021, 105, 2485–2508. [Google Scholar] [CrossRef]
  47. Sharma, M.; Pachori, R.B.; Acharya, U.R. A new approach to characterize epileptic seizures using analytic time-frequency flexible wavelet transform and fractal dimension. Pattern Recognit. Lett. 2017, 94, 172–179. [Google Scholar] [CrossRef]
  48. Mann, S.; Haykin, S. The chirplet transform: Physical considerations. IEEE Trans. Signal Process. 1995, 43, 2745–2761. [Google Scholar] [CrossRef]
  49. Srikanth, P.; Koley, C. An intelligent algorithm for autorecognition of power system faults using superlets. Sustain. Energy Grids Netw. 2021, 26, 100450. [Google Scholar] [CrossRef]
  50. Alpar, O.; Krejcar, O. Frequency and Time Localization in Biometrics: STFT vs. CWT. In Recent Trends and Future Technology in Applied Intelligence. IEA/AIE 2018; Lecture Notes in Computer Science; Mouhoub, M., Sadaoui, S., Ait Mohamed, O., Ali, M., Eds.; Springer: Cham, Switzerland, 2018; Volume 10868. [Google Scholar] [CrossRef]
  51. Ortiz-Echeverri, C.J.; Rodríguez-Reséndiz, J.; Garduño-Aparicio, M. An approach to STFT and CWT learning through musichands-on labs. Comput. Appl. Eng. Educ. 2018, 26, 2026–2035. [Google Scholar] [CrossRef]
  52. Moca, V.V.; Bârzan, H.; Nagy-Dăbâcan, A.; Mureșan, R.C. Time-frequency super-resolution with superlets. Nat. Commun. 2021, 12, 337. [Google Scholar] [CrossRef]
  53. Guo, Y.; Wan, L.; Sheng, X.; Wang, G.; Kang, S.; Zhou, H.; Zhang, X. The Application of Superlet Transform in EEG-Based Motor Imagery Classification of Unilateral Knee Movement. In ICAUS 2023, Proceedings of the 3rd 2023 International Conference on Autonomous Unmanned Systems (3rd ICAUS 2023), Nanjing, China, 9–11 September 2023; Qu, Y., Gu, M., Niu, Y., Fu, W., Eds.; Lecture Notes in Electrical Engineering; Springer: Singapore, 2024; Volume 1173, p. 1173. [Google Scholar] [CrossRef]
  54. Lee, J.; Lim, H.; Kim, D.-W. Approximating mutual information for multi-label feature selection. Electron. Lett. 2012, 48, 929–930. [Google Scholar] [CrossRef]
  55. Yan, K.; Ma, L.; Dai, Y.; Shen, W.; Ji, Z.; Xie, D. Cost-sensitive and sequential feature selection for chiller fault detection and diagnosis. Int. J. Refrig. 2018, 86, 401–409. [Google Scholar] [CrossRef]
  56. Abdel-Basset, M.; Ding, W.; El-Shahat, D. A hybrid Harris Hawks optimization algorithm with simulated annealing for feature selection. Artif. Intell. Rev. 2020, 54, 593–637. [Google Scholar] [CrossRef]
  57. Nourani, Y.; Andresen, B. A comparison of simulated annealing cooling strategies. J. Phys. A Math. Gen. 1998, 31, 8373–8385. [Google Scholar] [CrossRef]
  58. Ji, S.; Li, R.; Shen, S.; Li, B.; Wang, Z. Heartbeat classification based on multifeature combination and stacking-dwknn algorithm. J. Healthc. Eng. 2021, 2021, 8811837. [Google Scholar] [CrossRef] [PubMed]
  59. Said, M.; El-Rifaie, A.M.; Tolba, M.A.; Houssein, E.H.; Deb, S. An Efficient Chameleon Swarm Algorithm for Economic Load Dispatch Problem. Mathematics 2021, 9, 2770. [Google Scholar] [CrossRef]
  60. Castaño, A.; Fernández-Navarro, F.; Hervás-Martínez, C. PCA-ELM: A robust and pruned extreme learning machine approach based on principal component analysis. Neural Process. Lett. 2013, 37, 377–392. [Google Scholar] [CrossRef]
  61. van Heeswijk, M.; Miche, Y.; Oja, E.; Lendasse, A. GPU-accelerated and parallelized ELM ensembles for large-scale regression. Neurocomputing 2011, 74, 2430–2437. [Google Scholar] [CrossRef]
Figure 1. Simplified illustration of this work.
Figure 1. Simplified illustration of this work.
Biomimetics 09 00513 g001
Figure 2. Feature selection method utilizing the optimization algorithm and KNN.
Figure 2. Feature selection method utilizing the optimization algorithm and KNN.
Biomimetics 09 00513 g002
Figure 3. Performance analysis of classifiers with transforms for OIFS technique on EMOVO dataset.
Figure 3. Performance analysis of classifiers with transforms for OIFS technique on EMOVO dataset.
Biomimetics 09 00513 g003
Figure 4. Performance analysis of classifiers with transforms for OIFS technique on RAVDESS dataset.
Figure 4. Performance analysis of classifiers with transforms for OIFS technique on RAVDESS dataset.
Biomimetics 09 00513 g004
Figure 5. Performance analysis of classifiers with transforms for HHO feature selection technique on SAVEE dataset.
Figure 5. Performance analysis of classifiers with transforms for HHO feature selection technique on SAVEE dataset.
Biomimetics 09 00513 g005
Figure 6. Performance analysis of classifiers with transforms for CSA feature selection technique on Berlin EMO-DB dataset.
Figure 6. Performance analysis of classifiers with transforms for CSA feature selection technique on Berlin EMO-DB dataset.
Biomimetics 09 00513 g006
Table 1. Performance analysis of classifiers’ accuracy with transforms for OIFS technique on EMOVO dataset.
Table 1. Performance analysis of classifiers’ accuracy with transforms for OIFS technique on EMOVO dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF65.5666.4567.4565.5466.7869.65
SVM73.2371.2370.2374.2173.4576.78
XGB69.3469.4568.5666.4163.4567.89
NBC65.5665.7865.7866.7868.8963.23
GB61.2360.2366.7864.5866.7967.45
KNN63.4566.5566.7867.7866.2368.89
AB70.2371.2369.9166.6766.6962.34
DT67.4569.1269.9068.4168.0266.58
ELM72.3470.2372.3473.3473.2173.02
TELM75.5676.5677.4378.6573.3578.01
Table 2. Performance analysis of classifiers’ accuracy with transforms for HHO feature selection technique on EMOVO dataset.
Table 2. Performance analysis of classifiers’ accuracy with transforms for HHO feature selection technique on EMOVO dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF66.2365.5668.3366.0962.0968.09
SVM75.5672.7872.4575.8770.8977.68
XGB70.8770.9869.6768.7761.7868.32
NBC67.9668.4469.8769.7562.5568.45
GB63.3463.3469.7666.4363.4569.78
KNN64.5764.5668.2169.2468.1169.98
AB72.8974.8966.3469.6862.2765.23
DT68.3367.2267.5669.9865.8968.44
ELM75.7873.1270.7871.2377.6777.57
TELM74.9072.5772.9070.1974.8979.98
Table 3. Performance analysis of classifiers’ accuracy with transforms for CSA feature selection technique on EMOVO dataset.
Table 3. Performance analysis of classifiers’ accuracy with transforms for CSA feature selection technique on EMOVO dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF69.2369.9069.0969.8966.6668.09
SVM78.4575.9875.8979.9874.7877.98
XGB73.6778.7773.8872.7866.9868.78
NBC69.8769.6574.7673.6767.5468.67
GB69.5569.3473.4572.4467.2269.46
KNN68.4369.5675.3474.3269.3469.33
AB77.1279.4369.2373.1367.5665.21
DT69.3468.2172.1174.4569.7868.45
ELM79.3478.1174.2375.6779.9877.67
TELM78.7880.2375.6774.7880.6379.89
Table 4. Performance analysis of classifiers’ accuracy with transforms for OIFS technique on RAVDESS dataset.
Table 4. Performance analysis of classifiers’ accuracy with transforms for OIFS technique on RAVDESS dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF75.3476.0977.3375.7882.0979.09
SVM72.6775.8979.4478.7684.3479.77
XGB79.8879.8878.7878.5583.5581.63
NBC69.9875.4375.3479.4382.7875.24
GB69.7670.1176.6774.2180.9479.56
KNN71.5476.1278.8777.6882.2279.77
AB75.3477.3881.6379.9784.1281.87
DT73.5879.9881.2179.4678.4574.56
ELM74.9277.6780.6879.9779.7678.22
TELM76.1179.4582.9584.1983.3382.13
Table 5. Performance analysis of classifiers’ accuracy with transforms for HHO feature selection technique on RAVDESS dataset.
Table 5. Performance analysis of classifiers’ accuracy with transforms for HHO feature selection technique on RAVDESS dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF78.6779.8879.0978.9883.0982.11
SVM75.8977.9878.8779.7782.2282.23
XGB78.8878.6779.6679.6782.3484.45
NBC71.7678.4474.7883.4584.6782.87
GB72.4473.3279.9879.6781.8784.64
KNN73.3477.1379.2282.8980.5483.22
AB77.5679.6780.3483.2281.3384.34
DT78.7577.8780.5682.2379.2380.78
ELM76.1178.6982.7683.4580.4580.96
TELM75.2378.8784.1185.7684.6784.11
Table 6. Performance analysis of classifiers’ accuracy with transforms for CSA feature selection technique on RAVDESS dataset.
Table 6. Performance analysis of classifiers’ accuracy with transforms for CSA feature selection technique on RAVDESS dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF79.0380.9882.1181.2282.0981.11
SVM78.4479.8984.2382.3481.9984.26
XGB79.5679.7683.7883.5683.8785.08
NBC74.7880.4884.7385.0782.5684.93
GB75.9877.7184.2483.7780.7385.22
KNN78.2279.2383.5785.1583.4885.01
AB79.1282.4881.6584.4185.0581.68
DT75.3479.9082.7784.9982.6782.98
ELM79.5882.0384.8985.0884.9883.32
TELM79.8784.2385.0285.2485.5485.35
Table 7. Performance analysis of classifiers’ accuracy with transforms for OIFS technique on SAVEE dataset.
Table 7. Performance analysis of classifiers’ accuracy with transforms for OIFS technique on SAVEE dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF74.0378.7679.0976.5581.0981.11
SVM70.4478.5681.8879.6782.6782.14
XGB78.5677.7782.7679.8782.8382.56
NBC68.7879.8982.2380.2280.4583.87
GB72.9780.9379.4577.1281.6783.23
KNN72.3479.2179.6779.3480.8880.56
AB77.2479.2282.8782.6883.9180.78
DT76.6781.4580.2281.9679.3578.44
ELM77.8779.6781.1182.3180.7879.31
TELM78.2382.8783.2583.7483.9482.03
Table 8. Performance analysis of classifiers’ accuracy with transforms for HHO feature selection technique on SAVEE dataset.
Table 8. Performance analysis of classifiers’ accuracy with transforms for HHO feature selection technique on SAVEE dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF72.0375.3278.0977.5580.0982.56
SVM72.4477.4480.8878.6780.8881.77
XGB79.5775.5681.8778.8781.7583.87
NBC70.8978.7880.3581.4381.3482.22
GB70.8481.9878.6779.2280.5784.12
KNN74.2378.6578.6678.1579.8682.40
AB75.4778.3280.5483.6782.1283.57
DT77.8980.1181.2382.8780.3481.87
ELM75.1278.4580.2183.4482.7882.65
TELM76.3281.7882.3481.3183.2183.12
Table 9. Performance analysis of classifiers’ accuracy with transforms for CSA feature selection technique on SAVEE dataset.
Table 9. Performance analysis of classifiers’ accuracy with transforms for CSA feature selection technique on SAVEE dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF73.7874.0876.8779.1179.0882.02
SVM73.6679.7782.6781.2382.6780.34
XGB78.5477.8983.0679.4580.5682.56
NBC72.3479.6579.7882.7682.7881.87
GB72.6780.3476.9882.8781.7682.65
KNN75.8775.5679.2380.4382.4380.32
AB76.2382.7877.4582.2180.2182.13
DT78.1282.6280.6781.2379.6580.67
ELM74.3481.1378.8780.5681.7881.66
TELM75.6779.4683.1182.7682.8782.99
Table 10. Performance analysis of classifiers’ accuracy with transforms for OIFS technique on Berlin Emo-DB dataset.
Table 10. Performance analysis of classifiers’ accuracy with transforms for OIFS technique on Berlin Emo-DB dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF84.0279.8784.0279.0184.0284.21
SVM80.3479.3283.2481.2285.3386.32
XGB81.7780.3485.6781.3686.5686.56
NBC82.8979.5784.8782.8784.8784.87
GB80.7881.8683.2179.4385.5485.32
KNN79.2282.4384.2383.2182.2286.44
AB78.3483.2183.5584.5585.1385.56
DT80.6784.0285.6783.7886.5485.87
ELM82.9785.3483.8985.9883.7686.21
TELM83.1284.4786.2186.2286.8987.46
Table 11. Performance analysis of classifiers’ accuracy with transforms for HHO feature selection technique on Berlin Emo-DB dataset.
Table 11. Performance analysis of classifiers’ accuracy with transforms for HHO feature selection technique on Berlin Emo-DB dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF83. 0281.4583.0981.0583.1185.02
SVM82.3480.5685.5483.6084.2385.31
XGB80.5681.7684.2184.7287.4587.44
NBC83.8782.2385.3483.8685.7685.67
GB82.5482.2382.6682.2286.5486.89
KNN79.2383.4686.7884.1184.2185.87
AB80.1484.7882.9885.4186.3486.64
DT82.5682.9887.2283.5685.7884.13
ELM83.7786.2186.1286.7885.6585.67
TELM84.8985.2287.5785.9787.1986.87
Table 12. Performance analysis of classifiers’ accuracy with transforms for CSA feature selection technique on Berlin Emo-DB dataset.
Table 12. Performance analysis of classifiers’ accuracy with transforms for CSA feature selection technique on Berlin Emo-DB dataset.
Feature Extraction through Transformation Techniques
Synchrosqueezing
Transform
FSTKSTDISFAWTChirpletSuperlet
RF85. 0284.2185.0284.0285.0284.21
SVM85.3485.2386.3285.3387.3186.33
XGB83.5686.4587.4186.4586.4486.46
NBC86.7784.6787.5687.6886.5686.87
GB85.8786.8985.7885.7687.8786.64
KNN86.5387.6588.9887.4186.3387.21
AB83.2385.3285.2787.2387.4587.34
DT85.4583.1289.3488.7887.7886.70
ELM83.7887.5688.5687.9888.5486.87
TELM87.9188.7889.7788.3188.6788.21
Table 13. Performance comparison of classification accuracy—EMOVO dataset.
Table 13. Performance comparison of classification accuracy—EMOVO dataset.
AuthorsConcept UsedNumber of ClassesClassification Accuracy (%)
Assuncao et al. [15]Speaker awareness for SER techniques768.50
Haider et al. [16]Automated feature selection for emotion recognition in low-resource settings741
Ozseven et al. [17]Novel feature selection methods for SER760.40
Latif et al. [18]Transfer learning concept for enhancing SER776.22
Proposed worksFAWT + OIFS + TELM778.65
Superlet + HHO + TELM779.98
Chirplet + CSA + TELM780.63
Table 14. Performance comparison of classification accuracy—RAVDESS dataset.
Table 14. Performance comparison of classification accuracy—RAVDESS dataset.
AuthorsConcept UsedNumber of ClassesClassification Accuracy (%)
Jason and Kumar [19]Machine learning techniques880.21
Kwon [20]Convolutional neural networks (CNNs)879.50
Christy [21]Multimodal SER using CNN878.20
Masouri-Bensassi and Ye [22]Spiking neural networks883.60
Jalal et al. [23]Capsule routing technique877.02
Bhavan et al. [24]Bagged SVM875.69
Zeng et al. [25]Spectrogram-based multi-task audio classification864.48
Liu et al. [26]Frequency cepstral coefficients with neural networks879.80
Shegokar et al. [27]Continuous wavelet transform (CWT)860.10
Proposed worksFAWT + OIFS + TELM884.19
FAWT + HHO + TELM885.76
Chirplet + CSA + TELM885.54
Table 15. Performance comparison of classification accuracy—SAVEE dataset.
Table 15. Performance comparison of classification accuracy—SAVEE dataset.
AuthorsConcept UsedNumber of ClassesClassification Accuracy (%)
Vasuki and Aravindan [28]A hierarchical classifier783.78
Nguyen et al. [29]Joint deep cross-domain transfer learning769
Mekruksaranich et al. [30]Negative emotion recognition765.83
Hajarolasvadi and Demirel [31]3D-CNN-based SER using K-means clustering and spectrograms781.05
Tzinis et al. [32]Integrated recurrence dynamics for SER780.20
Sugan et al. [33]Cepstral feature comparison for SER778.60
Yogesh et al. [34]Hybrid Particle Swarm Optimization (PSO)-based biogeography optimization778.44
Proposed worksChirplet + OIFS + TELM783.94
Chirplet + HHO + TELM783.21
KSTDIS + CSA + TELM783.11
Table 16. Performance comparison of classification accuracy—Berlin Emo-DB dataset.
Table 16. Performance comparison of classification accuracy—Berlin Emo-DB dataset.
AuthorsConcept UsedNumber of ClassesClassification Accuracy (%)
Chen et al. [35]A two-layer fuzzy multiple random forest concept787.85
Daneshfar et al. [36]Modified quantum-behaved PSO782.82
Wang et al. [37]Wavelet packet analysis779.50
Guizzo et al. [38]Multi-time-scale convolution770.97
Zamil et al. [39]Voting mechanisms764.52
Alvarez et al. [40]Stacked generalization method782.45
Badshah et al. [41]Divide-and-conquer-based ensemble technique782.00
Proposed worksSuperlet + OIFS + TELM787.46
KSTDIS + HHO + TELM787.57
KSTDIS + CSA + TELM789.77
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Prabhakar, S.K.; Won, D.-O. A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition. Biomimetics 2024, 9, 513. https://doi.org/10.3390/biomimetics9090513

AMA Style

Prabhakar SK, Won D-O. A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition. Biomimetics. 2024; 9(9):513. https://doi.org/10.3390/biomimetics9090513

Chicago/Turabian Style

Prabhakar, Sunil Kumar, and Dong-Ok Won. 2024. "A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition" Biomimetics 9, no. 9: 513. https://doi.org/10.3390/biomimetics9090513

Article Metrics

Back to TopTop