Machine Learning Framework for Classifying and Predicting Depressive Behavior Based on PPG and ECG Feature Extraction

Alzate, Mateo; Torres, Robinson; De la Roca, José; Quintero-Zea, Andres; Hernandez, Martha

doi:10.3390/app14188312

Open AccessArticle

Machine Learning Framework for Classifying and Predicting Depressive Behavior Based on PPG and ECG Feature Extraction

by

Mateo Alzate

^1,*

,

Robinson Torres

^1,*

,

José De la Roca

²

,

Andres Quintero-Zea

¹

and

Martha Hernandez

³

¹

Escuela de Ciencias de la Vida y Medicina, Universidad EIA, Envigado 055420, Colombia

²

Department of Psychology, Division of Health Sciences, University of Guanajuato, Campus León, León 37670, Mexico

³

Unidad Médica de Alta Especialidad (UMAE), Hospital de Especialidades No. 1. Centro Médico Nacional del Bajio, 37328 IMSS, Blvd. Adolfo López Mateos Esquina Paseo de los Insurgentes S/N, Col. Los Paraisos, León 37320, Mexico

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8312; https://doi.org/10.3390/app14188312

Submission received: 1 August 2024 / Revised: 30 August 2024 / Accepted: 10 September 2024 / Published: 15 September 2024

(This article belongs to the Special Issue Applied and Innovative Computational Intelligence Systems: 3rd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Depression is a significant risk factor for other serious health conditions, such as heart failure, dementia, and diabetes. In this study, a quantitative method was developed to detect depressive states in individuals using electrocardiogram (ECG) and photoplethysmogram (PPG) signals. Data were obtained from 59 people affiliated with the high-specialized medical center of Bajio T1, which consists of medical professionals, administrative personnel, and service workers. Data were analyzed using the Beck Depression Inventory (BDI-II) to discern potential false positives. The statistical analyses performed elucidated distinctive features with variable behavior in response to diverse physical stimuli, which were adeptly processed through a machine learning classification framework. The method achieved an accuracy rate of up to 92% in the identification of depressive states, substantiating the potential of biophysical data in increasing the diagnostic process of depression. The results suggest that this method is innovative and has significant potential. With additional refinements, this approach could be utilized as a screening tool in psychiatry, incorporated into everyday devices for preventive diagnostics, and potentially lead to alarm systems for individuals with suicidal thoughts.

Keywords:

electrocardiogram; feature extraction; heart rate variability; machine learning; major depression

1. Introduction

Understanding depression has proven to be a long and meticulous endeavor, thus prompting the emergence of methodologies for its quantification. Such efforts are indispensable, as depression constitutes a significant risk factor for a multitude of comorbid conditions. In addition, it is a predominant catalyst for suicide worldwide. Mental disorders, including major depressive disorder, are implicated in 60% of all suicides [1]. Specifically, people who experience general depression have a probability 4% of committing suicide, with chronic depression resulting in suicide rates of 7% among men and 1% among women [2]. According to the World Health Organization (WHO), more than 280 million people worldwide are affected by depression, making it the main cause of years lost to disability [3]. Furthermore, the prevalence of depression spans 1 in 5 women and 1 in 8 men [4]. In Mexico, 5.3% of the population, equal to approximately 3.6 million people, is affected by depression, with 73.9% remaining untreated [5]. In Guanajuato, 17.6% of the population is estimated to manifest depressive symptoms or related conditions [6]. Currently, the diagnostic process for depression is predominantly subjective, leading to frequent misdiagnoses or inconsistent evaluations [7]. Approximately half of those affected by depression are misdiagnosed or do not receive a diagnosis [8]. Moreover, a correct diagnosis is impeded by the overlap of symptoms between various mental disorders [9]. A timely and precise diagnosis is critical for effective prevention and treatment, given that depression is a significant risk factor for other serious health conditions such as heart failure [10], dementia [11], and diabetes [12]. Individuals with depression exhibit an increased likelihood 60% of developing additional illnesses and an increased risk of suicide 4% [2]. Furthermore, the economic burden imposed by depression is substantial, estimating 1 trillion USD annually on a global scale [13]. Thus, enhancing diagnostic accuracy is imperative to mitigate these impacts.

Innovative methodologies for depression detection leverage physiological signals, notably biopotentials, which are imperative to elucidate the dynamics of tissues, nerves, and organs. These signals can furnish a quantitative index that conveys pertinent insights into a patient’s condition. Integrating these signals with machine learning algorithms for precise classification emerges as a promising strategy to detect not only emotions and feelings, but also depression and anxiety. Within this framework, efforts to utilize electrocardiogram (ECG) and electroencephalogram (EEG) signals have produced various results. Metrics such as entropy [14,15,16], zero cross frequency, and frequency centroid [17] have been identified as significant in the detection of depressive patterns and emotionally affective signals associated with the parasympathetic system. In terms of emotion induction, methodologies include the use of autobiographical memories and visual stimulation [18], as well as script-driven imagery, which employs written narratives and imagination to evoke a robust emotional response [19,20].

Numerous investigations have concentrated on the detection of depression using ECG data coupled with machine learning methodologies. The prevailing paradigm predominantly employs elementary machine learning classification algorithms, such as Support Vector Machines (SVM), in conjunction with the extraction of heart rate variability (HRV) attributes [21]. Scholars typically integrate more than 20 disparate attributes within a classification framework along with HRV-related characteristics [15]. A particular study exclusively examined HRV behavior alongside specific variables derived from a singular signal to evaluate depression, achieving metrics that surpass 80% with Bayesian classifiers [22]. A recent publication delineated the machine learning metrics that oscillate between 60% and 70% for the binary classification of subjects with and without depression using ECG-derived variables [23]. A prevalent obstacle in these examinations is the limited number of participants, often fewer than 70 [24]. To mitigate this limitation, validation methodologies such as stratified k-folds or the leave-one-out (LOO) approach are frequently used [25,26]. Notwithstanding these constraints, feature extraction classification models have shown considerable promise in the context of depression detection. In addition, certain studies have taken on deep learning strategies to enhance the precision of machine learning techniques. For example, a study reported a 93% accuracy rate in detecting depression by analyzing a 1-min series of ECG signals using a convolutional neural network (CNN) [27]. Another study achieved up to 98% accuracy by applying CNN to EEG signals [28]. It is imperative to recognize that accuracy is contingent not only on the artificial intelligence technique or model utilized, but also on the extracted features, the quantity and quality of data, and whether the participants are clinically diagnosed with depression or are healthy individuals in whom depressive states have been induced.

In this research, a comprehensive evaluation of various machine learning models was performed utilizing the aforementioned features. A control cohort consisting of 29 non-depressed and 30 depressed participants was employed. An emotional induction procedure was administered, followed by the acquisition of ECG and PPG signals. Using a Random Forest model, we achieved precision, precision, F1 scores, and recall metrics equal to or exceeding 88% using cross-validation techniques. Subsequently, an exhaustive statistical and characteristic analysis was undertaken.

2. Materials and Methods

2.1. Participants

For participant selection, individuals diagnosed with depressive disorders, including major depressive disorder or dysthymia, as well as those who present anxiety indicators, were included. The control cohort consisted of individuals without a diagnosis of depression or associated depressive symptoms. The exclusion criteria included individuals diagnosed with mental disorders other than depression, along with those suffering from cardiovascular complications such as arrhythmias, angina, cardiomyopathy, or heart failure. Furthermore, participants were excluded if they were under the influence of alcohol or subjected to significant emotional events within 24 h before the experimental protocol and data collection. Data were obtained from 59 people affiliated with the high-specialized medical center of Bajio T1, which consists of medical professionals, administrative personnel, and service workers. The sample consisted of 28 men and 31 women, with a mean age of 36.0 ± 8.8 years. Based on symptomatology and subsequent analysis, two distinct groups were delineated: 29 individuals without depressive symptoms and 30 individuals who had depressive symptoms.

2.2. Screening

From the total sample, several significant indicators were identified using a bifurcated methodology to assess depressive states. Of the 59 participants, the initial step involved participants giving a verbal affirmation in the presence of an attending psychiatrist about any previous diagnosis of depression or the occurrence of depressive symptoms within the previous two weeks. This established a preliminary control marker. Subsequently, the Beck Depression Inventory II (BDI-II) was administered to discern potential false positives. The BDI-II test categorizes depression severity into four intervals: No Depression (0–13), Mild Depression (14–19), Moderate Depression (20–28), and Severe Depression (29–63). In addition, an evaluative assessment of participants’ behavior and emotional state was performed for further verification. According to the BDI-II results, 54.2% of the individuals were classified as not having a depression, 13.6% were classified as mild depression, 16.9% with moderate depression, and 15.3% with Severe Depression. Participants who met the criteria on either the verbal affirmation or the BDI-II test were classified as “DEPRESSIVE”, while those who did not meet the criteria on both assessments were designated as “NORMAL”. In the final step, participants were asked about their current diagnosis of depression according to a psychiatrist. Of the 59 participants, 79.7% reported that they had no current diagnosis of depression or were under treatment, while the remaining 20.3% confirmed having a diagnosis of depression at the time of the study. Furthermore, only 8.5% (five participants) mentioned that they were under medication at the time. Based on these results, the minimal classification for a depression diagnosis would place an individual in the depression group.

2.3. Signal Acquisition System

Electrocardiography (ECG) and photoplethysmography (PPG) signals constituted the primary data sources for the experimental analysis. Signal acquisition was performed using the MAXREFDES104# device, Mouser Electronics, Mansfield, CT, USA, a specialized wearable bracelet designed to capture ECG and PPG signals. In addition, this device integrates a low-energy Bluetooth (BLE) module, ensuring efficient and precise wireless data transmission. Consequently, all processes related to data acquisition, visualization, and parameter modification for PPG and ECG signals were performed using MAXREFDES104 Health Sensor Platform 3.0 software, which enabled signal recording at a sampling frequency of 128 Hz. Ultimately, the data were archived in CSV files, encompassing both raw and software-filtered samples in a detailed format.

2.4. Description of Variables and Signal Processing

Filtering and artifact removal were performed using Butterworth bandpass filters for both ECG and PPG signals. Specifically, these signals were filtered within the frequency ranges of 0.1–50 Hz and 0.1–5 Hz, respectively. After filtering, the signals were corrected for in the baseline to eliminate the DC component using the ModPoly [29] method from the BaselineRemoval library. Finally, for both raw and extracted variables, a threshold-based random peak removal technique was employed. This method removes atypical clusters of points within a defined interval on the magnitude axis, based on a set threshold. Subsequently, three subsignals and a principal value were derived from the ECG and PPG signals. These three temporal series are illustrated in Figure 1, and are as follows: heart rate variability (HRV), respiration signal (RES), and pulse transit time (PTT), with heart rate (HR) being the predominant measurement. For HRV, the intervals corresponding to each R peak of the ECG signal were determined, thus generating a time vector. This vector was then utilized to generate the HRV signal by subtracting from a version of itself shifted one position to the right, thus containing the intervals between consecutive R peaks. In determining the respiration signal, we used the ECG-derived respiration method (EDR). This technique facilitates the approximation of the respiration signal by filtering the ECG signal within a frequency range encompassing the respiratory components (0.13–0.35 Hz). The R peaks of the ECG are subsequently identified and interpolated using a cubic spline to derive the respiration signal from the ECG. Lastly, the PTT signal was obtained, which reflects the duration of propagation of a heartbeat from the heart to the most distal part of the hand. This was achieved by aligning the time instances of the maximal peaks in the PPG signal with the time vector derived from the R peak of the ECG signal and subtracting the latter. Thus, a distinct time vector was formulated, representing the PTT signal. The heart rate was ultimately determined by counting the number of R peaks in the ECG over a one-minute duration and dividing this by 60 s.

The variables used in this analysis are delineated as follows. From the outset, the concept of physiological coherence, as elucidated by the HeartMath Institute in numerous research studies, was considered crucial. This notion describes the degree of synchronization or equilibrium among disparate oscillatory systems, proposing that when one system achieves a state of coherence and subsequently interacts with another, it can produce a comparable state of equilibrium in the latter, thus augmenting its functional efficiency [30]. Consequently, this metric has been shown to reflect relative variations in various emotional states. Initially, physiological coherence was intended to be assessed by heart rate variability (HRV), as stated in (1):

PhyCoh = \frac{P_{Peak}}{{(P_{Total} - P_{Peak})}^{2}}

(1)

where

P_{Peak}

is known as peak power, and is related to the power of the highest peak within a lower frequency band (designated between 0.04 and 0.5 Hz) of the power spectrum of an HRV signal, specifically within 0.04 and 0.26 Hz. Furthermore,

P_{Total}

is the total power of the specified lower frequency band [30].

Physical coherence is a number between 0 and 1. To achieve this, (1) is modified and the resulting equation is shown in (2). This modification was necessary due to an error in the original (1) that could cause the coherence number to exceed 1 [31]. The corrected (2) is then used as the physical coherence formula:

PhyCoh = \frac{P_{Peak}}{P_{Total}}

(2)

In our case, the coherence number was calculated for all three continuous signals, that is, HRV, PPT, and RES, using the same methodology mentioned above.

Another variable used, derived from physiological coherence, is the global coherence index (GCI), which is calculated using (5). This equation is composed of three main values: the combined value of the physiological coherences of the three signals, the phase synchronization index (PSI), obtained by weighting the phase values of the signals, and the cross-correlations of the three signals. As shown in (3) and (4), the PSI term

φ_{k}

represents the phase of the signal k, which is calculated using the Hilbert transform to find the analytic version of the signal and then determine the phase

φ

. On the other hand, for the cross-correlation

r_{x y} (k)

, the relationship between each pair of the three signals was found, with x and y representing the different signals (TTP, RES, or HRV). Finally, the Global Coherence Index (GCI) in (5) represents the relationship among the special summation of these terms [31].

{PSI}_{k} = \bar{{cos}^{2} (φ_{k})} + \bar{{sin}^{2} (φ_{k})}

(3)

r_{x y} (k) = \frac{1}{N} \sum_{n = 1}^{N} y (n) x (n + k)

(4)

GCI = \frac{3}{5} \bar{PhyCoh} + \frac{1}{4} \bar{r_{x y}} + \frac{3}{20} \bar{PSI}

(5)

Following the other variables found, the averages of each vector of values from the continuous subsignals were calculated. In this way, the general averages of PTT, RES, and HRV were obtained. The standard deviations of PTT and HRV were also calculated. Equation (6) shows how the standard deviation

σ

was found, where

\bar{x}

is the mean value of a certain time series signal,

x_{i}

is each value of the signal vector, and N is the total length of the time series vector:

σ = \sqrt{\frac{\sum_{i} {(x_{i} - \bar{x})}^{2}}{N}}

(6)

Another variable calculated was the variation coefficient for both the main PPG signals and the ECG, it was measured as illustrated in (7), where

σ

is the standard deviation of the signal and

\bar{x}

is the mean signal:

CV = \frac{σ}{\bar{x}}

(7)

Other types of variables that were calculated from the subsignals included Shannon entropy, frequency centroid, and zero crossings. The zero crossing was determined by programming a threshold detector with conditionals that incremented a counter each time the signal crossed the zero line. For the spectral centroid, as stated in (8),

f_{i}

represents a frequency component of the frequency spectrum and

| X (f_{i}) |

corresponds to the magnitude of the signal at the frequency

f_{i}

. The variable N represents the number of frequencies evaluated in the experiment. Finally, the spectral centroid is the average of the weighted frequencies [32]:

C_{i} = \frac{\sum_{i = 1}^{N} f_{i} \cdot | X (f_{i}) |}{\sum_{i = 1}^{N} | X (f_{i}) |}

(8)

Finally, for the Shannon entropy, as shown in (9), the entropy is calculated using the probability

p (x_{i})

, which is related to the number of occurrences of each unique value in the signal vector. In this context, the probability is used as follows [30,33]:

S (x) = \sum_{i = 1}^{N} p (x_{i}) {log}_{2} (p (x_{i}))

(9)

In summary, the final variables were all applied to the three subsignals. All of the calculations mentioned above are the features that were extracted for the classification exercise. Table 1 explains each variable and subsignal extracted from the ECG and PPG signals.

The preponderance of the selected features is intricately linked to the physiological coherence metrics. The principal signals acquired were used to compute the Global Coherence Index, using PPG and ECG to evaluate coherence among three additional physiological parameters: respiratory signal, pulse transit time, and heart rate variability. Furthermore, as previously delineated, mean values over a predefined temporal span were calculated, along with the standard coherence and the Global Coherence Index.

Coherent states are associated with the stability of the system, the correlation measures, and efficient energy utilization within a body signal [34]. These concepts are intrinsically linked to frequency domain analyses; hence, entropy and spectral centroid measures were also implemented for each of the aforementioned variables due to their relevance to information quality, order, and energy distribution within the power spectrum. Lastly, zero-crossing analysis, widely used in biosignal processing, was used because of its effectiveness in identifying key frequency components in random signals and revealing change factors [35]. As observed, the feature selection process was guided by a central concept, yet it was ultimately directed towards the objective of classification using all of these features. The goal was to explore which coherence-related variables could be effective in classifying depression, given their association with the chaotic and stabilizing properties stated in the main concept.

2.5. Experimental Protocol

Data gathering was carried out using the Script-Driven Imagery methodology, where emotional states were induced through suggestive script reading, followed by vivid imagination of the described scenarios. For this experiment, four scripts were adapted for data collection. The protocol consisted of four blocks, each divided into four stages. The process began with baseline stabilization for ECG and PPG measurements for 30 s. The participants then read the script, closed their eyes to vividly imagine the scenario, and finally relaxed and stopped thinking about it. Each block lasted approximately 3 min, resulting in a total data collection period of 12 min. The replicates included two emotional stimuli and two neutral stimuli, alternating between them. This approach allowed for comprehensive data capture, repeated four times to record all variables for each stimulus. Consequently, four data sheets, each containing 59 samples, were obtained to construct the final databases.

2.6. Obtained Databases

Data acquisition from a total of 59 participants adhered to the established experimental protocol. The data collection process was segmented into four distinct phases, each phase involving two neutral stimuli and two depressive stimuli per participant. Upon meticulous review of the samples, a total of 209 data entries were obtained, each entry comprising 20 scalar data values and a categorizable feature. This categorizable feature, derived from the initial screening process, classified each sample as DEPRESSIVE or NORMAL. This dataset, termed the Combined Stimuli Dataset, amalgamated data from both neutral and depressive stimuli. Subsequently, this aggregated dataset was partitioned into two distinct subsets: the Neutral Stimulus Dataset, encompassing 103 samples, and the Depressive Stimulus Dataset, encompassing 106 samples, each sample maintaining 20 unique characteristics.

2.7. Classification Models and Data Splitting

For the development of the classification system, feature extraction was performed as previously described, followed by the preparation of the resulting databases. Data validation was carried out using a stratified train–test split, distributing 20% of the data for testing and 80% for training. This stratification ensured that both groups contained proportional amounts of samples labeled DEPRESSIVE or NORMAL, maintaining equal proportions within each subset. Four different machine learning models were tested, but for the scope of this article, we will only delve into the three best performing models in each of the databases. The models of interest are Logistic Regression, Random Forest, Multilayer Perceptron, and AdaBoost Classifier. These four models were constructed using the Scikit-learn library in Python (https://scikit-learn.org/stable/).

In addition to the partitioning of data into training and testing subsets, a cross-validation methodology was employed to address potential overfitting in the outcomes. Specifically, a stratified k-fold technique was executed on the models, and the optimal results were systematically chosen for presentation. Considering the magnitude of the datasets, a five-fold cross-validation was deemed appropriate for all analyzed datasets. This procedure was conducted iteratively to ascertain the percentages of the metrics to be reported, leveraging the Scikit-learn library in Python.

Each model has a number of hyperparameters that can be modified to enhance performance. In this case, a grid search algorithm was used to tune the best hyperparameters for each model. The models used were logistic regression, random forest, multilayer perceptron, and AdaBoost classifier. Logistic regression, which is specifically designed for binary classification problems, utilizes the logistic function as its cost function to predict the probability that an instance belongs to one of two classes. For logistic regression, the default program values were used without modifications. In contrast, random forest, an ensemble learning method, constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees. Parameters such as “random state” were varied from 1 to 10, the “criterion” for evaluating the quality of node splitting was adjusted using “Gini” and “Entropy”, and the “n_estimators” parameter, defining the number of trees, was varied from 2 to 30.

The multilayer perceptron (MLP), a type of artificial neural network consisting of at least three layers of nodes (input layer, hidden layer, and output layer), is used for complex pattern recognition and learning non-linear relationships. The “random state” parameter was adjusted with values from 1 to 10 to control the weight optimization system, the “solver” parameter was modified using “lbfgs”, “sgd”, and “Adam”, and the number of neurons was chosen using “hidden_layer_sizes” with variations from 5 to 30. Lastly, AdaBoost (Adaptive Boosting) is a meta-estimator that starts by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset. However, it adjusts the weights of incorrectly classified instances so that subsequent classifiers focus more on these difficult cases. An initial decision tree estimator was used, the “random state” parameter was modified with values from 1 to 10, and the “n_estimators” values, determining the number of estimators to use, ranged from 2 to 20.

2.8. Metrics and Best Features

Four main metrics were used to assess the performance of each model within each database, including overall classification accuracy, sensitivity, precision, and the F1 score. These results were also compared with the confusion matrices obtained for each exercise. In this case, the results were classified into the following categories: True Positives (TP) are samples classified as depressive and are indeed depressive, False Positives (FP) are samples classified as normal but are depressive, False Negatives (FN) are samples of normal individuals classified as depressive, and true negatives (TN) are normal individuals correctly classified as normal. To characterize the signals, a statistical analysis was performed to identify the best features for accurate classification within an artificial intelligence algorithm. This was achieved through hypothesis testing and the use of the SHAP Python library [36] to gain insight into the importance of features in the classification process. However, it should be noted that the scikit-learn classification report was used to obtain the respective values, the equations normally used to calculate the mentioned metrics [37] are shown as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(10)

Precision = \frac{T P}{T P - F P}

(11)

Recall = \frac{T P}{T P + F N}

(12)

F 1 Score = \frac{2 \cdot R e c a l l \cdot P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(13)

2.9. SHAP Feature Performance Analysis

Two distinct graphical approaches were used to evaluate the importance of features. This analysis relied primarily on the SHAP method (SHapley Additive exPlanations), which is inspired by game theory and applied to regression models to determine the importance of individual features within a classification, prediction, or regression task [36]. SHAP values represent the difference between the expected model output and the partial dependence plot at a feature value. With SHAP values, different metrics can be achieved to evaluate the features of a classification model. Specifically, two native SHAP library graphics were used: a summary plot and a beeswarm plot. The summary plot provided a concise overview of the hierarchical importance of each feature within the model. Meanwhile, the beeswarm plot illustrated the impact of individual feature values on the classification process. Ultimately, the summary plot served as a generalized version of the beeswarm plot, highlighting the relative importance of features in a comprehensive manner.

2.10. Statistical Analysis

From the Combined Stimuli Dataset, two new databases were created: one containing all samples classified as depressive and the other containing samples classified as normal, based on the screening result in each capture sheet. Normality tests for each variable were performed using the Kolmogorov–Smirnov test, which confirmed the non-parametric nature of the data. Subsequently, a hypothesis test, specifically the Mann–Whitney U test for unrelated samples, was conducted between these two new datasets. These tests were carried out under the null hypothesis that both groups, Depressive and Normal, were equal, providing statistical confirmation for the comparison between the groups.

3. Results

This investigation involved a rigorous evaluation of the principal metrics of the top three machine learning models in each of the three datasets (Emotional, Neutral, and Combined). Significantly, one of the four models initially delineated was precluded from the results due to suboptimal performance, with the underperforming model differing based on the dataset in question. As a result of this thorough analysis, the model that exhibited the most substantial generalization and classification prowess was chosen alongside the dataset providing the most pertinent insights. The models were trained using a simple train–test validation method and stratified k-fold cross-validation for a more robust approach.

3.1. Train–Test Split Models Results

3.1.1. Emotional Stimuli Dataset

For the Emotional Stimulus Dataset, a well-balanced dataset comprising 106 samples, each characterized by 20 distinct characteristics, was used. The dataset was partitioned into a training group of 84 samples and a validation group of 22 samples. The Random Forest, Logistic Regression, and AdaBoost models exhibited the most favorable metric values. It is important to note that the metric values range from 0 to 1, with 1 signifying the optimal achievable result. Within the training group, all models demonstrated accuracy and F1 scores that exceeded 80%, some instances reaching as high as 99%. However, in the validation group, as shown in Table 2, the values achieved were significantly lower, with precision and F1 scores consistently below 70%, which indicates overfitting due to poor generalization. Despite this, Random Forest emerged as the best performing algorithm in the validation group, attaining an accuracy rate of 0.64 (64%) and an F1 score of 0.63 (63%).

3.1.2. Neutral Stimuli Dataset

In the context of the Neutral Stimulus Dataset, which is a well-balanced dataset comprising 103 samples, the dataset was stratified into a training cohort of 82 samples and a validation cohort of 21 samples. Analogous to the analysis performed on the Emotional Stimulus Dataset, machine learning models such as Random Forest, Multi-layer Perceptron (MLP), and AdaBoost exhibited performance metrics exceeding the 0.8 threshold in the training cohort. Notably, as delineated in Table 3, the Random Forest model surpassed the performance of other models, achieving precision and F1 score values of 0.81. This indicates that, contrary to observations made with the previous dataset, the issue of overfitting was markedly less pronounced in this instance.

3.1.3. Combined Dataset

The combined Emotional and Neutral Stimulus Dataset, which integrated 209 samples, each characterized by approximately 20 characteristics, was divided into a training group consisting of 167 samples and a validation group of 42 samples. In this specific case, models such as Random Forest, Logistic Regression, and AdaBoost were found to have commendable results, with metric values within the training group exceeding or equal to 80%. In the context of the validation group shown in Table 4, the precision and F1 score indices manifested the following results for these three models: 0.76 for both parameters in the case of the Random Forest, 0.64 for both in the case of logistic regression, and 0.6 and 0.59, respectively, for AdaBoost. The Random Forest model emerged as the algorithm that achieved the most favorable performance within this combined dataset. It can be said that this last Random Forest result even exceeded that of 81% of the Neutral Stimulus dataset due to the increased data flow and generalizability.

3.2. Stratified k-Fold Model Results

For cross-validation, a stratified five-fold randomized shuffle was implemented. The metrics were calculated for each fold and then averaged to obtain an overall mean value across all folds. This procedure was performed using the StratifiedKFold method from the Scikit-learn library in Python.

Table 5 presents a summary of all metrics for the top three models in the evaluated datasets. A notable finding is the consistent performance of the Random Forest classifier, which achieved an accuracy of more than 88% in all three datasets. Compared to the train–test split approach, these results showed an improvement not only in accuracy but also in all metrics, with recall, precision, and F1 scores exceeding 87% for all datasets. The highest classification performance was observed in the emotional dataset, where all metrics exceeded 91%, and no evidence of overfitting was detected, as shown in Figure 2. Furthermore, the precision of the combined dataset of 88% is not far from 92% achieved in the other datasets. Taking into account the significantly larger sample size and diverse stimuli in the combined dataset, this result can be considered more generalizable, representing the overall success of the models evaluated. The set of metrics obtained was superior to those from the train–test split and comparable datasets.

The confusion matrices shown in Figure 2 provide visual evidence of possible overfit comparisons between trials. The upper row displays the results of the stratified k-fold method for Random Forest across each dataset. In these matrices, the classes of true negatives and true positives (represented by the principal diagonal) are prominently distinguished by a high majority of values and color intensity. In contrast, the lower row shows the results of the random forest train–test split method in each dataset. Here, noticeable issues such as low quality in false positive cases are evident, as seen in instances in Figure 2d,e. This is characterized by a lack of majority classification in this segment of the row, which can also be differentiated by the color temperature map on the right side of each figure.

3.3. Analysis of Feature Importance via SHAP

In this particular context, feature analysis was meticulously executed on the Random Forest model using the combined stimulus dataset, which was identified as the most suitable option. Two primary visualizations, specifically a beeswarm plot and a bar plot, were analyzed, as depicted in Figure 3 and Figure 4.

3.3.1. SHAP Bar Plot

The bar plot delineated the hierarchical structure of the most significant features influencing the predictive process. Both validation methods exhibited nearly congruent behavior in feature importance when classifying depression. The only noticeable variation between the validation methods was the classification of SCRES and TTP features within the importance hierarchy, with TTP assuming marginally higher importance when using the cross-validation technique. It became evident that the coefficient of variation of the ECG exerted the most profound influence on the overall classification, followed, in order, by the physiological coherence of the TTP, zero HRV crossings, the global coherence index, and the physiological coherence in respiration. These five characteristics represented the most influential variables in the decision-making process of the Random Forest algorithm, applicable to both the combined dataset and for both the train–test split and the stratified k-fold strategies. Moreover, the differences in influence among these characteristics were relatively minor, maintaining a consistent and progressive hierarchy. Although individual variables demonstrated varying degrees of impact, no single feature unequivocally overshadowed the contributions of others. Collectively, the top ten characteristics contributed approximately 73% of the performance of the model, while the remaining ten less significant features contributed approximately 27%.

3.3.2. SHAP Beeswarm Plot

However, the beeswarm plot facilitated visualization of SHAP values for each sample, categorizing values as high (red) or low (blue). The color code indicated that low values were associated with favorable outcomes, while high values had a detrimental effect. For example, for both validation methods, lower values of the ECG coefficient of variation were associated with better predictions, while higher or intermediate values of zero crossings in HRV were more conducive to positive predictions. In the case of IDCG, its values had a relatively neutral impact, while the Shannon entropy for TTP was most influential when its values were higher; this applies for both Figure 3 and Figure 4.

3.3.3. Best Features Model Test

Ultimately, an experiment was conducted using the same models employed initially in the combined dataset, but only with the top 10 most influential features according to SHAP for both the test-train split and the stratified k-fold validation methods. The results indicated that the Random Forest model continued to have a great performance, which is illustrated in Table 6, with F1 and accuracy scores of approximately 0.71. These scores closely resembled the results achieved when using all features in the model. However, this is surpassed by the Random Forest result using the cross-validation method, which achieved an accuracy of 0.81 shown in Table 7, remaining close to the original result with all features. In contrast, the other two models showed a considerable reduction in performance when the number of features was decreased.

3.4. Statistical Analysis

For these two datasets, we performed a hypothesis test using the Mann–Whitney U test. The Mann–Whitney U test is a non-parametric test that establishes the null hypothesis that no significant differences can be observed between two “different” populations through a comparison of means [38]. In Table 8, it can be seen that most variables do not reject the null hypothesis. Therefore, it can be stated that there is no statistical evidence that each variable behaves differently in either of the two groups (individuals with depression and individuals without depression). In other words, the characteristics of individuals with depression behave similarly to those of individuals without depression. However, this is not definitive because three features managed to reject the null hypothesis, having p-values lower than 0.05. These variables are physiological coherence in TTP (PCTTP), spectral centroid of heart rate variability (SCHRV), and spectral centroid of TTP (SCTTP). Only with these three characteristics can it be statistically determined that, with a confidence of 95%, they can differentiate one group from the other.

4. Discussion

It has been shown that a depressive state in an individual can be identified with a quantifiable degree of precision, thus addressing the need to objectively combat depression for the benefit of both patients and healthcare professionals. This was accomplished through statistical analyzes that elucidated distinct characteristics that exhibited differential behavior between various groups in response to disparate physical stimuli. In particular, when the data were processed through a machine learning classification system with two types of validation techniques, a correct identification rate of depressive states was achieved in up to 92% of cases using a stratified cross-validation method. In this study, the best results were obtained using the stratified validation method k -fold rather than the split train test, with metrics exceeding 0.9. Therefore, the discussion will focus on evaluating these superior outcomes. Consequently, it can be substantiated that ECG and PPG signals, particularly the derived characteristics that have the highest impact on classification, can contribute to the diagnostic process or serve as corroborative evidence to assess whether an individual is experiencing depression or a depressive episode. This methodology offers an ancillary layer of support predicated on the individual’s biophysical data. There is a pronounced improvement in the performance of the machine learning algorithm when the sample size is increased. This is attributable to the markedly superior performance of the selected model in the more extensive dataset compared to the other two, which contained only half the amount of data. Although we achieved an accuracy rate of up to 92% with the emotional stimuli dataset, the 88% accuracy yielded by the combined dataset is likely to be more reliable due to the increased volume of information it provides. In most models, algorithms trained on limited data frequently exhibit deficient generalization capabilities during validation, likely due to the insufficiency of available information.

Compared with other authors and the state-of-the-art methods, it was found that the proposed method, with the reported values, shows results equivalent to most similar methodologies. For example, Kuang et al. [22] reported an accuracy of 86% with a Bayesian classifier, analyzing only non-linear features in the time and frequency domains of HRV. Our results are comparable, achieving higher levels of accuracy even with a similar number of features and fewer test subjects. Furthermore, our results can be compared to those of Byun et al. [15], who achieved results greater than 72% for binary classification in depression using cardiac variability characteristics and an SVM model with a greater set of features and volunteers than ours. Furthermore, in another study by Byun et al. [21], metrics exceeding 70% were achieved using only the entropy characteristics in HRV. In our study, in addition to HRV, we incorporate other related characteristics, which result in accuracy, precision, and sensitivity values that exceed 87%. On the other hand, Zhang et al. [23] achieved an accuracy of 95% and Noor et al. [39] reported an accuracy of 97% used ECG time series data with advanced deep learning models such as CNN. From the above, it can be observed that the classification performance values are often superior with deep learning models. However, despite their excellent results, these models require large amounts of data, which many research groups cannot afford. Moreover, they often do not attempt to understand the intrinsic phenomena associated with feature extraction. This is an alternative method that could work even better with a full understanding of the signals and the cause–effect relationships behind them. The implementation of other signal features is also explored, as demonstrated in the work of Zitouni et al. [40], where accuracy values exceeding 75% were obtained for a three-category depression classification using non-linear ECG and PPG features. Similarly, several attempts using EEG, such as those of Cai et al. [41] and Pange and Pawar [42], achieved performance with accuracy and sensitivity exceeding 70%, using a limited set of features, including frequency, time, entropy, and both linear and non-linear attributes.

4.1. Clinical Interpretation of Classification Results

A high accuracy percentage immediately suggests that the model is capable of safely classifying an individual as either depressed or non-depressed. However, after this initial assessment, it is important to evaluate two additional metrics: sensitivity (also known as recall) and specificity. These metrics provide deeper insight into the performance of the model, specifically its ability to correctly identify true positives (depressive cases) and true negatives (non-depressive cases) [43]. Furthermore, it is important to note that this method currently functions effectively only when complemented by the expert opinion and diagnosis of a healthcare professional. This is due to the need for extensive validation in a large number of patients, as well as for model personalization and enhancement.

High values in both sensitivity and specificity indicate that the model is highly effective in accurately classifying both depressive and non-depressive individuals, making it a valuable tool to assist psychiatrists in diagnosing depression. If the model demonstrates high sensitivity (recall) but lower specificity, it would be particularly useful in confirming a depression diagnosis, although a secondary evaluation might be necessary if the model suggests a non-depression outcome. In contrast, if the model exhibits high specificity but lower sensitivity, it would reliably confirm a non-depression diagnosis, but a follow-up assessment could be necessary when the model indicates the presence of depression; all these results can be observed in Table 9. The stratified k-fold validation method generally produced strong sensitivity and specificity results in all datasets, with values exceeding 0.80. However, given the sensitivity of these topics, the emotional dataset achieved the highest performance, with a consistent success rate above 0.88. A comparison of validation methods reveals that the train–test split often underperforms for certain classes, suggesting potential overfitting in these models. However, any of the models and parameters used for cross-validation datasets would be effective in providing a predominantly reliable confirmation of whether an individual has depression.

4.2. Real-Life Applications

Beyond the classification outcomes, comparable results have been observed to be achieved within the validation cohort, even with a reduction to half the feature set. Expanding the sample size posits the potential for significant enhancement of these results through a more comprehensive database. Frequency domain variables have shown superior efficacy in elucidating relationships with depression and potentially other mental and behavioral disorders. This methodological framework can be extrapolated to investigate anxiety and manic episodes, and stratify individuals into various gradations of severity of depression. The findings indicate that this approach is not only innovative but possesses considerable promise. With the current level of precision, this method could function as a diagnostic algorithm in psychiatry, allowing healthcare practitioners to confirm and classify individuals with up to 92% certainty regarding the presence or absence of depression. Further refinements in accuracy could facilitate integration into ubiquitous devices such as smartwatches, smartphones, or earphones, capable of capturing the requisite signals to offer real-time preventive diagnostics of an individual’s mental state. Ultimately, this can precipitate the development of an alarm system based on the physiological recognition of mental states, targeted toward individuals with suicidal tendencies. This system could be used as an evaluation instrument in psychiatric evaluations, providing an auxiliary verification mechanism for healthcare professionals to corroborate or negate a depression diagnosis. In addition, there is potential for incorporation into wearable technology, which functions as an early warning apparatus for those susceptible to suicidal behaviors instigated by depression. This analytical method has the potential to be extended to a variety of mental illnesses and different severity levels. Providing quantification of phenomena traditionally subjected to human interaction may offer substantial advantages in the prevention, diagnosis, and management of socioaffective disorders, which present significant challenges for psychologists, psychiatrists, and other healthcare providers.

4.3. Limitations and Future Research

Several limitations of this study warrant careful consideration. Primarily, an enhancement of the data acquisition apparatus is suggested, as it sporadically introduces substantial noise and extraneous data into the measurements, potentially resulting in inaccurate values. Nevertheless, as a preliminary step towards wireless signal acquisition, the current device’s performance is satisfactory. Furthermore, there are ample opportunities to enhance the code’s efficiency, speed, and accuracy. The emotion induction protocol could be refined by integrating alternative methods, such as auditory stimulation or personalized replication, and by providing a more immersive experience within the script-driven imagery protocol. The machine learning model utilized in this research could also benefit from an expanded sample size. Augmenting group heterogeneity through various validation and data partitioning techniques, including k-fold cross-validation, bootstrap, and leave-one-out (LOO), would likely enhance the model’s robustness and generalizability. Furthermore, the superior performance of the combined database relative to individual datasets remains ambiguous, potentially due to factors such as sample size, false positives in measurements, participants under the influence of medication, or those experiencing specific life conditions or undisclosed medical conditions. Moreover, the content of the script-driven imagery protocol itself might have influenced the measurements, as individual responses to certain scenarios may vary depending on their emotional states. These considerations underscore the need for further research to elucidate these findings and address the identified limitations. However, for the clinical and investigation topic, the real challenge lies in the reach of this kind of research. Sample sizes are often not large enough to conclusively generalize the findings, and the integration of such systems into medical centers presents additional complexities. Although this represents a step forward in generating proposals for clinical practice, further complementary studies are necessary to strengthen its robustness. In addition, widespread dissemination among healthcare professionals is essential for its integration into routine clinical practice. In addition, guidelines established by ethics committees and restrictions on the types of patients that can be included in studies can also influence the quality of the data and delay the integration of such approaches into clinical programs. Consequently, future investigations should focus on refining data acquisition methods, optimizing the code, enhancing emotion induction protocols, and increasing the diversity and size of the sample population, thereby contributing to a more profound understanding of the phenomena under study and enhancing the accuracy and reliability of the results.

5. Conclusions

In summary, this research has shown that depressive states in individuals can be detected with remarkable precision using ECG and PPG signals. The statistical analyzes performed elucidated distinctive characteristics with variable behavior in response to diverse physical stimuli, which were adeptly processed through a machine learning classification framework. The method achieved an accuracy rate of up to 92% in the identification of depressive states using a cross-validation approach, which supports the potential of biophysical data in increasing the diagnostic process of depression. The findings indicate that an enlarged sample size markedly improves the performance of the machine learning algorithm. Although a lower accuracy rate of 88% was observed with a combined dataset, the increased volume of information appears to improve its reliability.

Compared to other studies, our methodology yields comparable results, demonstrating its efficacy even with a reduced number of test subjects and characteristics. The findings of this study further emphasize the importance of integrating multiple signal features and understanding the inherent phenomena underlying feature extraction. Although advanced deep learning models have exhibited superior performance, they require extensive datasets and often disregard the underpinning mechanisms of the features. Our approach, which encompasses a thorough analysis of signal characteristics, presents a viable alternative with substantial potential for enhancement with larger datasets.

Furthermore, the variables in the frequency domain have proven effective in exploring associations with depression, implying that this methodology could be extended to other mental and behavioral disorders. With additional refinements, this approach could be utilized as a screening tool in psychiatry, incorporated into everyday devices for preventive diagnostics, and potentially lead to alarm systems for individuals with suicidal thoughts. This quantitative method offers a significant advantage in the prevention, diagnosis, and treatment of socioaffective disorders, addressing the challenges healthcare professionals face in managing these conditions.

Author Contributions

Conceptualization, R.T., J.D.l.R., and M.H.; methodology, M.A., R.T., and A.Q.-Z.; software, M.A.; validation, M.A., A.Q.-Z., and M.H.; formal analysis, M.A. and J.D.l.R.; data curation, M.A. and R.T.; writing—original draft preparation, M.A. and A.Q.-Z.; writing—review and editing, J.D.l.R., R.T., A.Q.-Z., and M.H.; visualization, M.A.; supervision, R.T., J.D.l.R., and M.H.; project administration, M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of “Comité local de investigación en salud 1001” (protocol code R-2023-1001-064 with date of approval 26 May 2023).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The databases acquired in this research are available for access upon request. Please contact Mateo Alzate via email for further information.

Acknowledgments

Special thanks to the High Specialty Medical Center of the Bajio T1, the staff of the SPPSTIMS section, the research department, and finally, Martha for allowing the use of the facility, providing the necessary materials, and supporting the research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ECG	Electrocardiography
PPG	Photoplethysmography
EEG	Electroencephalography
HRV	Heart rate variability
EDR	Electrocardiography derived respiration
RES	Respiration
PSI	Phase synchronization index
MLP	Multi layer perceptron
SVM	Support vector machine
SHAP	Shapley Additive explanations
PTT	Pulse time transit

References

Perry, S.W.; Rainey, J.C.; Allison, S.; Bastiampillai, T.; Wong, M.L.; Licinio, J.; Sharfstein, S.S.; Wilcox, H.C. Achieving health equity in US suicides: A narrative review and commentary. BMC Public Health 2022, 22, 1360. [Google Scholar] [CrossRef] [PubMed]
WHO. Preventing Suicide: A Global Imperative; World Health Organization: Geneva, Switzerland, 2014; Available online: https://www.who.int/publications/i/item/9789241564779 (accessed on 9 September 2024).
OMS. Depresión; OMS: Geneva, Switzerland, 2021. [Google Scholar]
Vázquez, F.L.; Muñoz, R.F.; Becoña, E. Depresión: Diagnóstico, modelos teóricos y tratamiento a finales del siglo XX. Psicol. Conduct. 2000, 8, 417–449. [Google Scholar]
Varela, J.A.; Ramírez, H.L.G.; Baeza, J.A.N.; Preciado, J.I.S.; Aguilar, J.F.; Torres, M.E.L.; Sirot, G.Z.; Orozco, D.T.; Gaytán, J.M.Q. 2º Diagnóstico Operativo de Salud Mental y Adicciones. 2022. Available online: https://www.gob.mx/cms/uploads/attachment/file/730678/SAP-DxSMA-Informe-2022-rev07jun2022.pdf (accessed on 9 September 2024).
INEGI. Encuesta Nacional de Bienestar Autorreportado ENBIARE 2021. 2021. Available online: https://www.inegi.org.mx/programas/enbiare/2021/ (accessed on 9 September 2024).
Richter, T.; Fishbain, B.; Richter-Levin, G.; Okon-Singer, H. Machine learning-based behavioral diagnostic tools for depression: Advances, challenges, and future directions. J. Pers. Med. 2021, 11, 957. [Google Scholar] [CrossRef] [PubMed]
Gilbody, S.; House, A.; Sheldon, T. Screening and case finding instruments for depression. Cochrane Database Syst. Rev. 2005, 4. [Google Scholar] [CrossRef]
Ayano, G.; Demelash, S.; Yohannes, Z.; Haile, K.; Tulu, M.; Assefa, D.; Tesfaye, A.; Haile, K.; Solomon, M.; Chaka, A.; et al. Misdiagnosis, detection rate, and associated factors of severe psychiatric disorders in specialized psychiatry centers in Ethiopia. Ann. Gen. Psychiatry 2021, 20, 10. [Google Scholar] [CrossRef]
Norra, C.; Skobel, E.C.; Arndt, M.; Schauerte, P. High impact of depression in heart failure: Early diagnosis and treatment options. Int. J. Cardiol. 2008, 125, 220–231. [Google Scholar] [CrossRef]
Byers, A.L.; Yaffe, K. Depression and risk of developing dementia. Nat. Rev. Neurol. 2011, 7, 323–331. [Google Scholar] [CrossRef]
Mezuk, B.; Eaton, W.W.; Albrecht, S.; Golden, S.H. Depression and type 2 diabetes over the lifespan: A meta-analysis. Diabetes Care 2008, 31, 2383–2390. [Google Scholar] [CrossRef]
OMS. La Inversión en el Tratamiento de la Depresión y la Ansiedad Tiene un Rendimiento del 400%; OMS: Geneva, Switzerland, 2016. [Google Scholar]
Akar, S.A.; Kara, S.; Agambayev, S.; Bilgiç, V. Nonlinear analysis of EEGs of patients with major depression during different emotional states. Comput. Biol. Med. 2015, 67, 49–60. [Google Scholar] [CrossRef]
Byun, S.; Kim, A.Y.; Jang, E.H.; Kim, S.; Choi, K.W.; Yu, H.Y.; Jeon, H.J. Entropy analysis of heart rate variability and its application to recognize major depressive disorder: A pilot study. Technol. Health Care 2019, 27, 407–424. [Google Scholar] [CrossRef]
Cai, H.; Han, J.; Chen, Y.; Sha, X.; Wang, Z.; Hu, B.; Yang, J.; Feng, L.; Ding, Z.; Chen, Y.; et al. A Pervasive Approach to EEG-Based Depression Detection. Complexity 2018, 2018, 5238028. [Google Scholar] [CrossRef]
Heylen, J.; Mechelen, I.V.; Fried, E.I.; Ceulemans, E. Two-mode K-spectral centroid analysis for studying multivariate longitudinal profiles. Chemom. Intell. Lab. Syst. 2016, 154, 194–206. [Google Scholar] [CrossRef]
Siedlecka, E.; Denson, T.F. Experimental Methods for Inducing Basic Emotions: A Qualitative Review. Emot. Rev. 2019, 11, 87–97. [Google Scholar] [CrossRef]
Bichescu-Burian, D.M.; Grieb, B.; Steinert, T.; Uhlmann, C.; Steyer, J. Use of a psychophysiological script-driven imagery experiment to study trauma-related dissociation in borderline personality disorder. J. Vis. Exp. 2018, 2018, e56111. [Google Scholar] [CrossRef]
Kearns, M.; Engelhard, I.M. Psychophysiological responsivity to script-driven imagery: An exploratory study of the effects of eye movements on public speaking flashforwards. Front. Psychiatry 2015, 6, 115. [Google Scholar] [CrossRef]
Byun, S.; Kim, A.Y.; Jang, E.H.; Kim, S.; Choi, K.W.; Yu, H.Y.; Jeon, H.J. Detection of major depressive disorder from linear and nonlinear heart rate variability features during mental task protocol. Comput. Biol. Med. 2019, 112, 103381. [Google Scholar] [CrossRef]
Kuang, D.; Yang, R.; Chen, X.; Lao, G.; Wu, F.; Huang, X.; Lv, R.; Zhang, L.; Song, C.; Ou, S. Depression recognition according to heart rate variability using Bayesian Networks. J. Psychiatr. Res. 2017, 95, 282–287. [Google Scholar] [CrossRef]
Zhang, F.; Wang, M.; Qin, J.; Zhao, Y.; Sun, X.; Wen, W. Depression Recognition Based on Electrocardiogram. In Proceedings of the 2023 8th International Conference on Computer and Communication Systems (ICCCS), Guangzhou, China, 21–23 April 2023; pp. 1–5. [Google Scholar] [CrossRef]
Khosla, A.; Khandnor, P.; Chand, T. Automated diagnosis of depression from EEG signals using traditional and deep learning approaches: A comparative analysis. Biocybern. Biomed. Eng. 2022, 42, 108–142. [Google Scholar] [CrossRef]
Xu, Y.; Goodacre, R. On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. J. Anal. Test. 2018, 2, 249–262. [Google Scholar] [CrossRef]
Bachmann, M.; Päeske, L.; Kalev, K.; Aarma, K.; Lehtmets, A.; Ööpik, P.; Lass, J.; Hinrikus, H. Methods for classifying depression in single channel EEG using linear and nonlinear signal analysis. Comput. Methods Programs Biomed. 2018, 155, 11–17. [Google Scholar] [CrossRef]
Zang, X.; Li, B.; Zhao, L.; Yan, D.; Yang, L. End-to-End Depression Recognition Based on a One-Dimensional Convolution Neural Network Model Using Two-Lead ECG Signal. J. Med Biol. Eng. 2022, 42, 225–233. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Jia, K.; Wang, Z.; Ma, Z. A Depression Prediction Algorithm Based on Spatiotemporal Feature of EEG Signal. Brain Sci. 2022, 12, 630. [Google Scholar] [CrossRef] [PubMed]
Lieber, C.A.; Mahadevan-Jansen, A. Automated Method for Subtraction of Fluorescence from Biological Raman Spectra. Appl. Spectrosc. 2003, 57, 1363–1367. [Google Scholar] [CrossRef] [PubMed]
McCraty, R.; Shaffer, F. Heart rate variability: New perspectives on physiological mechanisms, assessment of self-regulatory capacity, and health risk. Glob. Adv. Health Med. 2015, 4, 46–61. [Google Scholar] [CrossRef]
Mejia-Mejia, E.; Torres, R.; Restrepo, D. Assessment of high coherent states using heart rate variability, pulse transit time and respiratory signals. Biomed. Phys. Eng. Express 2019, 5, 045008. [Google Scholar] [CrossRef]
Dwivedi, D.; Ganguly, A.; Haragopal, V. Contrast between simple and complex classification algorithms. In Statistical Modeling in Machine Learning; Elsevier: Amsterdam, The Netherlands, 2023; pp. 93–110. [Google Scholar] [CrossRef]
Sugavaneswaran, L. Mathematical Modeling of Gene Networks. In Encyclopedia of Biomedical Engineering; Elsevier: Amsterdam, The Netherlands, 2019; pp. 33–55. [Google Scholar] [CrossRef]
Mccraty, R. Coherence: Bridging personal, social and global health. Altern. Ther. Health Med. 2011, 16, 10–24. [Google Scholar]
Kedem, B. Spectral analysis and discrimination by zero-crossings. Proc. IEEE 1986, 74, 1477–1493. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Vujović, Ž. Classification Model Evaluation Metrics. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 599–606. [Google Scholar] [CrossRef]
MacFarland, T.; Yates, J. Mann–Whitney U Test. In Introduction to Nonparametric Statistics for the Biological Sciences Using R; Springer: Cham, Switzerland, 2016; pp. 103–132. [Google Scholar] [CrossRef]
Noor, S.T.; Asad, S.T.; Khan, M.M.; Gaba, G.S.; Al-Amri, J.F.; Masud, M. Predicting the Risk of Depression Based on ECG Using RNN. Comput. Intell. Neurosci. 2021, 2021, 1299870. [Google Scholar] [CrossRef]
Zitouni, M.S.; Oh, S.L.; Vicnesh, J.; Khandoker, A.; Acharya, U.R. Automated recognition of major depressive disorder from cardiovascular and respiratory physiological signals. Front. Psychiatry 2022, 13, 970993. [Google Scholar] [CrossRef] [PubMed]
Cai, H.; Chen, Y.; Han, J.; Zhang, X.; Hu, B. Study on Feature Selection Methods for Depression Detection Using Three-Electrode EEG Data. Interdiscip. Sci.-Comput. Life Sci. 2018, 10, 558–565. [Google Scholar] [CrossRef] [PubMed]
Pange, S.; Pawar, V. Depression Analysis Based on EEG and ECG Signals. In Proceedings of the 2023 4th International Conference for Emerging Technology (INCET), Belgaum, India, 26–28 May 2023; pp. 1–6. [Google Scholar] [CrossRef]
Mor, Y. Diagnostic Test Evaluation; Elsevier: Amsterdam, The Netherlands, 2023; pp. 221–224. [Google Scholar] [CrossRef]

Figure 1. This figure shows a brief visual explanation of how the subsignals are obtained for: (1) heart rate variability, (2) ECG-derived respiration obtained by interpolation, and (3) pulse time transit by PPG and ECG interaction.

Figure 2. Confusion matrices for the Random Forest classifiers (RF) presented in this paper, comparing the classification performance of the stratified k-fold (SKf) method across each dataset (Row 1) with the classification results from the train–test split (TTs) test group (Row 2). (a) Combined Stimuli dataset SKf RF, (b) Emotional Stimuli dataset SKf RF, (c) Neutral Stimuli dataset SKf RF, (d) Combined Stimuli dataset TTs RF, (e) Emotional Stimuli dataset TTs RF, and (f) Neutral Stimuli dataset TTs RF.

Figure 3. (1) This SHAP bar plot represents in descending order the features with the greatest to least impact on the classification process of the Random Forest for the combined dataset in the train–test split method. These are the mean absolute values of the SHAP scores calculated. (2) On the right there is a Beeswarm plot also for the train–test split method, which has a colored point for each data point within each variable, with the colors indicating whether the value is high (in red) or low (in blue). Similarly, the plot has a positive and a negative section, indicating the type of impact of each data point on the classification. If it is on the positive side, it contributes positively to the classification, and vice versa.

Figure 4. This figure contains both SHAP Bar plot and Beeswarm plot of the Random Forest for the combined dataset but, in this case for the Stratified k-fold trial. (1) Bar plot; (2) Beeswarm plot.

Table 1. Description of all features implemented in the classification exercise.

Type of Feature	Features Names	Description
Main signals	HRV, FC, TTP, RES	Sampled at 128 Hz with a 3 min recording duration
Physical Coherence	PCRES, PCHRV, PCTTP	Coherence values between 0 and 1
Global coherence index (GCI)	IDCG	Coherence values between 0 and 1
Average mean values	HRVM, TTPM, FCM	Heart rate mean values between 50 and 120 bpm, Variability values > 600 ms, transit time values > 500 ms and less than 620 ms
Standard deviations	DEHRV, DETTP, CVECG, CVPPG	Average standard deviation of HRV and TTP and mean-related deviation of ECG and PPG
Zero crossing	P0HRV, P0TTP, P0RES	Number of zero-crossing on each signal
Entropy	SHAHRV, SHATTP, SHARES	Shannon entropy values of each signal. Examined in a bandwidth of 0–0.4 Hz.
Spectral centroid	SCHRV, SCTTP, SCRES	Spectral centroid values of each signal. Examined in a bandwidth of 0–0.4 Hz.

Table 2. Performance results of the emotional stimulus database in the validation group for the best classification models.

IA Model Name	Validation Accuracy	Validation Precision	Validation Recall	Validation F1-Score
Random Forest	0.64	0.64	0.64	0.63
Logistic Regression	0.5	0.5	0.5	0.5
AdaBoost Classifier	0.5	0.5	0.5	0.5

Table 3. Performance results of the neutral stimulus database in the validation group for the best classification models.

IA Model Name	Validation Accuracy	Validation Precision	Validation Recall	Validation F1-Score
Random Forest	0.81	0.81	0.81	0.81
Multi-Layer Perceptron	0.67	0.67	0.66	0.67
AdaBoost Classifier	0.67	0.68	0.66	0.66

Table 4. Performance results of the Combined Stimuli database in the validation group for the best classification models.

IA Model Name	Validation Accuracy	Validation Precision	Validation Recall	Validation F1-Score
Random Forest	0.76	0.77	0.76	0.76
Logistic Regression	0.62	0.62	0.62	0.62
AdaBoost Classifier	0.60	0.60	0.59	0.60

Table 5. Performance results of the three datasets using a stratified five-fold cross-validation, comparing each model (listed in the first column) against the metrics for all datasets (presented in the first row).

Model Name	Metric	Combined Dataset	Emotional Dataset	Neutral Dataset
Random Forest Model	Accuracy	0.88	0.92	0.89
	Precision	0.87	0.92	0.89
	Recall	0.87	0.91	0.89
	F1-Score	0.87	0.91	0.89
Logistic Regression	Accuracy	0.64	0.58	0.61
	Precision	0.64	0.57	0.61
	Recall	0.64	0.57	0.61
	F1-Score	0.64	0.57	0.61
AdaBoost Regressor	Accuracy	0.62	0.76	0.71
	Precision	0.65	0.76	0.76
	Recall	0.62	0.76	0.71
	F1-Score	0.60	0.76	0.69

Table 6. Performance results of the combined stimuli database in the validation group for the best classification models of the train–test split method including only the best 10 features found with the SHAP tool.

IA Model Name	Validation Accuracy	Validation Precision	Validation Recall	Validation F1-Score
Random Forest	0.71	0.73	0.71	0.71
Support Vector Machine	0.67	0.67	0.67	0.67
AdaBoost Classifier	0.64	0.64	0.64	0.64

Table 7. Performance results of the combined stimuli database with Stratified k-fold validation method for the best classification models including only the best 10 features found with the SHAP tool.

IA Model Name	Validation Accuracy	Validation Precision	Validation Recall	Validation F1-Score
Random Forest	0.81	0.81	0.81	0.81
Logistic Regression	0.59	0.59	0.59	0.59
AdaBoost Classifier	0.60	0.67	0.61	0.56

Table 8. Hypothesis test results of the Mann–Whitney U test are shown for each variable, displaying the values of the mean, standard deviation, and p-value. The values that rejected the null hypothesis have been highlighted in bold.

Feature Name	Mean	Deviation	p-Value
PCRES	0.134	0.051	0.168
PCHRV	0.188	0.083	0.174
PCTTP	0.193	0.104	<0.001
IDCG	0.182	0.038	0.335
HRVM	0.803	0.131	0.660
TTPM	0.606	0.031	0.931
FCM	74.664	12.131	0.721
DEHRV	0.086	0.078	0.916
DETTP	0.013	0.009	0.101
CVECG	56.093	14.502	0.156
CVPPG	56.917	9.452	0.319
P0HRV	32.592	15.795	0.143
P0TTP	39.583	17.809	0.258
P0RES	50.330	13.373	0.364
SHAHRV	13.413	0.350	0.176
SHATTP	13.408	0.349	0.081
SHARES	13.426	0.351	0.169
SCHRV	0.156	0.025	0.021
SCTTP	0.170	0.029	0.013
SCRES	0.196	0.019	0.751

Table 9. Sensitivity and specificity (recall metrics) results of the three datasets using stratified five-fold cross-validation).

Dataset	Clinical Metric	Stratified k-Fold	Train–Test Split
Combined Dataset	Sensitivity	0.80	0.43
Combined Dataset	Specificity	0.83	0.52
Emotional Dataset	Sensitivity	0.88	0.55
Emotional Dataset	Specificity	0.95	0.73
Neutral Dataset	Sensitivity	0.87	0.81
Neutral Dataset	Specificity	0.92	0.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alzate, M.; Torres, R.; De la Roca, J.; Quintero-Zea, A.; Hernandez, M. Machine Learning Framework for Classifying and Predicting Depressive Behavior Based on PPG and ECG Feature Extraction. Appl. Sci. 2024, 14, 8312. https://doi.org/10.3390/app14188312

AMA Style

Alzate M, Torres R, De la Roca J, Quintero-Zea A, Hernandez M. Machine Learning Framework for Classifying and Predicting Depressive Behavior Based on PPG and ECG Feature Extraction. Applied Sciences. 2024; 14(18):8312. https://doi.org/10.3390/app14188312

Chicago/Turabian Style

Alzate, Mateo, Robinson Torres, José De la Roca, Andres Quintero-Zea, and Martha Hernandez. 2024. "Machine Learning Framework for Classifying and Predicting Depressive Behavior Based on PPG and ECG Feature Extraction" Applied Sciences 14, no. 18: 8312. https://doi.org/10.3390/app14188312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning Framework for Classifying and Predicting Depressive Behavior Based on PPG and ECG Feature Extraction

Abstract

1. Introduction

2. Materials and Methods

2.1. Participants

2.2. Screening

2.3. Signal Acquisition System

2.4. Description of Variables and Signal Processing

2.5. Experimental Protocol

2.6. Obtained Databases

2.7. Classification Models and Data Splitting

2.8. Metrics and Best Features

2.9. SHAP Feature Performance Analysis

2.10. Statistical Analysis

3. Results

3.1. Train–Test Split Models Results

3.1.1. Emotional Stimuli Dataset

3.1.2. Neutral Stimuli Dataset

3.1.3. Combined Dataset

3.2. Stratified k-Fold Model Results

3.3. Analysis of Feature Importance via SHAP

3.3.1. SHAP Bar Plot

3.3.2. SHAP Beeswarm Plot

3.3.3. Best Features Model Test

3.4. Statistical Analysis

4. Discussion

4.1. Clinical Interpretation of Classification Results

4.2. Real-Life Applications

4.3. Limitations and Future Research

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI