Introduction
Cardiovascular diseases (CVDs) are the primary cause of death worldwide.1,2 Until the 1940s, the risk factors associated with CVD were not known in a context in which these diseases had already caused many casualties among Americans.3 Thus, during times of epidemiological transition, the treatment and prevention of CVD did not have precise directions.4 The need for a study to investigate the causes of the increasing burden of CVD was recognized.5
The Framingham study was initiated following 5,209 adults free from overt CVD, representing about 19% of Framingham’s population.6 The main objective was to follow the included participants for the development of cardiac heart failure for 20 years.3 The first results of this cohort emerged four years after the beginning of the study and revealed that high blood pressure, high cholesterol levels, and overweight were associated with the development of new-onset coronary heart disease.6 Although being overweight had stood out as a factor related to the development of CVD, obesity was not very prevalent at that time. It was only at the end of the 1970s that its prevalence reached epidemic levels worldwide,7 and today it represents one of the most serious public health challenges.8
Obesity is a complex disease of multiple aetiologies, with its own pathophysiologies, comorbidities, and disabilities, with direct or indirect influence on CVD, affecting endothelial and myocyte function, as well as enhancing major cardiovascular risk factors like diabetes, hypertension, and hyperlipidaemia.9 Thus, the problem lies in the fact that obesity evokes or exacerbates CVD. In addition, parameters indicative of autonomic nervous system (ANS) imbalance are commonly observed in these individuals, with increased sympathetic and reduced parasympathetic nervous activity.10
The ANS imbalance, or dysautonomia, often progresses to ANS dysfunction and is one of the most overlooked and misdiagnosed conditions.11 The ANS plays a major role in the integrated regulation of food intake, involving satiety signals and energy expenditure; thus, ANS dysregulation might favor body weight gain. Conversely, obesity might trigger alterations in the sympathetic regulation of cardiovascular function, thus favoring the development of cardiovascular complications and events.12 Neural mechanisms have been involved in the pathogenesis of obesity, particularly sympathovagal imbalance, and the relative prevalence of sympathetic activity has been suggested to play a pivotal role in this complex bi-directional relationship.13
Heart rate variability (HRV) is a non-invasive method to evaluate the modulation of the ANS on the electrophysiological sinoatrial node. By describing the oscillations between consecutive electrocardiogram (ECG) R-R intervals, HRV can measure the physiological link between the ANS and the heart.14 Thus, HRV is a relevant indicator of cardiovascular autonomic dysregulation in individuals with obesity.10,15
From information from the Framingham cohort, cardiovascular risk prediction models for different CVDs have emerged.16 Variables such as age, serum cholesterol, systolic blood pressure, cigarette smoking, left ventricular hypertrophy on ECG, and glucose intolerance were included in the score.16 The 10-year risk estimates used in the 1998 score provided a convenient way to classify individuals as having low, intermediate, or high risk for future coronary heart disease.3,17
Measuring the risk of a future adverse event in low-income countries from a simple ECG may be a good ally for therapeutic adjustment and directing behaviors for individuals at risk of developing CVD, especially when considering long-term clinical follow-up. Simplifying the process that defines cardiovascular risk based on isolated information (i.e., ECG data) can facilitate the identification of a larger number of individuals at risk. Recent advances in artificial intelligence strategies incorporating machine learning (ML) are gaining new applications in the clinical context, including disease prognosis,18–20 and may be tools to be incorporated into the definition of cardiovascular risk.
We found in state-of-the-art 11 ML studies using Framingham-style clinical variables for cardiovascular risk detection that most ML models rely on classic risk factors (age, sex, systolic blood pressure/diastolic blood pressure,21–31 total and high-density lipoprotein cholesterol, smoking, diabetes), sometimes extending to body mass index (BMI), medications, HbA1c, family history, estimated glomerular filtration rate, and heart rate (HR). Only one study (Yang et al.29) used a broader feature set (49 electronic medical record variables). Classifiers are varied but traditional: Support Vector Machine appears in 4/11 studies, Random Forest in 3/11, and single instances of Logistic Regression, XGBoost, Neural Networks, and gradient boosted trees/proportional hazards regression. Validation is predominantly simple: 7/11 use hold-out splits, and 5/11 use cross-validation; no study in the table reports external validation. Cohort sizes and balance vary widely from very small and imbalanced (e.g., Dogan et al.22: 504 vs 20; Navarini et al.26: 18 vs 115) to large and balanced (e.g., Sajeev et al.28: 23,152 vs 23,152; Quesada et al.23: 5,837 vs 5,837). Performance is heterogeneous. Reported accuracies (n = 9) range from 65.41% to 93.01%, with a mean of 80.21% and a median of 82.35%. Reported area under the curves (AUCs) (n = 9) span 0.6333 to 0.9220, with a mean of 0.764 and a median of 0.751. The highest AUC is from Yang et al.29 (stroke, XGBoost, 0.9220) on a moderately sized, hold-out cohort; the highest accuracy (93.01%) is from Dogan et al.22 but on a tiny, highly imbalanced dataset (504 low-risk vs 20 high-risk), which likely inflates accuracy. Among larger or balanced cohorts, AUCs cluster in the mid-0.7s to mid-0.8s (Alaa et al.24: 0.774; Sajeev et al.28: 0.852; Cho et al.30: 0.751; Chun et al.31: 0.836), while Quesada et al.23 is a lower outlier at 0.6333 despite balance. Methodologically, 10/11 studies reduce risk to binary endpoints (Low vs. High or event vs. no event), limiting calibration and clinical interpretability relative to full, continuous risk scoring; several explicitly note unbalanced data (at least 5/11) and small sample sizes (3/11).
This study aimed to develop and validate a multimodal ML model that integrates ECG non-linear parameters with medical features (HR, anthropometry, blood glucose lipid profile, and HRV) to classify cardiovascular risk severity in alignment with the Framingham risk score. Specifically, we sought to quantify the incremental value of multimodal integration by comparing ECG-only and medical features-only models with the combined model, to assess the relative contribution of each feature source (including ECG lead positions) to discrimination, and to make our curated dataset publicly available to support transparency, reproducibility, and further research.
Materials and methods
In this section, we describe each step of the experimental study and the introduction of the database. Figure 1 illustrates the proposed methodology, structured into five main phases: (1) data collection and curation of a database; (2) signal normalization and filtering: to ensure quality and consistency; (3) ECG multi-band decomposition via Discrete Wavelet Transform (DWT) and feature extraction: extraction of 27 non-linear features per 1-second segment across five decomposition levels, followed by dimensionality reduction using six statistical functions; (4) statistical analysis: refinement of the ECG-based feature set; and (5) classification evaluation metrics: integration of the optimized ECG features with 42 additional medical parameters, including HR, anthropometric measures, blood glucose lipid profile (BGLP), and HRV, followed by normalization and processing through a ML pipeline comprising 19 classifiers, resulting in a comprehensive classification report. More details and specific explanations for each phase are provided in the following subsections.
Experimental setup
This work was conducted using a MacBook Pro 14 with an M1 Pro chip (8-core CPU, 14-core GPU) and 16 GB of RAM, utilizing MATLAB® and Python coding languages. MATLAB®, version R2023b, was used to extract the non-linear characteristics of the ECG signals, organize them with different topological medical features, and compress and organize the data for ML tasks. Python (version 3.9.12) was used to design, train/test, and obtain discrimination reports from ML models.
Database collection and curation
The database is a cross-sectional study with a quantitative approach and was conducted at the Hospital Universitário Walter Cantídio in Fortaleza, the capital of the State of Ceará, Brazil. The research sample was selected by convenience, and the research was conducted from November 2023 to May 2024 after approval by the institutional Research Ethics Committees (CAAE: 74256823.4.0000.5054 and 74256823.4.3001.5045). The ethical principles recommended by the Declaration of Helsinki and Resolution 466/12 of the Brazilian National Health Council were followed.
Individuals of both sexes aged 30 years or older with a previous nosological diagnosis of obesity, asymptomatic for heart disease, and who underwent laboratory tests (lipid profile and fasting blood glucose) within a maximum period of six months from the interview and data collection were included. Participants in the acute phase of any disease and those with an inability to communicate verbally or cognitive deficits were not included.
A total of 60 patients were initially screened based on the inclusion criteria mentioned above. Of these, five were excluded due to the presence of cardiac disease, discrepancies in diagnostic or comorbidity information, or refusal to provide complete information during the interview, leaving 55 patients eligible for further evaluation. Following anthropometric measurements, vital signs assessment, cardiovascular risk stratification using the Framingham score, and ECG acquisition, two additional patients were excluded due to high interference in HRV measurements. Consequently, the final structured collected database includes data from a total of 53 participants with no missing data (Fig. 2).
ECG recording was performed noninvasively using the PowerLab data capture hardware system, and HRV parameters were consolidated using lead II of the ECG (Labchart Pro version 7.3.4, Brazil) beat by beat in the CM5 position and analyzed by software (MATLAB® 6.1.1.450 Release 12.1.2001). The collection was performed with a sampling frequency of 1,000 Hz.
ECG acquisition was performed at rest in a climate-controlled room in the morning to minimize circadian HR variations.32 For this purpose, all volunteers were previously instructed to abstain from stimulant drugs, caffeine, tobacco, alcohol, ingestion of high-fat foods, and physical activity for at least 24 h beforehand. Recordings were taken from 07:00 to 11:00 a.m. to avoid any hemodynamic effect on HRV. Participants were instructed not to talk during the assessment, thus avoiding interference that could affect the capture of the HR signal.33,34 Initially, the participant remained at rest on a stretcher in a supine position (ECG_D) for 5 minutes. They were then asked to perform the active postural maneuver (APM), during which they were instructed to stand up abruptly and remain in an orthostatic position (ECG_UP) without movement for 5 minutes until the end of the measurement.35,36
The use of APM was considered because it is a technique with potential sensitivity for assessing vagal and cardiac sympathetic responses.35 The use of APM causes reflex stimulation of the baroreceptors and contraction of the muscles of the lower limbs, thus changing the individual’s position from supine to bipedal, favoring the acquisition of higher delta HRV values.36
In the HRV analyses performed at rest and standing, the stability of the tracings was highlighted, excluding the phase related to the movement during posture change. This is because such movement generates intense instability in the signal due to high cardiocirculatory stress in progress, and therefore, it was not considered for validating the interpretation of the variable information.34,37 Thus, for interpretation purposes, the indexes used in the analysis were: standard deviation of RR intervals (hereinafter referred to as SDRR), square root of the mean of the squares of successive differences (hereinafter referred to as RMSSD), percentage of successive RR intervals with differences greater than 50 ms (hereinafter referred to as pRR50), low-frequency power (LF), high-frequency power (HF), low/high-frequency ratio (LF/HF), short-term variability (SD1), long-term variability (SD2), ratio between standard deviations 1 and 2 (SD1/SD2), and ratio between standard deviations 2 and 1 (SD2/SD1).
The volunteers were then assessed and stratified for cardiovascular risk using the Framingham score with information on sex, age, total and fractional cholesterol, blood pressure, smoking, and diabetes.38 Based on the percentage estimate, cardiovascular risk was classified into three categories: Low (<5%), Moderate (5% to 20% for men and 5% to 10% for women), and High (>20% for men and >10% for women).38–40
The Low cardiovascular risk class included 22 participants, of whom 95.45% (21 out of 22) were females, with an average age of 35.77 years and an average BMI of 42.96 Kg/m2. The Moderate cardiovascular risk class included 14 participants, of whom 71.43% (10 out of 14) were women, with an average age of 48.43 years and a mean BMI of 39.26 Kg/m2.
The High cardiovascular risk class comprised 17 participants, of which 88.24% (15 out of 17) were women, with an average age of 54.24 years and an average BMI of 45.03 Kg/m2. The number of participants results in a minor class imbalance, with a maximum ratio of 60:40, which is generally not considered critical for ML tasks’ performance, as suggested by Thabtah et al.41Table S1 in the Supplement provides additional details regarding the participants’ characteristics.
The database is registered at Mendeley Data - DOI: 10.17632/z8mrvy259n.1, and it includes:
An Excel file (Framingham Patients Information.xlsx), which contains each patient’s gender, age, Framingham risk score, and a set of medical features grouped into four categories:
HR: resting HR, HR during APM, average HR, systolic and diastolic blood pressure, mean arterial pressure, resting double product, estimated double product during APM, and average double product.
Anthropometry: BMI, abdominal circumference, waist circumference, and neck circumference.
BGLP: LDL cholesterol, HDL cholesterol, and glycemia.
HRV: average RR interval, SDRR, RMSSD, pRR50, LF Power, HF Power, LF/HF Power ratio, SD1, SD2, SD1/SD2, and SD2/SD1 ratios.
A folder named “ECG Signals”, containing raw ECG recordings for each patient in.mat format.
Note that in this study, we additionally combined non-linear features extracted from the ECG signals with the clinical data provided in the Excel file to enhance the analysis of cardiovascular risk.
Signal normalization
The ECG signals, x(n), were loaded into MATLAB® and normalized according to the Root Mean Square normalization formula.42
x(n)=x(n)∑n=0N−1x2(n)N,
where N represents the signal’s length. Then, the average value was removed from the entire signal.Signal filtering
The signals were sampled at a frequency rate of 1,000 Hz, and an elliptic band-pass filter of order 16 with frequency cut-offs at 1 Hz and 40 Hz,43 a steepness of 0.85, and a stop band attenuation of 60 dB was applied to the signal. Figure 3 presents the 2 seconds per class after signal filtering.
ECG Multi-band decomposition via wavelet transform and feature extraction
The DWT is a robust tool for analyzing finite-energy discrete-time signals. It decomposes a signal into a family of base functions constructed from a small set of prototype sequences and their time shifts, enabling compact, time-frequency-localized representations that are well-suited to nonstationary signals. Decomposition and perfect reconstruction are performed with an octave-band, critically decimated filter bank, following the framework introduced by Malvar et al.44 and extended by Vetterli et al.45 When focusing on positive frequencies, each sub-band in the transform is confined to a specific range:
Wk={[0,π2S],m=0,[π2S−m+1,π2S−m],m=1,2,...,S,
where S is the number of levels, S+1 is the number of sub-bands, and π is the normalized angular frequency equivalent to half the sampling rate. The DWT utilizes an analysis scale function, ϕ˜1(n), and an analysis wavelet function, ψ˜1(n), defined as: ϕ˜1(n)=hLP(n)
and ψ˜1(n)=hHP(n),
where hLP(n) and hHP(n) are the impulse responses of the low-pass and high-pass analysis filters, respectively.The following recursion formulas are defined:
ϕ˜i+1(n)=ϕ˜i(n/2)*ϕ˜1(n),ψ˜i+1(n)=ϕ˜i(n)*ψ˜1(n/2i),
where the symbol “*” denotes the convolution operation. The analysis filter for the mth sub-band is expressed as: hm(n)={ϕ˜S(n),m=0ψ˜S+1−m(n)m=1,2,...,S.
The mth sub-band signal is computed as:
xm(n){∑k=−∞∞x(k)hm(2Sn−k),m=0∑k=−∞∞x(k)hm(2S−m+1n−k)m=1,2,...,S.
In this study, the DWT was used to decompose each ECG signal into sub-bands (xm(n)) up to level five (S = 5). The Symlet7 wavelet was selected for its strong performance in ECG analysis up to five decomposition levels.19,25 To maintain consistency with the original sampling rate, the sub-band signals xm(n) were resampled to the native rate using a wavelet interpolation method. For example, Figure 4 provides a visual representation of the multiscale analysis of a 1-second ECG segment performed using the DWT.
Feature extraction
After that, 27 non-linear features (see Table 1 for more information46–68) were collected from the signal’s full-band and sub-bands every 1 s within a sliding rectangular non-overlapping window analysis. Then, the resulting time series of features per feature and band were compressed over time, respectively, by six distinct statistical functions: average (Avg), standard deviation (Std), 95th percentile (P95), variance (Var), median (Med), and kurtosis (Kur).19
Table 1Extracted features with corresponding equations and definitions
| Feature | Definition | Equation |
|---|
| Approximate Entropy (AE) | AE assesses the likelihood that similar patterns within the data remain consistent when additional data points are included. A lower AE suggests more regular or predictable data, while a higher AE indicates greater complexity or irregularity | AE(m,r)=limN→∞Θm(r)−Θm+1(r), θ is the Heaviside step function and can be defined as Θm(r)=∑iln[Cim(r)]N−m+1, where r denotes a predetermined tolerance, N is the number of data points in the time series, and m is the dimension46 |
| Correlation Dimension (CD) | CD measures self-similarity, with higher values indicating a high degree of complexity and less similarity | CD=limM→∞2∑i=1M−k∑j=i+kMΘ(l|Xi−Xj|)M2, where θ(x) is the Heaviside step function, Xi and Xj are position vectors on the attractor, l is the distance under consideration, k is the summation offset, and M is the number of reconstructed vectors from x(n)46 |
| Detrended Fluctuation Analysis (DFA) | DFA measures power scaling observed through R/S analysis | DFA(n)=∑k=1N[y(k)−yn(k)]2N, where N is the length, yn(k) is the local trend, and y(k) is defined as y(k)=∑i=1k[x(i)−x¯], with x(i) as the inter-beat interval and x as its average47 |
| Energy (En) | En represents the system capacity for performing work48 | En=∑n=0N−1|x(n)|2 |
| Higuchi Fractal Dimension (H) | H estimates the fractal dimension of a time series49 | H=ln(L(k))ln(1k), where k is the number of composed sub-series, and L(k) is the average curve size |
| Hurst Exponent (EH) | EH quantifies how chaotic or unpredictable a time series is | Kq(τ)~(τν)qEH(q), with Kq(τ)=|X(t+τ)−X(t)|q|X(t)|q, where q is the order moments of the distribution increments, ν is the time resolution, τ is the incorporation time delay of the attractor, and t is the period of a given time series signal X(t)50 |
| Katz Fractal Dimension (K) | K estimates the fractal dimensions through a waveform analysis of a time series50 | K=log(n)log(n)+log(maxn((n−1)2+(x(n)−x(1))2)∑n=2N1+(x(n−1)−x(n))2), |
| Lyapunov Exponent (ELy) | ELy evaluates the system’s predictability and sensitivity to change | ELy(x0)=limn→∞∑k=1nln|f′(xk−1)|n, where f′ is the derivative of f51 |
| Logarithmic Entropy (LE) | LE quantifies the average amount of information (in bits) needed to represent each event in the probability distribution. Higher logarithmic entropy values indicate greater unpredictability or randomness, while lower values suggest more certainty or order48 | LE=∑n=1Nlog2|x(n)|2 |
| Shannon Entropy (SE) | SE are measured in bits when the base-2 logarithm (log2) is used, quantifying the average bits required to represent each outcome in a probability distribution. Higher entropy values indicate greater uncertainty, unpredictability, or randomness, while lower values suggest more order or certainty48 | SE=−∑n=1N|x(n)|2log2|x(n)|2 |
| Sample Entropy (SampEn) | where m is the length of the vector, r is the tolerance, N is the total size of the vector, and n is a vector portion that is being analysed52 | SampEn(m,r,N)=−log(nm+1nm) |
| Fuzzy Entropy (FuzzEn) | where r is the resolution and m is the dimension53 | FuzzEn(x,m,r,ne,d)=−ln(ψm+1,d(ne,r)ψm,d(ne,r)) where ψm,d (ne,r) is defined as ψm,d(ne,r)=∑Λ=1N−md∑λ=1,λ≠ΛN−mdexp(ΔΛ,λne)(N−md)(N−md−1) |
| Kolmogorov Entropy (K2En) | where Cd(ε) is the correlation integral54 | K2En=limε→0limd→∞lnCd(ε)Cd+1(ε)τ |
| Permutation Entropy (PermEn) | where P is the probability distribution for each symbol55 | PermEn(m)=−∑v=1kPvlnPv |
| Conditional Entropy (CondEn) | where E is the Shannon Entropy, and L is the dimension56 | CondEn(LL−1)=E(L)−E(L−1) |
| Distribution Entropy (DistEn) | where p is the probability of each bin57 | DistEn=−∑t=1Mptlog2(pt)log2(M) |
| Dispersion Entropy (DispEn) | where πv0v1…vm−1 is the dispersion pattern and cm is the potential dispersion patterns58 | DispEn=−∑n=1cmp(πv0v1…vm−1)ln(p(πv0v1…vm−1)) where p(πv0v1…vm−1) can be defined as p(πv0v1…vm−1)=#{i|i≤N−(m−1)d,Zim,c}N−(m−1) |
| Spectral Entropy (SpecEn) | where S(f) is the power spectrum and fn is the upper limit Frequency59 | SpecEn=−∑f=0fnS(f)logS(f) |
| Symbolic Dynamic Entropy (SyDyEn) | where p(l) is the probability of occurrence and c(l) is the frequency of occurrence60 | SyDyEn=−∑l=14mp(l)logp(l) and p(l) is defined as p(l)=c(l)N−m+1 |
| Increment Entropy (IncrEn) | where p(wn) is the frequency of each unique word (wn), q is the quantifying precision, and m is the dimension61 | IncrEn=−∑n=12q+1)mp(wn)logp(wn)m−1 and p(wn) is defined as p(wn)=q(wn)N−m |
| Cosine Similarity Entropy (CoSiEn) | where B(ε)(m)(rCSE) is the global probability of occurrences of similar patterns and rCSE is the tolerance62 | CoSiEn=−[B(ε)(m)(rCSE)log2B(ε)(m)(rCSE) +(1−B(ε)(m)(rCSE)) log2(1−B(ε)(m)(rCSE))] |
| Phase Entropy (PhasEn) | where the p(i) is the probability distribution and k is the maximum sector of i63 | PhasEn=−∑i=1kp(i)logp(i)logk |
| Slope Entropy (SlopEn) | SlopEn is defined as the computation of a sub-sequence of symbols where d is the difference between two consecutive samples, and γ and δ are thresholds64 | {d>γ→2d≤γ&d>δ→1|d|≤δ→0d<−δ&d>−γ→−1d<−γ→−2 |
| Bubble Entropy (BubbEn) | where Hswapsm is the conditional Rényi entropy65 | BubbEn=Hswapsm+1−Hswapsmlogm+1m−1 |
| Gridded Distribution Entropy (GridEn) | where j is the jth grid, the pj is the probability of each grid, b is the number of points in the jth grid, and N is the length of the RR intervals66 | GridEn=−∑j=1n2pjlogpj where pj is defined as pj=bN−m |
| Entropy of Entropy (EnofEn) | where l is the level index, yj(τ) is the sequence, and N/τ is the number of representative states for each original time series67 | EnofEn=−∑l=1s2pllnpl and pl is defined as pl=Number of yj(τ)lnlN/τ |
| Attention Entropy (AttnEn) | where ai,j are the attention weight probabilities between the positions i and j and a′i,j is the averaged weights68 | AttnEn=−∑j=0d2ai,jlogai,j and ai,j is defined as ai,j=ea′i,j∑jea′i,j |
At the end of the feature extraction from ECG analysis, we obtained a total of 1,944 non-linear features (27 non-linear features extracted from [×] 6 bands (ECG full-band + sub-bands), compacted by [×] 6 data compressors in [×] 2 different positions).
Statistical analysis
To optimize the feature selection process, we performed a total of 1,944 statistical analyses with the Kruskal-Wallis statistical test using MATLAB® Statistics and Machine Learning Toolbox, which evolved all the group’s parameter distributions of features. Only those analyses showing significant differences (p < 0.05) between class distributions were selected, and the others were discarded. For multiclass distribution, we applied the Bonferroni correction. The percentage of statistical analyses with significant differences was approximately 3.1% (60 out of 1,944). Additionally, we combined the selected ECG features with the medical features (defined in Section 2.2), initially provided by health professionals, for use in the ML discrimination stage. Table 2 shows the number of features per origin after filtering.
Table 2Number of features per origin
| # of features |
|---|
| Non-linear features | |
| ECG_D | 39 |
| ECG_UP | 21 |
| Medical features | |
| Heart rate | 9 |
| Anthropometry | 4 |
| Blood glucose lipid Profile | 3 |
| Heart rate Variability | 26 |
Classification evaluation metrics
To train and test the ML models using a nested leave-out cross-validation procedure, we utilized data derived from three sources: ECG-only, medical features-only, and a combination of both as a multimodal model. Cross-validation techniques allow the entire dataset to be used in classification while preventing data leakage between training and testing sets. This method is beneficial for deriving classification conclusions from small datasets.69
For all models, data were pre-filtered using the analysis of variance F-value feature selector, which selects feature hierarchy based on their F-statistic power. This approach maximizes model performance by iteratively feeding the model with different feature combinations to identify the optimal set for each discriminative task (Low vs. Moderate, Low vs. High, Moderate vs. High, and All vs. All). Specifically:
For ECG-only models, the best combination of features ranged from 1 to 60.
For medical features-only models, the best combination of features ranged from 1 to 42.
For multimodal models, the best combination of features ranged from 1 to 102.
This approach resulted in three optimized models, each responsible for one of the four discrimination tasks previously identified. These models were selected from a pool of 19 pre-designed Scikit-learn ML classifiers,70 as shown in Table 3, with the hyperparameter tuning step selected within Nested leave-one-out cross-validation (LOOCV) due to the constraints of our dataset size. Given the sample size, we therefore relied on rigorous internal validation, using Nested LOOCV (a strategy widely used in the literature for small-sample studies) with all model selection and tuning confined to the training folds,71–75 and we report performance aggregated across folds along with calibration and sensitivity analyses. While this does not replace external validation, it provides an unbiased estimate of out-of-sample performance.76 The selection of the best results was based on their AUC performance for distinguishing between comparison groups.
Table 319 Scikit-learn machine learning classifiers configuration
| Classifier – Scikit-learn class (Abbreviation) | Hyperparameters |
|---|
| AdaBoostClassifier (AdaBoost) | n_estimators = 50, learning_rate = 1.0, algorithm = ‘SAMME.R’ + Default |
| BaggingClassifier (BaggC) | n_estimators = 10, max_samples = 1.0, bootstrap = true + Default |
| DecisionTreeClassifier (DeTreeC) | max_depth = 5, criterion = ‘gini’, splitter = ‘best’, min_samples_split = 2 + Default |
| ExtraTreesClassifier (ExTreeC) | n_estimators = 300, criterion = ‘gini’, max_features = ‘auto’, bootstrap=false + Default |
| GaussianNB (GauNB) | priors = none, var_smoothing = 1e-9, store_covariance = false + Default |
| GaussianProcess Classifier (GauPro) | 1.0 * RBF (1.0), optimizer = ‘fmin_l_bfgs_b’, max_iter_predict = 100, copy_X_train = true + Default |
| GradientBoosting Classifier (GradBoost) | loss = ‘log loss’, learning_rate = 0.1, n_estimators = 100 + Default |
| KNearestNeighbors Classifier (KNN) | n_neighbors = 5, weights = ‘uniform’, algorithm = ‘auto’ + Default |
| LinearDiscriminant Analysis (LinDis) | solver = ‘svd’, shrinkage = none, priors = none + Default |
| LinearSVC (LinSVC) | penalty = ‘l2’, loss = ‘squared_hinge’, dual = True + Default |
| LogisticRegression (LogReg) | solver = “lbfgs”, penalty = ‘l2’, C = 1.0, max_iter = 100 + Default |
| LogisticRegressionCV (LogRegCV) | cv = 3, penalty = ‘l2’, solver = ‘lbfgs’, max_iter = 100 + Default |
| MLPClassifier (MLP) | α = 1, max iter = 1,000, hidden_layer_sizes = 100, activation = ‘relu’, solver = ‘adam’ + Default |
| OneVsRestClassifier (OvsR) | estimator = LinearSVC(random_state = 0), n_jobs = none, verbose = false + Default |
| QuadraticDiscriminantAnalysis (QuadDis) | reg_param = 0.0, priors = none, store_covariance = false + Default |
| RandomForest Classifier (RF) | max_depth = 5, n_estimators = 300, criterion = ‘gini’, bootstrap = true, min_samples_split = 2 + Default |
| SGDClassifier (SGD) | max iter = 100, tol = 0.001, penalty = ‘l2’, loss = ‘hinge’, alpha = 0.0001 + Default |
| SGDClassifierMod (SGDCMod) | penalty = ‘l2’, loss = ‘hinge’, alpha = 0.0001 + Default |
| Support-vector Machines (SVC) | γ = “auto”, kernel = ‘Radial Basis Function’, C = 1.0, probability = false + Default |
AUC quantifies a binary classifier’s ability to distinguish between positive and negative instances across all decision thresholds by comparing the true positive rate (TPR) to the false positive rate (FPR). AUC ranges from 0 to 1, where 1 indicates a perfect classifier and 0.5 indicates performance equivalent to random guessing. Because it summarizes performance across thresholds, AUC provides a single, threshold-independent measure that is especially useful for model comparison under class imbalance.77 In addition to AUC, we report the following metrics: Accuracy (Acc), Precision (Prec; positive predictive value), Recall (Rec; sensitivity), Specificity (Spec), F1 − Score, and negative predictive value (NPV). Let TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively.
Acc is the proportion of correctly classified instances among all instances.78
Acc=TP+TNTP+TN+FP+FN×100%,
Prec, the proportion of predicted positives that are truly positive,79 can be defined as
Prec=TPTP+FP×100%,
NPV shows the proportion of predicted negatives that are truly negative and can be defined as80
NPV=TNTN+FN×100%,
Rec represents the proportion of actual positives that are correctly identified and is defined as79
Rec=TPTP+FN/×100%,
Spec is the proportion of actual negatives that are correctly identified.81
Spec=TNTN+FP×100%
F1 − Score is the harmonic mean of Prec and Rec, balancing both types of error,82 and it is defined as
F1−Score=2×Prec×RecPrec+Rec×100%.
Results
We divided these results into four subsections: (1) the performance of the ECG-only model, trained and tested exclusively with ECG features; (2) the performance of the medical features-only model, trained and tested on medical features; (3) the performance of the multimodal models, trained with both ECG and medical features; and (4) a comparison between the performance of the three classes.
ECG-only models’ performance
Table 4 shows the classification report for the ECG-only models. Models trained exclusively on ECG-derived features achieved strong discrimination across all comparison groups. For Low vs. Moderate, the best model using 100% of features from the ECG_D position reached an Acc of 80.36%, Rec of 86.36%, Prec of 82.61%, Spec of 71.43%, F1 − Score of 84.44%, NPV of 76.92%, and an AUC of 0.7890. For Low vs. High, the best model, trained with 83.3% of features from ECG_D and the remainder from ECG_UP, attained an Acc of 97.44%, Rec of 100%, Prec of 95.65%, Spec of 94.12%, F1 − Score of 97.78%, NPV of 100%, and an AUC of 0.9706. For Moderate vs. High, performance included an Acc of 93.55%, Rec of 85.71%, Prec of 100%, Spec of 100%, F1 − Score of 92.31%, NPV of 89.47%, and an AUC of 0.9286, with 65.2% of discriminative information coming from ECG_D. In the All vs. All comparison, the model achieved an Acc of 86.79%, Rec of 86.79%, Prec of 86.92%, Spec of 93.06%, F1 − Score of 86.86%, NPV of 93.47%, and an AUC of 0.9346. Collectively, these results indicate that the ECG_D position contributed the most informative features for maximizing discrimination, with ECG_UP playing a secondary role, particularly in the Moderate vs. High comparison.
Table 4Best classification results for the electrocardiogram (ECG)-only models per comparison group
| Comparison group | # of features | Feature combination (category percentage) | Classifier | Acc (%) | Rec (%) | Prec (%) | Spec (%) | F1 − Score (%) | Negative predictive value (%) | Area under the curve |
|---|
| Low vs. Moderate | 28 | Electrocardiogram_D (100%) & Electrocardiogram _UP (0%) | DeTreeC | 80.56 | 86.36 | 82.61 | 71.43 | 84.44 | 76.92 | 0.7890 |
| Low vs. High | 18 | Electrocardiogram _D (≈ 83.3%) & Electrocardiogram _UP (≈ 16.7%) | SGD | 97.44 | 100 | 95.65 | 94.12 | 97.78 | 100 | 0.9706 |
| Moderate vs. High | 23 | Electrocardiogram _D (≈ 65.2%) & Electrocardiogram _UP (≈ 34.8%) | MLP | 93.55 | 85.71 | 100 | 100 | 92.31 | 89.47 | 0.9286 |
| All vs. All | 47 | Electrocardiogram _D (≈ 68.1%) & Electrocardiogram _UP (≈ 31.9%) | LinSVC | 86.79 | 86.79 | 86.92 | 93.06 | 86.86 | 93.47 | 0.9346 |
Medical features models’ performance
Table 5 illustrates the classification results for the medical features-based models. When trained exclusively on medical parameters (HR, anthropometry, BGLP, and HRV), models showed competitive but generally lower performance than ECG-based models, except in one case. For Low vs. Moderate, Acc was 83.33%, Rec 86.36%, Prec 86.36%, Spec 78.57%, F1 − Score 86.36%, NPV 78.57%, and AUC 0.8247. For Low vs. High, Acc reached 87.18%, with Rec 86.36%, Prec 90.48%, Spec 88.24%, F1 − Score 88.24%, NPV 83.33%, and AUC 0.8730. For Moderate vs. High, Acc was 80.65%, Rec 92.86%, Prec 72.22%, Spec 70.59%, F1 − Score 81.25%, NPV 92.31%, and AUC 0.8172. In the All vs. All comparison, Acc was 60.38%, Rec 60.38%, Prec 60.03%, Spec 77.66%, F1 − Score 60.20%, NPV 80.17%, and AUC 0.6626. HRV features were the most influential among the medical parameters, contributing 57.1%, 61.5%, 79.3%, and 80% of the discriminative information for Low vs. Moderate, Low vs. High, Moderate vs. High, and All vs. All, respectively. Comparing ECG-based and medical features-only models, the medical features-only model was 3.45% more accurate for Low vs. Moderate but was less accurate by 10.78% to 26.42% for Low vs. High, Moderate vs. High, and All vs. All, indicating stronger ECG-driven discrimination, particularly in binary tasks involving the High class and in multiclass settings.
Table 5Best classification results for the medical features-only models per comparison group
| Comparison group | # of features | Feature combination (category percentage) | Classifier | Acc (%) | Rec (%) | Prec (%) | Spec (%) | F1 − Score (%) | Negative predictive value (%) | Area under the curve |
|---|
| Low vs. Moderate | 7 | Heart rate (≈28.6%) & Anthropometry (0%) & Blood glucose lipid profile (≈14.3%) & Heart rate variability (≈57.1%) | QuadDis | 83.33 | 86.36 | 86.36 | 78.57 | 86.36 | 78.57 | 0.8247 |
| Low vs. High | 26 | Heart rate (≈27%) & Anthropometry (≈11.5%) & Blood glucose lipid profile (0%) & Heart rate variability (≈61.5%) | QuadDis | 87.18 | 86.36 | 90.48 | 88.24 | 88.37 | 83.33 | 0.8730 |
| Moderate vs. High | 29 | Heart rate (0%) & Anthropometry (≈13.8%) & Blood glucose lipid profile(≈6.9%) & Heart rate variability (≈79.3%) | LinDis | 80.65 | 92.86 | 72.22 | 70.59 | 81.25 | 92.31 | 0.8172 |
| All vs. All | 5 | Heart rate (0%) & Anthropometry (0%) & Blood glucose lipid profile (20%) & Heart rate variability (80%) | SGDMod | 60.38 | 60.38 | 60.03 | 77.66 | 60.20 | 80.17 | 0.6626 |
Multimodal models’ performance
Table 6 corresponds to the best classification results for the multimodal models (ECG features + medical features). Combining ECG and medical features yielded the best overall results in selected comparisons. For Low vs. Moderate, the multimodal model achieved an Acc of 88.89%, Rec 90.91%, Prec 90.91%, Spec 85.71%, F1 − Score 90.91%, NPV 85.71%, and AUC 0.8831, improving accuracy by 10.34% over the ECG-only model. For Low vs. High, the combined model matched the ECG-only performance with an Acc of 97.44%, Rec 100%, Prec 95.65%, Spec 94.12%, F1 − Score 97.78%, NPV 100%, and AUC 0.9706. For Moderate vs. High, Acc was 93.55%, Rec 100%, Prec 87.50%, Spec 88.24%, F1 − Score 93.33%, NPV 100%, and AUC 0.9412. For All vs. All, performance was comparable to ECG-only: Acc 86.79%, Rec 86.79%, Prec 86.92%, Spec 93.06%, F1 − Score 86.86%, NPV 93.47%, and AUC 0.9346. These patterns reflect that feature selection often discarded medical variables in the Low vs. High and All vs. All settings, effectively aligning the combined model’s behavior with ECG-only models in those comparisons.
Table 6Best classification results for the multimodal models per comparison group
| Comparison group | # of features | Feature combination (category percentaege) | Classifier | Acc (%) | Rec (%) | Prec (%) | Spec (%) | F1 − Score (%) | Negative predictive value (%) | Area under the curve |
|---|
| Low vs. Moderate | 33 | Electrocardiogram_D (0%) & Electrocardiogram _UP (≈51.5%) & Heart Rate (0%) & Anthropometry (≈3%) & Blood Glucose lipid profile (≈3%) & Heart rate variability (≈42.5%) | LinDis | 88.89 | 90.91 | 90.91 | 85.71 | 90.91 | 85.71 | 0.8831 |
| Low vs. High | 18 | Electrocardiogram_D (≈83.3%) & Electrocardiogram_UP (≈16.7%) & Heart rate (0%) & Anthropometry (0%) & Blood glucose lipid profile (0%) & Heart rate variability (0%) | SGD | 97.44 | 100 | 95.65 | 94.12 | 97.78 | 100 | 0.9706 |
| Moderate vs. High | 31 | Electrocardiogram_D (≈83.3%) & Electrocardiogram_UP (≈16.7%) & Heart Rate (0%) & Anthropometry (0%) & Blood glucose lipid profile (0%) & Heart rate variability (0%) | LinDis | 93.55 | 100 | 87.50 | 88.24 | 93.33 | 100 | 0.9412 |
| All vs. All | 47 | Electrocardiogram_D (≈68.1%) & Electrocardiogram_UP (≈31.9%) & Heart rate (0%) & Anthropometry (0%) & Blood glucose lipid profile (0%) & Heart rate variability (0%) | LinSVC | 86.79 | 86.79 | 86.92 | 93.06 | 86.86 | 93.47 | 0.9346 |
Comparison between different types of models’ performance
Across models, Figure 5 summarizes AUC behavior between the approaches presented in Tables 4–6, for each comparison group: adding ECG to medical features improved AUC in every group (by 7.1% to 41.1%), while adding medical features to ECG improved AUC for Low vs. Moderate and Moderate vs. High but not for Low vs. High or All vs. All due to the feature selection process. Confusion matrices in Figure 6 show a small number of misclassifications: in Low vs. Moderate, two participants per class were misclassified; in Low vs. High, one High participant was predicted as Low; in Moderate vs. High, two High participants were misclassified; and in All vs. All, most errors occurred between Low and Moderate. Receiver-operating characteristic curves (Fig. 7) approached near-perfect classification with AUCs of 0.8831, 0.9706, 0.9412, and 0.9346 for Low vs. Moderate, Low vs. High, Moderate vs. High, and All vs. All, respectively.
Discussion
The results demonstrate that ECG-derived information, particularly from the ECG_D position, is the dominant discriminative signal. Compared with medical features-only models, the ECG-only model achieved higher Acc in Low vs. High (97.44% vs. 87.18%; AUC +11.2% relative), Moderate vs. High (93.55% vs. 80.65%; AUC +13.6%), and All vs. All (86.79% vs. 60.38%; AUC +41.1%), with the sole exception of Low vs. Moderate, where the medical features-only model slightly surpassed the ECG-only model in Acc (83.33% vs. 80.36%; AUC +4.5% relative). Multimodal integration delivered its largest gains for adjacent strata, improving Low vs. Moderate Acc by 10.6% relative over ECG-only (88.89% vs. 80.36%) and raising AUC by 11.9% vs. ECG-only and 7.1% vs. medical features-only. For Moderate vs. High, AUC increased by 1.4% vs. ECG-only and 15.2% vs. medical features-only, with accuracy unchanged. In Low vs. High and All vs. All, multimodal performance matched ECG-only (0% change in Acc and AUC) because feature selection discarded medical variables, suggesting that ECG alone captured the dominant signal or that selection favored parsimony over complementarity. The multimodal findings underscore both the potential and the limitations of feature selection in combined pipelines. In Low vs. Moderate and Moderate vs. High, combining modalities improved discrimination (AUC +11.9% and +1.4% versus ECG-only; +7.1% and +15.2% versus medical features-only), with a notable Acc gain only for Low vs. Moderate (+10.3%), consistent with the idea that adjacent categories benefit from complementary features. In contrast, for Low vs. High and All vs. All, the feature selection procedure eliminated medical variables, yielding no change versus ECG-only models (0% for both AUC and Acc), although AUC still improved substantially over medical features-only baselines (+11.2% and +41.1%, respectively), indicating either that ECG features alone sufficiently captured strong cardiovascular risk evidence or that the selection algorithm favored parsimony over potential complementarity. This behavior highlights an important methodological consideration: aggressive feature selection can optimize performance while inadvertently suppressing clinically relevant complementarity. Alternative selection strategies or model families designed to leverage multimodal complementarity may further improve performance in settings where ECG already dominates. Error patterns align with clinical plausibility. Misclassifications most frequently occurred between Low and Moderate, reflecting the expected continuum in risk phenotypes. High Spec and Prec in the High class suggest a favorable profile for ruling in higher risk, while strong Rec and NPV in several settings support safe rule-out. Nonetheless, even small numbers of misclassifications in the High class warrant attention; model calibration and threshold selection should consider clinical consequences and decision costs.
Table 6 indicates that LinDis, LinSVC, and SGD classifiers achieved the best performance. All three are linear classifiers: LinDis (linear discriminant, LDA-style) fits class-conditional densities and applies Bayes’ rule to produce a linear decision boundary; LinSVC is a support vector machine with a linear kernel; and SGDClassifier with loss = ‘hinge’ optimizes the linear SVM (hinge) objective using stochastic gradient descent, yielding a linear large-margin classifier trained via SGD. The superiority of these linear models likely reflects a favorable bias-variance trade-off. On small or noisy datasets, non-linear kernels (e.g., polynomial or radial basis function) can capture spurious patterns unless heavily regularized and carefully tuned, which degrades generalization. Linear models impose a simpler hypothesis class that emphasizes dominant trends over noise and are straightforward to regularize (e.g., via C/alpha or early stopping), resulting in more stable out-of-sample performance.83 The similar results from LinSVC and SGD further suggest that the margin-based linear decision boundary, rather than solver-specific nuances, drives the gains.83
It should be noted that, although we report in Tables 4–6 the best-performing model for each classification method and comparison based on AUC, we have included Figure S1 to provide a comprehensive overview of the AUC distributions across all combinations of feature sets and comparisons per classifier. These boxplots clearly illustrate that (1) performance differences between classifiers are relatively minor, (2) the results are consistent across models, indicating no evidence of cherry-picking, and (3) the choice of classifier does not significantly affect overall performance, thereby reinforcing the robustness and reliability of our findings.
In comparison with the state-of-the-art (Table 7),21–31 prior works predominantly reported binary tasks, with 66.67% using unbalanced datasets and 41.67% using cross-validation. Among 11 studies, five used an equivalent Low vs. High risk formulation.22,23,25,27,28 Relative to these, our best overall results (Table 6) showed Acc gains of 4.43% to 32.03%, though direct comparison must be interpreted cautiously due to differences in databases and validation protocols. Nevertheless, external comparison with published studies is encouraging but must be interpreted with caution. Reported Acc improvements over prior work in Low vs. High tasks suggest that our approach, particularly with ECG_D features and, where beneficial, multimodal integration, is competitive with the state-of-the-art. However, differences in datasets, class balance, and validation protocols can substantially affect headline metrics. A robust external validation, harmonized evaluation protocols, and transparent reporting will be essential to confirm generalizability.
Table 7State-of-the-art literature report on cardiovascular risk detection systems based on the Framingham risk scale, with information about the database, the comparison groups, the features extracted, used classifiers, limitations, and Accuracy.
| Ref | Year | Source | Features extracted | Comparison group (number of participants) | Classifier | Limitations | Validation | Accuracy | Area under the curve |
|---|
| Unnikrishnan et al.21 | 2016 | Medical Features | Age, body mass index, current smoker, gender, total cholesterol, systolic blood pressure, high density lipoprotein cholesterol, diabetes, medication for hypertension, retinopathy, diastolic blood pressure | No-Cardiovascular Diseases (382) vs. Cardiovascular Diseases (128) | Support Vector Machine | Limited assessment of risk score, just Cardiovascular Diseases vs. No-Cardiovascular Diseases. Unbalanced dataset. Small dataset | Hold-on | 82.35% | 0.71 |
| Dogan et al.22 | 2018 | Medical Features | Age, gender, total cholesterol, High density lipoprotein cholesterol, systolic blood pressure, diastolic blood pressure, hemoglobin A1c, and smoking status | Low-Risk (504) vs. High-Risk (20) | Ensemble of Random Forest | Limited assessment of risk score, just Low vs. High. Unbalanced dataset | Hold-on | 93.01% | – |
| Quesada et al.23 | 2019 | Medical Features | Age, sex, total cholesterol, systolic blood pressure, tobacco use, diastolic blood pressure, high density lipoprotein cholesterol, and the presence of diabetes | Low-Risk (5,837) vs. High-Risk (5,837) | Random Forest | Limited assessment of risk score, just Low vs. High | Hold-on | 80.9% | 0.6333 |
| Alaa et al.24 | 2019 | Medical Features | Gender, age, systolic blood pressure, treatment for hypertension, smoking status, history of diabetes, and body mass index | 5-year risk of Cardiovascular Diseases (4,801) vs. No-5-year risk of Cardiovascular Diseases (4,801) | AutoPrognosis | Limited assessment of risk score, just Cardiovascular Diseases vs. No Cardiovascular Diseases | Cross-validation | – | 0.774 |
| Chen et al.25 | 2020 | Medical Features | Age, sex, systolic blood pressure, diastolic blood pressure, total cholesterol, high density lipoprotein cholesterol, diabetes status, smoking status | Low-Risk (1,036) vs. High-Risk (983) | Support Vector Machine | Limited assessment of risk score, just Low vs. High | Cross-validation | 85.11% | – |
| Navarini et al.26 | 2020 | Medical Features | Age, sex, systolic blood pressure, total cholesterol, smoking status, and hypertension treatment | Cardiovascular Diseases (18) vs. No- Cardiovascular Diseases (115) | Random Forest | Limited assessment of risk score, just Cardiovascular Diseases vs. No Cardiovascular Diseases. Unbalanced dataset. Small dataset | Cross-validation | 65.41% | 0.7297 |
| Jamthikar et al.27 | 2020 | Medical Features | Age, Sex, glycated hemoglobin, low density lipoprotein cholesterol, high density lipoprotein cholesterol, total cholesterol, triglyceride, systolic blood pressure, diastolic blood, pressure, hypertension, estimated glomerular filtration rate, and family history | Low-Risk (22) vs. High-Risk (180) | Support Vector Machine | Limited assessment of risk score, just Low vs. High. Unbalanced dataset. Small dataset | Cross-validation | 65.41% | 0.67 |
| Sajeev et al.28 | 2021 | Medical Features | Age, sex, total cholesterol, High density lipoprotein cholesterol, systolic blood pressure, hypertension medication, diabetes, and smoking status | Low-Risk (23,152) vs. High-Risk (23,152) | Logistic Regression | Limited assessment of risk score, just Low vs. High | Cross-validation | – | 0.852 |
| Yang et al.29 | 2021 | Medical Features | 49 Medical records’ features | Stroke (2,648) vs. No-Stroke (3,337) | XGBoost | Limited assessment of risk score, just Stroke vs. No-Stroke | Hold-on | 84.78% | 0.9220 |
| Cho et al.30 | 2021 | Medical Features | Age, sex, systolic blood pressure, total cholesterol, high density lipoprotein cholesterol, smoking status, history of diabetes, and antihypertensive medication use | 5-year risk of Cardiovascular Diseases (1,862) vs. No-5-year risk of Cardiovascular Diseases (50,327) | Neural Network | Limited assessment of risk score, just Cardio Vascular Diseases vs. No Cardio Vascular Diseases. Unbalanced dataset | Hold-on | – | 0.751 |
| Chun et al.31 | 2021 | Medical Features | Age, smoking status, coronary heart disease, diabetes, blood pressure-lowering treatment, systolic blood pressure-untreated, and systolic blood pressure-treated | Stroke (532) vs. No-Stroke (6,185) | Gradient Boosted Trees and proportional hazards regression | Limited assessment of risk score, just Stroke vs. No Stroke. Unbalanced dataset | Hold-on | 80% | 0.836 |
| Present study | 2025 | Electrocardiogram + Medical features | 102 features | Low (22) vs. Moderate (14) | LinDis | Small dataset. | Cross-validation | 88.89% | 0.8831 |
| | | | Low (22) vs. High (17) | SGD | | | 97.44% | 0.9706 |
| | | | Moderate (14) vs. High (17) | LinDis | | | 93.55% | 0.9412 |
| | | | Low (22) vs. Moderate (14) vs. High (17) | LinSVC | | | 86.79% | 0.9346 |
Overall, the evidence supports ECG as the primary source of discriminative information for cardiovascular risk detection, with ECG_D emerging as the most informative position and ECG_UP contributing to specific pairwise comparisons. Medical parameters (especially HRV) add value when discriminating adjacent risk levels, and their utility can be amplified by careful feature selection and model design choices that preserve multimodal complementarity.
Study limitations
While the findings are promising, several limitations warrant caution. First, the dataset is relatively small, meaning that even a few misclassifications can significantly affect the reported metrics. Without confidence intervals, the true uncertainty may be greater than the point estimates suggest. Second, generalizability may be constrained by the specific acquisition protocol, electrode positions (ECG_D and ECG_UP), device characteristics, and preprocessing choices. These factors, along with the operational definitions of risk groups, may not transfer to other populations, settings, or hardware. Moreover, medical parameters (particularly HRV) are sensitive to transient influences such as autonomic state, medications, and recording conditions, introducing potential confounding if not fully controlled.
Methodological choices may also introduce optimism. The observation that the feature selection step frequently discarded medical variables for feeding the multimodal model suggests that complementary multimodal information may have been underutilized; alternative integration strategies might yield different results. Finally, we did not evaluate model calibration or clinical utility (e.g., decision-curve analysis), and no external or prospective validation was conducted.
Future directions
Future efforts should prioritize updating the database to ensure the generalization of the results. This will also enable the use of a hold-out validation process for classification, which can provide a more straightforward evaluation of model performance on unseen data compared to cross-validation.
Additionally, it should prioritize the use of large and more balanced multi-center cohorts, harmonized acquisition protocols, rigorously nested model selection with calibration assessment, alternative feature selection and integration strategies, and external validation to substantiate robustness and clinical applicability.
Conclusions
In a cohort of 53 patients, we extracted 27 non-linear ECG features from two positions and 42 physician-curated clinical features and, using nested LOOCV with analysis of variance F-value selection, trained models to discriminate Framingham risk strata. Across Low vs. Moderate, Low vs. High, Moderate vs. High, and All vs. All tasks, the multimodal model consistently outperformed ECG-only and medical features-only models, achieving 86–97% Acc with AUCs up to 0.97. ECG-derived non-linear features, especially from the ECG_D position, were the principal drivers of discrimination, while medical features provided complementary gains, indicating the proposed multimodal approach is a promising tool to support clinical triage.
Supporting information
Supplementary material for this article is available at https://doi.org/10.14218/ERHM.2025.00037 .
Table S1
Summary table with the subjects’ characteristics.
(DOCX)
Fig. S1
Area under the curve (AUC) Boxplots of every classifier performance per category per comparison group. (a) AUC Boxplots for the comparison Low vs. Moderate; (b) AUC Boxplots for the comparison Low vs. High; (c) AUC Boxplots for the comparison Moderate vs. High; (d) AUC Boxplots for the comparison All vs. All.
(TIF)
Declarations
Acknowledgement
This work was supported by National Funds from FCT - Fundação para a Ciência e a Tecnologia through project UIDB/50016/2020.
Ethical statement
The research was approved by the Institutional Research Ethics Committees (CAAE: 74256823.4.0000.5054 and 74256823.4.3001.5045). The ethical principles recommended by the Declaration of Helsinki (as revised in 2024) and Resolution 466/12 of the Brazilian National Health Council were followed. All patients provided consent beforehand.
Data sharing statement
The dataset used to support the findings of this study has been deposited in Mendeley Data (doi:10.17632/z8mrvy259n.1). The dataset is currently under embargo and will be publicly available on February 11, 2026.
Funding
No funding was received.
Conflict of interest
The authors have no conflicts of interest related to this publication.
Authors’ contributions
Conceptualization, drafting of manuscript, investigation, manuscript editing (PR, JALM, PMR), validation (CL, MB, ON, JALM, PMR), critical revision of the manuscript for important intellectual content (PR, CL, MB, ON, JALM, PMR), study supervision, and funding acquisition (JALM, PMR). All authors have approved the final version and publication of the manuscript.