Introduction
Metabolic dysfunction-associated fatty liver disease (MAFLD) has emerged as the predominant hepatic manifestation of systemic metabolic dysregulation, intricately linked to obesity and metabolic syndrome (MetS).1 It represents a critical and growing global public health challenge. The prevalence of MAFLD is rising at an alarming rate worldwide, a trajectory that closely parallels the concurrent epidemics of obesity, MetS, and type 2 diabetes mellitus (T2DM).2 Current projections estimate that by 2040, over 55% of the global adult population could be affected.3 This trend is particularly pronounced in China, where recent epidemiological studies report an adult prevalence as high as 44.39%,4 underscoring the urgency of addressing this condition within national and global health agendas.
The clinical significance of MAFLD extends far beyond the spectrum of progressive liver disease, including steatohepatitis, fibrosis, cirrhosis, and hepatocellular carcinoma. It is now recognized as a multisystem disorder that engages in a complex, bidirectional interplay with various metabolic aberrations. This synergy significantly amplifies the risk of severe extrahepatic complications.5 Compelling evidence indicates that the confluence of MAFLD with MetS and T2DM can escalate the risk of hepatocellular carcinoma by up to five-fold.6 While the recent advent of effective pharmacotherapy underscores the need for early detection, the current diagnostic paradigm remains suboptimal. Therefore, the timely diagnosis and effective monitoring of MAFLD are paramount not only for mitigating hepatic outcomes but also for the primary and secondary prevention of life-threatening cardiovascular and malignant diseases.
In response to this clinical imperative, the diagnostic paradigm for MAFLD is shifting from reliance on invasive liver biopsy toward the use of non-invasive tests (NITs). A range of NITs, which integrate imaging modalities, clinical parameters, and serum biomarkers into predictive models, are increasingly employed to assess disease severity and stratify patients according to their risk of liver-related events.7–9 However, while tools like the Enhanced Liver Fibrosis test have gained regulatory approval for prognostic staging,10 a substantial unmet clinical need remains. There is a critical shortage of validated, accessible, and accurate NITs for other essential clinical applications in MAFLD, including early detection in primary care, precise phenotypic differentiation, and guidance for personalized management strategies. This gap underscores the necessity for developing novel, cost-effective, and clinically integrative diagnostic approaches.
Concurrently, there is a resurgence of interest in tongue diagnosis, a cornerstone of traditional Chinese medicine (TCM), now augmented by modern technology.11 Recent studies have indicated that characteristics such as the microbial composition of tongue coating, tongue color, and tongue morphology are associated with systemic metabolism and circulatory disorders.12–14 Advances in digital imaging and computational analysis have modernized this practice, transforming it from a subjective art into an objective, quantifiable discipline. Standardized image acquisition and automated feature extraction techniques have minimized environmental bias and inter-observer variability. Consequently, objective tongue diagnostic platforms have shown promising utility in screening various hepatic conditions, including viral hepatitis, cirrhosis, and MAFLD, demonstrating enhanced diagnostic reproducibility.
Recent studies leveraging machine learning to analyze quantitative features from tongue images have reported encouraging accuracy in identifying MAFLD.15–17 Nonetheless, a significant translational chasm remains. Most existing computational models operate as “black boxes”, relying on high-dimensional optical data that are decoupled from the clinically interpretable visual features used by physicians. This lack of interpretability severely limits the practical integration of tongue diagnosis into routine metabolic assessment workflows and hinders clinician trust and adoption.
To bridge this gap, we hypothesize that a clinically intuitive framework, which directly incorporates visual tongue characteristics understood by practitioners, can enhance diagnostic utility. This study therefore proposes to develop and validate a clinician-oriented, multimodal fusion model. This model will uniquely integrate intuitively discernible visual tongue features with a panel of readily accessible clinical metabolic indicators. We aimed to create a tool that not only improves the accuracy and interpretability of MAFLD risk assessment but also serves to elucidate the connections between external clinical signs and the internal multisystem metabolic dysregulation that defines MAFLD. By doing so, this work sought to translate a traditional diagnostic method into a validated, modern tool for stratified hepatology care.
Methods
Study design
This prospective, observational, single-center cohort study was conducted in China and utilized data from the cohort “A prospective cohort study of real-world clinical diagnosis and treatment of MAFLD”, which is registered with the Chinese Clinical Trial Registry (ChiCTR: https://www.chictr.org.cn ; No. ChiCTR2200063127).
Ethical statement
The study protocol adhered strictly to the ethical principles of the Declaration of Helsinki and the relevant regulations of China’s “Ethical Review Measures for Biomedical Research Involving Humans.” It was approved by the Ethics Committee of the Hubei Provincial Hospital of Traditional Chinese Medicine, the lead institution (Approval No. HBZY2022-C08-01). All participants were thoroughly informed about the study’s purpose, procedures, potential risks, and benefits. Ample time was provided for consideration, and written informed consent was obtained from each subject prior to enrollment.
Study population
Inclusion and exclusion criteria
Participants were required to meet all the following criteria: (1) Aged between 18 and 75 years, regardless of gender; (2) Diagnosed with MAFLD based on FibroTouch® transient elastography, defined by evidence of hepatic steatosis: Controlled Attenuation Parameter ≥ 245 dB/m, and meeting at least one of the following three criteria, as per the International Expert Consensus on the New Definition of MAFLD (2020)18; (3) For the healthy control group: absence of MAFLD diagnostic components, such as overweight/obesity, T2DM, abnormal blood pressure, dyslipidemia (including triglycerides and total cholesterol), and insulin resistance; (4) Ability to understand and willingness to comply with the study protocol, and provision of voluntary written informed consent; (5) Ability to cooperate with tongue image acquisition, anthropometric measurements, questionnaire survey, and blood sample collection. Participants were excluded if they met any of the following conditions: (1) Liver-related conditions: confirmed diagnosis of viral hepatitis (Hepatitis B surface antigen positive or Hepatitis C virus antibody positive), autoimmune liver disease, drug-induced liver injury, genetic metabolic liver diseases (e.g., Wilson’s disease, alpha-1-antitrypsin deficiency), or other specific etiologies of chronic liver disease; (2) Excessive alcohol consumption or diagnosed decompensated hepatocellular carcinoma; (3) Severe comorbidities: presence of life-threatening or study compliance-affecting cardiovascular diseases (e.g., heart failure NYHA class III-IV, uncontrolled hypertension), respiratory diseases, renal failure [estimated glomerular filtration rate < 30 mL/min/1.73 m2], hematological diseases, or active malignancy (within 5 years); (4) Factors affecting tongue imaging: major oral diseases, such as oral ulcers, oral cancer, history of tongue surgery, or severe tongue markings distorting tongue morphology, as well as congenital tongue anomalies, such as geographic tongue or fissured tongue, which may interfere with image analysis; (5) Medication use: long-term use of medications known to affect hepatic fat metabolism or fibrosis (e.g., glucocorticoids, amiodarone, tamoxifen, specific antiviral drugs) within 3 months prior to enrollment; (6) Special populations: pregnant or lactating women; (7) Concurrent participation in another interventional clinical trial; or any other condition deemed by the investigators as unsuitable for study participation.
Participant grouping
Healthy control group (non-MAFLD): defined by Controlled Attenuation Parameter < 245 dB/m, normal liver function tests, and absence of MAFLD-related metabolic risk factors and relevant medical history.
MAFLD group and fibrosis stratification definition: all patients meeting MAFLD diagnostic criteria and hepatic fibrosis progression were stratified by Liver Stiffness Measurement (LSM) into: MAFLD-No Fibrosis (F0-F1): LSM < 7.3 kPa; MAFLD-Mild to Moderate Fibrosis (F2-F3): 7.3 kPa ≤ LSM < 12.4 kPa; MAFLD-Severe Fibrosis (F4): LSM ≥ 12.4 kPa. A flowchart detailing the participant inclusion process is depicted in Figure 1.
Data collection and statistical analysis
All data were collected by uniformly trained and certified research personnel following standard operating procedures to ensure consistency and accuracy. The methodology regarding the collection of tongue manifestation and clinical feature data was described in our earlier work.19
Clinical data analysis was conducted using Python (executed in PyCharm 2025.2; JetBrains, Prague, Czech Republic) and R (executed in RStudio). Key Python libraries included pandas (v2.1.4), numpy (v1.26.2), scipy (v1.11.4), and matplotlib (v3.8.2). Key R packages included gtsummary and dplyr. All indicators were converted to numerical variables, and missing and invalid values were uniformly set to zero without additional standardization or normalization to preserve the original numerical characteristics and clinical significance of the indicators.
The Kruskal-Wallis H test was used to verify the overall distribution differences of each metabolic indicator among different fibrosis grades of MAFLD, with a significance level of α = 0.05. Spearman rank correlation analysis was further used to quantify the association strength and trend direction between the indicators and fibrosis grades, defining a strong association as |r| ≥ 0.2 and P < 0.05, and a weak association as 0 < |r| < 0.2 and P < 0.05.
Multimodal medical AI model
Multimodal learning has demonstrated significant advantages in medical diagnostics through the integration of complementary data sources. However, in the specific field of computer-aided tongue diagnosis, research on multimodal fusion remains in its early stages. Current approaches attempting to combine tongue image features with clinical indicators often rely on elementary fusion strategies such as feature concatenation or late fusion. These methods fail to capture the intricate and interactive relationships between visual features of the tongue and systemic physiological or metabolic parameters. Moreover, while many existing models treat disease staging as a nominal classification task, the present study focuses specifically on the binary discrimination between healthy individuals and patients with MAFLD. This streamlined objective prioritized accurate screening and early detection, aligning with the practical need for accessible, non-invasive diagnostic tools in population health. The architecture of the proposed “Multimodal-aided Diagnostic System” in this study is illustrated in Figure 2.
The proposed multimodal auxiliary diagnostic system is designed to emulate the comprehensive diagnostic logic employed by clinical experts, akin to the integrative approach of “four diagnostic methods combined” in TCM. By leveraging deep learning techniques, the framework achieves a systematic integration of macro-level visual representations derived from tongue images with micro-level metabolic data obtained from clinical biochemical indicators. This section details the core algorithms of the system across five key components: data preprocessing, overall network architecture, feature extraction mechanisms, multimodal fusion strategy, and classifier design.
Image data preprocessing
High-quality, standardized input data are fundamental to the performance and generalizability of any computational model, particularly in multimodal medical artificial intelligence, where heterogeneous data sources must be integrated. To address pervasive challenges in real-world clinical datasets—such as inconsistent image acquisition protocols, confounding background noise, and missing values in electronic health records—we established a rigorous preprocessing pipeline for both visual and tabular modalities.
Image preprocessing: For tongue images, preserving morphologically and diagnostically significant features was paramount. To avoid geometric distortion of the tongue body, which could obscure critical signs such as swelling or tooth marks, we rejected conventional fixed-size resizing. Instead, we implemented an aspect ratio-preserving Letterbox strategy. The longer side of each image was rescaled to 224 pixels, maintaining the original proportions, while zero-padding (RGB = [0,0,0]) was applied to the shorter side to generate a standardized 224 × 224 pixel input. During resizing, bicubic interpolation was employed to conserve high-frequency texture details relevant to coating analysis. Subsequently, pixel intensities were normalized using Z-score standardization based on ImageNet statistics (mean [0.485, 0.456, 0.406]; standard deviation [0.229, 0.224, 0.225]), ensuring compatibility with pre-trained vision models. To mitigate overfitting and enhance robustness given limited sample sizes, moderate data augmentation was applied stochastically during training only, including horizontal flipping (P = 0.5), minor affine transformations (rotation ± 15°, translation ± 10%), and controlled color jitter (brightness/contrast variation ± 20%).
Tabular data processing: Clinical and laboratory variables (e.g., Age, BMI, biochemical markers) underwent systematic cleaning and encoding. Missing values, an inevitable feature of retrospective clinical data, were imputed with zero. Crucially, to prevent the loss of information regarding data completeness, a binary mask vector indicating the presence or absence of each feature was concatenated with the imputed dataset, allowing the model to explicitly learn from missingness patterns. Label normalization was performed to ensure consistency; numerical annotations (e.g., fibrosis stages) were extracted from unstructured text using regular expressions, and outliers were rigorously filtered to constrain all labels to a validated, predefined range.
This comprehensive preprocessing framework ensures that both imaging and clinical data streams are transformed into a coherent, analysis-ready format, forming a reliable foundation for the subsequent multimodal fusion and modeling stages.
Visual perception stream
This branch handles structured clinical biochemical indicators (Xt ∊ RH*W*3). A multilayer perceptron maps discrete and continuous physiological parameters into a high-dimensional latent space, constructing a panoramic metabolic profile of the patient. The resulting feature vector provides a compact representation of systemic inflammation and fibrosis risk, extending beyond a mere numerical transformation.
Backbone network: Given that our dataset size represents a typical small-sample scenario in medical research, we selected ConvNeXt-Tiny as the visual backbone. Unlike Vision Transformers (ViTs), which rely heavily on large-scale pre-training to establish global dependencies, ConvNeXt retains the inherent inductive biases of convolutional neural networks, such as translation equivariance and locality. This characteristic enables it to effectively capture subtle textures (e.g., coating granularity) and local morphological features in tongue images even without extensive pre-training on massive datasets, thereby significantly mitigating the risk of overfitting (Fig. 3).
Metabolic feature stream
Serving as the “logical core” of the system, this branch handles structured clinical biochemical features (Xt ∊ RN). A multilayer perceptron maps discrete and continuous physiological parameters into a high-dimensional latent space, constructing a panoramic metabolic profile of the patient. The resulting feature vector provides a compact representation of systemic inflammation and fibrosis risk, extending beyond a mere numerical transformation.
A multilayer perceptron-based encoder is designed to process continuous clinical features for effective embedding. This component maps heterogeneous physiological and biochemical parameters into a latent space that is semantically aligned with the visual feature representations.
The network architecture comprises a linear layer (20 → 128), followed by one-dimensional batch normalization (BatchNorm1d), a rectified linear unit activation function, and a Dropout layer (P = 0.2). This encoder transforms the raw 20-dimensional clinical data into a 128-dimensional tabular feature embedding Ft, facilitating cross-modal alignment at a semantic level.
The inclusion of BatchNorm1d standardizes the activation distributions across layers, accelerating training convergence. The Dropout layer introduces stochasticity by randomly deactivating neurons, thereby enhancing the model’s robustness to potential missing indicators or measurement errors commonly encountered in clinical data. The resultant metabolic feature vector Ft serves as the conditioning input for the subsequent Dynamic Affine Feature Transformation (DAFT) module, enabling it to actively modulate the visual feature stream. (in Fig. 4)
Dynamic interaction and feature modulation
This is the core innovation of this architecture. Unlike traditional “post-fusion” strategies that simply concatenate the outputs of two streams, we designed a deep interaction mechanism based on DAFT.
In this mechanism, the metabolic feature stream is no longer a passive input but acts as an active “conditional regulator.” It dynamically recalibrates the feature channels in the visual stream by learning the generated affine transformation parameters (scaling factor α and translation factor β).
Mathematically, this mechanism simulates a Bayesian inference process: clinical indicators provide prior probabilities that guide the model to focus on a more discriminative posterior distribution within the visual feature space, thereby significantly suppressing non-pathological visual noise (e.g., illumination variation or physiological tongue enlargement). The modulated multimodal features are subsequently fed into a classifier to output a probability distribution indicating whether the person is healthy or has MAFLD.
Conventional feature fusion methods, such as simple vector concatenation or element-wise summation, often fail to account for the substantial disparities in data distribution and semantic meaning between heterogeneous modalities. To enable clinically informative, low-dimensional metabolic indicators to effectively guide the interpretation of high-dimensional, semantically sparse visual tongue features, we introduce a DAFT module. This component establishes a deep fusion mechanism based on conditional feature recalibration.
The central premise of the DAFT module is to treat the tabular (metabolic) modality as a dynamic “controller” that generates affine transformation parameters for the visual modality. This mechanism allows the model to adaptively enhance or suppress specific channels within the visual feature maps according to the patient’s metabolic profile. For instance, a clinical indicator such as “low platelet count” can guide the model to amplify features corresponding to a “purplish-dark tongue body.” This facilitates deep, interactive integration of multimodal information at the feature-extraction stage.
First, the metabolic features are projected to the dimensionality of the visual feature space:
Let Fv∈RDv and Fv∈RDv denote the visual and metabolic feature vectors, respectively. The DAFT module employs two lightweight auxiliary networks: a Scale Generator (Gs) and a Shift Generator (Gb). The fusion process is defined as follows:
S=σ(GS(Ft))∈(0, 1)DvB=tanh(Gb(Ft))∈(0, 1)Dv
where σ is the Sigmoid activation function, ensuring that the scaling factor S acts as a gating signal, and tanh is used to generate the bidirectional feature offset B.Subsequently, a channel-based affine transformation is performed to generate the fused features Frefined:
Frefined=S⊙Fv+B
Here, ⊙ represents the Hadamard product, i.e., element-wise multiplication. Geometrically, this transformation is equivalent to a dynamic distortion and correction of the visual feature space based on metabolic states.Finally, to preserve the original metabolic information, residual joins or concatenation were performed between the recalibrated visual features and the original tabular features:
Ffusion=Concat(Frefined,Ft)
This design offers inherent clinical interpretability. The scaling factor S mimics a clinician’s “attentional focus,” amplifying the weights of visual feature channels relevant to specific pathological states. Conversely, the shifting factor B acts as a “baseline calibrator,” adjusting the decision threshold for disease severity based on contextual factors such as age or sex, even when visual presentations are similar. This proactive, context-aware fusion strategy demonstrably outperforms passive data concatenation, yielding diagnostic synergy where the integrated whole is greater than the sum of its individual parts (in Fig. 5).
Training strategy and evaluation metrics
Dataset partitioning
To ensure a rigorous, unbiased, and reproducible evaluation of our multimodal framework, we implemented a strict data partitioning protocol. From the final curated cohort of 477 eligible participants, the dataset was randomly split into training, validation, and a held-out independent test set using an 8:1:1 ratio. This resulted in 381 subjects allocated for model training, 48 for validation (used for hyperparameter tuning and early stopping to prevent overfitting), and a final, completely independent set of 48 subjects for testing. The random partitioning was performed at the patient level using stratified sampling to preserve the approximate distribution of key classes (healthy vs. MAFLD) across all splits, thereby mitigating potential evaluation bias.
All model development, training, and evaluation were conducted within the PyTorch 2.8 deep learning ecosystem. Computations were performed on dedicated NVIDIA RTX 5090 GPUs (32 GB memory), with software environments containerized to ensure consistency.
Loss function
In clinical practice, the task of early screening, accurately identifying the presence or absence of significant liver fibrosis, holds greater immediate relevance. Therefore, this study formulates the problem as a binary classification task, aiming to determine whether a patient has fibrotic lesions.
Weighted cross-entropy loss: For the binary classification objective, we employ cross-entropy as the base loss function. To address the potential class imbalance commonly encountered in medical datasets, where healthy (negative) samples may outnumber diseased (positive) ones, we introduce a weighting mechanism. This enhances the model’s sensitivity to the minority class (typically the diseased cases), which is critical for reducing false negatives in a screening context.
The loss function is defined as follows:
L=-1N∑i=1N[αyilog(pi)+(1-yi)log(1-pi)]
where yi ∊ {0,1} is the true label of the i-th sample (1 indicates the presence of fibrosis, 0 indicates health), and pi is the probability that the model predicts it as positive. α is the weighting coefficient for positive samples. When α > 1, the model pays more attention to the classification error of positive samples, effectively avoiding the clinical risks caused by “missed diagnoses.” This design ensures that while pursuing high accuracy, the model prioritizes the high sensitivity required for the screening task.Parameter optimization strategy based on transfer learning
To address the pervasive challenges of data scarcity and costly annotation in medical imaging, this study adopts a backbone-freezing transfer learning strategy.
Rationale: The shallow filters of deep convolutional neural networks typically learn general visual features—such as edges and textures—that exhibit high transferability between natural image domains (e.g., ImageNet) and medical images. For small-scale datasets like ours (≈500 samples), fine-tuning all parameters is prone to overfitting dataset-specific noise and may trigger catastrophic forgetting, thereby degrading the robust feature-extraction capability already acquired during pre-training.
Implementation: Source domain: The visual encoder was initialized with ConvNeXt-Tiny weights pre-trained on the large-scale ImageNet 1K dataset (1.2 million images).
Target domain: On our tongue image dataset, we strictly froze all parameters of the backbone network. Gradient updates were allowed only for the newly added tabular encoder, the DAFT fusion module, and the classification head.
Advantages: This “frozen-backbone” strategy reduces the number of trainable parameters by over 90%, substantially lowering computational overhead. More importantly, it forces the model to learn how to use clinical priors to modulate general visual features, rather than relearning visual representations from scratch. Experiments confirm that this approach achieves better generalization and more stable convergence under limited data compared to full fine-tuning.
Evaluation metrics
Model performance was comprehensively assessed using the following metrics:
Accuracy: Measures the overall proportion of correctly classified samples.
Accuracy=∑i=1CTPiN
where TPi is the number of samples correctly predicted for class i, N is the total number of samples, and C is the total number of classes.Quadratic weighted kappa (QWK)—The primary metric for this study: QWK quantifies the agreement between predicted and true ordinal labels by applying a quadratic weight to the distance of disagreement. It is more sensitive than accuracy in evaluating a model’s ability to correctly assess disease severity progression.
k=1-∑i,jωi,jOi,j∑i,jωi,jEi,jωi,j=(i-j)2(C-1)2 where Oi,j is the observed confusion matrix, Ei,j is the expected random consistency matrix, and ωi,j is the quadratic weight matrix.
Sensitivity (Recall): Measures the model’s ability to correctly identify all positive cases (e.g., patients with significant fibrosis ≥ F2).
Sensitivityk=TPkTPk+FNk
where TPk represents the positive examples correctly predicted as belonging to this category, and FNk represents the false negative examples incorrectly predicted as belonging to other categories.Specificity: Measures the model’s ability to correctly identify all negative cases (e.g., healthy controls or F0 patients).
Specificityk=TNkTNk+FPk
where TNk represents the negative examples correctly excluded from the category, and FPk represents the falsely predicted negative examples incorrectly predicted to belong to the category.Results
Trend-based analysis of metabolic indicators
This study enrolled 477 participants. MAFLD patients exhibited significantly higher levels across all measured parameters, including liver function, lipid profiles, uric acid, anthropometric measures, and hepatic fat deposition, compared to healthy subjects, as detailed in Table 1. The distributions of all 20 metabolic indicators differed significantly across the healthy group and MAFLD groups with different fibrosis grades. Spearman rank correlation analysis further demonstrated that the trends of the aforementioned indicators showed a strong positive correlation with fibrosis grade (r > 0.2, P < 0.001). Arranged in descending order of correlation coefficient, the core indicators were BMI (r = 0.7904), VFA (r = 0.7279), WHR (r = 0.6889), GGT (r = 0.5121), ALT (r = 0.5106), and AST (r = 0.498), indicating that the values of these indicators significantly increased with the severity of fibrosis, with no weak positive trends or other trends observed (Fig. 6). Among the lipid profiles, TG (r = 0.4587), CHOL (r = 0.3879), and LDL-C (r = 0.4198) were all significantly and strongly positively correlated with fibrosis grade.
Table 1Comparison of clinical characteristics between healthy group and MAFLD group
| Characteristic | Healthy Group (N = 157) | MAFLD Group (N = 320) | Pa |
|---|
| Gender, n (%) | | | <0.001 |
| Female | 144 (92) | 118 (37) | |
| Male | 13 (8.3) | 202 (63) | |
| Age (y) | 31.00 (27.00, 39.00) | 28.00 (23.00, 34.00) | <0.001 |
| CAP (dB/m) | 230.02 (219.71, 239.43) | 345.21 (331.27, 357.59) | <0.001 |
| LSM (kPa) | 5.84 (5.28, 6.53) | 11.53 (8.97, 15.23) | <0.001 |
| BMI (kg/m2) | 25.30 (24.00, 26.30) | 36.90 (34.50, 40.10) | <0.001 |
| ALT (U/L) | 15.00 (12.00, 18.00) | 80.00 (52.00, 112.00) | <0.001 |
| AST (U/L) | 18.00 (14.50, 20.50) | 40.50 (29.00, 60.00) | <0.001 |
| GGT (U/L) | 14.00 (10.00, 20.50) | 39.00 (27.00, 57.00) | <0.001 |
| CHOL (mmol/L) | 4.74 ± 0.75 | 4.95 ± 0.91 | 0.14 |
| TG (mmol/L) | 1.14 (0.85, 1.65) | 1.81 (1.43, 2.48) | <0.001 |
| HDL-C (mmol/L) | 1.24 (1.04, 1.37) | 0.98 (0.88, 1.09) | <0.001 |
| LDL-C (mmol/L) | 2.67 ± 0.60 | 3.15 ± 0.69 | <0.001 |
| UA (µmol/L) | 322.50 (269.00, 372.00) | 484.00 (426.00, 593.00) | <0.001 |
Tongue image feature analysis
We performed colorimetric analysis by measuring the Lab values of the tongue and coating across predefined regions (including the tip, root, left side, right side, and entire area). Trend analysis was subsequently conducted on the resulting 36 Lab color features (in Fig. 7). The results showed that the Lab-b* value (yellowness) of the tongue image was the most significant color indicator changing with the progression of MAFLD fibrosis, exhibiting a clear negative trend in both the tongue and tongue coating areas.
Regarding trend strength, strong negative trends were displayed by ten Lab-b* value features covering multiple regions of both the tongue and coating (entire, center, tip, right side, left side). Only the Lab-b* value at the coating root showed a weak negative trend. No positive trends were observed in any Lab-b* features, confirming a consistent decrease in yellowness across all regions. Linear fitting quantified the rate of yellowness (Lab-b*) decrease. The coating exhibited a significantly faster decline (slope of Lab-b* at the tip: −1.1675) compared to the tongue (slope of Lab-b*: −0.4715). This consistent negative trend was further supported by a stable negative Spearman correlation with fibrosis grade and significant Kruskal-Wallis test results (P < 0.05). In contrast, trends for Lab-a* and Lab-L* values were minimal and non-specific.
A dual-ranking analysis identified the most altered tongue regions. Based on the mean absolute Spearman correlation coefficient (|r|), the top regions were the left side (0.3179), tip (0.3129), and whole area (0.3014). A composite index (F+d) combining ANOVA F-value and Cohen’s d yielded a congruent ranking: left side (1.8694), tip (1.867), and whole area (1.8321) (Fig. 8). Both methods indicated that the coating was generally more sensitive than the tongue, and the lateral edges (particularly the left side) showed the most pronounced inter-group differences.
The observed pattern, where the most significant changes localizes to the lateral tongue edges, aligns with the TCM theory that this region corresponds to the liver and gallbladder. This correlation provides supportive evidence linking the topographic findings of tongue diagnosis to the modern pathological understanding of MAFLD. The heightened sensitivity of the tongue coating also suggests its potential utility as a superficial marker for early screening.
Predictive model quantitative analysis
The multimodal fusion network was implemented using the PyTorch framework. Input images were standardized to 224 × 224 pixels. To prevent overfitting given the limited sample size, the ConvNeXt-Tiny backbone was frozen, and only the DAFT fusion module and classification head were optimized. The trainable parameters were updated using the AdamW optimizer with an initial learning rate of 1 × 10−3 and a weight decay of 0.05. A Cosine Annealing scheduler was employed to dynamically adjust the learning rate, decaying it to a minimum of 1 × 10−6 over the course of 100 training epochs. The model was trained with a batch size of 32. For the fibrosis diagnosis task, we utilized the Binary Cross-Entropy Loss function. To ensure reproducibility, the random seed was fixed at 42. We adopted a best-model checkpointing strategy, retaining the model state that achieved the highest QWK on the validation set.
Given the retrospective nature of the study, incomplete clinical records were unavoidable. To address this, we implemented a zero-imputation combined with masking strategy for all tabular covariates, including demographic (e.g., age, BMI) and biochemical indices. Specifically, missing entries were replaced with zeros to maintain dimensional consistency. To prevent the model from interpreting these placeholders as clinically meaningful low values, we simultaneously generated a binary mask vector corresponding to the input features. In this masking scheme, a value of 1 denotes an observed measurement, while 0 indicates a missing value. This mask vector is concatenated with the feature vector and fed into the tabular encoder, thereby enabling the neural network to explicitly distinguish between observed data and imputed values, and to adaptively attenuate the noise introduced by missing data during the feature fusion process.
To comprehensively evaluate the effectiveness and robustness of our proposed multimodal binary classification model for diagnosing MAFLD, we conducted a quantitative assessment on an independent test set (n = 48). The model demonstrated strong diagnostic performance, achieving an overall accuracy of 97.92% and a QWK of 0.9538. Further analysis of the confusion matrix (Fig. 9) revealed its discriminative capability across classes. The model correctly identified all 16 healthy control samples, yielding a specificity of 100%. Among the 32 MAFLD-positive samples, it successfully recognized 31 cases, resulting in a sensitivity of 96.88%. This balance between high specificity and sensitivity indicated robust generalization on unseen data.
Comparative evaluation of unimodal versus multimodal diagnostic approaches
To rigorously evaluate the diagnostic contribution of tongue imaging and to benchmark the added value of multimodal integration, we conducted a systematic comparative analysis. This involved training and validating state-of-the-art deep learning models exclusively on tongue image data, using the identical dataset partition (training, validation, and independent test sets) as employed for our primary multimodal framework.
For this unimodal assessment, we implemented two distinct architectural paradigms: a YOLO-based model optimized for efficient localization and feature extraction from tongue regions, and an ViT model to capture long-range dependencies and global contextual information within the images. Both models were tasked with the binary classification of healthy controls versus MAFLD patients, mirroring the primary objective of our multimodal system.
The performance of these image-only models was quantitatively inferior to that of our multimodal architecture (Table 2). While achieving non-trivial accuracy, both the YOLO and ViT baselines exhibited consistently lower sensitivity, specificity, and overall balanced accuracy on the held-out test set. Specifically, the unimodal models demonstrated a notable decrease in sensitivity, indicating a higher rate of missed MAFLD cases compared to our integrated approach.
Table 2Comparison of performance metrics for the MAFLD multimodal diagnosis model
| Models | QWK | Accuracy | Sensitivity | Specificity |
|---|
| Vit | 0.3662 | 0.6875 | 0.6562 | 0.7500 |
| Yolov8 | 0.3836 | 0.6875 | 0.6250 | 0.8125 |
| Our proposed model | 0.9538 | 0.9792 | 0.9688 | 1 |
This performance gap underscores a key finding: although tongue imagery contains discriminative signals related to MAFLD, these features are insufficient in isolation for robust screening. The superior diagnostic accuracy of our multimodal model arises from the synergistic integration of visual tongue features with core metabolic and clinical parameters. The comparative results provide empirical validation that the complex pathophysiology of MAFLD is more completely captured by the confluence of phenotypic (tongue appearance) and physiological (clinical biomarkers) data domains than by either modality alone. This evidence firmly establishes the necessity and advantage of the proposed multimodal fusion framework for improving non-invasive MAFLD screening.
Interpretability analysis
To elucidate the decision logic of our multimodal model and establish its clinical relevance, we employed complementary interpretability techniques across both visual and feature domains. Attention visualization using EigenCAM revealed that the model consistently attended to anatomically and diagnostically meaningful regions of the tongue. As shown in the generated heatmaps (Fig. 10), high-response areas were primarily concentrated in the central tongue body and the lateral edges. This focus aligns precisely with core tenets of TCM diagnosis: the central region corresponds to the spleen and stomach, often affected by dampness accumulation in MAFLD, while the lateral edges correspond to the liver and gallbladder, where signs of qi stagnation manifest.
This visual interpretability was further quantified and extended through SHapley Additive exPlanations (SHAP) analysis. SHAP analysis quantified the contribution of each feature to the model’s predictions, identifying the top drivers (Fig. 11).
Feature importance ranking: Bars represent the mean absolute SHAP value for each clinical variable, indicating its average magnitude of contribution to the model’s decision. Body composition and liver function markers, notably BMI and ALT, emerge as the most influential predictors, consistent with the central role of adiposity and hepatocellular injury in MAFLD pathogenesis.
Directional impact of individual features: Each point represents the SHAP value for a feature in a single sample; the horizontal position indicates whether the feature value increased (positive SHAP) or decreased (negative SHAP) the predicted likelihood of MAFLD. Features are ordered vertically as in (a). This visualization reveals consistent directional effects: for example, higher BMI and ALT values systematically increase the predicted risk, whereas higher albumin levels exhibit a protective effect. The analysis confirms that the model’s decisions are driven by clinically meaningful and biologically interpretable feature interactions.
Collectively, these interpretability analyses demonstrate that our data-driven model does not merely learn statistical correlations but discovers and leverages biomedical features with established diagnostic significance. The convergence of its attention mechanisms with TCM theory and its reliance on key metabolic and morphological indicators provide a transparent, clinically grounded rationale for its predictions, effectively bridging computational analysis with traditional diagnostic wisdom.
Discussion
Against the backdrop of global lifestyle shifts, metabolic diseases, particularly MAFLD, have emerged as a major public health challenge.20 The MAFLD criteria enable better identification of individuals with hepatic steatosis and significant fibrosis, as assessed by NITs.21 Given that MAFLD with clinically significant fibrosis markedly elevates the risk of liver-related complications and mortality,22 early screening is of critical importance. In TCM theory, tongue diagnosis serves as a key indicator of the functional state of zang-fu organs, qi, and blood. However, conventional TCM tongue assessment has long relied on subjective clinical evaluation, lacking objective standardization.23 Meanwhile, current non-invasive tools for MAFLD, such as the Fatty Liver Index, NAFLD Fibrosis Score, and Fibrosis-4 Index, have advanced fibrosis risk stratification,24 yet more accessible early screening models remain needed. TCM tongue diagnosis offers a convenient and promising avenue for this purpose.
A key methodological strength lies in the model’s interactive fusion design. We built a dual-stream architecture where a ConvNeXt-Tiny network processes tongue images and a multilayer perceptron encodes clinical variables. The DAFT module enables context-aware fusion—metabolic features dynamically recalibrate visual feature maps, simulating clinician reasoning. Using a weighted cross-entropy loss and a frozen-backbone transfer-learning strategy, the model prioritized sensitivity (96.88%) and robustness, which are critical for a screening tool.
On an independent test set, the model achieved an accuracy of 97.92% and a QWK of 0.9538, with 96.88% sensitivity and 100% specificity, outperforming single-modality and conventional serological models. Interpretability analyses confirmed that the model focused on tongue regions aligned with TCM theory and was driven by key metabolic features such as visceral fat area. Notably, a progressive decline in tongue yellowness (Lab-b* value) was observed with fibrosis progression, most pronounced at the lateral edges—consistent with the TCM principle linking the tongue’s left side to the liver and gallbladder. This finding provides objective, modern evidence supporting the physiologic relevance of TCM tongue inspection.
Clinically, the model is designed as a practical binary (Healthy vs. MAFLD) screening tool for primary-care or community settings. Its purpose is efficient triage—identifying individuals needing further specialist assessment—rather than replicating detailed fibrosis staging. This focus enhances sensitivity, reduces missed cases, and improves generalizability, making it suitable for resource-limited environments.
Several limitations of the current study should be acknowledged. First, the model in its present form performs binary screening for MAFLD and does not provide a stratification of fibrosis severity. Second, the single-center, cross-sectional design necessitates external validation in multi-center, prospective cohorts. Although the 8:1:1 data partitioning ensured internal validity, the independent test size remains modest. Future studies with expanded cohorts are needed to enhance statistical power and to evaluate model generalizability across diverse populations and imaging conditions.
Methodologically, while LSM by vibration-controlled transient elastography is a widely validated, non-invasive surrogate for liver biopsy in routine practice,25,26 an ideal diagnostic tool should also demonstrate the capacity for specific fibrosis staging, prognosis prediction, progression monitoring, and treatment response assessment.27 To advance towards non-invasive fibrosis grading, particularly in cohorts without biopsy confirmation, future work must develop more precise image recognition techniques, robust multimodal data fusion frameworks, and rigorous ordinal regression methods. For example, the key technical challenges to address include mitigating potential domain shifts in tongue image acquisition and clinical covariates,28 as well as enhancing the robustness of core fusion modules to variations in input data quality.
In summary, this study presents a novel, interpretable multimodal framework that synergizes TCM tongue diagnostics with metabolic indicators for non-invasive MAFLD screening. By combining methodological innovation with clinically meaningful interpretation, the model offers a low-cost, practical tool for large-scale risk stratification, particularly in settings where specialist resources are limited.
Conclusions
This study successfully developed and validated a clinically oriented, multimodal auxiliary screening and diagnostic model for MAFLD that integrates objective TCM tongue appearance features with conventional metabolic indicators. Through interpretable fusion analysis, it provides a practical application for using TCM tongue appearance as a “window” for the non-invasive assessment of MAFLD. This model holds promise as a new non-invasive, low-cost, and efficient tool for MAFLD screening, particularly in regions with limited healthcare resources.
Declarations
Ethical statement
The study protocol adhered strictly to the ethical principles of the Declaration of Helsinki (as revised in 2024) and the relevant regulations of China’s “Ethical Review Measures for Biomedical Research Involving Humans.” It was approved by the Ethics Committee of the Hubei Provincial Hospital of Traditional Chinese Medicine, the lead institution (Approval No. HBZY2022-C08-01). All participants were thoroughly informed about the study’s purpose, procedures, potential risks, and benefits. Ample time was provided for consideration, and written informed consent was obtained from each subject prior to enrollment.
Data sharing statement
The data that support the findings of this study are available on request from the corresponding authors. The data are not publicly available due to privacy and ethical restrictions, as they contain sensitive participant information, including clinical health records and identifiable tongue images.
Funding
This work was supported by the National Natural Science Foundation of China (No. 82274352 to XDL), the Hubei Provincial Key Research and Development Program (No. 2024BCB038 to XDL), and City University of Hong Kong (7006082, 7020073, 9609332, 9609333, 9678292, 7020002), and the Research Grants Council (9048206, 8799020 to BLK).
Conflict of interest
The authors have no conflict of interests related to this publication.
Authors’ contributions
Study concept and design, methodology, funding acquisition, project administration, and review & editing of the manuscript (CXL, MZX, CHL, BLK, XDL), data curation, formal analysis, visualization, and writing – original draft (CXL, CXT, QYH), investigation (tongue image acquisition and data processing) (ZXS, QH), investigation (clinical data collection and validation) (YBJ, LW, LHZ, HYY, WBZ), AI algorithm design, model architecture development, model training, illustration of AI model architecture and related figures (QYH, YBJ), AI model validation and optimization (LW), bioinformatics analysis and statistical analysis (QYH, HYY), resources and patient recruitment (HZ, JZ). All authors approved the final manuscript.