Introduction
Metabolic dysfunction-associated steatotic liver disease (MASLD) has attracted increasing attention, with a global prevalence of around 30–35%.1 A recent epidemiological model prediction suggests that by 2030, the global prevalence of MASLD will rise to 36.8%, corresponding to approximately 101.2 million people. By 2050, the prevalence is expected to increase further to 41.4%, affecting approximately 122 million people.2 With this rising prevalence, the disease burden of MASLD-related liver cancer has also become increasingly significant.3 Globally, incident cases, deaths, and disability-adjusted life years attributable to MASLD-related liver cancer in 2019 increased by 205%, 195%, and 166%, respectively, compared with 1990.4 In the United States, MASLD has become the leading cause of liver cancer among liver transplant candidates.5
Despite growing awareness of MASLD, the causal biomarkers underlying this condition remain unclear.6 Identifying such biomarkers has become an urgent research priority. As an effective tool for causal inference, Mendelian randomization (MR) can reduce confounding and reverse causation biases inherent in observational studies by using genetic variations as instrumental variables, thereby providing more robust evidence for causal relationships between diseases and related factors.7,8 Although some studies have applied MR to investigate genetic susceptibility and potential biomarkers for MASLD, most published research has focused on a limited number of biomarkers, lacking a systematic and comprehensive approach.9,10
Biomarkers related to MASLD include not only traditional clinical biomarkers but also novel molecular biomarkers identified through multi-omics technologies.11 Clinical biomarkers, as standard medical indicators, are widely used for disease diagnosis, prognostic assessment, and therapeutic monitoring due to their accessibility and established clinical utility.11 In contrast, molecular biomarkers reflect the underlying mechanisms of disease development and progression and are typically derived from multi-omics platforms such as transcriptomics and proteomics.11,12 Although many biomarkers associated with MASLD have been identified, their clinical significance remains insufficiently characterized.13 When used collectively, these biomarkers have the potential to improve the accuracy of noninvasive MASLD diagnosis, thereby reducing the need for liver biopsies.14,15 Moreover, they may serve as prognostic indicators; for example, the novel acMASH index has been shown to correlate with all-cause mortality risk and assist in risk stratification.16 Therefore, understanding the clinical and translational relevance of biomarkers with a confirmed causal relationship to MASLD is critically important.
In this study, we aimed to identify causal molecular biomarkers (based on proteomics data) and clinical biomarkers associated with MASLD, and to evaluate their diagnostic and prognostic significance. First, we conducted MR analysis to assess the causal effects of 2,925 molecular biomarkers and 35 clinical biomarkers on MASLD, thereby revealing the causal relationships between these biomarkers and MASLD. Mediation analysis was performed to determine whether key clinical biomarkers mediated the effects of molecular biomarkers (exposures) on MASLD (the outcome). The association between key clinical biomarkers and MASLD was also externally validated in a hospital-based cohort. Second, we applied six machine-learning algorithms to develop and validate a novel noninvasive diagnostic model for MASLD based on the identified molecular biomarkers. Third, we explored the prognostic value of molecular biomarkers for the development of and poorer survival from hepatocellular carcinoma (HCC) using data from The Cancer Genome Atlas. We also assessed the prognostic significance of key clinical biomarkers for all-cause mortality and cause-specific mortality by analyzing prospective data from the National Health and Nutrition Examination Survey (NHANES).
Methods
Data collection of molecular and clinical biomarkers and MASLD for MR analysis
In this study, the molecular biomarkers were derived from proteomics data obtained from the FinnGen database.17 This database included 619 samples encompassing 2,925 molecules, such as apolipoprotein E (APOE), canopy FGF signaling regulator 4 (CNPY4), ectonucleoside triphosphate diphosphohydrolase 6 (ENTPD6), major histocompatibility complex, class I, A (HLA-A), secretogranin III (SCG3), and torsin 1A interacting protein 1 (TOR1AIP1), which were publicly released in April 2024.17 Our clinical biomarkers comprised 35 blood and urine biomarkers from the UK Biobank dataset, which included 363,228 individuals.18 These 35 biomarkers are extensively utilized in clinical diagnostics and contribute to assessing various physiological functions, including serum albumin, APOE, gamma-glutamyl transferase (GGT), high-density lipoprotein cholesterol (HDL-C), insulin-like growth factor 1 (IGF-1), total protein, triglycerides, and urinary sodium. Data on MASLD were derived from a meta-analysis of genome-wide association studies (GWAS) involving four cohorts with electronic health record–documented MASLD among participants of European ancestry (8,434 cases and 770,180 controls).19
Instrumental variable screening and MR analysis
We referred to prior literature for the screening criteria of instrumental variables (IVs) (Supplementary Method 1).20 For example, in selecting IVs for the 35 clinical biomarkers, single-nucleotide polymorphisms (SNPs) were selected based on genome-wide significance (P < 5 × 10−8) and were subsequently clumped to ensure independence using a linkage disequilibrium threshold of r2 < 0.001 within a 10,000-kb distance. A two-step, two-sample MR approach was employed to assess the causal relationships between molecular biomarkers and MASLD, as well as between clinical biomarkers and MASLD. This approach is a genetics-based method for causal inference, conducted through two independent sample datasets in two stages. In the first step, GWAS was used to identify genetic variants (such as SNPs) significantly associated with the exposure factor in the initial sample, thereby establishing a strong link between these variants and the exposure.20 In the second step, these instrumental variables were tested for association with disease outcomes in a second, independent sample using statistical models, such as the inverse-variance weighted (IVW) method, while conducting heterogeneity and pleiotropy tests to validate the plausibility of the causal hypothesis.20 To ensure the reliability of the MR results, we conducted heterogeneity, pleiotropy, and sensitivity analyses (Supplementary Method 2). All MR analyses employed the IVW method as the primary test for causal associations, and results were verified using the MR-Egger, weighted median, weighted mode, and simple mode methods. Since the IVW method served as the main causal association test, we corrected IVW results using the false discovery rate method; Pfdr < 0.05 was considered statistically significant. For other analyses without false discovery rate adjustment, a P-value < 0.05 was considered indicative of statistical significance.
Mediation effect analysis to identify key clinical biomarkers with a mediating role
Mediation effect analysis was used to determine whether key clinical biomarkers mediated the causal relationship between molecular biomarkers and MASLD. The formulas for calculating each effect size were as follows: total effect = C × 100%, indirect effect = A × B × 100%, direct effect (C′) = [C – (A × B) × 100%], and the proportion of the mediation effect = (A × B) / C × 100%. Among them, A is the β1 value of the MR analysis between the key protein and the clinical marker; B is the β2 value of the MR analysis between the clinical marker and MASLD; and C is the β value of the MR analysis between the key protein and MASLD.
Validation of the association between key clinical biomarkers and MASLD
Through mediation analyses, we identified key clinical biomarkers that mediate the effects of molecular biomarkers on MASLD. To further validate externally the association between these biomarkers and MASLD, we utilized a follow-up cohort of MASLD patients from the First Affiliated Hospital of Xi’an Medical University who had undergone vibration-controlled transient elastography examinations.20 Patients with a controlled attenuation parameter value ≥ 248 dB/m, as assessed by FibroScan, were considered to have hepatic steatosis (Supplementary Method 3).21
Machine learning (ML) algorithms for MASLD based on the identified molecular biomarkers
We employed six widely used ML algorithms—including Extreme Gradient Boosting (XGBoost), Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Machine, Multilayer Perceptron, and Light Gradient Boosting Machine—to develop a noninvasive diagnostic model for MASLD. Expression profiles of the identified molecular biomarkers were obtained from the Gene Expression Omnibus database, with GSE89632 (n = 63) serving as the training set and GSE48452 (n = 73) as the external validation cohort.
Evaluation of the prognostic significance of the identified molecular and clinical biomarkers
We further evaluated the prognostic value of the identified molecular biomarkers, particularly their impact on HCC development and overall survival (Supplementary Method 4). To investigate associations between clinical biomarkers and the risk of all-cause and cause-specific mortality, we analyzed prospective data from NHANES collected between 1999 and 2006 (Supplementary Method 5). The study design is illustrated in Figure 1.
Results
Molecular biomarkers causally related to MASLD
The IVW algorithm indicated that among the 2,925 molecular biomarkers, only six exhibited a significant relationship with MASLD (Fig. 2A). The mean F-statistics for the selected IVs were as follows: APOE (F = 36.919), CNPY4 (F = 26.539), ENTPD6 (F = 43.279), HLA-A (F = 38.635), SCG3 (F = 30.938), and TOR1AIP1 (F = 48.678). These values were well above the threshold of 10, indicating that all the IVs were strong instruments. The IVW algorithm showed that the odds ratio (OR) for APOE was 1.057 (95% confidence interval (CI): 1.031–1.083, Pfdr = 4.62E-03), for CNPY4 was 1.054 (95% CI: 1.029–1.081, Pfdr = 8.38E-03), for ENTPD6 was 1.031 (95% CI: 1.016–1.045, Pfdr = 5.75E-03), for HLA-A was 0.969 (95% CI: 0.960–0.979, Pfdr = 1.81E-06), for SCG3 was 0.956 (95% CI: 0.937–0.975, Pfdr = 4.62E-03), and for TOR1AIP1 was 0.964 (95% CI: 0.950–0.979, Pfdr = 8.13E-04). Results from the other four algorithms are shown in Supplementary Table 1. The Cochran’s Q test analysis revealed no heterogeneity in the results for APOE, CNPY4, ENTPD6, HLA-A, SCG3, and TOR1AIP1 (Supplementary Table 2). Subsequently, the MR-Egger intercept test was conducted to evaluate the presence of pleiotropy among the IVs (Supplementary Table 2). The results indicated no horizontal pleiotropy between these six biomarkers and MASLD (Supplementary Table 2). The leave-one-out sensitivity analysis confirmed that the relationship between TOR1AIP1 and MASLD remained stable (Fig. 2B). From the scatter plots, the relationship between TOR1AIP1 and MASLD showed a consistent trend (with OR values all greater than 1) across the five MR algorithms, further supporting the robustness of our findings (Fig. 2C). Scatter plots and leave-one-out sensitivity analyses for the remaining five proteins are detailed in Supplementary Figures 1–5. The reverse MR analysis found no significant association between MASLD and these six proteins.
Clinical biomarkers causally related to MASLD
The IVW algorithm indicated that among the 35 clinical biomarkers, only eight exhibited a significant relationship with MASLD (Fig. 3A). The mean F-statistics for the selected IVs were as follows: albumin (F = 73.095), ApoA (F = 135.563), GGT (F = 129.355), HDL-C (F = 145.250), IGF-1 (F = 98.572), urinary sodium (F = 40.886), total protein (F = 63.070), and triglycerides (F = 147.741). All these values exceeded the threshold of 10, indicating that the IVs were strong instruments. The IVW algorithm revealed that the OR for albumin was 1.373 (95% CI: 1.140–1.654, Pfdr = 4.11E-03), for ApoA was 0.811 (95% CI: 0.695–0.946, Pfdr = 0.026), for GGT was 1.281 (95% CI: 1.167–1.406, Pfdr = 1.25E-06), for HDL-C was 0.792 (Pfdr = 7.53E-05), for IGF-1 was 0.870 (95% CI: 0.784–0.964, Pfdr = 0.027), for urinary sodium was 2.583 (95% CI: 1.407–4.743, Pfdr = 8.548E-03), for total protein was 1.248 (95% CI: 1.083–1.438, Pfdr = 8.488E-03), and for triglycerides was 1.392 (95% CI: 1.239–1.563, Pfdr = 2.15E-07). The MR-Egger intercept test demonstrated no horizontal pleiotropy between any of these biomarkers and MASLD. The scatter plots indicated that the relationship between total protein and MASLD exhibited a consistent trend (with OR values all greater than 1) across the five MR algorithms, further supporting the robustness of the findings (Fig. 3B). Additionally, a Manhattan plot was used to display the distribution characteristics of SNPs for total protein after removing linkage disequilibrium (Fig. 3C). Scatter plots for the other seven clinical biomarkers are shown in Supplementary Figures 6–9.
Total protein demonstrated a significant mediating effect in the relationship between HLA-A and MASLD
Before conducting the mediation analysis, we used an MR approach to explore the causal relationships between the six molecular biomarkers (exposures) and the eight clinical biomarkers (outcomes), which served as a prerequisite for the mediation analysis. This analysis revealed that the OR between HLA-A and total protein was 0.967 (95% CI: 0.948–0.987, Pfdr = 1.15E-02). The results confirmed no horizontal pleiotropy between HLA-A and MASLD (MR-Egger intercept = –0.011, Pfdr = 0.556). The scatter plots and leave-one-out sensitivity analyses for the relationship between HLA-A and total protein are presented in Supplementary Figure 10. Further mediation analysis showed that total protein exhibited a significant mediating effect in the association between HLA-A and MASLD (Fig. 4A, B). Specifically, the total effect was 0.969 (95% CI: 0.960–0.979, Pfdr = 1.81 × 10−6), with total protein mediating 23.61% of the relationship between HLA-A and MASLD (OR = 0.993, Pfdr < 0.05).
Validation of the relationship between serum total protein levels and MASLD
The independent validation cohort included 330 patients with MASLD confirmed by vibration-controlled transient elastography and 85 controls. Baseline characteristics of the participants are shown in Supplementary Table 3. In the subsequent multivariable logistic regression analysis (Supplementary Table 4), serum total protein levels showed a positive association with MASLD risk that remained statistically significant across all models. Specifically, in the age- and sex-adjusted model, the OR for MASLD was 1.103 (95% CI: 1.064–1.143). In multivariable model 1, which adjusted for age, sex, body mass index, diabetes, and hypertension, the OR was 1.092 (95% CI: 1.052–1.134, P < 0.001). Even after additional adjustment for serum liver enzymes, lipids, creatinine, HbA1c, and platelet count (adjusted model 2), the association between total protein and MASLD remained significant, with an adjusted OR of 1.080 (P = 0.023).
Performance of multiple machine learning algorithms for MASLD based on six molecular biomarkers
We developed a non-invasive model for diagnosing MASLD using the six identified molecular biomarkers and six commonly used supervised machine-learning algorithms. As shown in Figure 5A, the RF model demonstrated the best performance in the training set, achieving an AUC of 0.941 (95% CI: 0.829–1.000), followed by KNN (AUC = 0.885, 95% CI: 0.732–1.000) and XGBoost (AUC = 0.834, 95% CI: 0.662–0.975). Conversely, the MLP model performed poorly, with an AUC of 0.536 (95% CI: 0.244–0.827). Additionally, we applied the Shapley Additive exPlanations method to analyze the random forest model. As illustrated in Figure 5B, this analysis revealed that CNPY4, SCG3, TOR1AIP1, and ENTPD6 were the top four molecular features influencing the model output. Notably, CNPY4 exhibited the highest mean Shapley Additive exPlanations value, indicating its pivotal role in distinguishing MASLD from non-MASLD individuals. In the validation dataset (GSE48452) (Fig. 5C), the RF model maintained superior discriminative performance with an AUC of 0.875, demonstrating robust generalization and discriminative capability. KNN and XGBoost also showed consistent performance, while MLP continued to exhibit poor performance in the validation set (GSE48452). To assess the robustness and potential overfitting of the RF model, we performed 500 bootstrap resampling iterations separately in both the training and validation sets. The resulting ROC curves with 95% confidence intervals are shown in Supplementary Figure 11.
Prognostic molecular and clinical biomarkers associated with MASLD
We further analyzed the prognostic significance of key molecular biomarkers closely associated with MASLD, particularly their impact on the development of HCC and overall survival. We found that CNPY4 was significantly upregulated in HCC (Supplementary Fig. 12). Across multiple cancer types, elevated CNPY4 expression demonstrated varying degrees of association with overall survival, with a particularly strong link in LIHC, where high CNPY4 expression was significantly associated with poor prognosis (HR = 1.753, Supplementary Fig. 13). Notably, in LIHC, CNPY4 expression was positively correlated with several immune cell types, especially macrophages, dendritic cells, and T helper cells (Supplementary Fig. 14), suggesting that CNPY4 may contribute to tumor progression by modulating the tumor immune microenvironment. Similarly, we observed that ENTPD6 was significantly overexpressed in LIHC (P < 0.001, Supplementary Fig. 15). Survival analysis indicated that high ENTPD6 expression was significantly associated with poorer overall survival in patients (HR = 1.483, P < 0.05, Supplementary Fig. 16), highlighting ENTPD6 as another potential adverse prognostic biomarker. In LIHC, ENTPD6 expression also exhibited significant positive associations with various immune cell populations, including dendritic cells, macrophages, CD8+ T cells, T helper cells, and regulatory T cells (P < 0.05, Supplementary Fig. 17).
Given that serum total protein level mediates the effect of the molecular biomarker HLA-A on MASLD, we regarded it as an important clinical biomarker. Therefore, we further investigated its prognostic significance using data from the NHANES study. A total of 41,474 individuals were initially identified from the NHANES database. After excluding participants with missing data for serum total protein levels and mortality, 3,540 individuals were included in the final analysis. Kaplan–Meier survival curves are shown in Figure 6. Individuals with total protein levels < 60 g/L had a significantly higher risk of all-cause mortality compared to those with levels ≥ 60 g/L (HR = 2.180; 95% CI: 1.188–4.002) in the age- and sex-adjusted model. The HR was 2.769 (95% CI: 1.441–5.319) in regression model 1, which was adjusted for age, sex, marital status, hypertension, diabetes, and body mass index, and an HR of 2.495 (95% CI: 1.224–5.087) in regression model 2, which was additionally adjusted for serum GGT, ALT, total cholesterol, triglycerides, and platelet count (Supplementary Table 5).
Discussion
In this study, we identified several molecular and clinical biomarkers causally associated with MASLD and explored their diagnostic significance for MASLD as well as their prognostic relevance for mortality outcomes. Notably, total serum protein levels were found to partially mediate the effect of HLA-A on the risk of MASLD, revealing a novel immuno-metabolic causal pathway. To translate these findings into a practical clinical tool, we developed a non-invasive diagnostic model based on the six MR-identified proteins using multiple machine learning algorithms. Among these, the RF model demonstrated excellent performance (AUC = 0.941 in the training set and 0.875 in the validation set), underscoring its potential utility in early MASLD screening and diagnosis. Additionally, we demonstrated that higher expression levels of CNPY4 and ENTPD6 were associated with poorer overall survival in HCC, while lower serum total protein levels were linked to increased all-cause mortality in the general population. These findings suggest that certain MASLD-related biomarkers may have prognostic relevance in broader clinical settings.
In our study, six proteins (molecular biomarkers) were identified as having a significant causal relationship with MASLD. APOE is an essential component of chylomicrons, and studies have shown that polymorphisms in the APOE gene are closely related to MASLD development.22 Amzolini and colleagues reported that the frequency of HLA-A25 in patients with MASLD was significantly lower than in healthy controls.23 Shin et al. found that deletion of the TOR1AIP1 gene induced MASLD development.24 Currently, few reports exist on the associations of CNPY4, ENTPD6, and SCG3 with MASLD, and these proteins may represent key targets for future research. CNPY4 is localized in the endoplasmic reticulum and assists in protein maturation and lipid synthesis. Hotta et al. found that the rs3764220 variant in the SCG3 gene was associated with metabolic syndrome.25 ENTPD6 belongs to the ectonucleoside triphosphate diphosphohydrolase family; its encoded proteins hydrolyze nucleoside triphosphates and diphosphates, playing a crucial role in cell signaling and energy metabolism.26 A recent MR study identified potential targets for abdominal obesity and found that ENTPD6 may be a novel biomarker for interventions targeting visceral adipose tissue.27
Among 35 blood and urine biomarkers, eight were identified as closely associated with MASLD. A meta-analysis of 12 studies revealed that circulating IGF-1 levels were significantly lower in individuals with MASLD than in healthy controls.28 GGT in human serum primarily originates from the liver and biliary system. Serum GGT level serves not only as a conventional liver function marker but also plays a broader role in metabolic health.29 Additionally, serum GGT is included in the fatty liver index equation used to identify hepatic steatosis.30 Furthermore, urinary sodium concentration was closely associated with MASLD (OR = 2.48, 95% CI: 1.52–4.06).31
HLA-A is a member of the major histocompatibility complex class I family and plays a central role in antigen presentation and immune surveillance. Genetic variations in HLA-A have previously been associated with metabolic and inflammatory diseases.32–34 This study revealed that total protein functions as a (partial) mediator in the causal pathway from HLA-A to MASLD, accounting for 23.61% of the total effect. Our mediation analysis indicates that downregulation of HLA-A may lead to increased total protein levels, which in turn could elevate MASLD risk. The liver is enriched with CD4+ T helper cells, CD8+ cytotoxic T cells, and B lymphocytes, all contributing to persistent inflammation and tissue remodeling.35,36 Mechanistically, HLA-A, a core component of MHC class I molecules, is responsible for presenting endogenous antigens to CD8+ T cells, thereby maintaining immune surveillance and homeostasis. When HLA-A expression is reduced, impaired antigen presentation leads to dysfunctional CD8+ T-cell activation and inefficient clearance of endogenous antigens derived from apoptotic cells or lipotoxic stress.37,38 The persistence of these antigens may induce chronic antigenic stimulation, activate B cells, and promote polyclonal immunoglobulin production, resulting in elevated serum globulin and thus increased total protein.39 The consequent elevation in globulin contributes to increased total protein concentrations, indicative of immune activation, which plays a crucial role in MASLD pathogenesis and progression.40 These findings suggest that HLA-A downregulation may indirectly contribute to MASLD by promoting immune pathways.
In this study, six ML algorithms were used to construct a noninvasive MASLD diagnostic model. The application of ML in MASLD has made remarkable progress in recent years, positioning it as a pivotal tool for precise diagnosis and treatment.41 ML broadly includes supervised, unsupervised, semi-supervised, and reinforcement learning based on training methods.42 In our analysis, we used supervised ML and found that RF had the best discriminative power, outperforming other algorithms. RF is an ensemble learning algorithm based on decision trees, which significantly improves model accuracy by integrating predictions from multiple decision trees.43 Compared with other algorithms, RF has better tolerance for noise and outliers because its prediction results are based on synthesizing multiple decision trees.44 Furthermore, RF demonstrates resistance to overfitting and strong generalization abilities through its randomized construction approach.43
Previous studies have shown that patients with MASLD are at increased risk of developing HCC.45 In our study, we found that CNPY4 and ENTPD6 not only contribute to the pathogenesis of MASLD but may also play a role in HCC development, further supporting their significance and utility as molecular biomarkers related to liver disease. Moreover, both CNPY4 and ENTPD6 were associated with poor prognosis in HCC, suggesting their potential as prognostic biomarkers in liver diseases. CNPY4 is closely associated with immune regulation and exhibits carcinogenic effects across various tumors.46 ENTPD6 is involved in extracellular purine metabolism and may contribute to tumor metabolic reprogramming by modulating mitochondrial function. Previous studies have consistently demonstrated that the rs738409 variant in the PNPLA3 gene, which encodes patatin-like phospholipase domain-containing protein 3, significantly increases the risk of both MASH and MASLD-related HCC.47,48 This variant impairs triglyceride hydrolysis in hepatocytes, promoting lipid accumulation and resulting in more than a twofold increase in MASH risk (compared to healthy controls) and a 2.2-fold increased risk of MASLD-related HCC (compared to MASLD patients without the variant).47,48 However, as a static genetic susceptibility variant, PNPLA3 is primarily useful for long-term risk prediction and does not reflect dynamic biological changes during disease progression. In contrast, CNPY4 and ENTPD6 are expression-based protein biomarkers that can be quantitatively monitored and may offer greater potential for clinical translation, particularly in dynamic risk stratification, progression surveillance, and therapeutic response assessment in MASLD-HCC.
Additionally, based on prospective data from the NHANES study, we found that individuals with serum total protein levels below 60 g/L, a marker identified as a potential mediator of mortality, exhibited an increased risk of all-cause mortality. This may be because hypo-proteinemia often reflects inadequate protein intake or synthesis, leading to malnutrition, impaired organ function, delayed tissue repair, and reduced resilience to illness. Hypo-proteinemia can also result in edema, ascites, hypovolemia, and tissue hypoperfusion, potentially triggering complications such as renal dysfunction and hypotension, all of which may contribute to increased all-cause mortality.49
Although this study employed the MR method, it has several limitations. First, the GWAS data lack extensive validation across different ethnic groups. Therefore, future studies must validate these findings in diverse populations to confirm the causal relationships. Second, while we identified causal proteins and biomarkers associated with MASLD, the specific biological mechanisms underlying these findings remain unexplored. Third, MASLD is a metabolically driven liver disease with complex and multifactorial etiologies. Other potential mediators, particularly those related to systemic inflammation or immune-metabolic interactions, may exist but were not included in the current analysis due to limitations in the available datasets. Future studies incorporating more comprehensive inflammatory markers, multi-omics data, and immune phenotyping are warranted to refine and expand the mediation pathway models. Fourth, the hospital-based cohort used in this study was derived from a tertiary medical center and had a relatively limited sample size, which may introduce selection bias. Compared with the general MASLD population, patients treated at tertiary centers are more likely to have complex disease presentations and a higher burden of metabolic comorbidities. Such differences in population characteristics may affect the generalizability of our findings.
Supporting information
Supplementary File 1
Supplementary Method 1.
(DOCX)
Supplementary File 2
Supplementary Method 2.
(DOCX)
Supplementary File 3
Supplementary Method 3.
(DOCX)
Supplementary File 4
Supplementary Method 4.
(DOCX)
Supplementary File 5
Supplementary Method 5.
(DOCX)
Supplementary Table 1
Causal relationships between the six proteins (exposure) and MASLD (outcome).
(DOCX)
Supplementary Table 2
Heterogeneity and pleiotropy analysis between the six proteins (explore) and MASLD (outcome).
(DOCX)
Supplementary Table 3
Baseline characteristics of the validation cohort used for serum total protein–related analysis.
(DOCX)
Supplementary Table 4
Adjusted odds ratios of the serum total protein levels for the risk of MASLD in the validation cohort.
(DOCX)
Supplementary Table 5
Adjusted hazard ratios of the serum total protein levels for the risk of all-cause and cause-specific mortality among adult individuals in the United States from the NHANES 1999-2006 database.
(DOCX)
Supplementary Fig. 1
Causal relationship between APOE (exposure) and MASLD (outcome).
(DOCX)
Supplementary Fig. 2
Causal relationship between CNPY4 (exposure) and MASLD (outcome).
(DOCX)
Supplementary Fig. 3
Causal relationship between ENTPD6 (exposure) and MASLD (outcome).
(DOCX)
Supplementary Fig. 4
Causal relationship between HLA-A (exposure) and MASLD (outcome).
(DOCX)
Supplementary Fig. 5
Causal relationship between SCG3 (exposure) and MASLD (outcome).
(DOCX)
Supplementary Fig. 6
Scatterplots of the MR analysis between Apolipoprotein A (exposure), GGT (exposure), and MASLD (outcome).
(DOCX)
Supplementary Fig. 7
Scatterplots of the MR analysis between albumin (exposure), HDL-C (exposure), and MASLD (outcome).
(DOCX)
Supplementary Fig. 8
Scatterplots of the MR analysis between IGF-1 (exposure), urinary sodium (exposure) and MASLD (outcome).
(DOCX)
Supplementary Fig. 9
Scatterplots of the MR analysis between triglycerides (exposure) and MASLD (outcome).
(DOCX)
Supplementary Fig. 10
Causal relationship between HLA-A (exposure) and serum total protein levels (outcome).
(DOCX)
Supplementary Fig. 11
Bootstrap ROC curves for the random forest model in the training and validation sets.
(DOCX)
Supplementary Fig. 12
Differential expression analysis of CNPY4 in hepatocellular carcinoma and other cancer types.
(DOCX)
Supplementary Fig. 13
Association between CNPY4 expression and overall survival in hepatocellular carcinoma and other cancer types.
(DOCX)
Supplementary Fig. 14
Correlation between CNPY4 expression and immune cell infiltration in hepatocellular carcinoma and other cancers.
(DOCX)
Supplementary Fig. 15
Differential expression analysis of ENTPD6 in hepatocellular carcinoma and other cancer types.
(DOCX)
Supplementary Fig. 16
Association between ENTPD6 expression and overall survival in hepatocellular carcinoma and other cancer types.
(DOCX)
Supplementary Fig. 17
Correlation between ENTPD6 expression and immune cell infiltration in hepatocellular carcinoma and other cancers.
(DOCX)