$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
Identification of candidate shared genes between major depressive disorder and dermatomyositis
Following data merging, normalization, and batch correction of dermatomyositis-related GEO datasets (Figure 2A,B), a total of 570 differentially expressed genes were identified (Figure 2C,D), comprising 517 up-regulated and 53 down-regulated genes. This dermatomyositis differential expression analysis was used to characterize disease-related transcriptional changes.
In parallel, WGCNA was applied separately to the major depressive disorder and dermatomyositis datasets to identify disease-associated co-expression modules. The optimal soft thresholding power was 4 for the major depressive disorder dataset and 9 for the dermatomyositis dataset (Figure 3A,B). Five modules were detected in each dataset. Module–trait correlation analysis showed that the grey module was most strongly associated with major depressive disorder (Cor = −0.28, p = 1 × 10-4), while the brown (Cor = −0.73, p = 5 × 10-18) and red (Cor = −0.89, p = 1 × 10-35) modules were most strongly associated with dermatomyositis (Figure 3C,D). The overlap between the selected key module genes of the two datasets was defined as the candidate shared gene set, yielding 33 shared genes (Figure 3E). Notably, the module–trait association in major depressive disorder was markedly weaker than that observed in dermatomyositis. Specifically, the major depressive disorder-associated grey module showed only a modest correlation with disease status, whereas the dermatomyositis-associated brown and red modules exhibited substantially stronger correlations. Therefore, the overlapping genes identified from these modules should be interpreted as candidate shared transcriptomic signals.

Figure 2: Data processing and cleaning for dermatomyositis. (A) Box plot before batch effect removal and data merging. (B) Box plot after batch effect removal and data merging. (C) Clustered heatmap after data cleaning. (D) Volcano plot after data cleaning. Please click here to view a larger version of this figure.

Figure 3: Results of Weighted Gene Co-expression Network Analysis (WGCNA). (A) Determination of Soft Threshold power for major depressive disorder. (B) Determination of Soft Threshold power for dermatomyositis. (C) Module correlation heatmap of major depressive disorder. (D) Module correlation heatmap of dermatomyositis. (E) Venn diagram showing the overlap between genes from the key WGCNA module(s) of dermatomyositis and the key WGCNA module of major depressive disorder, yielding 33 shared genes. Please click here to view a larger version of this figure.
Functional enrichment and geneMANIA-Based network analysis of shared genes
GO and KEGG enrichment analyses were performed on the 33 shared genes to characterize their biological functions and pathway associations. A total of 297 significantly enriched GO terms and 7 KEGG pathways were identified. GO analysis indicated that these genes were mainly enriched in immune defense- and cytotoxicity-related biological processes. They were localized predominantly to granules and vesicles that store immune effector molecules, and were enriched for molecular functions such as serine hydrolase activity and calcium ion transport (Figure 4A,B). KEGG pathway analysis highlighted significant enrichment in the PPAR signaling pathway, antigen processing and presentation, and the IL-17 signaling pathway (Figure 4C). To further explore the functional interaction context of these shared genes, the 33 observed shared genes were used as seed/query genes in GeneMANIA. GeneMANIA then automatically added functionally related partner genes to generate an expanded network containing 49 total nodes. Therefore, the 49-node network shown in Figure 4D does not represent 49 directly observed cross-disease shared genes, but rather an expanded functional association network composed of the 33 shared genes plus GeneMANIA-added related partners. Within this expanded network, ELANE, PPBP, and CTSG emerged as highly connected candidate nodes (Figure 4D).

Figure 4: Functional enrichment and GeneMANIA-based functional association network analysis of the shared genes. (A) Bar plot of Gene Ontology (GO) enrichment results. (B) Bubble plot of GO enrichment results. (C) Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment results. (D) Functional association network generated using GeneMANIA from the 33 shared genes and exported to a network visualization platform for visualization and topological analysis. Please click here to view a larger version of this figure.
Machine-learning based feature prioritization and model interpretation
Machine-learning analysis was performed on the 33 shared genes. Among the 113 candidate model combinations evaluated in the present benchmarking framework, the RF+Enet model showed the highest relative performance and was therefore selected for subsequent feature prioritization and model interpretation (Figure 5A). This model retained 8 machine-learning-selected genes (including KIF4A and OLR1, among others). ROC curves and confusion matrices were used to summarize classification performance in the training and validation cohorts (Figure 5B,C). Detailed information on all evaluated model genes is provided in Supplementary File S1.

Figure 5: Machine-learning-based classification performance of candidate model combinations and the selected model. (A) Heatmap summarizing the area under the curve (AUC) values of 113 candidate machine-learning model combinations across the training cohort and independent validation cohorts, together with the mean AUC used for model comparison. (B) Confusion matrices of the selected Random Forest plus Elastic Net [RF+Enet(alpha = 0.7)] model in the training cohort and independent validation cohorts. (C) Receiver operating characteristic (ROC) curves of the selected RF+Enet(alpha = 0.7) model in the training cohort and independent validation cohorts. Please click here to view a larger version of this figure.
These 8 genes were subsequently carried forward as candidate model-selected genes for downstream interpretation (Figure 6A). To improve transparency of the selected classifier, SHAP (SHapley Additive exPlanations) analysis was used as a post hoc interpretability approach to quantify the relative contribution of retained features to model predictions. Three supervised machine-learning algorithms, including Random Forest (RF), XGBoost (XGB), and Gradient Boosting Machine (GBM), were compared for interpretability analysis (Figure 6B). Among them, RF showed relatively higher AUC, consistent with the benchmarking results, and was therefore used for SHAP visualization. The SHAP bar plot (Figure 6C) indicated that KIR2DL4 had the largest mean absolute SHAP value within the RF model, whereas AZU1 and SCG5 also showed relatively higher contributions, and KIF4A showed a comparatively smaller contribution. The beeswarm and scatter plots (Figure 6D,E) further illustrated how feature values were associated with the direction and magnitude of the model output at the sample level. These analyses were used to explain model behavior and support feature prioritization within the classifier.

Figure 6: Receiver operating characteristic (ROC) and SHapley Additive exPlanations (SHAP)-based model interpretability analysis. (A) ROC curves of the candidate model-selected genes. (B) Comparison of three supervised machine-learning algorithms, including Random Forest (RF), XGBoost (XGB), and gradient boosting machine (GBM), used for model interpretability analysis. (C) SHAP bar plot and summary plot showing the relative contribution of each feature to the RF model. (D) SHAP beeswarm plot showing the distribution of feature contributions across samples. (E) Scatter plots showing the relationship between gene expression levels and SHAP values. Please click here to view a larger version of this figure.
Taken together, these analyses prioritized several candidate model-selected genes within the present dermatomyositis-related classification framework. Because the machine-learning model was trained and evaluated only in dermatomyositis-versus-control cohorts, these findings should not be interpreted as direct validation of biomarkers for comorbidity between major depressive disorder and dermatomyositis, but rather as supportive evidence for feature prioritization in a dermatomyositis-centered classification setting.
GSEA enrichment and immune infiltration
Among the positively enriched pathways (enriched in the high-expression group), immune- and inflammation-related pathways were consistently prominent across multiple genes. The cytokine–cytokine receptor interaction pathway was significantly enriched for KIR2DL4 (NES = 1.41, adjusted P = 1.93 × 10⁻9), KIF4A (NES = 1.29, adjusted P = 3.12 × 10⁻5), OLR1 (NES = 1.26, adjusted P = 1.34 × 10⁻5), and SCG5 (NES = 1.16, adjusted P = 5.09 × 10⁻3). The natural killer cell-mediated cytotoxicity pathway was enriched for KIR2DL4 (NES = 1.42, adjusted P = 1.04 × 10⁻5), LRRC37BP1 (NES = 1.22, adjusted P = 7.37 × 10⁻4), and KIF4A (NES = 1.24, adjusted P = 1.55 × 10⁻2). The antigen processing and presentation pathway was enriched for KIR2DL4 (NES = 1.48, adjusted P = 5.42 × 10⁻5) and KIF4A (NES = 1.30, adjusted P = 9.99 × 10⁻3). The JAK-STAT signaling pathway was enriched for KIR2DL4 (NES = 1.34, adjusted P = 2.39 × 10⁻4), OLR1 (NES = 1.22, adjusted P = 3.83 × 10⁻3), and KIF4A (NES = 1.25, adjusted P = 8.59 × 10⁻3). The Toll-like receptor signaling pathway was enriched for KIR2DL4 (NES = 1.35, adjusted P = 1.45 × 10⁻3), OLR1 (NES = 1.22, adjusted P = 1.50 × 10⁻2), and KIF4A (NES = 1.24, adjusted P = 3.00 × 10⁻2). The chemokine signaling pathway was enriched for KIR2DL4 (NES = 1.35, adjusted P = 3.17 × 10⁻5) and KIF4A (NES = 1.19, adjusted P = 1.50 × 10⁻2). The complement and coagulation cascades pathway was enriched for KIR2DL4 (NES = 1.40, adjusted P = 1.39 × 10⁻3) and OLR1 (NES = 1.24, adjusted P = 3.10 × 10⁻2). The NOD-like receptor signaling pathway was enriched for KIR2DL4 (NES = 1.40, adjusted P = 4.23 × 10⁻3) and KIF4A (NES = 1.35, adjusted P = 6.27 × 10⁻3).
In contrast, KRT23 showed predominantly negative enrichment for immune-related pathways, including the Toll-like receptor signaling pathway (NES = −2.12, adjusted P = 7.31 × 10⁻8), RIG-I-like receptor signaling pathway (NES = -2.03, adjusted P = 2.81 × 10⁻6), and antigen processing and presentation (NES = -2.02, adjusted P = 2.13 × 10⁻6), indicating an inverse association with immune activation.
Among the negatively enriched metabolic pathways, valine, leucine, and isoleucine degradation was enriched for KIR2DL4 (NES = -2.79, adjusted P = 6.33 × 10⁻9) and SCG5 (NES = -2.35, adjusted P = 3.37 × 10⁻6), and fatty acid metabolism was enriched for KIR2DL4 (NES = -2.42, adjusted P = 6.89 × 10⁻6) and SCG5 (NES = -2.12, adjusted P = 4.17 × 10⁻5).
Overall, these GSEA results indicated that the majority of the candidate model-selected genes were associated with coordinated upregulation of immune and inflammatory signaling pathways and concurrent downregulation of metabolic pathways in dermatomyositis, consistent with the functional enrichment and immune infiltration findings described above (Figure 7).

Figure 7: Results of GSEA for key genes. Please click here to view a larger version of this figure.
To further examine the immune context of the identified candidate shared genes, a CIBERSORT-based immune deconvolution analysis was performed. The inferred immune-cell composition of each sample is shown in Figure 8A. Compared with the control group, significant differences were observed in the relative proportions of regulatory T cells (Tregs), resting mast cells, resting dendritic cells, M1 macrophages, and M2 macrophages (Figure 8B). Correlation analysis among immune-cell subsets showed a negative correlation between memory B cells and naive B cells (r = -0.64), whereas memory B cells and plasma cells showed a positive correlation (r = 0.37) (Figure 8C). In addition, LRRC37E, SCG5, AZU1, and KRT23 were positively correlated with neutrophils, eosinophils, and resting mast cells, whereas KIF4A and OLR1 were negatively correlated with naive B cells, memory B cells, and plasma cells (Figure 8D). These findings indicate associations between selected genes and inferred immune-cell distribution patterns in the current deconvolution analysis.

Figure 8: Analysis of immune cell infiltration for key genes. (A) Overlapping histogram showing the proportion of immune cells in each sample. (B) Bar chart comparing 22 immune cells between the experimental group and control group. (C) Heatmap showing the correlation between candidate shared genes of the model and infiltration of 22 immune cells. (D) Positive and negative regulatory relationships between genes and immune cells. *p < 0.05; **p < 0.01; ****p < 0.0001; ns, not significant. Please click here to view a larger version of this figure.
Results of Single-Cell Analysis
To further localize the candidate shared genes at the cellular level, a dermatomyositis-related single-cell RNA-seq dataset (GSE190510) was analyzed as a cellular contextualization step rather than a direct validation dataset for comorbidity. UMAP clustering identified seven distinct cell clusters (Figure 9A), which were subsequently annotated into major immune-cell subsets (Figure 9B). A heatmap of cluster-associated marker genes further illustrated cell-type-related expression patterns and hierarchical relationships across clusters (Figure 9C). The expression distribution of those candidate genes detectable in the single-cell dataset (KIF4A, KIR2DL4, KRT23, AZU1, and SCG5) across annotated cell populations is shown in Figure 9D; the remaining candidates (OLR1, KIR3DS1, and LRRC37E) did not show detectable expression in this dataset. The AUCell, ssGSEA, AddModuleScore, and integrated composite scoring results across major cell subsets are presented in Figure 9E. Because the bulk-level analyses highlighted immune defense- and cytotoxicity-related programs, and because KIR2DL4 and AZU1 showed detectable expression within the CD8⁺ T-cell compartment, this subset was selected for downstream exploratory analyses (Figure 9F). Pseudotime analysis showed the distribution of high-score and low-score CD8⁺ T-cell states along the inferred differentiation trajectory, together with the overall pseudotime progression pattern (Figure 9G). Cell-cell communication analysis revealed that both groups maintained extensive interactions with monocytes/macrophages, platelets, and other immune-cell subsets (Figure 9H). Outgoing and incoming signaling-pattern analysis showed distinct communication preferences between the two composite-score-defined groups (Figure 9I). Pathway-level communication analysis further suggested differential signaling activities involving pathways such as ANNEXIN, IL16, MIF, and ncWNT among the indicated sender-receiver cell groups (Figure 9J). At the ligand-receptor level, interactions such as LGALS9-CD44, MIF-(CD74+CXCR4), and MIF-(CD74+CD44) were identified among CD8⁺ T-cell score groups and other immune-cell subsets (Figure 9K).
These findings should be interpreted as hypothesis-generating cellular context within dermatomyositis. They do not constitute direct evidence for mechanisms of comorbidity between major depressive disorder and dermatomyositis, but rather indicate potential immune-cell states and signaling interactions in which candidate shared genes may participate in a dermatomyositis-related immune background.

Figure 9: Single-cell RNA-seq analysis and downstream functional contextualization of candidate shared genes. (A) Uniform manifold approximation and projection (UMAP) plot showing unsupervised clustering of cells. (B) UMAP plot showing annotated immune-cell subsets. (C) Heatmap of cluster marker genes across annotated cell subsets. (D) UMAP feature plots showing the expression distribution of representative composite-score across clusters. (E) Dot plot showing AUCell, ssGSEA, AddModuleScore, and integrated composite scoring across annotated cell subsets. (F) Heatmap showing the expression patterns of KIR2DL4 and AZU1 in the CD8⁺ T-cell subset. (G) Pseudotime trajectory plots showing candidate-gene score grouping and pseudotime progression in the CD8⁺ T-cell subset. (H) Circle plot showing inferred intercellular communication among candidate-gene score groups and immune-cell subsets. (I) Heatmaps showing outgoing and incoming signaling patterns. (J) Bubble plot showing selected signaling pathways among the indicated sender–receiver cell groups. (K) Bubble plot showing selected ligand–receptor interactions among the indicated cell groups. Please click here to view a larger version of this figure.
DATA AVAILABILITY :
All datasets analyzed in this study were obtained from the Gene Expression Omnibus (GEO) database under accession numbers GSE5370, GSE1551, GSE46239, GSE11971, GSE39454, GSE128470, and GSE190510. The bulk transcriptomic datasets used for major depressive disorder and dermatomyositis analyses, as well as the dermatomyositis-related single-cell dataset, are publicly available through GEO. The analysis code, including the machine-learning scripts used in this study, has been made available through a public code repository [https://github.com/tengfeitcm/machine-learning.git]. Selected processed results generated during the present study are provided as Supplementary File 1.
Supplementary File 1: Details of all 113 evaluated machine-learning model combinations, including first- and second-stage algorithms, selected genes, AUCs in the training and independent validation cohorts, and mean AUCs for model comparison and ranking.Please click here to download this file.