We developed a machine learning–derived six-gene tamoxifen resistance signature that stratifies breast cancer patients by survival risk and may support personalized prognostic assessment and therapeutic decision-making in clinical practice.
Research Article
We developed a machine learning–derived six-gene tamoxifen resistance signature that stratifies breast cancer patients by survival risk and may support personalized prognostic assessment and therapeutic decision-making in clinical practice.
Tamoxifen is a key endocrine therapy for estrogen receptor-positive (ER+) breast cancer, but acquired resistance limits long-term efficacy. The molecular mechanisms remain complex, and predictive biomarkers are lacking. Gene expression data related to tamoxifen resistance were obtained from GEO (GSE67916), and differentially expressed genes (DEGs) were identified using the limma algorithm. Functional enrichment analyses (GO and KEGG) revealed involvement in immune processes, antiviral responses, endocytosis, lysosome pathways, and estrogen signaling. Three machine learning algorithms (LASSO, SVM-RFE, and RF) identified six hub genes (CAMK1D, CHAC1, KIAA0513, MED13, NDRG1, STXBP5). A prognostic risk model based on these genes was constructed using TCGA-BRCA data, effectively stratifying patients into high- and low-risk groups with significantly different overall survival. The model demonstrated good predictive accuracy (AUC = 0.70) and stable performance in time-dependent ROC analyses, validated in an independent cohort. This study provides a robust tamoxifen resistance–related gene signature and a multigene prognostic model, offering novel insights into resistance mechanisms and potential guidance for individualized prognosis and therapy in ER+ breast cancer.
Breast cancer is the most commonly diagnosed cancer among women globally and remains a major contributor to cancer-associated deaths1. Based on molecular profiling, approximately 60-70% of patients present with the estrogen receptor-positive (ER+) subtype, which generally responds well to initial endocrine therapy. Among various endocrine agents, tamoxifen—a selective estrogen receptor modulator (SERM)—has long been the standard of care for adjuvant and metastatic treatment of ER+ breast cancer, significantly improving survival outcomes in large-scale randomized trials2.
Despite its established clinical efficacy, a substantial proportion of patients eventually develop acquired resistance, leading to disease recurrence and progression3. Consequently, tamoxifen resistance represents a critical bottleneck in the long-term management of breast cancer. Previous studies indicate that the mechanisms driving resistance are highly heterogeneous, involving aberrant ER signaling, activation of alternative growth factor pathways (e.g., PI3K/AKT, MAPK), dysregulation of apoptosis, and metabolic reprogramming4.
Recently, the tumor microenvironment (TME), particularly the immune microenvironment, has garnered increasing attention for its role in therapeutic response and resistance. Immune cell infiltration and inflammatory signaling are closely linked to breast cancer progression and sensitivity to treatment. Emerging evidence suggests that immune-related processes may modulate cellular stress responses and promote immune evasion, thereby contributing to endocrine resistance5. However, the specific immune landscape associated with tamoxifen resistance remains to be fully elucidated.
With the advent of high-throughput sequencing and bioinformatics, analyzing transcriptomic data offers a powerful approach to decoding the molecular basis of drug resistance. Compared to single-gene biomarkers, multi-gene signatures capture tumor heterogeneity more effectively, offering superior stability and accuracy in prognosis. Machine learning algorithms, such as LASSO regression, Support Vector Machines (SVM), and Random Forest (RF), have become essential tools for identifying robust biomarkers and constructing predictive models in precision medicine6. Although several gene signatures have been proposed7,8,9, robust validation is often lacking.
While previous studies have explored tamoxifen resistance genes, few have systematically integrated multiple machine learning strategies to build a robust prognostic model validated in independent cohorts. In this study, we integrated transcriptomic data from GEO and TCGA to screen for tamoxifen resistance-associated genes. By applying an ensemble of machine learning algorithms, we identified core resistance genes and constructed a prognostic risk model. We further evaluated the model's association with the immune microenvironment and validated its predictive value in an independent cohort, aiming to provide new theoretical evidence for personalized treatment strategies.
All data used in this study were obtained from publicly accessible databases (TCGA, GEO, and METABRIC). No human participants or animals were involved; therefore, institutional review board approval and informed consent were not required.
Data acquisition and preprocessing
Gene expression data related to tamoxifen resistance were retrieved from the Gene Expression Omnibus (GEO) database10. The GSE67916 dataset (Affymetrix Human Genome U133 Plus 2.0 Array) includes 18 breast cancer cell samples: 8 untreated control samples and 10 tamoxifen-resistant samples generated through long-term drug exposure. RNA-sequencing data and corresponding clinical follow-up information for the Breast Invasive Carcinoma cohort (TCGA-BRCA) were downloaded from The Cancer Genome Atlas (TCGA)11.
Raw microarray CEL files were processed in R (version 4.4.2) using the affy package. Background correction and normalization were performed with the Robust Multi-array Average (RMA) algorithm, including log2 transformation and quantile normalization. Probe IDs were mapped to gene symbols using platform annotation files; for genes with multiple probes, the average expression value was used. For TCGA RNA-seq data, transcripts per million (TPM) values were log2-transformed [log2(TPM + 1)]. Samples with incomplete survival information or missing clinical variables were excluded.
Identification of differentially expressed genes
Differential expression between tamoxifen-resistant and control samples was assessed using the limma R package12 with empirical Bayes moderation. Genes with |log2 fold change| > 1 and an adjusted P value < 0.05 (Benjamini–Hochberg FDR) were defined as differentially expressed genes (DEGs).
Functional enrichment analysis
Gene Ontology (GO)13and Kyoto Encyclopedia of Genes and Genomes (KEGG)14 analyses were performed using the clusterProfiler15 R package. GO categories included biological process (BP), cellular component (CC), and molecular function (MF). Adjusted P values < 0.05 were considered significant.
Machine learning–based feature selection
Three machine learning algorithms were applied to identify hub genes: (1) LASSO regression16 (glmnet package) with 10-fold cross-validation to select the optimal penalty parameter (lambda.min); (2) Support vector machine–recursive feature elimination (SVM-RFE)17 (e1071 package) with fivefold cross-validation to identify the minimal gene subset with lowest classification error; (3) Random forest (RF)18 (randomForest package) with 500 trees (ntree = 500); genes were ranked by MeanDecreaseGini. Genes identified by all three methods were defined as hub genes.
Construction of the prognostic risk model
A multigene prognostic risk model was constructed using TCGA-BRCA gene expression and survival data. Survival-associated genes were screened using univariate Cox regression, followed by multivariate Cox regression to develop the final signature. The risk score formula was calculated as: Risk score = (0.01297 × CAMK1D) + (0.03021 × CHAC1) + (0.02018 × KIAA0513) + (0.00647 × MED13) + (0.00108 × NDRG1) + (0.04551 × STXBP5). Patients were stratified into high- and low-risk groups based on the median risk score.
Evaluation and validation of the prognostic model
Overall survival differences between groups were assessed with Kaplan–Meier analysis and the log-rank test. Predictive performance was evaluated using ROC curves (pROC package) and time-dependent ROC analysis (timeROC package). A nomogram integrating risk scores and clinical variables was constructed using the rms package. Calibration curves assessed agreement between predicted and observed survival probabilities. External validation was conducted in the independent Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) cohort19 using the same formula and cutoff.
Immune infiltration analysis
Immune cell infiltration was estimated using CIBERSORT20 with 1,000 permutations based on TCGA data. Samples with P < 0.05 were included. Differences in immune cell composition between high- and low-risk groups were assessed with the Wilcoxon rank-sum test, and correlations between hub gene expression and immune cell abundance were evaluated using Spearman's rank correlation.
Statistical analysis
All analyses were performed in R. Continuous variables were compared using the Wilcoxon rank-sum test, and categorical variables using the chi-square test. Two-sided P < 0.05 was considered statistically significant.
Identification of differentially expressed genes associated with tamoxifen resistance
A total of 854 differentially expressed genes (DEGs) were identified between tamoxifen-resistant and control samples, including 556 upregulated and 298 downregulated genes. The distribution of DEGs showed a predominance of upregulated genes in resistant samples, suggesting extensive transcriptional activation associated with tamoxifen resistance (Figure 1).
Functional enrichment analysis of DEGs
GO enrichment analysis revealed that DEGs were primarily involved in immune- and defense-related biological processes, including defense response to viruses, type I interferon signaling pathway, and cellular response to interferon. Cellular component analysis demonstrated significant enrichment in membrane- and vesicle-related structures, such as endocytic vesicle membrane, phagocytic vesicle, and early endosome. Molecular function analysis highlighted enrichment in binding and transporter-related activities.
KEGG pathway analysis further showed that DEGs were significantly enriched in immune-related pathways, including viral carcinogenesis, Epstein–Barr virus infection, and intracellular degradation pathways such as the phagosome and lysosome. Additionally, cancer- and hormone-related pathways, including the estrogen signaling pathway and EGFR tyrosine kinase inhibitor resistance, were also enriched (Figure 2). The complete list of significantly enriched GO terms and KEGG pathways, including gene counts, P-values, FDR-adjusted P-values, and enrichment scores, is provided in Supplementary Files 1 and 2.
Identification of hub genes using machine learning algorithms
To identify robust hub genes associated with tamoxifen resistance, three machine learning algorithms were applied to the DEGs. LASSO regression (10-fold cross-validation) selected 9 genes, SVM-RFE (fivefold cross-validation) identified 11 genes, and Random Forest (ntree = 500) ranked the top 24 genes based on Mean Decrease Gini. The intersection of the three algorithms yielded six consensus hub genes: CAMK1D, CHAC1, KIAA0513, MED13, NDRG1, and STXBP5 (Supplementary Figure 1).
Construction and evaluation of the prognostic risk model
Univariate and multivariate Cox regression analyses were performed in the TCGA-BRCA cohort. All six hub genes were significantly associated with overall survival in the univariate analysis (Supplementary Figure 2A). Multivariate Cox regression further confirmed these genes as independent prognostic factors (Supplementary Figure 2B). The risk score was calculated using the following formula based on the multivariate regression coefficients: Risk score = (0.01297 × CAMK1D) + (0.03021 × CHAC1) + (0.02018 × KIAA0513) + (0.00647 × MED13) + (0.00108 × NDRG1) + (0.04551 × STXBP5).
Patients were divided into high- and low-risk groups based on the median risk score. Kaplan–Meier survival analysis revealed that high-risk patients had significantly poorer overall survival than low-risk patients (log-rank P < 0.001). The model demonstrated good predictive accuracy (AUC ≈ 0.70), with stable performance in time-dependent ROC analyses at 1-, 3-, and 5-year time points. A nomogram integrating the risk score with clinical variables (age, stage, T, N, M) was constructed, and calibration curves showed excellent agreement between predicted and observed survival probabilities (Figure 3).
Clinical characteristics of the study cohorts
The baseline clinicopathological characteristics of patients in the TCGA-BRCA training and METABRIC validation cohorts are summarized in Supplementary File 3. No statistically significant differences in major clinical features were observed between the high- and low-risk groups in either cohort (p > 0.05).
Validation in an independent cohort
The robustness of the six-gene signature was validated using the METABRIC dataset. Consistent with the training set, the high-risk group in the validation cohort exhibited significantly worse survival outcomes. Time-dependent ROC analysis confirmed the model's stable predictive capability across different follow-up durations (Figure 4).
Immune infiltration analysis
CIBERSORT analysis revealed distinct immune cell infiltration patterns between the high- and low-risk groups. The relative proportions of 22 immune cell types across breast cancer samples are presented (Figure 5). Comparison of immune cell infiltration levels showed that high-risk patients had significantly higher proportions of M0 and M2 macrophages, while low-risk patients exhibited higher infiltration of CD8⁺ T cells and other anti-tumor immune subsets (Figure 6). Correlation analyses further demonstrated significant associations between the prognostic risk score, hub gene expression, and various immune cell types (Figure 7). Statistical significance was defined as P < 0.05.
Data Availability:
The gene expression dataset used to identify tamoxifen resistance-related genes is publicly available in the Gene Expression Omnibus (GEO) database under accession number GSE67916 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67916). The breast cancer RNA-seq data and corresponding clinical information were obtained from The Cancer Genome Atlas (TCGA-BRCA) cohort and the METABRIC dataset, both of which are publicly accessible through the GDC portal and cBioPortal, respectively. All processed data, including the risk score formula, differential expression results, and supplementary tables, are provided within the Supporting Information files of this article. The R code used for data analysis in this study is available from the corresponding authors upon reasonable request.

Figure 1: Identification of differentially expressed genes associated with tamoxifen resistance. (A) Bar plot showing the number of upregulated and downregulated genes identified between tamoxifen-resistant and control breast cancer samples. A total of 556 genes were upregulated, and 298 were downregulated at the defined thresholds. (B) Volcano plot illustrating the distribution of differentially expressed genes based on log2 fold change and adjusted P value. Red dots represent upregulated genes, blue dots represent downregulated genes, and black dots indicate genes without significant differential expression. The dashed vertical lines indicate the log₂ fold change cutoff, and the dashed horizontal line represents the significance threshold. Please click here to view a larger version of this figure.

Figure 2: Functional enrichment analysis of differentially expressed genes (DEGs) associated with tamoxifen resistance. (A) Gene Ontology (GO) enrichment analysis of DEGs, including biological process (BP), cellular component (CC), and molecular function (MF) categories. The x-axis represents the enrichment score, and the y-axis indicates the enriched GO terms. Different colors denote the three GO categories. (B) Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of DEGs. The x-axis shows the enrichment score (−log10 adjusted p-value), while the y-axis shows the significantly enriched pathways. The size of each dot corresponds to the number of genes involved in each pathway, and the color gradient reflects the significance level. Please click here to view a larger version of this figure.

Figure 3: Construction and assessment of the prognostic risk model and nomogram in the TCGA cohort. (A) Kaplan-Meier curves of overall survival (OS) for patients in the high- and low-risk groups in the training set. (B) ROC curve comparison between the risk score and clinical characteristics (Age, Stage, T, N, M). (C) Time-dependent ROC curves for predicting 1-, 3-, and 5-year OS. (D) A prognostic nomogram for predicting 1-, 3-, and 5-year OS by integrating the risk score with clinical features. (E) Calibration plots of the nomogram for 1-, 3-, and 5-year OS predictions. Please click here to view a larger version of this figure.

Figure 4: Validation of the prognostic signature in the METABRIC cohort. (A) Kaplan-Meier curves of overall survival (OS) for the high- and low-risk groups in the validation set. (B) Time-dependent ROC curves for predicting 1-, 3-, and 5-year OS in the validation set. Please click here to view a larger version of this figure.

Figure 5: Relative proportions of 22 immune cell types in breast cancer samples estimated by the CIBERSORT algorithm. Each bar represents one sample, and different colors indicate distinct immune cell populations. Please click here to view a larger version of this figure.

Figure 6: Comparison of immune cell infiltration levels between high-risk and low-risk groups. Significant differences were observed across several immune cell types (Wilcoxon rank-sum test). Please click here to view a larger version of this figure.

Figure 7: Correlation analysis between the prognostic risk model and immune cell infiltration. Correlation heatmap showing the associations between the expression levels of the six hub genes, risk score, and immune cell types. The color intensity indicates the strength of the Spearman correlation, with red indicating a positive correlation and blue a negative one. Statistical significance was defined as p < 0.05. Please click here to view a larger version of this figure.
Supplementary Figure S1. Identification of hub genes using three machine learning algorithms.Please click here to download this file.
Venn diagram illustrating the intersection of candidate genes selected by LASSO regression, SVM-RFE, and Random Forest algorithms. The overlapping region indicates the six consensus hub genes (CAMK1D, CHAC1, KIAA0513, MED13, NDRG1, and STXBP5) consistently identified by all three methods.
Supplementary Figure S2. Univariate and multivariate Cox regression analyses of the six hub genes in the TCGA-BRCA cohort. (A) Forest plot showing the results of univariate Cox regression analysis. All six genes were significantly associated with overall survival (p < 0.05). (B) Forest plot showing the results of multivariate Cox regression analysis. The six genes remained independent prognostic factors. Hazard ratios (HRs), 95% confidence intervals (CIs), and p-values are presented.Please click here to download this file.
Supplementary File 1: Gene Ontology (GO) enrichment analysis results for Molecular Function (GO.MF), Cellular Component (GO.CC), and Biological Process (GO.BP). The file includes enriched GO terms, gene ratios, background ratios, p-values, adjusted p-values, q-values, overlapping gene symbols, enrichment scores, and fold enrichment values. Please click here to download this file.
Supplementary File 2: Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis results. The file includes enriched pathways, gene ratios, background ratios, p-values, adjusted p-values, q-values, overlapping gene symbols, enrichment scores, and fold enrichment values. Please click here to download this file.
Supplementary File 3. Clinical characteristics and survival data of patients from TCGA and METABRIC cohorts used for prognostic and clinicopathological analyses. The file includes patient IDs, survival status, survival time, age, tumor stage, and TNM classification data where available. Please click here to download this file.
Tamoxifen resistance remains a pivotal obstacle in the treatment of ER+ breast cancer. While mechanisms such as ER mutations are well known, the systemic molecular adaptations to long-term therapy remain less well understood. In this study, we integrated transcriptomics and machine learning to identify a robust six-gene signature (CAMK1D, CHAC1, KIAA0513, MED13, NDRG1, STXBP5) that predicts both tamoxifen resistance and patient prognosis.
Our functional analysis revealed that resistance is accompanied by extensive transcriptomic remodeling, particularly in immune response and intracellular transport pathways (lysosome/phagosome). The enrichment of "viral defense" and "type I interferon" pathways likely reflects a heightened sterile inflammatory state or stress response in resistant cells, rather than actual viral infection21. This aligns with emerging evidence that chronic stress signaling can rewire the tumor microenvironment to support survival under drug pressure.
The six identified genes have biologically plausible links to resistance. NDRG1: NDRG1 is a well-characterized stress-response gene with a significant role in hypoxia adaptation and therapeutic resistance. This gene is predominantly regulated by hypoxia-inducible factors (HIFs), which orchestrate cellular metabolic reprogramming to support survival in low-oxygen tumor microenvironments. In breast cancer, elevated NDRG1 expression is associated with enhanced resistance to chemotherapy and endocrine therapies, likely through activation of the PI3K/AKT signaling axis and epithelial-mesenchymal transition (EMT). The modulation of these pathways in response to microenvironmental stressors highlights NDRG1’s role in mediating therapeutic tolerance and tumor progression. Moreover, its involvement in hypoxia-induced changes suggests that NDRG1 may serve as a potential biomarker for predicting poor prognosis in tumors with limited vasculature22.
CHAC1: As a regulator of glutathione metabolism, CHAC1 plays a pivotal role in oxidative stress adaptation, which is crucial for the development of resistance to tamoxifen. CHAC1 facilitates the degradation of glutathione, the primary antioxidant in cells, thus influencing redox balance and reactive oxygen species (ROS) dynamics. In breast cancer, particularly in the context of tamoxifen treatment, oxidative stress is a significant challenge, and cells with heightened CHAC1 expression may be better equipped to endure ROS-induced damage. This redox plasticity could be a critical factor in tumor survival under therapeutic stress. Therefore, CHAC1 represents a promising target for therapeutic strategies aimed at overcoming resistance to oxidative damage in breast cancer treatments23.
CAMK1D: CAMK1D, a member of the calcium/calmodulin-dependent protein kinase family, has emerged as a key player in both calcium signaling and immune regulation pathways. Intracellular calcium signaling is central to regulating essential processes such as cell proliferation, apoptosis, and migration. CAMK1D participates in complex signaling cascades that integrate calcium dynamics with MAPK and NF-κB pathways, which may influence both tumor cell behavior and the inflammatory response in the tumor microenvironment. In addition, recent studies indicate that CAMK1D modulates immune cell activation and cytokine production, suggesting that it could impact immune evasion mechanisms in hormone-responsive breast cancers. Given its dual role in both cancer cell signaling and immune regulation, CAMK1D represents a potential therapeutic target in tumors with complex signaling networks24.
KIAA0513, MED13, STXBP5: Although the precise roles of KIAA0513, MED13, and STXBP5 in breast cancer remain poorly defined, their consistent identification across multiple bioinformatics platforms underscores their potential relevance in breast cancer progression. MED13 is a core subunit of the Mediator complex, which facilitates transcriptional regulation, particularly in estrogen receptor (ER)-dependent gene expression programs. STXBP5, involved in vesicle trafficking and membrane fusion, may influence receptor signaling efficiency, thus affecting tumor cell responses to external stimuli. KIAA0513, with its putative involvement in neuronal differentiation, may also play a role in cellular organization and gene expression regulation within the tumor. Given their recurrent selection in various computational models, these genes are likely to be involved in key biological processes that drive tamoxifen resistance and tumorigenesis. Further experimental validation is essential to confirm their functional roles and assess their potential as therapeutic targets or biomarkers for breast cancer25,26.
Critically, our risk model correlates with the immune landscape. High-risk patients exhibited enriched M2 Macrophages (often pro-tumorgenic and immunosuppressive) and M0 Macrophages, while low-risk patients had higher CD8+ T cell infiltration27. This suggests that the gene signature captures not only tumor-intrinsic resistance but also an immunosuppressive microenvironment that may facilitate tumor progression and drug tolerance28.
Limitations: This study is retrospective and relies on publicly available databases. While validated in an independent cohort (METABRIC), the biological function of these genes in tamoxifen resistance requires further verification through in vitro and in vivo experiments29,30. Additionally, the discovery cohort from GSE67916 contained only 18 samples (8 control and 10 tamoxifen-resistant), which is relatively small and may increase the risk of overfitting when identifying differentially expressed genes and hub genes. Although we employed three independent machine learning algorithms (LASSO, SVM-RFE, and Random Forest) with cross-validation and further validated the six-gene signature in large cohorts, the small sample size of the initial discovery dataset remains a potential limitation. Future studies with larger tamoxifen resistance cohorts and functional experimental validation are warranted to confirm the robustness of the identified gene signature.
The authors declare that they have no conflicts of interest.
This study was supported by the Project of Jiangsu Province Engineering Research Center of Molecular Target Therapy and Companion Diagnostics in Oncology (SGK2202319), the Jiangsu higher education institution innovative research team for science and technology (2021), Program of Jiangsu vocational college engineering technology research center (2023), The Natural Science key Foundation of the Jiangsu Higher Education Institutions of China (Grant No. 24KJA310008), the Key Programs of the Suzhou Vocational Health College (szwzy szwzy202406), the Project of State Key Laboratory of Radiation Medicine and Protection, Soochow University (No. GZK1202506), the Project of Jiangsu Province Engineering Research Center of Molecular Target Therapy and Companion Diagnostics in Oncology (SGK1202413), the Dongwu Health Talent Program (DWWS2024002, DWWS2025001).
| Name | Company | Catalog Number | Comments |
|---|---|---|---|
| Affy package | Bioconductor | - | Microarray data processing |
| clusterProfiler package | Bioconductor | v4.4.2 | GO and KEGG enrichment analysis |
| CIBERSORT algorithm | Newman et al. | - | Immune cell infiltration estimation |
| e1071 package (SVM-RFE) | CRAN | - | Support vector machine recursive feature elimination |
| Gene Expression Omnibus (GEO) | NCBI | GSE67916 | Tamoxifen resistance dataset |
| glmnet package | CRAN / Bioconductor | - | LASSO regression |
| limma package | Bioconductor | - | Differential expression analysis |
| METABRIC dataset | cBioPortal / Curtis et al. | - | Validation cohort |
| pROC package | CRAN | - | ROC curve analysis |
| R software | R Core Team | v4.4.2 | Statistical computing environment |
| Random forest package | CRAN | - | Random Forest algorithm |
| rms package | CRAN | - | Nomogram construction |
| TCGA-BRCA cohort | The Cancer Genome Atlas (TCGA) | - | Training cohort RNA-seq data |
| timeROC package | CRAN | - | Time-dependent ROC analysis |
Request permission to reuse the text or figures of this JoVE article
Request Permission