Method Article

Data-Driven Drug Discovery Optimization for Breast Cancer Using Interpretable Machine Learning Models

DOI:

10.3791/68705

September 12th, 2025

In This Article

Summary

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This protocol presents a machine learning pipeline using XGBoost and SHAP to predict drug sensitivity in breast cancer. The workflow includes data preprocessing, hybrid modeling, SHAP-based interpretation, synergy scoring, and PCA clustering to identify potent drugs and understand key biological factors influencing therapeutic response.

Abstract

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Breast cancer remains one of the most prevalent malignancies worldwide, posing significant therapeutic challenges due to tumor heterogeneity and drug resistance. This study presents a reproducible, data-driven machine learning protocol for predicting drug sensitivity in breast cancer cell lines, with the dual objective of identifying potent single agents and synergistic drug combinations. Using curated datasets from the Genomics of Drug Sensitivity in Cancer (GDSC), two predictive approaches were implemented: a standalone XGBoost regressor and a hybrid Autoencoder-XGBoost pipeline. Preprocessing included label encoding, one-hot encoding, Z-score standardization, missing value imputation, and dimensionality reduction via PCA. Model evaluation demonstrated that XGBoost achieved superior performance (MSE = 1.3789, R2 = 0.8145) compared to the hybrid model (MSE = 4.0322, R2 = 0.4577). Interpretability was addressed using SHapley Additive exPlanations (SHAP), which identified TARGET_PATHWAY, DRUG_ID, TARGET, and CELL_LINE_NAME as key predictive features, aligning with established pharmacological mechanisms. Predicted synergy scores, derived from combining model outputs with DrugComb and SynergyDB data, highlighted promising drug pairs such as Bortezomib + Romidepsin and Paclitaxel + Bortezomib. These findings were further supported by PCA-based pharmacological clustering, revealing biologically relevant groupings of drugs with similar mechanisms of action. The proposed protocol provides a transparent and adaptable framework for precision oncology research, enabling both predictive accuracy and biological interpretability. By integrating rigorous preprocessing, model validation, explainability, and drug synergy analysis, this workflow offers a scalable foundation for translational drug discovery and repurposing in breast cancer treatment.

Introduction

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Breast cancer remains the most commonly diagnosed cancer and the second leading cause of cancer-related death among women globally1. In the United States alone, it accounts for nearly 30% of all new female malignancies, with over 280,000 new cases diagnosed annually2. Despite therapeutic advances, particularly in HER2-positive and hormone receptor-positive subtypes, resistance to treatment and recurrence remain critical challenges-especially for aggressive subtypes like triple-negative breast cancer (TNBC), which lacks targeted therapies3,4. This underscores the ....

Access restricted. Please log in or start a trial to view this content.

Protocol

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

1. Dataset acquisition

  1. Download drug sensitivity data from GDSC (https://www.cancerrxgene.org/downloads/drug_data). A summary of the used dataset is provided in Table 1. Files used are gdsc_drug_data.csv (drug response), gdsc_expression_data.csv (gene expression), and gdsc_cell_metadata.csv (cell line info).
    ​See Figure 1 for an example of the dataset structure used in this workflow.
  2. Filter the dataset to include only breast cancer cell lines using Python (Pandas library).
    1. Select records where the TCGA_DESC column equals "Breast".
    2. Extract corresponding CEL....

Access restricted. Please log in or start a trial to view this content.

Results

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study focused on optimizing drug selection and predicting combinatorial efficacy for breast cancer using advanced machine learning models. The dataset included a curated and filtered panel of breast cancer cell lines, drug sensitivity metrics (LN_IC50, AUC, Z-Score), and molecular descriptors such as CNA, methylation, gene expression, tissue descriptors, and drug targets. The primary objective was to predict the LN_IC50 (natural log of half-maximal inhibitory concentration) of individual drugs across cell lines, ide.......

Access restricted. Please log in or start a trial to view this content.

Discussion

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study presents an integrated machine learning pipeline to adapt to the choice of the drug, predict synergistic combinations, and identify the possibilities of drug repurposing. Data from the GDSC database and synergy repositories (e.g., SynergyDB, DrugComb) were integrated to curate a comprehensive panel of drug-cell line interactions, encompassing molecular characteristics (e.g., gene expression, copy number alterations) and pharmacological responses31. The primary purpose here was to make a.......

Access restricted. Please log in or start a trial to view this content.

Disclosures

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare that there are no conflicts of interest related to this work. We confirm that large language model (LLM) technology (ChatGPT, developed by OpenAI) was used in a limited capacity during the early stages of manuscript preparation. Specifically, ChatGPT was employed for idea generation and preliminary brainstorming of conceptual frameworks, which were subsequently refined, validated, and fully rewritten by the authors. All core scientific content, data analysis, interpretation, and final drafting were performed exclusively by the authors. The outputs from ChatGPT were critically reviewed for accuracy, coherence, and integrity before inclusion, in compliance with the journal's transparency and ethical guidelines. All authors have reviewed and approved the final version of the manuscript and confirm that there are no financial, personal, or professional relationships that could be construed to influence the content of this publication.

Acknowledgements

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors sincerely acknowledge the institutional support provided by the Department of Computer Science, Christ University, which facilitated the computational resources and academic environment necessary to conduct this research. We are also grateful for the collaborative guidance and encouragement extended by our colleagues and mentors throughout the course of this work.

AUTHOR CONTRIBUTION:
Dyuti Banerjee conceived the study, designed the methodology, and curated the dataset. Sivaneasan Bala Krishnan and Kamal Upreti implemented the machine learning models and performed the computational analysis. Sumegh Shrikan....

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
NameCompanyCatalog NumberComments
Autoencoder (Deep Learning Model)TensorFlow (Google)https://www.tensorflow.orgDimensionality reduction and feature encoding for drug response modeling
BortezomibSelleck ChemicalsS1013Drug used in synergy analysis
DactinomycinSigma-AldrichD1037Drug used in synergy analysis
DocetaxelSigma-AldrichD1080Drug used for mechanism-based clustering validation
Matplotlib LibraryPython Package Index (PyPI)https://matplotlib.orgData visualization and plotting in Python
NumPy LibraryPython Package Index (PyPI)https://numpy.orgNumerical computing and matrix operations
PaclitaxelSigma-AldrichT7191Drug used for mechanism-based clustering validation
Pandas LibraryPython Package Index (PyPI)https://pandas.pydata.orgData manipulation and processing
Python 3.10Python Software Foundationhttps://www.python.orgPrimary programming language
RomidepsinSelleck ChemicalsS3020Drug used in synergy analysis
Scikit-learn LibraryPython Package Index (PyPI)https://scikit-learn.orgMachine learning modeling and preprocessing tools
Seaborn LibraryPython Package Index (PyPI)https://seaborn.pydata.orgData visualization and statistical plotting
SHAP LibraryPython Package Index (PyPI)https://shap.readthedocs.ioExplainable AI model interpretability
Synergy Data (DrugComb)FIMM, Finlandhttps://drugcomb.fimm.fiDrug synergy reference dataset
Synergy Data (SynergyDB)University of Groningenhttps://synergy.bioinformatics.nlDrug synergy reference dataset
TensorFlow 2.11Googlehttps://www.tensorflow.orgAutoencoder deep learning model implementation
VinblastineSigma-AldrichV1377Drug used in synergy analysis
XGBoost LibraryPython Package Index (PyPI)https://xgboost.readthedocs.ioGradient boosting regression modeling

References

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,
  1. Vamathevan, J., et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. 18 (6), 463-477 (2019).
  2. Dara, S., Dhamercherla, S., Jadav, S. S., Babu, C. M., Ahsan, M. J. Machine learning in drug discovery: a review. Artif Intell Rev.

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Tags

Breast CancerDrug DiscoveryMachine LearningDrug SensitivityXGBoost ModelAutoencoder PipelineDrug SynergySHAP AnalysisPrecision OncologyPharmacological Clustering

Related Articles