July 22nd, 2025
Mime is a flexible computational framework to construct a machine learning-based integration model with elegant performance. Here, we provide a detailed step-by-step procedure for developing predictive models with high accuracy, leveraging complex datasets to identify critical genes associated with disease progression, patient outcomes, and therapeutic response.
High-throughout sequencing technology significantly impact our understanding of biology and cancer heterogeneity. However, with numerous high-throughout sequencing data, it is difficult to rapidly screen and identify disease-associated genes and biomarkers. Numerous machine learning frameworks exist, yet none of them offer integrated comparison for informed decision-making. To address this gap, we developed Mime, a unified platform for evaluating model stress and weaknesses.
Mime offers four functions, optimal prognosis modeling, binary response prediction, co-prognostic feature identification, and model performance visualization, leveraging self-trained machine learning algorithms for integrated inter-critical analysis.
Researchers often struggle with choosing predictive algorithms and managing machine learning environments. Mime open-source R packaging simplifies model setup, parameter selection, and deployment, enabling users to analyze their own data easily.
Mime marks a milestone in applying AI to biomedicine into integrate machine learning across single-cell sequencing layer to uncover intratumoral heterogeneity using intratumoral diversity.
[Narrator] To begin, open the GitHub website on a desktop computer. Install the development version of Mime from GitHub using the devtools package in R. Prepare multiple cohorts containing transcriptional sequencing data with survival or clinical response information. Use the example datasets, Example.cohort and Example.ici, which are accessible from the Mime GitHub repository. The Example.cohort contains two glioma datasets with randomly selected 100 samples from the TCGA and CGGA database, respectively. Include multiple datasets to construct predictive models for prognosis in Example.cohort. Verify that the dataset format includes the sample ID in the first column, survival time and status in the second and third columns, and log transformed gene expression levels in the remaining columns. Confirm that Dataset1 is used for training and other datasets for validation. Next, load the Example.ici dataset and confirm the format includes sample ID in the first column, therapeutic response in the second column, and log transformed gene expression levels in remaining columns. Prepare the genelist using the gene set associated with Wnt/beta-catenin signaling in R from the genelist file. Use the function ML.Dev.Prog.Sig and the given codes to construct predictive models for prognosis based on Example.cohort and the genelist. Then use the function cindex_dis_all to plot the C-index of each model and identify the optimal model. Calculate the survival curves of patients using the according-to-risk score using a specific model among different datasets and process that in Mime using the given codes. Calculate time-dependent AUC for the predictive models using the function cal_AUC_ml_res and the given codes. Now, plot the time-dependent AUC for each model using the function auc_dis_all and the given codes. Process the time-dependent ROC curve of a specific model among different datasets in Mime using the function roc_vis and the given codes. To construct predictive models for therapeutic response, use the function ML.Dev.Pred.Category.Sig based on the Example.ici dataset and the genelist. Visualize AUC for each response model using auc_vis_category_all. Then, generate the ROC curves for each model using roc_vis_category. For the core feature selection, identify core genes associated with prognosis using ML.Corefeature.Prog.Screen based on the Example.cohort and genelist. Plot the rank of genes filtered by different methods using core_feature_rank to highlight frequently identified core genes. Among the 117 prognostic models constructed by Mime, the StepCox[Forward] + plsRcox combined model showed the highest concordance index across all cohorts. Patients with high-risk scores had significantly worse outcomes in all cohorts. The one-year area under the curve predicted by SPCOM ranked highest among all models with the highest mean AUC value across cohorts. Among the seven therapeutic response prediction models, the svmRadialWeights model achieved the highest performance with an area under the curve of 0.81 in the training dataset and 0.68 in the validation dataset. Core feature selection identified PSEN2, WNT5B, and SKP2 as the top-ranked genes based on their recurrence across different algorithms.
View the full transcript and gain access to thousands of scientific videos
Mime is a computational framework designed to construct machine learning-based integration models for predicting disease-associated genes. This article outlines a step-by-step procedure for developing high-accuracy predictive models using complex datasets.