RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
A subscription to JoVE is required to view this content. Sign in or start your free trial.
Research Article
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
This study enhances discogenic low back pain (DLBP) diagnosis using lumbar spine MRI T2 data, radiomics, and a Random Forest model with SHAP analysis, achieving high accuracy and interpretability for improved clinical decision-making.
Low back pain (LBP) is a leading cause of disability and reduced quality of life globally, with discogenic low back pain (DLBP) accounting for 39% of cases. Accurate diagnosis of LBP etiology is challenging due to the lack of reliable methods. This study aims to improve DLBP diagnostic efficiency using lumbar spine MRI T2 data combined with radiomics and machine learning. This retrospective study analyzed MRI data from 81 DLBP patients and 162 healthy controls. Radiomics features, clinical data, and high-intensity zone (HIZ) imaging features were extracted. The data were divided into four groups (d0, d1, d2, D), and 20 predictive models were built using Random Forest (RF), Decision Tree (TREE), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Logistic Regression (LOG). Model performance was evaluated using Receiver Operating Characteristic (ROC) area under the curve (AUC), precision recall (PR) AUC, accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score. SHapley Additive exPlanations (SHAP) were applied to interpret the most significant features. The Random Forest in group D showed the best performance, with ROC AUCs of 0.9861 (train) and 0.9580 (test), PR AUCs of 0.9813 and 0.9179, and F1 scores of 0.9254 and 0.8148, respectively. SHAP analysis identified first-order kurtosis as the top feature contributing to DLBP diagnosis. The Random Forest model with SHAP analysis significantly improved DLBP diagnosis, offering high performance and interpretability to enhance clinical decision-making.
With the aggravation of population aging and the widespread occurrence of improper sitting postures in daily work and life, the number of patients with low back pain (LBP) is increasing year by year1. The average prevalence of LBP in the general adult population is approximately 12%, with higher prevalence observed in individuals aged 40 or older and in women, with a lifetime prevalence of about 40%2. Among all LBP cases, discogenic low back pain (DLBP) accounts for approximately 39%, with its etiology being complex and diverse, often associated with disc degeneration, particularly annular tears that reach the outer annulus of the disc3,4,5. Increased disc pressure stimulates nerve endings sensitized by inflammatory mediators, leading to discogenic pain5,6.
Currently, discography is considered the gold standard for DLBP diagnosis, confirming etiology by provoking discogenic pain7. However, its validity and safety are questioned due to potential false positives and the risk of accelerating disc degeneration, limiting its clinical application5,8. The high-intensity zone (HIZ) refers to high-intensity signals in the posterior annulus on T2-weighted MRI images, pathologically characterized by annular tears or ruptures5,8,9. Although the relationship between HIZ and DLBP has been widely studied, HIZ has low specificity and is also common in asymptomatic individuals, often leading to misdiagnosis10,11,12,13. In T2-weighted images of lumbar degenerative disc disease (LDDD) and chronic LBP patients, nearly all degenerated discs show reduced signal intensity14. According to Pfirrmann et al.'s classification, disc degeneration is graded from I (normal disc) to V (severe degeneration)14. T2-weighted imaging is widely used in multimodal studies for the diagnosis and evaluation of LDDD and LBP, providing critical insights into disc degeneration and its clinical relevance15,16. However, low signal intensity does not accurately reflect morphological changes in the disc and has little correlation with the degree of pain caused by DLBP17,18,19.
Currently, clinical decision-making for DLBP primarily relies on the subjective assessment of clinicians, who focus on factors such as HIZ, disc height, black discs, or signal changes in the disc20,21. Additional considerations include patient age, gender, height, weight, and body mass index (BMI). The duration and intensity of LBP are also within the scope of consideration. Such a diagnosis method, dependent on subjective assessment, may lead to erroneous treatment plans. Therefore, developing an effective diagnostic tool to improve the diagnostic efficiency of DLBP is urgently needed4.
MRI, as a non-invasive imaging technique, plays an important role in the diagnosis of LBP. In particular, axial T2-weighted imaging (T2WI), with its high tissue contrast, provides more accurate physiological information and pathological status of the disc compared to traditional computed tomography (CT)22,23. Radiomics feature analysis, by converting image data from MRI, CT, etc., into quantitative features for advanced mathematical analysis, provides a powerful tool for the evaluation and interpretation of medical images24. This method can reduce biases from subjective analysis, thereby enhancing the accuracy and consistency of diagnostic results25.
In recent years, artificial intelligence and machine learning techniques in medical imaging analysis have brought new hope for disease diagnosis26. By enhancing diagnostic accuracy, automating clinical workflows, and enabling personalized treatment strategies, AI technologies are profoundly transforming the landscape of intelligent healthcare27. In DLBP diagnosis, the integration of radiomics with machine learning algorithms holds great potential to improve diagnostic accuracy and efficiency28. However, the "black-box" nature of complex machine learning models limits their clinical applicability, making explainable AI (XAI) a critical component of clinical decision support systems29. SHapley Additive exPlanations (SHAP), a game theory-based interpretive method, quantifies each feature's contribution and impact on model diagnostic predictions, making complex machine learning models more transparent and interpretable30. It has become one of the most widely used XAI techniques in clinical decision support systems, effectively enhancing clinicians' trust and acceptance of AI systems29.
This study developed a Random Forest algorithm incorporating radiomics and SHAP interpretability analysis, significantly improving the diagnostic efficiency of DLBP compared to traditional subjective assessments. Furthermore, the SHAP interpretation method enhances the transparency of the model's diagnostic predictions, boosting clinicians' confidence in the diagnosis and improving the accuracy of clinical decision-making. This study posits that this model will reduce unnecessary suffering caused by misdiagnosis and delayed treatment, promote the advancement of personalized treatment, and ultimately improve the overall treatment outcomes for patients.
Ethical considerations and study population
This retrospective study was approved by the institutional ethics committee. Informed consent was waived as all protected health information was anonymized. The study population consisted of patients retrieved from the imaging database of the First People's Hospital of Nantong, who underwent lumbar spine MRI due to low back pain between January 2022 and December 2023. Clinical features of the patients were also collected (Table 1).
Inclusion and exclusion criteria
The current diagnostic standard for DLBP follows the 1995 International Association for the Study of Pain discography method, which provokes pain through increased pressure but is not widely accepted due to its high invasiveness8. Therefore, this study adopted a restrictive intraoperative disc pain provocation test as the diagnostic method for DLBP, consistent with discography principles while minimizing further disc damage and complications31. The procedure involved patients in a prone position under spinal anesthesia, with the target lumbar disc located using C-arm X-ray guidance. An 18 G (<22 G) needle was inserted via a posterolateral approach into the central nucleus pulposus, avoiding nerve roots and the dural sac. After confirming needle placement, saline or non-ionic contrast (e.g., iohexol) was injected at a rate not exceeding 0.5 mL/min, with pressure below 50 psi and a total injection volume of less than 3 mL, to simulate increased intradiscal pressure and provoke familiar low back pain symptoms, with a visual analog scale (VAS) score ≥ 7.
DLBP group inclusion criteria: The patients included here underwent MRI examination, have recurrent low back pain for over 3 months, with failed conservative treatment, with or without lower limb numbness or radiating pain, and are positive for the intraoperative disc pain provocation test.
Non-DLBP group inclusion criteria: Patients included here underwent MRI; have no history of low back pain within 3 months or are healthy individuals undergoing physical examination; have no MRI abnormalities; and have a standardized assessment with Oswestry Disability Index (ODI) <10 and VAS ≤2.
Exclusion Criteria: Exclude patients who have other causes of low back pain, such as significant disc herniation compressing nerves, fractures, spinal infections, spondylolisthesis, tumors, osteoporosis, or metabolic bone diseases; history of surgery prior to examination; unclear or poor-quality images; inability to identify the region of interest (ROI).
Ultimately, 243 patients were included, comprising 81 DLBP patients and 162 controls.
MRI parameters
All patients included in this study underwent 3.0 T MRI scans, using sequences that included sagittal and axial T1-weighted imaging (T1WI) and T2-weighted imaging (T2WI). Three different MRI machines were used in the study: Siemens Verio, Siemens Prisma, and Philips Ingenia CX. The scan parameters were set as follows: for sagittal T2WI, TR ranged from 2000 to 4597 ms, TE from 90 to 120 ms, slice thickness from 4.0 to 4.8 mm, with 15 slices included, bandwidth from 250 Hz to 340 Hz, matrix size of 384 × 384 or 512 × 512, phase field of view percentage of 100%, and a readout field of view of 300 mm.
HIZ imaging features measurement and statistics
This study used 3D Slicer software (version 5.6.1, https://download.slicer.org/?version=5.6.1) to manually analyze lumbar MRI T2-weighted images from 243 patients and controls. The data processing workflow was as follows: Images were imported using the Add DICOM Data function in 3D Slicer. Measurements were performed using the Markups function by two spine research graduate students under the guidance of radiologists and spine surgeons with over 10 years of clinical experience. Discrepancies in measurements were resolved through consultation with the two doctors. All measurement values were averaged by two statisticians to ensure accuracy. On sagittal images, the two mutually perpendicular maximum diameters of the high intensity zone (HIZ) were measured, defined as Hizh (near-vertical direction) and Hizw (near-horizontal direction). The Segmentation function was used to delineate the HIZ region, and the area of the most prominent HIZ on the sagittal plane was calculated, denoted as Hizarea. On axial images, the maximum length of the HIZ was measured, named Hizl. Additionally, three binary variables were defined: Hiz (presence of HIZ), Other (presence of multisegmental HIZ), and Position (whether HIZ crosses the posterior midline) (Table 2).
Radiomics feature extraction and standardization
To integrate the specific operation of Resampling Scalar Volume into the context of processing lumbar MRI T2-weighted images from 243 patients and controls using 3D Slicer, the following detailed steps were followed: The process began with importing lumbar MRI T2-weighted images into 3D Slicer. To ensure consistency and reduce heterogeneity bias, all images were resampled to a voxel size of 0.6 × 0.6 × 0.6 mm using the Resample Scalar Volume module. The specific steps were: Navigate to the Modules section and select Resample Scalar Volume. Under Parameter Set, ensure Resample Scalar Volume is selected. In the Resampling Parameters section, set Spacing to 0.6, 0.6, 0.6 to define the target voxel dimensions. Select an appropriate Interpolation method from options such as linear, nearest neighbor, bspline, hamming, cosine, welch, lanczos, or blackman, with linear as the default. For Output Volume, select or create a new volume to store the resampled data. After verifying settings, click Apply to execute the resampling process.
After resampling, two spine research graduate students, guided by radiologists and spine surgeons with over 10 years of experience, performed semi-automatic delineation of the region of interest (ROI). Discrepancies were resolved through consultation. The specific steps were: Select the layer (resampled new volume), create a new segmentation layer, and use the Draw function for layer-by-layer delineation. Use the Fill between slices function for interlayer filling. Smooth the formed ROI using the Median smoothing method with a kernel size of 3.0 mm, 5 x 5 pixels. Using the PyRadiomics library, 107 radiomic features were extracted from the resampled images, including shape features, first-order statistical features, gray-level co-occurrence matrix (GLCM), gray-level run-length matrix (GLRLM), gray-level size zone matrix (GLSZM), neighboring gray-tone difference matrix (NGTDM), and gray-level dependence matrix (GLDM) features (Supplementary Table 1). Finally, the extracted feature data were standardized using the Z-score method to transform their natural range into a standardized range.
Grouping based on features
The groups were assigned as follows:
d0 (Base group): Included clinical features, n(d0) = 5.
d1 (Base finetune group): Included clinical features and HIZ imaging features, n(d1) = 12.
d2 (Model group): Included clinical features and radiomics features, n(d2) = 112.
D (Model finetune group): Included clinical features, HIZ imaging features, and radiomics features, n(D) = 119.
Data reading and preprocessing
Data for the respective groups were read using the read_excel function in R software (version 4.3.1, https://www.r-project.org/, platform: x86_64-w64-mingw32/x64 [64-bit]) with UTF-8 encoding (system default), which supports most written languages globally. The select function was used to separate the target variable from the feature variables.
Feature selection
To ensure reproducibility of data splitting, the four datasets (D, d2, d1, d0) followed the same processing workflow, with a fixed random seed of 80. A label column was added to the data frame, converted to a factor type, and random labels (1 and 0) were generated using a binomial distribution to split the data into training and test sets in an 8:2 ratio (80% probability for the training set). Due to the limited number of features, d0 and d1 groups did not require feature selection, while d2 and D groups underwent feature selection:
First, the Mann-Whitney U test was applied to the training set to select features with p-values <0.05. Then, Lasso regression with 10-fold cross-validation and L1 regularization was performed to determine the optimal penalty coefficient and select features from a single iteration. A random seed from 1 to 100 was set, and the above steps (data splitting, Mann-Whitney U test, and Lasso regression) were repeated in a loop. Features appearing more than 50 times in 100 iterations were selected for subsequent modeling to ensure reliability in model construction and diagnostic prediction. Modeling features for each group are listed in Table 3.
Grid search and model tuning
To optimize the models, grid search was applied to systematically explore and evaluate different hyperparameter combinations. Through iterative tuning and performance assessment, this study identified the set of hyperparameters that provided the best ROC AUC values for the train and test sets, thereby optimizing the model's performance.
Model development and assessment
All model development and data analysis were conducted in R (version 4.3.1). This study applied various machine learning algorithms, including Random Forest (RF), Support Vector Machine (SVM), Decision Tree (TREE), K-Nearest Neighbors (KNN), and Logistic Regression (LOG), to develop 20 models across the four groups (d0, d1, d2, and D) for predicting DLBP diagnosis. Model performance was evaluated using the following metrics: ROC AUC, PR AUC, accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score.
The ROC AUC and PR AUC values were directly generated from the ROC and PR curves, while the remaining metrics (accuracy, sensitivity, specificity, PPV, NPV, and F1 score) were calculated and statistically analyzed using corresponding functions in R. Among the 20 developed models, this study selected 8 representative models for assessment based on the above performance metrics.
The formulas for each metric are as follows:




Q

TP: True Positive; TN: True Negative; FP: False Positive; FN: False Negative.
SHAP interpretability analysis
The Random Forest model in the D group demonstrated the best performance in model assessment. To further interpret the model's predictions, this study rebuilt the model in Python (version 3.7.9) and performed SHAP interpretability analysis. To maintain consistency, the train hyperparameters of the Random Forest model in Python were identical to those used in R, and the data processing workflow was also consistent. Specifically, we used the tree model interpreter to analyze the contribution of each feature to the model's diagnostic predictions in the train and test sets. For better visualization, we generated SHAP value distribution plots, feature importance plots, and SHAP force plots for individual predictions.
Statistical analysis
All data analyses were conducted using R (version 4.3.1, https://www.r-project.org/, platform: x86_64-w64-mingw32/x64 (64-bit)) and Python (version 3.7.9, https://www.python.org/downloads/release/python-379/). Continuous variables were described as mean ± standard deviation, while categorical variables were expressed as frequency and percentage. The comparison between DLBP and Non-DLBP was analyzed using the U-test, with a P-value of less than 0.05 considered statistically significant.
This study was based on MRI T2WI data from 81 surgically confirmed DLBP patients and 162 controls, totaling 243 patients. Radiomic features were extracted and combined with clinical and HIZ imaging features, with data split into training and test sets in an 8:2 ratio to build predictive DLBP diagnostic models. Specifically, seven HIZ imaging features were obtained after measuring the target disc using 3D Slicer and were included in the d1 and D group models along with clinical features. The 107 radiomic features extracted using PyRadiomics were standardized and included in the d2 and D group models with clinical features. Due to the limited number of features, d0 and d1 groups did not undergo feature selection; however, after feature selection and multiple iterations, the d2 group stabilized at 11 key features, and the D group at 15 key features. This study enhanced the feature selection process for d2 and D groups using the Mann-Whitney U test, Lasso regression (with L1 regularization and 10-fold cross-validation), and iterative execution to robustly identify the most influential features for DLBP diagnosis. Modeling results after feature selection for each group are listed in Table 3.
Based on data from 243 patients, we constructed 20 diagnostic models for predicting DLBP using five different algorithms, including Random Forest, Logistic Regression, Decision Tree, Support Vector Machine, and K-Nearest Neighbors, across four groups (d0, d1, d2, D). We compared the performance of these 20 models on the train and test sets. The results showed that the Random Forest model in group D performed the best, with ROC AUC values of 0.9861 and 0.9580 for the train and test sets, respectively (Figure 4), which is close to the ideal value of 1. The PR AUC values were 0.9813 and 0.9179, respectively (Figure 5), and the F1 scores were 0.9254 and 0.8148, respectively (Figure 6 and Table 4). The accuracy was 0.9492 and 0.8913, the sensitivity was 0.9118 and 0.8462, the specificity was 0.9690 and 0.9091, the positive predictive value (PPV) was 0.9394 and 0.7857, and the negative predictive value (NPV) was 0.9542 and 0.9375.
This study evaluated the performance of the 20 models, starting with an inter-group comparison, which indicated that the models in group D overall outperformed those in the other three groups (d0, d1, d2). Then, this study conducted a comparison of model types and found that the Random Forest models generally performed better than the other models (except in the d0 group). This led to the selection of the Random Forest model in group D as the optimal model.
Group D RF: As previously mentioned, performed the best.
Group D LOG: The ROC AUC values for the train and test sets were 0.8980 and 0.9114, respectively; the PR AUC values were 0.8643 and 0.8771, and the F1 scores were 0.7742 and 0.8000.
Group D TREE: The ROC AUC values for the train and test sets were 0.9663 and 0.8963, respectively; the PR AUC values were 0.9530 and 0.7194, and the F1 scores were 0.8905 and 0.7200
Group D SVM: The ROC AUC values for the train and test sets were 0.8894 and 0.9091, respectively; the PR AUC values were 0.8626 and 0.8698, and the F1 scores were 0.7438 and 0.8000
Group D KNN: The ROC AUC values for the train and test sets were 0.9375 and 0.9266, respectively; the PR AUC values were 0.8965 and 0.7583, and the F1 scores were 0.6727 and 0.7273
Group d2 RF: The ROC AUC values for the train and test sets were 0.9838 and 0.9557, respectively; the PR AUC values were 0.9697 and 0.9065, and the F1 scores were 0.9037 and 0.8148
Group d1 RF: The ROC AUC values for the train and test sets were 0.9321 and 0.9027, respectively; the PR AUC values were 0.9179 and 0.8547, and the F1 scores were 0.8174 and 0.5263
Group d0 RF: The ROC AUC values for the train and test sets were 0.8790 and 0.7751, respectively; the PR AUC values were 0.8002 and 0.6396, and the F1 scores were 0.5263 and 0.3529.
When tuning the hyperparameters using grid search, we judged the quality of the model adjustments based on two criteria: the performance of the ROC AUC values on the test set and the absolute difference in ROC AUC values between the train and test sets.
Additionally, for the RF models in the four groups, after excluding overfitting (the train set ROC AUC values are not equal to 1), we ranked the test set ROC AUC values in descending order (Figure 7A). The RF hyperparameter set in group D performed the best, followed by those in groups d2, d1, and d0. We selected the optimal hyperparameters from each group's Random Forest models for comparison.
The specific hyperparameters used for performance assessment and comparison are listed below:
Group D RF Model: The train set did not overfit, and the test set performed the best. The hyperparameters were ntree = 50, mtry = 7, maxnodes = 10, maxdepth = 3.
Group d2 RF Model: The train set did not overfit, and the test set performed second best. The optimal hyperparameters were ntree = 500, mtry = 6, maxnodes = 10, maxdepth = 3.
Group d1 RF Model: There was no risk of overfitting during model tuning. The optimal hyperparameters were ntree = 100, mtry = 10, maxnodes = 10, maxdepth = 3
Group d0 RF Model: There was no risk of overfitting during model tuning. The optimal hyperparameters were ntree = 100, mtry = 2, maxnodes = 10, maxdepth = 3.
Group D TREE Model: There was no risk of overfitting during model tuning. The optimal hyperparameter was maxdepth = 4 (Figure 7B).
Group D SVM Model: There was no risk of overfitting during model tuning. The optimal hyperparameter was cost = 1 (Figure 7C).
Group D KNN Model: There was no risk of overfitting during model tuning. The optimal hyperparameter was k = 3 (Figure 7D).
Group D LOG Model: No hyperparameter tuning was performed.
The feature importance plots for the train and test sets ranked the most important variables in descending order (Figure 8). In the train set, firstorder_Kurtosis showed the strongest predictive value, followed by Hizw, Hizarea, Hizl, firstorder_Skewness, shape_Flatness, and shape_Maximum2DDiameterColumn. In the test set, firstorder_Kurtosis remained the most important predictor, followed by Hizw, Hizl, shape_Flatness, Hizarea, shape_Maximum2DDiameterColumn, and firstorder_Skewness.
Furthermore, the SHAP values revealed the positive and negative correlations between predictive factors and the target outcome (Figure 9). The horizontal position of the variables indicates whether the variable is high (red) or low (blue) in the observations; In the train set, an increase in first-order Kurtosis, Hizw, Hizarea, and Hizl is positively correlated with the prediction of DLBP diagnosis, while an increase in shape_Flatness promotes the prediction of Non-DLBP diagnosis. The same positive and negative correlations were observed in the test set, although the ranking of feature importance differed slightly.
Figure 10 presents individual force plots for DLBP and Non-DLBP patients in the test set. These plots show the contribution of each feature to the predicted diagnostic outcome: red bars indicate that the feature increased the likelihood of predicting DLBP, while blue bars indicate a decrease in the likelihood of predicting DLBP. The length of the arrows visually represents the impact of each feature on the prediction, with longer arrows indicating a greater influence.
Using SHAP analysis, this study identified several radiomic features critical to DLBP diagnosis, which significantly influence predictive outcomes. These features showed consistent impact in training and test sets, indicating model stability (Supplementary Figure 1). Despite the high overall performance of the D-group Random Forest model, a comprehensive assessment of classification behavior across all 20 models-including detailed true/false positive and negative rates-confirms consistent diagnostic reliability. The full set of confusion matrices is provided as follows: Confusion Matrix for Group D RF Test Set (Supplementary Figure 2), Confusion Matrix for Group D LOG Test Set (Supplementary Figure 3), Confusion Matrix for Group D TREE Test Set (Supplementary Figure 4), Confusion Matrix for Group D SVM Test Set (Supplementary Figure 5), Confusion Matrix for Group D KNN Test Set (Supplementary Figure 6), Confusion Matrix for Group d2 RF Test Set (Supplementary Figure 7), Confusion Matrix for Group d2 LOG Test Set (Supplementary Figure 8), Confusion Matrix for Group d2 TREE Test Set (Supplementary Figure 9), Confusion Matrix for Group d2 SVM Test Set (Supplementary Figure 10), Confusion Matrix for Group d2 KNN Test Set (Supplementary Figure 11), Confusion Matrix for Group d1 RF Test Set (Supplementary Figure 12), Confusion Matrix for Group d1 LOG Test Set (Supplementary Figure 13), Confusion Matrix for Group d1 TREE Test Set (Supplementary Figure 14), Confusion Matrix for Group d1 SVM Test Set (Supplementary Figure 15), Confusion Matrix for Group d1 KNN Test Set (Supplementary Figure 16), Confusion Matrix for Group d0 RF Test Set (Supplementary Figure 17), Confusion Matrix for Group d0 LOG Test Set (Supplementary Figure 18), Confusion Matrix for Group d0 TREE Test Set (Supplementary Figure 19), Confusion Matrix for Group d0 SVM Test Set (Supplementary Figure 20), and Confusion Matrix for Group d0 KNN Test Set (Supplementary Figure 21).
DATA AVAILABILITY:
All raw data supporting the results presented in this study are made publicly available. The datasets have been submitted to https://doi.org/10.5281/zenodo.17365220.

Figure 1: Introduction abstract Please click here to view a larger version of this figure.

Figure 2: Flow chart showing the study procedures (part I). Please click here to view a larger version of this figure.

Figure 3: Model construction methodology pipeline (part II). Please click here to view a larger version of this figure.

Figure 4: Receiver operating characteristic (ROC) curve. ROC for Machine Learning Algorithms on Train and Test sets among 20 models. From left to right are the d0 group, d1 group, d2 group, and D group. The top four charts represent the ROC curves for the training sets, and the bottom four charts represent the ROC curves for the testing sets. AUC is the mean 'Area Under the Curve', and taking the 95% Confidence Interval. Please click here to view a larger version of this figure.

Figure 5: Precision-Recall (PR) curve. PR Curves for Machine Learning Algorithms on Training and Testing Sets. The PR curves compare the performance of eight representative machine learning models (D RF, D LOG, D TREE, D SVM, D KNN, d2 RF, d1 RF, d0 RF) on the train and test datasets. PR Curves for each model indicate their ability to balance precision and recall. The higher the AUC value, the better the model's performance. Please click here to view a larger version of this figure.

Figure 6: Performance assessment. The heatmap shows the performance metrics of different machine learning models on the train set and the test set. The chart includes the following eight performance metrics for evaluating the models: ROC AUC, PR AUC, Accuracy, Sensitivity (Recall), Specificity, PPV (Precision), NPV, and F1 score. Please click here to view a larger version of this figure.

Figure 7: Model parameter adjustment. (A) The hyperparameter search using grid search for the random forest models in four groups. After eliminating overfitting (train set ROC AUC values not equal to 1), the top 20 ranked ROC AUC values on the test set are displayed in descending order. (B-D) The parameter tuning curves for the Decision Tree, SVM, and KNN machine learning models in Group D. Please click here to view a larger version of this figure.

Figure 8: SHAP for feature importance and weight. These two figures illustrate the mean SHAP values, representing the average weight of each feature on the model output magnitude in the train(left) set and test(right) set. Please click here to view a larger version of this figure.

Figure 9: SHAP for feature impact and contribution. These two figures show the value distribution of features in the train(left) set and the test(right) set, indicating the contribution of each feature to the model output. Please click here to view a larger version of this figure.

Figure 10: SHAP force plot for two selected patients (DLBP and Non-DLBP) in the test set. Please click here to view a larger version of this figure.
Table 1: Clinical features. Features included in the analysis are shown in bold. BMI: Body Mass Index, calculated as weight/height², with non-categorical results expressed as mean ± SD. Please click here to download this Table.
Table 2: High-intensity zone features. Non-categorical results are presented as mean ± SD. If HIZ features are absent, all seven values are recorded as 0. Please click here to download this Table.
Table 3: Features in each group. The d0 group and d1 group did not undergo feature selection due to the limited number of features, while the d2 group and D group underwent feature selection; n→n indicates the change in the number of features in each group before and after selection. The second row shows the final features included in the model construction for each group. Please click here to download this Table.
Table 4: Model hyperparameters and performance. Please click here to download this Table.
Table 5: Research gaps and main contributions. Please click here to download this Table.
Supplementary Figure 1: SHAP value heatmaps showing feature contributions to model predictions for two instance subsets (left: Training set and right:Test set). Red indicates positive, blue negative, and white near-zero influence on the output. Top row displays predicted values f(x). Features include HIZ imaging metrics (e.g., Hizw, Hizl, Hizarea) and radiomics features (e.g., original_firstorder_Kurtosis). "Sum of 6 other features" aggregates minor contributors not shown individually. Please click here to download this File.
Supplementary Figure 2: Confusion Matrix for Group D RF Test Set Please click here to download this File.
Supplementary Figure 3: Confusion Matrix for Group D LOG Test Set Please click here to download this File.
Supplementary Figure 4: Confusion Matrix for Group D TREE Test Set Please click here to download this File.
Supplementary Figure 5: Confusion Matrix for Group D SVM Test Set Please click here to download this File.
Supplementary Figure 6: Confusion Matrix for Group D KNN Test Set Please click here to download this File.
Supplementary Figure 7: Confusion Matrix for Group d2 RF Test Set Please click here to download this File.
Supplementary Figure 8: Confusion Matrix for Group d2 LOG Test Set Please click here to download this File.
Supplementary Figure 9: Confusion Matrix for Group d2 TREE Test Set Please click here to download this File.
Supplementary Figure 10: Confusion Matrix for Group d2 SVM Test Set Please click here to download this File.
Supplementary Figure 11: Confusion Matrix for Group d2 KNN Test Set Please click here to download this File.
Supplementary Figure 12: Confusion Matrix for Group d1 RF Test Set Please click here to download this File.
Supplementary Figure 13: Confusion Matrix for Group d1 LOG Test Set Please click here to download this File.
Supplementary Figure 14: Confusion Matrix for Group d1 TREE Test Set Please click here to download this File.
Supplementary Figure 15: Confusion Matrix for Group d1 SVM Test Set Please click here to download this File.
Supplementary Figure 16: Confusion Matrix for Group d1 KNN Test Set Please click here to download this File.
Supplementary Figure 17: Confusion Matrix for Group d0 RF Test Set Please click here to download this File.
Supplementary Figure 18: Confusion Matrix for Group d0 LOG Test Set Please click here to download this File.
Supplementary Figure 19: Confusion Matrix for Group d0 TREE Test Set Please click here to download this File.
Supplementary Figure 20: Confusion Matrix for Group d0 SVM Test Set Please click here to download this File.
Supplementary Figure 21: Confusion Matrix for Group d0 KNN Test Set Please click here to download this File.
Supplementary Table 1: radiomics features Please click here to download this File.
The D group model, integrating clinical features, HIZ imaging features, and radiomic features, stabilized at 15 key features through feature selection and multiple iterations, demonstrating superior average performance compared to other groups, with higher stability and predictive ability. The random forest model outperformed other models in all groups (except d0), particularly in the D group, with ROC AUC values close to 1 and minimal differences between training and test sets, establishing it as the best model. The d0 and d1 group models are more aligned with clinical application and practice, providing clearer support for physicians' decision-making processes. Comparing the four groups, HIZ imaging features and radiomic features complemented each other, significantly improving DLBP diagnostic efficiency compared to subjective physician assessments, highlighting the random forest model's advantage as an ensemble learning tool for complex classification tasks. However, significant gaps remain in non-invasive DLBP diagnosis and other fields32. To our knowledge, this study is the first to systematically compare the predictive efficacy of multiple imaging features in DLBP diagnosis, developing a comprehensive predictive model based on imaging interventions, providing a novel decision-making tool for precise clinical DLBP diagnosis (Table 5).
In DLBP diagnosis, first-order Kurtosis (kurtosis) contributed the most, reflecting abnormalities in the brightness distribution of lesion areas in medical images, consistent with Li et al.'s findings on the importance of image kurtosis in disc degeneration-related low back pain33. The random forest model in this study effectively handled noise and abnormal signals in images, identifying diagnostically valuable kurtosis features in complex imaging data. However, kurtosis is highly dependent on raw image quality, and low signal-to-noise ratios may limit its application. In recent years, Transformer-based deep networks have shown revolutionary denoising capabilities in medical imaging34, complementing the random forest model in this study. Leveraging feature engineering and multi-algorithm fusion could further enhance the stability and diagnostic accuracy of kurtosis features. HIZ-related features (e.g., Hizarea, Hizw, and Hizl) showed significant positive contributions to model predictions, consistent with Wang et al.'s observation that a higher number of HIZ layers (larger Hizw) provides more reliable diagnostic information35. This study further found that Hizarea size was positively correlated with DLBP probability, suggesting that clinical diagnosis should quantify HIZ features rather than merely noting their presence, leading to the exclusion of binary HIZ variables in feature engineering and the selection of quantifiable HIZ features36. Notably, HIZ, as a specific imaging marker of disc pathology, represents a novel imaging modality distinct from conventional T2 sequences. Multimodal data integration has shown significant advantages in medical imaging analysis, such as multimodal MRI combined with deep learning achieving more precise lesion identification in brain tumor segmentation tasks37. This study incorporated HIZ spatial dimension features into a machine learning framework, similarly embodying the concept of multimodal feature fusion, achieving a diagnostic paradigm shift from "qualitative identification" to "quantitative comprehensive analysis," providing new methodological insights for radiomics applications in degenerative spinal diseases. Additionally, in the high-performance D group model with complete radiomic features, age maintained stable and significant predictive contributions, indicating that integrating clinical baseline information with radiomic features is the optimal predictive strategy. DePalma et al.38 noted that pain in younger patients with chronic low back pain is more likely to originate from the intervertebral disc rather than other causes. However, as disc degeneration is an age-dependent process, the actual number of patients with discogenic low back pain (DLBP) increases with age. The data from this study support this view, showing a significant positive correlation between age and DLBP, consistent with the natural course of intervertebral disc degeneration39.
This study is the first to systematically apply radiomics and machine learning methods to DLBP diagnosis, providing empirical support for a paradigm shift from invasive to non-invasive, precise diagnosis32,40. De Simone et al.41recently proposed a comprehensive perspective integrating anatomy, pathophysiology, clinical symptoms, and evaluation, ultimately assessing DLBP from the perspective of biomarkers and artificial intelligence to improve the diagnostic process. Our study responds to this need by constructing a predictive model with multidimensional feature fusion, achieving a safer and more accurate diagnostic method. Carragee et al.12 explored the value of discography in DLBP diagnosis but overlooked its invasiveness and prognostic impact. Peng et al. analyzed the pathological features of HIZ and discussed its clinical significance in DLBP diagnosis42. Zhang et al.4 specifically discussed the clinical diagnostic process and future challenges of DLBP. All these studies emphasize the need for innovative, non-invasive, safer, and more accurate DLBP diagnostic methods rather than relying on traditional invasive methods or subjective clinical assessments.
This study's method has significant clinical application value and promotion potential. Through SHAP interpretability analysis, clinicians can not only obtain diagnostic results but also understand the specific contribution weights of each feature to diagnostic decisions. For example, when the model indicates a high DLBP risk for a patient, physicians can see that the conclusion is primarily based on the patient's high kurtosis value (indicating abnormal disc signals), larger HIZ area (indicating the extent of annular tears), and age factors32,35. This transparent decision-making process enables physicians to combine patients' clinical symptoms and signs to make more precise and personalized treatment decisions, such as whether aggressive interventional therapy is needed, the duration of conservative treatment, or the timing of surgery. In terms of diagnostic process optimization, the model can be embedded in standardized clinical pathways for DLBP patients as a preliminary screening tool, prioritizing further examinations for high-risk patients and observing low-risk patients with conservative treatment to avoid over-medicalization43. Additionally, interpretive analysis of key features can guide personalized treatment; for instance, patients with large HIZ areas and high kurtosis may benefit more from anti-inflammatory treatment or minimally invasive interventions, while those with age-related degeneration as the primary feature may require more rehabilitation and lifestyle adjustments39,40,43. The clinical application of this model will directly improve patient care quality by: (1) avoiding unnecessary invasive examinations, reducing medical risks and economic burden; (2) shortening diagnostic timelines, enabling earlier definitive diagnoses and targeted treatments; (3) preventing ineffective treatments or delays due to misdiagnosis by accurately identifying DLBP patients; (4) providing objective evidence for surgical indications, aiding patients in making informed treatment choices. Furthermore, the model can be used in clinical research to rapidly identify DLBP patient cohorts or perform stratified randomization, reducing case screening workload40,43. The study framework is highly portable and applicable to other complex diagnostic scenarios requiring multimodal feature fusion, such as other spinal diseases, joint diseases, or tumor malignancy assessments37. Notably, the simplified d0 and d1 group models have significant application potential in primary healthcare settings, improving diagnostic capabilities in regions with uneven medical resource distribution.
This study faces several validity threats that require careful consideration. First, HIZ and ROI measurements and semi-automatic delineation during data collection may involve subjective bias, despite being guided by experienced physicians and standardized workflows. Second, image heterogeneity due to differences in MRI equipment and scanning parameters may affect the model's generalizability. Third, all DLBP patients were surgically confirmed cases with longer disease duration and more severe symptoms, so the model's diagnostic efficacy for milder or occult DLBP cases may be limited, requiring caution in early clinical screening applications. Fourth, the study data came from a single hospital in China, and the demographic characteristics, disease spectrum, and imaging manifestations may have regional specificity, making the model's applicability in other hospitals, countries, or regions uncertain44. Finally, the lack of an external validation cohort means the model's broad applicability requires further verification. To address these limitations, several alternative research methods are worth exploring. To reduce subjective bias, fully automated segmentation techniques based on deep learning could eliminate human measurement errors. To mitigate image heterogeneity, advanced image standardization and domain adaptation techniques or heterogeneous training datasets across multiple devices could enhance model generalizability34. To address sample selection bias, patients at different disease stages should be included to build a full-spectrum dataset, and longitudinal study designs should track disease progression to inform early interventions. To overcome regional limitations, multicenter international collaborative studies could include patient data from diverse races and regions for multicenter model training. For external validation, data collection from external hospitals is underway for multicenter validation, and prospective clinical trials could assess the model's real-world performance. Beyond these targeted improvements, other alternative approaches exist for this study framework, such as end-to-end deep learning methods (CNN or Transformer34), which can capture more complex imaging patterns but require larger sample sizes and have lower interpretability; multitask learning frameworks could predict multiple clinical indicators simultaneously for more comprehensive decision support but require richer annotated data; integrating multiple interpretability methods (SHAP, LIME, attention mechanisms, etc.) could provide multidimensional model insights to enhance clinical trust.
This study developed the first non-invasive predictive model for discogenic low back pain (DLBP) based on random forest and SHAP interpretability analysis, integrating clinical features, HIZ imaging features, and radiomic features to achieve high diagnostic accuracy. Future research will advance in three dimensions: (1) establishing high-level evidence-based medical evidence for clinical application through multicenter external validation and prospective randomized controlled trials; (2) expanding the model's functionality from a diagnostic tool to a comprehensive dynamic monitoring system covering early screening, disease progression prediction, and treatment response assessment; (3) integrating multidimensional data and advanced AI algorithms to develop an intelligent clinical decision support platform, advancing DLBP precision medicine from concept to clinical practice. The model's clinical value lies in multiple aspects: SHAP analysis provides transparent decision-making evidence, helping clinicians rapidly assess DLBP risk, understand the imaging basis of the disease, and formulate personalized treatment plans. Key features can directly guide treatment choices-e.g., high kurtosis values suggest benefits from interventional therapy, and large HIZ areas indicate the need for proactive surgical intervention. Simplified models (d0/d1) are particularly suitable for screening and referral decisions in primary healthcare settings. By enhancing diagnostic transparency and accuracy, this tool promotes the development of DLBP precision medicine, with the potential to become a standard component of clinical practice, improving patients' quality of life and reducing healthcare burdens.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The authors would like to thank for their financial support. The work Supported by Science Foundation of Kangda College of Nanjing Medical University (Grant No. KD2024KYJJ292). The work Supported by Nantong University Special Research Fund for Clinical Medicine (Grant No. 2024JY002). This work was supported by Science and Technology Project of Nantong Municipal Health Commission (grant no. MS2024045). This work was supported by Science and Technology Projects in Jiangsu Province [BE2023742], Project of Jiangsu Administration of Traditional Chinese Medicine (grant no. MS2023113).
| 3D Slicer 5.6.1 | 3D Slicer | https://download.slicer.org/?version=5.6.1 | |
| Data Storage | Zenodo | https://doi.org/10.5281/zenodo.17365220. | |
| GeForce RTX 3060 Laptop GPU | NVIDIA | N/A | |
| Ingenia CX 3.0T MRI machine | Philips | N/A | |
| Prisma 3.0T MRI machine | Siemens | N/A | |
| Python 3.7.9 | Python | https://www.python.org/downloads/release/python-379/ | |
| R 4.3.1 | R | https://www.r-project.org/ | |
| Verio 3.0T MRI machine | Siemens | N/A |