$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
Implementation of automated joint space identification improves bone segmentation accuracy
Given the heterogeneity of bone shape and architecture in complex structures such as the murine hindpaw, we build upon our systematic image processing algorithm12 through DL training predictions (blue) coupled with image processing steps for robust identification of inter-bone joint spaces in micro-CT datasets (Figure 1A-B; process described below and shown in Supplementary Figure 1). The identification of spaces between bones enabled precise bone separation and segmentation of individual hindpaw bones (separate colors; Figure 1C). For the DL component, the training and validation datasets (WT) consisted of equal ages (2-6 months of age, n=8 hindpaws per age) and sex (n=20 hindpaws per sex). The remainder of the WT hindpaws (n=44, from 2-8 months of age, excluding 6 months as all were used for training and validation) served as the test datasets to quantify the accuracy of bone segmentation (Figure 1D). There were 2 WT male hindpaws at 2 months and 2 WT female hindpaws at 3 months that were omitted due to imaging error (Supplementary Table 1).
Along with implementation in WT hindpaws, we also tested the automated segmentation approach on hindpaws from TNF-Tg mice (n=56 male hindpaws, n=48 female hindpaws) with spontaneous inflammatory-erosive arthritis. There were 4 TNF-Tg female hindpaws at both 4 months and 5 months that were omitted due to imaging error or premature death prior to endpoint at 5 months (Supplementary Table 2). The novel segmentation algorithm automatically detected the joint spaces (blue, left) for individual bone separation (colors, right) across both sexes and genotypes (Figure 2A-D). For segmentation accuracy of individual bones shown in Supplementary Table 5 and Supplementary Table 6, WT outperformed TNF-Tg datasets for both males (WT 98.4% versus TNF-Tg 93.1%, p<0.0001) and females (WT 98.7% versus TNF-Tg 92.1%, p<0.0001). The source of error was demonstrated visually as incomplete closure of joint spaces (arrows in white dashed box), thus inadvertently over-connecting two distinct bones into a single segmentation (Figure 2C-D). These over-connected errors demonstrated in TNF-Tg hindpaws may represent sequelae of chronic damage leading to joint fusion, where the space between bones no longer exists. In fact, the difference in accuracy between WT and TNF-Tg datasets becomes more pronounced across time as the arthritic severity increases (Figure 2E-F), especially in the tarsal bones (Figure 2G-H, yellow = increased accuracy, green = decreased accuracy) that typically serve as reliable biomarkers for the progression of bone erosion23. However, compared to our prior SA segmentation approach, there was a remarkable improvement in dataset accuracy overall (Figure 2E-F; WT male: SA 79.39% ± 5.73% versus DL 98.16% ± 1.47%, p<0.0001; WT female: SA 79.16% ± 4.84% versus DL 99.19% ± 1.63%, p<0.0001), demonstrating the robust methodologic advancements both in automaticity and reliability. Thus, our novel strategic model for hindpaw bone segmentation using DL facilitated joint space identification provides significantly increased segmentation accuracy in WT datasets (>98%) compared to prior SA methods (~79%), but with slightly deprecated performance when applied to hindpaws with inflammatory-erosive arthritis (92%-93%).
Flexible application of the segmentation method to forepaws highlights pronounced joint destruction and bone fusions in TNF-Tg mice with a rapid reduction in segmentation accuracy
We further extended the application of the novel segmentation method to murine forepaws (n=55 WT male forepaws, n=29 WT female forepaws, n=54 TNF-Tg male forepaws, and n=50 TNF-Tg female forepaws) with unique bone size and anatomy. There was 1 forepaw at 4 months from the WT male, 1 forepaw at 4 months and 2 forepaws at 5 months for the WT female, 2 forepaws at 3 months for the TNF-Tg male, and 2 forepaws at 4 months and 4 forepaws at 5 months for the TNF-Tg female, which were omitted due to imaging error or premature death prior to the endpoint. In addition, there was a partial imaging error for 1 forepaw at 3 months for the WT female with the omission of DP-F3, PP-F3, DP-F4, and PP-F4 (Supplementary Table 3 and Supplementary Table 4). For orientation, we provide a model WT forepaw with each individual bone separated by color and bone-specific nomenclature indicated from different viewpoints (Figure 3). Prior investigation in TNF-Tg mice has primarily focused on the hindpaw, while here we demonstrate the architecture of murine forepaws in both WT and TNF-Tg mice. We particularly highlight the carpals (yellow dashed circle) and sesamoids (blue dashed circle) that exhibit visually profound erosive disease, especially in TNF-Tg females (Figure 4A-D). As such, comparison of hindpaw and forepaw segmentation accuracy showed marked reduction in forepaws (paw type effect p<0.0001) primarily driven by the steep decline of bone integrity with increased age and disease severity in TNF-Tg datasets (Figure 4E-F; paw x genotype effect p=0.0083; male forepaws: WT 87.29% ± 2.07% versus TNF-Tg 72.65% ± 11.70%, p<0.0001). Similar to hindpaws, the decline in TNF-Tg segmentation accuracy with aging and disease severity is more pronounced in carpals, along with the sesamoids (Figure 4G-H, Supplementary Table 7, and Supplementary Table 8). This regional bone pathology may be driven by enhanced erosive activity at the adjacent articulation of the MET-F and PP-F (metacarpophalangeal joint). Evaluation of error type revealed that TNF-Tg forepaws tend to exhibit a higher proportion of completely eroded bones compared to hindpaws (Supplementary Figure 2, red as missing). While certainly representative of progressive arthritic severity, the absence of bones in TNF-Tg forepaws could also highlight a limitation in image resolution. The severe erosions in TNF-Tg forepaws are further demonstrated by representative images across time that highlight the carpal region (white arrows) and the progressive complete dislocation of the paw from the forearm (yellow arrows) most notable in TNF-Tg females (Supplementary Figure 3). Thus, flexible application of the automated bone segmentation method to unique structures of the forepaw showed remarkable performance in WT datasets (~87%) with similar reduction in accuracy in TNF-Tg forepaws with inflammatory-erosive arthritis (67%-72%).
Data availability:
As described in micro-CT image collection section, the hindpaw data was previously published12,23,35 and is publicly available at at https://doi.org/10.5281/zenodo.1119178228. The data for accuracy quantification in the SA segmentation method for WT12 and TNF-Tg23 datasets was repurposed for direct comparison with the novel DL model described here. No specific data from the additional prior study was repurposed35, but the same hindpaw datasets that are publicly available28 were also utilized. Additional details on licensing and repurposing of data are provided below. For the purposes of the described study, the corresponding forepaw data has also been made publicly available at the Zenodo repository (https://doi.org/10.5281/zenodo.14865639)29.
The accuracy data for the SA segmentation method WT datasets12 was repurposed in Figure 2. Reuse of this material is protected by the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode. As authors of the referenced work, we retain the right to prepare other derivative works via Author’s Rights from Elsevier https://beta.elsevier.com/about/policies-and-standards/copyright. The datapoints have been revisualized for comparison with accuracy over time with TNF-Tg counterparts and directly compared to the novel DL method described here.
The accuracy data for the SA segmentation method WT and TNF-Tg datasets23 was repurposed for Figure 2, and the WT and TNF-Tg hindpaw datasets were further evaluated for volumetric measurements previously23. Reuse of the material is protected by the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. The datapoints have been revisualized for assessment of accuracy over time and directly compared to the novel DL method described here.
The same publicly available WT and TNF-Tg hindpaw datasets28 were further utilized for bone volumetric measurements previously for novel comparisons with wheel running cohorts35. Reuse of the material is protected by the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The same publicly available datasets28 have been utilized in the current work, but without any specific utilization or modification of previously published datapoints.

Figure 1: Automated joint space detection by strategic image processing and deep learning predictions for bone segmentation. Murine micro-CT datasets with visualization from (A) the dorsal (top) and plantar (bottom) surfaces were processed for (B) subsequent automated identification of joint spaces (blue) using a DL model (described in Supplementary Figure 1) developed from gold-standard bone segmentations12,23. (C) Final successful bone separation (bone-specific colors) was accomplished through an additional combination of image processing steps, including a black top-hat12, structure enhancement37, and membrane enhancement with tensor voting38 for robust joint space identification to label individual bones. (D) Training and validation (n=40 hindpaws) of the DL component were performed with WT murine hindpaws of equal age (from 2-6 months, n=8 hindpaws each timepoint) and sex (n=20 hindpaws male/female) distribution, with a randomized 25% of subvolumes used for validation (3 subvolumes per hindpaw, total 120 subvolumes). The remaining WT hindpaws (n=44) were evaluated as test cases for further analysis. The combination of the DL model and image processing algorithms was evaluated using previously published and publicly available datasets23,28. Please click here to view a larger version of this figure.

Figure 2: Implementation of automated joint space identification with deep learning facilitation improves bone segmentation accuracy. (A-B) Following the development of the automated joint space detection, we applied the DL model (left: blue joint spaces; right: bone-specific segmentation colors) to the remaining test cases for WT males and females. (C-D) We also assessed performance on age-matched cohorts (males: 2- 8 months; females: 2- 5 months) of TNF-Tg mice with progressive inflammatory-erosive arthritis associated with early onset mortality in females32. Inset images demonstrate high-magnification segmentation errors (dashed boxes) where disconnections in predicted joint spaces (white arrows) lead to a leak in bone separation, resulting in over-connected bone segmentation errors. (E-F) Note that the 6-month male timepoint was omitted as all WT datasets were utilized for training and validation, so they were not included in the DL testing cohort. Compared to our previous SA segmentation algorithms12,23, the segmentation accuracy (bones correctly segmented / total bones) was remarkably improved for both WT and TNF-Tg datasets with the DL approach, regardless of sex (average accuracy lines: solid black = DL WT, dashed black = DL TNF-Tg, solid grey = SA WT, dashed grey = SA TNF-Tg). However, the accuracy of TNF-Tg segmentations notably declined with time and associated progressive joint damage compared to WT, although it continued to outperform the SA method. (G-H) Heatmaps of accuracy specified to bone compartments (T = tarsals, MT = metatarsals, PP = proximal phalanges, DP = distal phalanges, S = sesamoids) demonstrate the increased error rate in TNF-Tg mice as predominately localized to the tarsal region (light (yellow) = high (100%), dark (purple) = low (20%) accuracy). As mentioned, inset images (C-D) highlight the source of error with disconnected joint spaces (arrows, left image) leading to over-connected bones (colors, right image). In fact, the errors were predominantly over-connected (2+ bones segmented as 1 material; noted in Supplementary Figure 2), which may represent the pathologic process of joint fusions with increasing arthritic severity. Statistics: 3-way mixed-effects analysis (SA versus DL; method x genotype x time; E-F), 2-way mixed-effects analysis (WT vs TNF; genotype x time; E-H); ****p<0.0001, **p<0.01, *p<0.05 (interaction effects); data presented as mean ± standard deviation. Sample sizes: n=34 hindpaws WT male (n=2 at 2-months, n=4 at 3-months, n=6 at 4-5-months, n=0 at 6-months [all data used for testing], n=8 at 7-8-months), n=10 hindpaws WT female (n=4 at 2-months, n=2 at 3-5-months), n=56 hindpaws TNF-Tg male (n=8 at 2-8-months), and n=48 hindpaws TNF-Tg female (n=14 at 2-3-months and n=10 at 4-5-months). Data used in this figure has been modified from prior studies12,23. Please click here to view a larger version of this figure.

Figure 3: Flexible application of joint space deep learning segmentation to other complex structures highlights murine forepaw bone anatomy. Next, we evaluated the potential for the joint space segmentation DL model to automatically separate bones in additional complex structures beyond the hindpaw. The segmentation method was implemented in the corresponding forepaw micro-CT datasets visualized from the (A) dorsal, (B) plantar, (C) lateral, and (D) medial surfaces with colors representing individual segmented bones. We identified the potential for accurate segmentation of the forepaw bones, including distinct carpals, metacarpals (#, MET-F), proximal phalanges (^, PP-F), distal phalanges (~, DP-F), sesamoids (dashed circles, S-F), and claws (*) with bone-specific labeling corresponding with known forepaw anatomy40. Please click here to view a larger version of this figure.

Figure 4: TNF-Tg mice exhibit pronounced forepaw joint destruction and bone fusions with a rapid reduction in segmentation accuracy. (A-B) Given the complexity and small architecture of murine forepaws highlighted by dorsal (left) and plantar (right) visualization of micro-CT images from WT male and female mice, (C-D) the anatomy and associated arthritis in TNF-Tg mice has not been previously assessed. Application of our novel joint space DL approach provided an initial opportunity to evaluate these complex structures by reducing the analytical challenges with achievement of >85% accuracy of WT forepaws, although with deficient accuracy compared to hindpaws (average accuracy lines: solid blue = WT hindpaw, dashed blue = TNF-Tg hindpaw, solid red = WT forepaw, dashed red = TNF-Tg forepaw). (E-F) In addition, TNF-Tg forepaws showed a rapid and dramatic decline in segmentation accuracy due to errors localized to the carpals (yellow dotted circles in A-D) and sesamoids (blue dotted circles in A-D) over time. (G-H) The decreased regional reductions in segmentation accuracy are shown by heatmaps (light (yellow) = high (100%), dark (purple) = low (20%) accuracy) of bone compartments (C = carpals, MC = metacarpals, PP = proximal phalanges, DP = distal phalanges, S = sesamoids). Note that the 6-month male timepoint was omitted in (E) as all WT hindpaw datasets were utilized for training and validation, so were not included in the DL testing cohort. Statistics: 3-way mixed effects analysis (hindpaw vs forepaw, WT vs TNF; paw type x genotype x time, interaction effects reported; E-F), 2-way mixed effects analysis with Sidak's multiple comparisons (WT vs TNF; genotype x time; G-H); ****p<0.0001, **p<0.01, *p<0.05; data presented as mean ± standard deviation. Sample sizes: n=55 forepaws WT male (n=8 at 2-3- and 5-8-months, n=7 at 4-months), n=29 forepaws WT female (n=8 at 2-3-months, n=7 at 4-months, n=6 at 5-months), n=54 forepaws TNF-Tg male (n=8 at 2- and 4-8-months, n=6 at 3-months), and n=50 forepaws (n=14 at 2-3-months, n=12 at 4-months, and n=10 at 5-months). DL hindpaw data (E-F) reproduced from Figure 2E-F for additional comparison with DL forepaw data. Please click here to view a larger version of this figure.
Supplementary Figure 1: Development and training of the joint detection deep learning model. (A) Ground truth joint regions were obtained from initial ground truth bone segmentations by an automatic recipe using Amira, which combines label expansion, extraction of label interfaces, masking, and dilation. (B) For each of the 20 training micro-CT datasets (40 hindpaws), 6 subvolumes of 200 x 200 voxels were manually extracted from tarsals, distal phalanges, and background regions, evenly split between left and right paws (3 patches per hindpaw). The resulting 120 subvolumes were then used as input for a 3D segmentation Amira training module along with corresponding labeled joint regions as ground truth target. A randomized subset of 25% patches was used for validation to control model overfitting during the training. Please click here to download this File.
Supplementary Figure 2: Distinct distribution of error types between hindpaws and forepaws. Similar to the SA segmentation algorithm previously developed12,23, the joint space DL model produced the greatest proportion of errors by over-connecting bones (green, 2+ bones segmented as 1 material), most notable in the (A-D) hindpaws or (E-F) WT forepaws. As noted in Figure 2, over-connected errors will occur if there is a gap in the detected joint space that may occur for various reasons, including greater bone proximity than image resolution, motion artifact blurring the joint space, or bone remodeling in the context of arthritis, leading to joint fusions. (G-H) Interestingly, TNF-Tg forepaws exhibit a remarkably increased proportion of missing bones (red), meaning the bone was completely absent from the segmentation. These errors are likely attributed to a combination of severe erosions and deficiencies in image resolution, given the relatively decreased size of forepaw bones, especially carpals and sesamoids, as the predominant source of error (Figure 4), compared to those of hindpaws. Additional types of errors include over-split (blue, 1 bone segmented as 2+ materials) or both over-connected and over-split (orange). Pie charts represent proportions of total errors attributed to specific error subtypes. Please click here to download this File.
Supplementary Figure 3: Evaluation of progressive TNF-Tg forepaw arthritis with severe bone erosions and joint dislocations. To visualize the structural changes in forepaws over time, we provided representative images of the dorsal surface from (A) WT male, (B) TNF-Tg male, (C) WT female, and (D) TNF-Tg female forepaws over time from 2- 5 months (left to right) to particularly highlight the carpal region (white arrows). Note the severe bone erosions and remodeling that occur by approximately 4 months in females and 5 months in males. These time periods predate the typical onset of severe bone erosions in hindpaws at approximately 5 months in females and 7- 8 months in males23. (E) A side view of TNF-Tg female forepaws is also shown to demonstrate the progressive dislocation of the entire paw from the forearm (yellow arrows) associated with the joint destruction. Please click here to download this File.
Supplementary Table 1: Sample sizes of WT hindpaws for DL training, validation, and methodological testing. Sample sizes in the number of hindpaws are provided across age (months 2-8) and organized by datasets used for DL training/validation, total methodologic testing, or those omitted either due to imaging error, severe motion artifact, or death prior to the scheduled micro-CT scan. Black cells from months 6-8 for females indicate the planned termination of scans after 5 months due to early mortality of TNF-Tg experimental counterparts. Please click here to download this File.
Supplementary Table 2: Sample sizes of TNF-Tg hindpaws for methodological testing. Sample sizes in the number of hindpaws are provided across age (months 2-8) and organized by datasets used for total methodologic testing or those omitted due to imaging error, severe motion artifact, and/or death prior to the scheduled micro-CT scan. Black cells from months 6-8 for females indicate the planned termination of scans after 5 months due to early mortality of TNF-Tg female mice. Please click here to download this File.
Supplementary Table 3: Sample sizes of WT forepaws for methodological testing. Sample sizes in the number of forepaws are provided across age (months 2-8) and organized by datasets used for total methodological testing or those omitted due to imaging error, severe motion artifact, and/or death prior to the scheduled micro-CT scan. Black cells from months 6-8 for females indicate the planned termination of scans after 5 months due to early mortality of TNF-Tg experimental counterparts. *At 3 months for WT females, n=1 forepaw had omitted DP-F3, PP-F3, DP-F4, and PP-F4 due to imaging error, although the remainder of the forepaw was evaluated. Please click here to download this File.
Supplementary Table 4: Sample sizes of TNF-Tg forepaws for methodological testing. Sample sizes in the number of forepaws are provided across age (months 2-8) and organized by datasets used for total methodological testing or those omitted due to imaging error, severe motion artifact, and/or death prior to the scheduled micro-CT scan. Black cells from months 6-8 for females indicate the planned termination of scans after 5 months due to early mortality of TNF-Tg female mice. Please click here to download this File.
Supplementary Table 5: Individual bone accuracy of male hindpaws. To identify the particular bones that reduce the segmentation accuracy in TNF-Tg versus WT hindpaws, details are provided on the number of bones segmented correctly, incorrectly, and the percent correct relative to the total bones evaluated in male mice. Within the tarsal region where the primary deficits occur (Figure 2), the calcaneus (CALC), intermediate cuneiform (unfused, INT), and navicular/lateral cuneiform (unfused) demonstrated the most prominent decrease in accuracy for TNF-Tg hindpaws. Statistics: Fisher's exact test; *p<0.05, **p<0.01, ***p<0.001, ****p<0.0001. Please click here to download this File.
Supplementary Table 6: Individual bone accuracy of female hindpaws. To identify the particular bones that reduce the segmentation accuracy in TNF-Tg versus WT hindpaws, details are provided on the number of bones segmented correctly, incorrectly, and the percent correct relative to the total bones evaluated in female mice. Given the utilization of datasets for DL training and validation, along with the decreased timeframe to 5 months for comparison with TNF-Tg mice that exhibit early mortality32, the total number of allocated DL testing hindpaws for WT females limits the capacity for individual bone comparisons to explain the overall decreased accuracy in TNF-Tg datasets. Statistics: Fisher's exact test; ****p<0.0001. Please click here to download this File.
Supplementary Table 7: Individual bone accuracy of male forepaws. To identify the particular bones that reduce the segmentation accuracy in TNF-Tg vs WT forepaws, details are provided on the number of bones segmented correctly, incorrectly, and the percent correct relative to the total bones evaluated in male mice. Within the carpal and sesamoid regions where the primary deficits occur (Figure 4), the capitate (CAP), triquetrum (TRI), centrale (unfused, CENT), scaphoid/lunate (SCAPHATE), trapezoid (ZOID), and sesamoids 2-10 demonstrated the most prominent decrease in accuracy for TNF-Tg forepaws. Of note, the accuracy of sesamoids 1 and 2 is deficient for both the WT and TNF-Tg datasets. Interestingly, metacarpal 1 actually showed improvements in segmentation accuracy in TNF-Tg mice, potentially due to close articulations with adjacent bones leading to over-connected errors that are mitigated with arthritic erosions. Statistics: Fisher's exact test; *p<0.05, **p<0.01, ***p<0.001, ****p<0.0001. Please click here to download this File.
Supplementary Table 8: Individual bone accuracy of female forepaws. To identify the particular bones that reduce the segmentation accuracy in TNF-Tg vs WT forepaws, details are provided on the number of bones segmented correctly, incorrectly, and the percent correct relative to the total bones evaluated in female mice. Within the carpal and sesamoid regions where the primary deficits occur (Figure 4), the capitate (CAP), hamate (HAM), triquetrum (TRI), and sesamoids 1-10 demonstrated the most prominent decrease in accuracy for TNF-Tg forepaws. Of note, the accuracy of sesamoids 1 and 2 is deficient for both the WT and TNF-Tg datasets. Statistics: Fisher's exact test; *p<0.05, ***p<0.001, ****p<0.0001. Please click here to download this File.
Supplementary File 1: Joint segmentation recipe for deep learning model training. Series of embedded steps to extract segmented joint spaces from gold-standard pre-segmented micro-CT hindpaws that were used to train DL model for joint space identification. Please click here to download this File.
Supplementary File 2: Bone segmentation recipe using image processing with deep learning facilitation. Series of embedded steps to transform original micro-CT data into segmentations of individual bones using image processing steps combined with the output of DL joint space identification to guide bone separation. Please click here to download this File.
Supplementary File 3: Deep learning prediction weights. File used as input for weights during deep learning prediction of joint space segmentation. Please click here to download this File.
Supplementary File 4: Deep learning prediction architecture. File used as input for architecture during deep learning prediction of joint space segmentation. Please click here to download this File.
Supplementary File 5: Deep learning python script. File used as python script for deep learning prediction of joint space segmentation. Please click here to download this File.