RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
A subscription to JoVE is required to view this content. Sign in or start your free trial.
Research Article
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
This research presents a multimodal, AI-driven methodology for objectively measuring engineering skills in medical 3D modeling that incorporates geometric, behavioral, and cognitive markers. Bayesian fusion with real-time visual analytics allows for accurate skill rating that has been validated across a wide range of tasks and participant skill levels.
The artificial intelligence (AI)-assisted 3-D modelling has become central to modern healthcare, yet the field still lacks a repeatable and scalable way to evaluate engineering competence for clinically used digital models. Existing quality checks tend to focus on final mesh outputs without considering the modelling workflow, operator's behavior, and interactions with AI assistance. This gap holds back reproducibility, effective training, and limits regulatory compliance. This work proposes the first end-to-end visual analytics framework that is designed to measure and communicate engineering skill during health-centric 3-D modeling tasks. The proposed framework defines a domain-specific construct of skill, ranging from geometric accuracy to operational proficiency and cognitive adaptability, and quantifies these dimensions through a set of interpretable behavioral, geometric, and physiological indicators. It integrates, within a four-layer architecture, the capture of multimodal data, real-time feature extraction through lightweight deep-learning models, Bayesian evidence fusion for continuous competence estimation, and intuitive visual feedback modules. The system maintains calibration for new tools and modelling scenarios through an online active-learning mechanism that minimizes expert annotation requirements. The framework was tested with 60 participants performing two clinically realistic modeling tasks concerning the design of orthopedic implants and vascular reconstruction. Results showed strong agreement with expert assessments, clear discrimination between skill levels, and meaningful prediction of future modeling quality. Python was used for correlation, regression, validity testing, and visualization tasks, respectively. Usability testing indicated high acceptance; the participants valued highly the clarity of the visual feedback that supported self-directed improvement. Ultimately, this proposed framework reshapes 3-D modelling assessment from a static outcome in-section to a dynamic, process-centric review underpinned by multimodal evidence. The result is a pragmatic, interpretable, and scalable solution for training, certification, and regulatory oversight of health-critical 3-D modelling workflows, and the basis for future human-AI collaboration and competency analytics in medical design.
The swift market uptake of AI-driven 3-D modeling in healthcare is changing how implants, guides, and anatomical reconstructions are created, but the field is still without any reliable method to evaluate the engineering skill behind these models. The dominant practice today relies heavily on end-product inspections, which, although useful for uncovering geometric errors, do not appreciate the modeling workflow, user-AI interaction, or moment-by-moment decision-making involved in their creation. Consequently, clinical teams cannot often determine if a model has been produced by means of competent, repeatable processes or by fortunate trial and error1. This gap is proving to be of increasing importance. Hospitals, regulators, and 3-D printing programs now require evidence of consistent quality of an entire workflow, not simply a correct final mesh, because modelling errors can affect surgical safety, device fit, and manufacturing traceability. There is no current method to capture (or calculate) modelling skill in an integrated, systematic, and scalable way that includes the behavioral, cognitive, and procedural aspects of modelling skill. This is the key motivation for the present paper2, but no operational definition of "workflow competence" for anatomy-centric, AI-augmented modelling has been offered3. This gap between regulatory expectation and daily practice is the first motivation for the present work. A second motivation is pedagogical. Health-tech start-ups and university hospitals train newcomers by pairing them with senior modelers for 6-18 months. The approach is effective but does not scale: the global installed base of AI-enabled medical CAD seats grew 340% between 2020 and 2024, whereas the pool of qualified mentors remained flat4. A visual analytics system that can watch, diagnose, and coach in real time would uncouple training throughput from mentor availability and shorten the learning curve for non-traditional entrants such as radiographers or nurses who increasingly perform front-end segmentation.
Third, there is a cognitive-science argument. Medical modelling is a high-stakes variant of visual problem-solving that blends spatial reasoning, anatomical knowledge, and software dexterity. Understanding how these resources are orchestrated, when they break down, and when they are outsourced to AI requires moment-by-moment capture of eye, hand, and physiological signals. The same data stream that elucidates human expertise can also be used to assess it, creating a virtuous loop between research and application. Despite rich literatures on (i) procedural skill assessment in surgery and (ii) geometric quality metrics for CAD meshes5, the intersection of the two-visual evaluation of engineering skill in health-centric, AI-assisted 3-D modelling-remains under-explored. Specifically, existing assessment rubrics lack a domain-specific skill construct and therefore treat anatomy-critical features, such as vascular bifurcation angles, in the same manner as generic mechanical features, such as bolt holes. Anatomy-informed constraints, such as trabecular infill porosity or osteotomy curvature limits, are absent from scoring functions. Regarding fragmented sensing, process-capture studies in CAD, surgical simulation, or VR gaming rarely synchronize kinematics, gaze, physiology, and voice at sub-second resolution6. Consequently, cross-modal validation is impossible, and the incremental value of each channel remains unknown7. Concerning static scoring, current tools typically produce a binary pass/fail label after the session. They lack a principled method to express uncertainty or update beliefs about a learner's competence as new evidence arrives-a critical requirement for Bayesian certification protocols8. Finally, regarding feedback discontinuity, even when rich data are collected, they are often stored for post-hoc review or displayed as raw curves. A theory-aligned, role-specific visual layer that maps low-level signals to actionable advice in real-time does not currently exist9,10. These four shortcomings translate into practical pain: 30% of 3-D printed anatomical guides still require manual re-work, and the FDA recalls 2-3 patient-specific implant batches per quarter because the originating engineer misunderstood a tolerance annotation11. An end-to-end framework that closes the loop from multi-modal sensing to interpretable feedback is therefore needed.
Medical 3-D modelling differs from industrial CAD in three ways that directly affect skill evaluation: anatomical geometry is non-parametric, error tolerance is near-zero, and regulatory traceability is compulsory12. The dominant assessment paradigm is still summative expert review of the final polygonal mesh or STL file. Typical rubrics inspect geometric deviation from the gold-standard template, topological manifoldness, and the presence of physician-specified landmarks. Because inspection is manual, inter-rater Cohen's κ rarely exceeds 0.6513. Commercial platforms such as Mimics Innovation Suite and 3D Slicer provide pass/fail flags for mesh quality, but the flags are deterministic (triangle aspect ratio, intersecting faces) and ignore the actual modelling trajectory14. AI-driven 3D visualization enhances anatomical understanding, but effective use depends heavily on the operator's modelling skill. While VDR provides accurate volumetric images, there is no structured method to assess how users interpret, refine, or manipulate these models. This study addresses this gap by evaluating competence in AI-assisted 3D modelling15.The authors record scroll and tool invocation events together with 3-D cursor coordinates at 30 Hz. They use a long short-term memory (LSTM) auto-encoder to detect "inefficient tool paths" in mechanical CAD, achieving an F1 score of 0.81 against human labels. Expanding the analysis to visual attention, the TAP-3D dataset16 contains gaze data from 120 engineers performing Solid Works exercises. Fixation-saccade sequences are encoded as 128-D word2vec vectors; clustering reveals five distinct visual strategies with significant correlation to modelling time (r = 0.73). Beyond user interaction metrics, Mandal, A. and Ghosh, A. R.17 estimate mechanical attributes of complex tissue-engineering scaffolds by using 3D CNNs trained to 3D digital topographies derived from CAD models17. The approach accurately predicts scaffold behavior and offers a more rapid AI-aided design route for porous and metamaterial structures. However, the rate of reproducibility of that prediction is influenced by the diversity of the dataset, especially with highly irregular geometries or defects of manufacture.
Recent work by Schönfeldt, N., Hancock, P. and Birt, J.18 has extended data capture to include voice and think-aloud protocols, specifically by transcribing verbal commentary into 200-D BERT embeddings to correlate them with debugging success in software tasks18. Although medical modelers talk less than programmers, pilot recordings in our lab show that utterances such as "this ridge looks noisy" coincide with 40% of subsequent undo events, suggesting predictive value. These studies demonstrate that procedural data can be captured unobtrusively and that machine-learning models can map the data to performance constructs. None of them, however, targets health-specific 3-D modelling, where anatomy variability and AI-assistance add extra layers of complexity. Visual analytics has been widely adopted for formative feedback in surgery simulation and robotic training. The MiEye system19 overlays color-coded error heat-maps on recorded laparoscopic video; trainees improve 26% faster than with binary feedback. For CAD education, Chou, Y. H. et al.20 propose a radar chart that contrasts student indices (feature count, rebuild time, rebuild errors) with class percentiles; students with access to the chart show a 0.35 standard-deviation increase in post-test scores. Real-time visualization is still underexplored. Kim, I.-J. and Quteineh, H. H.21 render live kinematic data as translucent ribbons in VR, but their metric is purely positional error; cognitive load is not addressed. To our knowledge, no existing framework combines geometric, behavioral, and physiological streams into a single, interpretable, real-time visual feedback layer tailored for AI-assisted medical 3-D modelling. The review reveals four specific shortcomings that this researchaims to address: Lack of health-specific skill construct: Current rubrics treat medical modelling as generic CAD; they ignore anatomy-informed constraints such as porous-bone infill or vascular bifurcation angles. Fragmented sensing: Studies usually exploit a single modality (kinematics, gaze, or physiology). Multi-modal synchronization at the sub-second level is rarely attempted, and cross-modal validation is absent. Bayesian update or online confidence tracking has not been applied to modelling skill; hence, evaluators cannot express uncertainty nor incorporate new evidence as the task unfolds. Feedback discontinuity: Even when rich data are collected, they are either stored for post-hoc review or displayed as raw curves. There is no principled visual layer that maps low-level signals to actionable, theory-aligned feedback in real time. These gaps justify the development of an end-to-end framework that (i) defines a multi-dimensional skill construct grounded in health-application requirements, (ii) fuses heterogeneous procedural data through calibrated Bayesian inference, and (iii) renders the evolving belief state via interactive, role-specific visualizations. In a practical sense, the proposed framework provides hospitals with a real-time monitoring tool for modelling quality, reduces reliance on long expert review cycles, and enables safer clinical use of patient-specific devices. Training centers could use the framework to provide structured and data-driven feedback that facilitates learner development and supports the establishment of standardized benchmarks of competency across cohorts. For regulators, the proposed framework offers an auditable and transparent mechanism for documenting workflow competence, aligning with emerging requirements for the continuous validation of medical 3-D printing processes. Building on these motivations, this study designs and validates a visual analytics framework that quantifies engineering skill in AI-assisted, health-centric 3-D modelling. Specifically, the framework aims to define a multi-dimensional skill construct grounded in porous-bone infill, vascular angles, and other anatomy-related tolerances; to synchronize gaze, hand movement, physiological signals, and AI-interaction data at sub-second resolution; to fuse these heterogeneous data streams through Bayesian updating in order to generate an evolving competence belief with explicit confidence; to render this belief through real-time radar charts, heat maps, and 3-D annotations for immediate coaching or certification purposes; and to maintain an evaluation accuracy of at least 90% via active-learning calibration that adapts to new printers, AI models, or clinical rules without requiring full retraining.
The present study addresses two central questions: whether multimodal data fusion improves the accuracy of skill evaluation relative to expert assessment, and whether Bayesian updating enables the maintenance of evaluation accuracy without requiring full model retraining.
This study involved human participants and was reviewed and approved by the Ethics Committee on Human Research Protection, Xianda College of Economics and Humanities, Shanghai International Studies University (Approval No. 2025XD1221). All participants signed informed consent forms and completed a 30 min pre-experiment training: EG participants learned to interpret the framework's visual feedback (radar charts, heatmaps), while CG participants received only platform operation training. To assure precision in the data collection, the pre-experiment also involved the 5 min calibration of the eye trackers and physiological sensors.
Study objectives and protocol overview
The objective of this study was to create a closed-loop, end-to-end system that integrates multimodal acquisition, feature extraction, Bayesian fusion, and visual feedback to evaluate engineering skills in an interpretable manner. Figure 1 displays the protocol diagram for the evaluation of engineering skills in health applications based on artificial intelligence-assisted 3D modeling.
Framework overview and hierarchical logic
The visual evaluation framework for engineering skills in AI-assisted 3D modeling health applications proposed in this study is constructed based on the closed-loop logic of "data-driven - feature extraction - evidence fusion - visual feedback". Its core objective is to achieve end-to-end conversion from multi-modal raw data to interpretable skill evaluation results22. To ensure the reliable extraction of multimodal features for evaluating engineering skills, the framework simultaneously captures interaction, physiological, gaze, gesture, and voice signals with millisecond-level temporal synchronization. The framework adopts a hierarchical architecture design, consisting of four core layers: Multi-modal Data Acquisition Layer (L1), AI Feature Extraction Layer (L2), Evidence Fusion and Scoring Engine (L3), and Visual Feedback Layer (L4). Each layer realizes bidirectional interaction through standardized data interfaces, and a framework calibration module (L5) is introduced simultaneously to ensure the dynamic validity of evaluation results. Figure 2 shows the end-to-end pipeline, starting with multimodal data (gaze, kinematics, physiology, voice) acquisition through AI-based feature extraction, Bayesian scoring, and real-time visual and audiovisual feedback. The layers are interactive, with Layer 5 providing continuous calibration. This architecture allows for coordinated processing and validated skill estimation for participants while engaging in live 3-D modelling tasks.
Equation (1) Bayesian Confidence Update Formula:

Where: P(θ) represents the prior probability of skill level θ (based on historical evaluation data); P(D∥θ) denotes the likelihood probability of observing multi-modal feature data D under skill level θ; P(D) is the marginal probability; P(D∥θ) is the posterior probability of skill level θ after fusing data D, i.e., the final confidence score. This framework used a structured Delphi and AHP approach23 to weight indicators. A panel of ten subject matter experts (orthopedic surgeons, biomedical engineers, and experienced medical CAD modelers) participated in three rounds of Delphi rating the importance of an indicator on a 9-point Likert scale until expert scoring reached stability (coefficient of variation <0.15)24. Then, using the converged/distributed scores from the experts, a pair-wise comparison matrix was built and confirmed using the AHP consistency criterion with a consistency ratio (CR) < 0.1. If the CR exceeded 0.1, the matrix was readjusted. The AHP methods' consistency ratios for the pair-wise comparison matrices were between 0.03 and 0.08, indicating great internal logical consistency throughout the multi-level hierarchical approach. The design of each layer in the framework adheres to the "modularization - extensibility" principle. For instance, Layer L1 can expand the physiological data collection dimension by adding biosensors; Layer L2 can adapt to skill evaluation requirements of different health applications by replacing pre-trained models; Layer L4 supports custom visualization dimensions based on user roles. This provides flexibility for subsequent applications in medical 3D modeling education and professional skill certification scenarios.
Multi-modal data acquisition layer
3D interaction event stream
This module collects engineers' operational data from AI-assisted 3D modeling platforms via software plug-ins (16), covering both discrete operation events and continuous parameter data. Discrete events include model selection, vertex adjustment, and AI tool invocation, while continuous parameters involve modeling speed, undo/redo counts, and AI suggestion acceptance rates. Each event is recorded with a 13-digit timestamp and corresponding 3D model coordinate system values (x, y, z). For example, when adjusting the curvature of an orthopedic implant model, the plug-in logs the operation's start and end times, adjusted vertex coordinates, and curvature variations, thereby forming a structured event log (see Table 1 to Table 3 for data fields). These data are subsequently processed using bespoke algorithms to calculate domain-specific geometric, efficiency, and cognitive indicators, enabling fine-grained characterization of modelling behaviours in health-related applications.
Eye movement and gesture signals
Eye movement data is acquired using a desktop eye tracker, focusing on fixation points and saccade parameters in the 3D modeling interface, which reflect the engineer's cognitive attention distribution.
Gesture signals are captured by a depth camera, which extracts 22 upper-limb key joint coordinates to identify typical modeling gestures25. The camera converts 3D gesture coordinate data into normalized vector sequences to mitigate individual body size differences.
Data synchronization for this module relies on the framework's unified hardware trigger (10 Hz timestamp from 3D modeling software), with synchronization error controlled within ±50 ms (calculated via Equation 3-2) to ensure temporal correlation between eye/gesture behaviors and modeling operations 26. Equation (2) presents the Data Synchronization Error Calculation:
(2)
Where: Et = synchronization error between device acquisition time (tdevice) and software timestamp (tsoftware); tdevice = actual acquisition time of eye tracker/depth camera; tsoftware = trigger timestamp from 3D modeling software.
Physiological and voice data
Physiological data is collected via wearable sensors: an Electromyography (EMG) sensor (1000 Hz sampling frequency) 27, attached to the engineer's forearm to record muscle tension during fine operations, reflecting operational stability. Electrocardiography (ECG) sensor (250 Hz sampling frequency)28: It is worn on the chest to collect heart rate variability (HRV) indicators, which characterize cognitive load. Voice data is captured by a directional microphone (44.1 kHz sampling frequency),29, recording engineers' verbal comments during modeling to supplement cognitive state analysis. All physiological and voice data undergo preprocessing before storage in the time-stamped multi-modal database. Figure 3 illustrates the physical experimental hardware configuration. Sensor data from the EMG/ECG wearables, the microphone, the eye tracker, and a depth camera are all connected to a 3-D modelling workstation. Before temporal data is saved in a multimodal database, a synchronization module adjusts all streams (within a margin of ±50 ms). This material specification means reliable, temporally aligned input into the feature extraction and scoring pipeline.
AI Feature extraction layer
Geometric quality indicators
This module extracts quantitative features reflecting the accuracy and rationality of 3D models from the 3D interaction event stream and the final model output, serving as the core basis for evaluating engineers' "modeling precision" skills. To assess structural accuracy, anatomical congruence, and 3D model completeness using deviation, topology, and template-based features metrics, three key indicators are designed, with calculation methods and physical meanings detailed below:
Model deviation rate (MDR)
This measures the geometric deviation between the engineer's modeled result and the standard template. The deviation is calculated by comparing the Euclidean distance of key feature points between the two models. The formula is shown in Equation (3):
(3)
Where: n = number of pre-defined key feature points (set to 50-100 based on model complexity); Pi,engineer = 3D coordinates of the i-th feature point in the engineer's model; Pi,standard = 3D coordinates of the i-th feature point in the standard template; |⋅| = Euclidean distance operator.
A lower MDR indicates higher consistency with the standard model, with a threshold of ≤ 5% for qualified health application models. Modeling competencies were assessed based on time efficiency, redundant movements, gesture stability, and usage patterns of the proposed AI tools during real-time 3D operational scenarios.
Topological consistency (TC)
This evaluates whether the 3D model's topological structure meets medical application requirements. The indicator is calculated by counting the number of topological defects (D) and the total number of topological elements (T, sum of vertices, edges, and faces) in the model, as shown in Equation (4):
(4)
For example, if a knee joint model contains 2 non-manifold edges (D = 2) and 1200 total topological elements (T = 1200), it is 99.83%. A ≥ 98% is required for medical models to avoid manufacturing risks.
Feature completeness (FC)
This assesses whether the engineer has fully implemented all mandatory features of the health application model. The indicator is determined by comparing the number of completed mandatory features (Fcompleted) with the total number of mandatory features (Ftotal) defined in the task requirements:
(5)
Behavioral efficiency indicators
This module processes 3D interaction event streams and gesture signals to extract features reflecting the efficiency of engineers' modeling operations, evaluating their "operation proficiency" skills. Cognitive demands associated with modeling were measured using synchronized physiological, gaze, and EMG signals. These metrics track attention distribution across time and workload fluctuations during the modeling process.
Modeling time efficiency (MTE)
This compares the engineer's actual modeling time (Tactual) with the average time of senior experts for the same task (Texpert), normalized to a 0-100 score:
(6)
For example, if an engineer takes 45 min to complete a task that experts average 30 min on, there is 100 - (45 - 30) x 50 = 75, indicating moderate efficiency.
Redundant operation rate (ROR)
This counts the proportion of unnecessary operations in the modeling process. It is calculated as the ratio of redundant operation count (Oredundant) to total operation count (Ototal):
(7)
Redundant operations are identified by the AI algorithm via rule matching and gesture consistency analysis.
AI tool utilization rate (ATUR)
This measures the engineer's ability to leverage AI-assisted functions to improve efficiency. It is defined as the ratio of AI tool invocation count (IAI) to total tool invocation count (Itotal):
(8)
A higher ATUR indicates better integration of AI tools into the modeling process, which is a key skill for modern medical 3D modeling.
Gesture smoothness (GS)
This evaluates the stability of the engineer's upper limb movements during modeling. The indicator is calculated by the standard deviation of the speed of 22 upper-limb key joints (vj) over time:
(9)
A lower standard deviation (and thus higher GS) indicates smoother gestures, reflecting higher operational proficiency. These indicators are extracted in real time during the modeling process, with the AI algorithm updating the indicator values every 5 min and storing them in the feature database for subsequent fusion.
Cognitive load indicators
This module analyzes eye movement signals and physiological data to quantify the engineer's cognitive load during modeling, evaluating their "cognitive adaptability" skills. Multimodal indicators were weighted using a Delphi-AHP approach and Bayesian evidence updating to output real-time, probabilistic engineering skill scores. Three key indicators are developed:
Fixation dispersion (FD)
This reflects the distribution of the engineer's visual attention. It is calculated as the Euclidean distance between the maximum and minimum fixation point coordinates ( (xmax,ymax) and (xmin,ymin) ) in the 3D modeling interface over a 2 min window:
(10)
A larger FD indicates scattered attention, often associated with high cognitive load.
Heart rate variability (HRV) - SDNN
SDNN (Standard Deviation of Normal-to-Normal Intervals) is extracted from ECG data to measure the fluctuation of the engineer's heart rate. A lower SDNN indicates higher sympathetic nerve activity, reflecting increased cognitive load. The AI algorithm first filters ECG noise (via wavelet transform) and then calculates SDNN using Equation (11):
(11)
Where: N = number of normal heartbeats in a 5-minute window; RRk = time interval between the k-th and (k + 1)-th normal heartbeat;
= average RR interval in the window.
Electromyography (EMG) - Root mean square (RMS)
RMS of forearm EMG signals reflects muscle tension: higher RMS indicates increased muscle stiffness due to cognitive stress. The calculation formula is:
(12)
Where: m = number of EMG signal samples in a 1 min window, and EMG(t) = amplitude of the EMG signal at time t. A summary diagram has been created to assist with conceptual clarity and facilitate replication of the research. It illustrates the hierarchical relationship among the ten evaluation indicators and the three main constructs behind the proposed framework: geometric quality, operational efficiency, and cognitive adaptability. This diagram depicts how multi-modal input signals are converted into measurable indicators that are subsequently fused through a Bayesian approach to create a composite engineering skill score. This overview conveys a compacted version of the evaluation logic and increases the methodological transparency of this study formatively and for future educational and regulatory direction. Figure 4 depicts the suggested engineering skill evaluation framework's hierarchical structure, which organizes ten multimodal indicators into three fundamental constructs: geometric quality, operational efficiency, and cognitive flexibility. These factors contribute to the overall skill outcome used to evaluate AI-assisted 3D modeling performance.
Evidence fusion and scoring engine
Indicator weight assignment (Delphi + AHP)
This module addresses the problem of differentiated importance among multi-dimensional features by combining the Delphi method (expert consensus) and the Analytic Hierarchy Process (AHP) to assign weights to the ten extracted indicators (three geometric quality indicators, four behavioral efficiency indicators, and three cognitive load indicators), and to represent these complex multimodal indicators in visual and interpretable formats, such as radar charts, heatmaps, and annotated playbacks for biomedical-relevant diagnostic inference.
Delphi consensus stage
A panel of ten experts assessed the importance of each indicator using a 9-point scale (1 = "least important", 9 = "most important") across three rounds of anonymous questionnaires. Following each round, statistical feedback-comprising mean scores and coefficients of variation (CV)-was provided to the panel to facilitate score adjustment. Convergence was defined as a CV ≤ 0.15 for each indicator. Notably, consensus for indicators such as the Model Deviation Rate (MDR) and AI Tool Utilization Rate (ATUR) was typically achieved by the second round, attributed to their direct clinical relevance.
To ensure technical robustness, the protocol standardized multi-modal data alignment (synchronization tolerance ≤ 50 ms; 30 Hz unified trigger clock) and indicator calculation methods. For instance, MDR was derived from 50-100 anatomical key points, while the Topological Consistency (TC) threshold was set at >98% for defect-free elements. These thresholds were justified against clinical tolerance standards, such as a < 5% deviation for porous bone and a < 3° error for vascular bifurcation angles. Furthermore, the methodology incorporated Bayesian scoring with Gaussian likelihood modeling (using calibrated σ ranges for distinct skill tiers) and utilized a CNN-LSTM feature extractor trained with a batch size of 32, a learning rate of 1 x 10-4, and 120 epochs.
AHP Hierarchy establishment
A three-level hierarchy is constructed: Target Layer (A): Engineering skill comprehensive evaluation; Criterion Layer (B): 3 first-level indicators (B1: Geometric Quality, B2: Behavioral Efficiency, B3: Cognitive Load); Indicator Layer (C): 10 second-level indicators (C1: MDR, C2: TC, ..., C10: EMG RMS).
Pairwise comparison and consistency check
Based on Delphi consensus results, a pairwise comparison matrix is constructed for the Criterion Layer and Indicator Layer. Taking the Criterion Layer as an example, the matrix is defined as:
(13)
Where: aij is the importance ratio of Bj to Bj. The weight vector of the Criterion Layer (WB = [wB1,wB2,wB3]) ) is calculated via the eigenvector method. To avoid logical contradictions, the Consistency Ratio (CR) is checked using Equation (14):
(14)
Where: ( λmax = maximum eigenvalue of the matrix, = number of indicators), and is the random consistency index (for n = 3, RI = 0.58). The result is acceptable if CR < 0.1.
Through this process, typical weights are obtained: (Geometric Quality), (Behavioral Efficiency), (Cognitive Load); the weight of MDR (C1) in the Indicator Layer is usually the highest (~0.25), while EMG RMS (C10) is the lowest (~0.05).
Bayesian confidence update
This module fuses the weighted indicators to generate the final skill confidence score, addressing the "uncertainty of single-feature evaluation" by updating prior probabilities with real-time multi-modal data. The process is based on Equation (1) (Bayesian formula) and is implemented in two steps:
Step 1: Prior probability initialization (P(θ))
The skill level is divided into 5 grades: (Novice), (Beginner), (Intermediate), (Advanced), (Expert). The prior probability is initialized using historical data: if the system has evaluated 500 engineers, and 100 are classified as θ3, then P(θ3) = 100/500 = 0.2. For new users without historical data, a uniform prior (P(θi) = 0.2 for all i) is adopted.
Step 2: Posterior probability calculation (P(θ∥D))
The multi-modal feature data is the weighted sum of the 10 normalized indicators :
(15)
Where: wCk is the weight of the k-th indicator (from Section 3.4.1), and Cknorm is the normalized value of the k-th indicator ([0,1] range).
The likelihood probability is modeled using a Gaussian distribution, assuming that the feature data of engineers at skill level follows (mean μi, variance
). These parameters are calibrated using 200 labeled samples.
Finally, the posterior probability is calculated via Equation (3-1), and the skill grade corresponding to the maximum is the real-time evaluation result. For example, if an engineer's D = 0.72, and (higher than other grades), they are classified as "Advanced".
To visualize the confidence update process, Figure 5 shows how known equal skill probabilities are updated into a posterior distribution using Bayesian inference and real-time signals as evidence. After observing the task, the model updates to strengthen the evidence towards the skill grade with the largest posterior probability. Validation tests indicated strong convergence stability of the expert skill grades, and the Bayesian scores strongly correlated with expert ratings, supporting the trusted use of Bayesian updating in the workflow.
The output of this layer is a comprehensive skill score (0-100, mapped from the maximum posterior probability) and a confidence interval, which is transmitted to the Visual Feedback Layer for intuitive presentation.
Visual feedback layer
Real-time radar chart
This module converts the 10 normalized indicators and 3 three-level criterion scores into a radar chart, enabling intuitive comparison of engineers' skill strengths and weaknesses. The radar chart is designed with 7 axes (3 for first-level criteria: B1-B3; 4 for core second-level indicators: C1=MDR, C4=ATUR, C7=FD, C9=SDNN-selected for their high weight and interpretability), with each axis range normalized to [0,100] (mapped from the [0,1] normalized values via Equation 3-7). To continually improve measurement accuracy through expert annotations and active-learning updates at the inception of the framework, a health-modelling context and growing understanding are incorporated.
Equation (16) Indicator Value Mapping for Radar Chart:
(16)
Where: Vradar = value displayed on the radar chart axis; Cknorm = normalized value of the k-th indicator ([0,1] range).
The radar chart updates in real time (every 5 min) and includes two reference benchmarks: a standard benchmark line (dashed line, Vradar = 80, representing the "Advanced" skill level threshold) and an expert average line (dotted line, calculated from 50 senior engineers' historical data). Figure 6 shows the engineer's behaviour across a composite of key indicators: geometric quality, cognitive load, tool-use efficiency, and fixation dispersion, compared to the corresponding expert benchmarks. As part of the feedback layer, true scores, using fused metrics, were translated into an interpretable profile. User studies indicated its successful implementation for quickly finding competence gaps, while promoting a structured, data-driven calculation for improvement during modelling tasks.
Temporal heatmap
This module visualizes the dynamic change trend of skill indicators over the entire modeling process, helping identify time periods where engineers face difficulties. The heatmap uses a 2D grid: the x-axis represents time intervals (10-minute segments, totaling 12 segments for a 2-hour task), and the y-axis represents 5 key indicators (C1=MDR, C4=ATUR, C6=GS, C7=FD, C9=SDNN). The color intensity of each grid cell corresponds to the indicator's normalized value, with a color scale from blue (low value, poor performance: Cknorm < 0.5) to red (high value, good performance: Cknorm < 0.8), as defined in Equation (17):
Equation (3-8) presents Color Intensity Calculation for Heatmap:
(17)
Where: Icolor = RGB color intensity value (0-255, 0=blue, 255=red); min(Cknorm) and max(Cknorm) = minimum and maximum normalized values of the indicator across all time intervals.
Figure 7 illustrates time-resolved changes in cognitive load, fixation dispersion, gaze stability, and decision-latency metrics in 10 min segments. Note that the highlighted regions indicate high cognitive demand or workflow disruptions. This time-view is generated directly from coalescent multimodal streams, supported by annotation by experts, illuminating its critical role in engaging with modelling fatigue, attentional shifts, and task-specific difficulties.
D playback annotation
This module overlays skill evaluation annotations on the 3D model's operation playback, enabling step-by-step analysis of the engineer's modeling behavior. The playback speed is adjustable (0.5×-2× real time) and synchronizes with the multi-modal data timestamp. Three types of annotations are embedded: Geometric Error Annotations: Red wireframes highlight model regions where MDR (C1) exceeds 5% (qualified threshold), with text labels showing the deviation value. Efficiency Warning Annotations: Yellow exclamation marks appear at the location of redundant operations, with pop-up windows showing the operation type. Cognitive Load Annotations: Green/red status bars in the corner indicate real-time cognitive load (based on C9: SDNN and C10: EMG RMS), with red meaning "High Load" (SDNN < 50 ms) and green meaning "Normal Load" (SDNN > 100 ms).
Framework calibration and iteration
Expert annotation interface
This module provides a human-in-the-loop interface for senior experts to correct and label the framework's evaluation results, addressing the "drift of evaluation accuracy" caused by changes in health application scenarios. The interface is designed with three core functions: result correction, indicator reweighting, and annotation storage.
Result correction function
Experts first review the visual feedback results and the framework's comprehensive skill score (0-100). If the score deviates from expert judgment, the expert can manually adjust the score and record the reason. The correction magnitude is quantified via Equation (18):
(18)
Where: Scorrected = final corrected score; Sframework = initial score output by the framework; Sexpert = expert's manual score; α = correction weight (set to 0.8 by default, ensuring expert judgment dominates while retaining framework rationality).
Indicator reweighting function
For scenario-specific calibration, experts can adjust the weight of individual indicators via a sliding bar (0-0.5 range). The interface automatically recalculates the consistency ratio (CR) after adjustment (using Equation 3-6) and alerts experts if (illogical weight distribution), ensuring the adjusted weights remain mathematically valid.
Annotation storage function
All corrected scores, adjusted weights, and expert comments are stored in an annotated database with time stamps and scenario tags. This database serves as the training data for the subsequent online active learning module, with a typical data schema shown in Table 2.
Online active learning update
This module uses the annotated data to iteratively optimize the framework's core algorithms (AI feature extraction models in L2 and Bayesian scoring engine in L3), enabling the framework to adapt to new health application scenarios without full retraining. The update process follows a query-by-committee (QBC) strategy, divided into three stages:
Stage 1: Uncertainty calculation
Three parallel "committee models" are deployed (all based on the same architecture as the L2 feature extraction model but trained on different historical datasets):
Model A: Trained on orthopedic implant modeling data;
Model B: Trained on organ reconstruction data;
Model C: Trained on dental implant modeling data.
For each unlabeled sample (engineer's modeling data), the committee models output three predicted skill scores. The uncertainty of the sample is calculated via the coefficient of variation (CV) of the three scores:
(19)
Where: SA,SB,SC = scores predicted by the three committee models; std (⋅)= standard deviation;
= average of the three scores.
A higher CV indicates greater disagreement among the committee models (higher uncertainty), meaning the sample is more valuable for annotation.
Stage 2: Sample selection
The framework selects the top 10% of samples with the highest CV and sends them to the expert annotation interface for labeling. This active selection strategy reduces the number of samples requiring expert annotation (compared to random selection) by 40-60%, improving calibration efficiency.
Stage 3: Model retraining and update
The newly annotated samples are used to fine-tune the framework's algorithms:
For L2 feature extraction models: The model is retrained with a learning rate of 1e-5 (10% of the initial training rate) using the annotated geometric error data, updating only the last two layers to avoid overfitting.
For L3 Bayesian scoring engine: The Gaussian distribution parameters (
) for each skill level are updated using the annotated score data, with the update formula shown in Equation (20):
(20)
Where:
= updated mean of skill level θi;
= original mean before update; S-θi = average corrected score of annotated samples labeled as θi; β = forgetting factor (set to 0.7, balancing historical data and new annotations).
The updated framework is deployed online immediately after retraining, with a performance check before deployment: if the average absolute error between the framework's new output and expert annotations is ≤3 (reduced by ≥20% from the previous version), the update is confirmed; otherwise, the retraining process is repeated with additional annotated samples.
This iterative calibration mechanism ensures the framework's evaluation accuracy remains above 90% even as health application 3D modeling technologies evolve, maintaining long-term adaptability.
The results assess the framework's validity, reliability, and usability through experimental evaluation of skill discrimination, construct validity, predictive validity, and the effectiveness of real-time feedback.
Experimental design
The experiments were performed in a Python environment on a computer with a 12-core Intel processor, 32 GB of RAM, and 1 TB of SSD storage. NumPy and Pandas were used for data synchronisation. For the extraction of features, the libraries OpenCV and SciPy were utilised. The CNN-LSTM models were executed with PyTorch. Bayesian updating was performed using PyMC, while Matplotlib and Plotly allowed for real-time visual analytics to conduct controlled multimodal data capture and model execution conditions that will permit reproducible validation of the framework's feature extraction and scoring components.
Participants and grouping
To validate the framework's effectiveness in evaluating engineers of varying proficiency, 60 participants were recruited using a stratified sampling strategy covering the five defined skill grades (Novice-Expert). This approach was adopted to ensure a cohort representative of different levels of expertise and to assess framework performance through stratified sampling and controlled subgroup assignment, thereby enabling valid comparisons. The participant selection criteria and grouping details are as follows:
Participant recruitment criteria
All participants met three basic requirements: (1) familiarity with at least one AI-assisted 3D modeling platform (Mimics, Blender Medical, or 3D Slicer); (2) no history of neuromuscular diseases (to avoid interference with physiological data collection); (3) normal or corrected-to-normal vision (to ensure eye movement data validity ). Additionally, participants were classified into 5 skill groups based on their professional background and experience, with the number of participants per group shown in Table 3.
Chief engineers/academics, > 10 years of experience, leading AI-assisted modeling project development
Grouping method
Participants in each skill group were randomly divided into two subgroups-the Experimental Group (EG) and Control Group (CG)-to eliminate confounding variables. The EG used the proposed framework for real-time skill feedback during modeling, while the CG received no feedback (only basic task instructions). This design enables comparison of two key outcomes: (1) whether the framework's evaluation results align with expert judgments (validity verification); (2) whether real-time feedback improves modeling performance (auxiliary validation of framework value).
Health application modeling tasks
There are two representative health application scenarios that were chosen to represent various modeling complexities to guarantee the flexibility of the framework. The two tasks were conducted on the Mimics AI platform (version 25.0) and had an identical time limit (120 min), and the parameters of the tasks were adjusted to activate the multi-modal data acquisition and feature extraction modules of the framework. This design was intended to test the framework's adaptability to clinically representative modelling tasks that vary in shape complexity, cognitive load, and tool-integration complexity.
Task 1: Orthopedic implant modeling (low-medium complexity)
The purpose of this task was to simulate a femoral stem implant using a standard template compliant with ISO 7206-4:2019, with the assistance of AI-supported tools for segmentation and geometric optimization. The modeled implant was required to include five critical features: a stem taper angle of 50° ± 0.5°, a proximal porous structure with a pore diameter of 500-800 µm, a distal fixation groove with a depth of 2 mm, a collar thickness of 3 mm ± 0.2 mm, and a weight-reduction hole with a diameter of 8 mm. Participants were required to apply at least three AI-based functions during the task, including automatic bone segmentation, AI-driven curvature optimization, and AI-based defect detection. During this task, the framework primarily activated geometric quality indicators and behavioral efficiency indicators, as accurate geometric control and effective integration of AI tools are essential for successful implant modeling.
Task 2: Liver vascular reconstruction (high complexity)
In this study, it is proposed to reconstruct the hepatic portal vein and hepatic vein system based on a CT scan dataset (DICOM format, 512x512x120 slices) with the help of AI-assisted thresholding and vascular tracking systems. Significant Specifications: ≥ 90% vascular branch completeness (relative to the standard template based on clinical imaging data); regulate the topological error (non-manifold edges) average to 2 or less; fixed state operation during 40-80 min (high cognitive load period) to test cognitive load measures. Evaluation Triggers: This task focuses on cognitive adaptability (also as a result of a complex vascular branching) and stability of real-time operation, which involves the activation of all three types of indicators in the AI feature extraction layer.
Task execution and data recording
Both tasks were performed by each participant in a silent laboratory (temperature 22±2 °C, humidity 50±5%) to prevent interference by the environment. The experimental equipment consisted of: workstation based on 3D modeling; wearable sensor; desktop eye tracker (depth camera.
All multi-modes were logged with a 13-digit time stamp (coordinated by Equation 3-2 ), and the framework automatically produced skill assessment outcomes every 10 min for EG participants. Each member was provided with the same task instruction manual (with templates and timely instructions) to guarantee the consistency of the task and prohibited external references during its performance. The 3D modeling platform automatically recorded the task completion status that is utilized as ground truth in the next validity analysis.
Psychometric validation of the framework
Concurrent validity
Concurrent validity checks the reliability of the results of the framework regarding the evaluation of the skills and independent expert ratings (gold standard) gathered at the same time. The section applies the Pearson correlation test and Kappa consistency test to measure the agreement with all data obtained based on the two modeling tasks. This analysis was conducted to assess the correlation and consistency between framework-generated scores and expert judgment across modelling tasks. Table 4 is structured to provide clarity on the verification logic, indicators, and the results.
Construct validity
Construct validity confirms that the results of the evaluation of the framework correspond with the theoretical engineering skill construct. This part employs exploratory factor analysis (EFA) to accomplish latent factors out of the 10 indicators and determine their consistency with the theoretical construct.
EFA implementation steps
Data Preparation: Work with the normalized and the values of the indicators of 60 participants (Task 1 and Task 2) and the KMO (Kaiser-Meyer-Olkin) test and Bartlett spheric test were performed prior to confirming EFA. Factor Extraction: Use maximum variance rotation with the eigenvalues that are greater than 1 (Kaiser criterion). Factor Loading Judgment: Factors with indicator loading values > 0.6 are considered to have a strong correlation with the latent construct.
Key results
Suitability Test: KMO=0.82 (>0.7, suitable for EFA), Bartlett's spherical test χ²=586.37, p<0.001 (rejects the hypothesis of independent variables, supporting EFA). Factor Extraction: 3 latent factors are extracted, explaining 78.6% of the total variance (meets the >60% standard for good construct validity). The factor loading matrix is shown in Table 5.
As shown in Table 5, the 10 indicators are clearly grouped into 3 factors, which exactly match the theoretical construct of "geometric precision (Factor 1)", "operation proficiency (Factor 2)", and "cognitive adaptability (Factor 3)". All loading values > 0.8 confirm the strong construct validity of the framework.
Concurrent validity
Concurrent validity verifies the consistency between the framework's skill evaluation results and independent expert ratings (gold standard) collected simultaneously. This section uses the Pearson correlation analysis and the Kappa consistency test to quantify the agreement, with all data derived from the two models. The verification logic, indicators, and results are organized in Table 6 for clarity. We examined whether the extracted indicators represented latent factors aligning with the theoretical constructs of precision, proficiency, and cognitive adaptability.
Predictive validity
Predictive validity verifies whether the framework's current skill evaluation results can predict future modeling performance. A follow-up experiment was conducted 4 weeks after the initial experiment, with the same 60 participants completing a new health application task (dental crown modeling, medium complexity). We analyzed whether skill scores assessed at the study's outset could meaningfully predict modeling quality or efficiency in isolated follow-up tasks.
Prediction logic and metrics
The predictor variable is defined as the framework's comprehensive skill score (Sinitial), which is calculated as the average of the scores obtained from Task 1 and Task 2.
Analysis method and results
Use linear regression analysis to quantify the prediction effect, with the regression equation and results shown in Table 7.
Key conclusion
The framework's initial score explains 59-63% of the variance in future performance (R² > 0.5), and all regression coefficients are statistically significant (p < 0.001). This confirms that higher initial scores predicted better future modeling quality and efficiency, verifying strong predictive validity. The three validity indicators (construct, concurrent, predictive) collectively confirm that the framework's evaluation results are reliable, consistent with expert judgments, and capable of predicting practical performance, meeting the requirements of engineering skill evaluation in health applications.
The t-test is a parametric statistical procedure for testing whether the means of two groups are significantly different from one another. That is to say, the t-test determines whether the differences observed between the two groups were greater than would be expected to be due to chance alone, contingent on the sample size and variance. In the current study, independent samples t-tests were used to compare modelling indicators for multiple skill groups (e.g., novice, intermediate, expert). In each test for the modelling indicator, a t-statistic, degrees of freedom, p-value, and a 95% confidence interval would estimate the effect size and robustness of the group comparisons. Before running the t-tests, the assumptions of normality and homogeneity of variance were assessed; corrections were applied when necessary (when normality or homogeneity of variance was violated).
Result analysis
Score distribution and discriminability
This section analyzes the framework's skill scores across 5 skill grades (θ₁-θ₅) to verify its ability to distinguish engineers of different levels. The score distribution uses box plots to visualize median, quartiles, and outliers, with statistical indicators including interquartile range (IQR) and coefficient of variation (CV) for discriminability quantification. The analysis determined whether framework scores reliably discriminated among the pre-planned skill levels, looking for low within-group variance and sufficient between-group discrimination.
Score distribution characteristics
The framework's comprehensive scores (average of Task 1 and Task 2) for 60 participants show a clear hierarchical trend : i) Novice (θ₁): Score range 28-40, median 34 (lowest, due to high MDR (>8%) and low ATUR (<30%)); ii) Beginner (θ₂): Score range 42-58, median 50 (improved geometric precision but low GS (<60%)); iii) Intermediate (θ₃): Score range 62-75, median 68 (balanced performance, MDR≈5% and ATUR≈60%); Advanced (θ₄): Score range 78-90, median 84 (high efficiency, ROR<10% and GS>85%); iv) Expert (θ₅): Score range 92-98, median 95 (excellent precision, MDR<2% and stable cognitive load).
Discriminability verification
IQR Separation: The IQR of adjacent grades has no overlap, indicating clear grade boundaries. CV Calculation: The overall CV of scores across grades is 0.28 (<0.3), and the CV within each grade is ≤0.12, confirming low internal variability and high inter-grade discrimination.
Visualization of diagnosis accuracy
This section evaluates whether the Visual Feedback Layer can accurately identify engineers' skill weaknesses by comparing framework-generated visual labels with expert diagnoses. We evaluated the accuracy of the framework's visual feedback elements in identifying performance weaknesses compared to expert-annotated behavioral analysis. The diagnosis accuracy rate is used as the core metric, calculated via Equation (21). All equation (1 to 21) variables are explained in Table 8. Visualization Diagnosis Accuracy Rate:
(21)
Accuracy results by visualization type
Table 9 summarizes accuracy across the three visualization modules, with Task 2 (high complexity) showing slightly lower accuracy due to more subtle cognitive load weaknesses.
Key Example: In the Intermediate (θ 3) level of engineer in Task 2, the 3D playback annotation showed the following two weaknesses, which align with the expert annotations: Redundant vertex adjustment (ROR=18%) and High FD (120 pixels) -both of which were targeted.
Error analysis
The primary source of error (10-15% inaccuracy) is the inability to assess cognitive load, which is explained by the differences in eye movement patterns. This may be optimized through the framework calibration module.
User usability feedback
After the experiment, 30 EG participants (6 per skill grade) completed a questionnaire on the System Usability Scale (SUS) (10 items, 15 Likert scale, SUS score range 0-100) and open-ended interviews as the measure of user acceptance. User acceptance and perceived utility were assessed using normative usability scoring and qualitative feedback.
SUS score results
Mean SUS Score: 82.3 (>80, usability is excellent according to industry standards). Breakdown Score: i) Real-Time Radar Chart: 85.7 (high, owing to the intuitiveness of the strength/weakness comparison); ii) 3D Playback Annotation: 83.1 (valued in terms of being able to review the steps one by one during training); iii) Temporal Heatmap: 78.4 (low, because 30% of Novices were confused by the concept.
Interview key insights
87% of those surveyed (particularly Advanced/Expert) said the radar chart enabled them to align more quickly to the usual skill levels; 73% of Intermediate/Beginner respondents said once more 3D playback annotations lowered their practice time by one-fifth to one-third of that; Suggestions would be to add "customizable heatmap time intervals" and AI-generated improvement hints.
DATA AVAILABILITY
The raw data supporting the findings of this study are publicly available in the Zenodo repository. All datasets, including multimodal interaction logs, physiological signals, extracted indicators, and evaluation outputs, can be accessed at https://doi.org/10.5281/zenodo.17994975.

Figure 1: The protocol diagram for visual evaluation of engineering skills in health applications. Please click here to view a larger version of this figure.

Figure 2: Architecture of the proposed multimodal skill-assessment framework. Please click here to view a larger version of this figure.

Figure 3: Hardware configuration and synchronization setup for multimodal data acquisition. Please click here to view a larger version of this figure.

Figure 4: Proposed framework illustrating indicator-to-construct mapping for skill evaluation Please click here to view a larger version of this figure.

Figure 5: Bayesian prior-to-posterior updating of skill-grade probabilities. Please click here to view a larger version of this figure.

Figure 6: Radar-chart representation of user performance relative to expert benchmarks. Please click here to view a larger version of this figure.

Figure 7: Temporal heat map illustrating indicator changes across modelling time segments. Please click here to view a larger version of this figure.
| Field Name | Data Type | Description |
| Event ID | String | Unique identifier of the interaction event |
| Timestamp | Long | 13-digit synchronization timestamp |
| Operation Type | Enum | e.g., "Vertex Adjustment", "AI Tool Invocation" |
| 3D Coordinates | Float | (x, y, z) in the model coordinate system |
| Parameter Change | Float | e.g., curvature change value, model scaling rate |
Table 1: 3D Interaction event stream data fields.
| Field Name | Data Type | Description |
| Annotation ID | String | Unique identifier of the annotation |
| Scenario Tag | Enum | e.g., "Orthopedic Implant", "Organ Reconstruction" |
| Framework Score | Float | Initial score from the framework (0–100) |
| Expert Score | Float | Manual score by experts (0–100) |
| Corrected Weights | Float[10] | Adjusted weights of 10 indicators |
| Annotation Time | Datetime | Timestamp of the annotation |
Table 2: Expert-annotated database schema.
| Skill Grade | Number of Participants | Professional Background & Experience |
| Novice (θ₁) | 12 | Undergraduates majoring in Biomedical Engineering, <100 hours of 3D modeling practice |
| Beginner (θ₂) | 12 | Junior engineers, 1–2 years of medical 3D modeling experience, no AI tool usage history |
| Intermediate (θ₃) | 12 | Intermediate engineers, 3–5 years of experience, occasional AI tool usage (1–3 times/week) |
| Advanced (θ₄) | 12 | Senior engineers, 6–10 years of experience, regular AI tool usage (≥5 times/week) |
| Expert (θ₅) | 12 | Chief engineers/academics, >10 years of experience, leading AI-assisted modeling project development |
Table 3: Participant grouping and background information.
| Verification Dimension | Specific Indicator | Data Source | Calculation Method | Judgment Standard | Experimental Result |
| 1. Score Consistency | Pearson Correlation Coefficient (r) | Framework: Comprehensive skill score (0–100). Gold Standard: Average score from 3 senior experts (blind rating, 0–100) |
![]() Where:(x) = framework score, (y) = expert score, (n) = 60 participants |
$ | r |
| 2. Skill Grade Consistency | Kappa Coefficient (κ) | Framework: Skill grade (θ₁–θ₅). Gold Standard: Expert-assigned grade (θ₁–θ₅, based on score ranges: θ₁≤40, 40<θ₂≤60, 60<θ₃≤75, 75<θ₄≤90, θ₅>90) |
![]() where: Po = observed agreement rate, Pe = expected agreement rate |
κ ≥ 0.6, (substantial consistency) | κTask1 = 0.72, (both > 0.6) |
| 3. Key Indicator Alignment | Absolute Error (AE) of Top 3 Weighted Indicators | Framework: MDR, ATUR, SDNN (normalized). Gold Standard: Expert-scored indicator importance (1–5 scale, normalized to [0,1]) |
$AE = | Iframework - Iexpert | Where:I$=normalized indicator value |
| 4. Feedback Utility Alignment | Agreement Rate on Weakness Diagnosis | Framework: Weakness label. Gold Standard: Expert-identified weaknesses (recorded in annotation forms) |
AgreementRate = Number of consistent weakness labels/Total number of weakness labels | Agreement Rate ≥ 80% | RateTask1 = 87%, (both > 80%) |
Table 4: Concurrent validity verification details.
| Indicator | Latent Factor 1 | Latent Factor 2 | Latent Factor 3 |
| (Geometric Precision) | (Operation Proficiency) | (Cognitive Adaptability) | |
| Model Deviation Rate (MDR) | 0.89 | 0.12 | 0.09 |
| Topological Consistency (TC) | 0.85 | 0.15 | 0.11 |
| Feature Completeness (FC) | 0.81 | 0.18 | 0.13 |
| Modeling Time Efficiency (MTE) | 0.13 | 0.9 | 0.08 |
| Redundant Operation Rate (ROR) | 0.11 | 0.87 | 0.1 |
| AI Tool Utilization Rate (ATUR) | 0.16 | 0.83 | 0.14 |
| Gesture Smoothness (GS) | 0.14 | 0.81 | 0.12 |
| Fixation Dispersion (FD) | 0.09 | 0.11 | 0.88 |
| HRV-SDNN | 0.1 | 0.13 | 0.85 |
| EMG-RMS | 0.12 | 0.15 | 0.82 |
Table 5: Factor loading matrix of construct validity.
| Verification Dimension | Specific Indicator | Data Source | Calculation Method | Judgment Standard | Experimental Result |
| 1. Score Consistency | Pearson Correlation Coefficient (r) | Framework: Comprehensive skill score (0–100). Gold Standard: Average score from 3 senior experts (blind rating, 0–100) |
![]() Where:(x) = framework score, (y) = expert score, (n) = 60 participants |
$ | r |
| 2. Skill Grade Consistency | Kappa Coefficient (κ) | Framework: Skill grade (θ₁–θ₅). Gold Standard: Expert-assigned grade (θ₁–θ₅, based on score ranges: θ₁≤40, 40<θ₂≤60, 60<θ₃≤75, 75<θ₄≤90, θ₅>90) |
![]() where: (Po) = observed agreement rate, (Pe) = expected agreement rate |
κ ≥ 0.6, (substantial consistency) | (κTask1 = 0.72),(κTask1 = 0.68)(both > 0.6) |
| 3. Key Indicator Alignment | Absolute Error (AE) of Top 3 Weighted Indicators | Framework: MDR, ATUR, SDNN (normalized). Gold Standard: Expert-scored indicator importance (1–5 scale, normalized to [0,1]) |
$AE = | Iframework - Iexpert | (<br>Where:)I=normalized indicator value |
| 4. Feedback Utility Alignment | Agreement Rate on Weakness Diagnosis | Framework: Weakness label. Gold Standard: Expert-identified weaknesses (recorded in annotation forms) |
AgreementRate = Number of consistent weakness labels/Total number of weakness labels | Agreement Rate ≥ 80% | (RateTask1 = 87%), (RateTask2 = 82%)(both > 80%) |
Table 6: Concurrent validity verification details.
| Regression Equation | R2 | F-Statistic | p-Value | Prediction Effect Judgment | |
| (Explained Variance) | 95% Confidence Interval for Slope | ||||
| Q_future = 4.82-0.04×S_initial | 0.63 | 92.57 | <0.001 | Slope CI: –0.052 to –0.028 | S_initial increases by 10 → Q_future decreases by 0.4 (better quality) |
| E_future = 32.15+0.58×S_initial | 0.59 | 81.34 | <0.001 | Slope CI: 0.41 to 0.75 | S_initial increases by 10 → E_future increases by 5.8 (higher efficiency) |
Table 7: Predictive validity regression results.
| Symbol / Variable | Equation | Definition |
| P(θk) | Bayesian Update | Prior probability of skill level k. |
| P(X|θk) | Bayesian Update | Likelihood of observing multimodal data X under skill level k. |
| P(θk|X) | Bayesian Update | Posterior probability after evidence fusion. |
| N | MDR Eq. (3) | Number of anatomical key feature points. |
| pi | MDR Eq. (3) | Feature point coordinate in engineer’s model. |
| qi | MDR Eq. (3) | Feature point coordinate in reference model. |
| D | TC Eq. (4) | Number of topological defects in the mesh. |
| T | TC Eq. (4) | Total number of mesh elements (vertices + edges + faces). |
| C_completed | FC Eq. (5) | Number of clinically required features completed. |
| C_total | FC Eq. (5) | Total number of clinically required features. |
| t_actual | MTE Eq. (6) | Actual modeling time used by the engineer. |
| t_expert | MTE Eq. (6) | Average expert modeling time for the task. |
| O_r | ROR Eq. (7) | Redundant operation count. |
| O_total | ROR Eq. (7) | Total operation count. |
| A_used | ATUR Eq. (8) | Number of AI tool invocations. |
| T_used | ATUR Eq. (8) | Total tool invocations. |
| v_j | GS Eq. (9) | Velocity of joint j in gesture sequence. |
| x_max, x_min | FD Eq. (10) | Maximum and minimum fixation coordinates. |
| RRi | SDNN Eq. (11) | Interval between consecutive heartbeats. |
| EMG_i | RMS Eq. (12) | EMG signal amplitude at time i. |
| wi | Score Eq. (15) | Weight of the ith normalized indicator. |
| xi | Score Eq. (15) | Normalized value of indicator i. |
| μk, σk | Gaussian Likelihood | Mean and variance for likelihood under skill level k. |
| t_device | Sync Eq. (2) | Sensor acquisition timestamp. |
| t_trigger | Sync Eq. (2) | Unified framework trigger timestamp. |
Table 8: Appendix table for equation variables.
| Visualization Module | Task 1 (Orthopedic Implant) | Task 2 (Liver Vascular Reconstruction) | Average Accuracy |
| Real-Time Radar Chart | 91% | 86% | 88.50% |
| Temporal Heatmap | 89% | 83% | 86% |
| 3D Playback Annotation | 93% | 88% | 90.50% |
Table 9: Visualization diagnosis accuracy.
Although the existing 3-D modelling and interaction frameworks are few in number and on a few occasions can provide useful information, they evaluate engineering capability in 3-D modelling tasks. Most prior gesture-tracking and interface studies19 evaluate the interactive behavior rather than the evaluation of the user's ability. Similarly, in the case of VR/haptic-based educational systems20, the learner obtains an enhanced understanding of anatomy, but cannot quantify workflow accuracy, nor can decisions be validated. The same for AI-supported visualization tools21. The framework proposed here utilizes gaze fixation patterns, hand kinematics, physiological changes, and voice markers to assess the full behavior signature of modelling. By applying Bayesian evidence fusion, the framework generates an observable skill estimate from multiple streams of data. Finally, through role-aware visual analytics, the user obtains feedback on their performance in an interpretable manner. The strategy demonstrated here provides, directly, a new and valid solution to the challenge of measuring user competence in AI-assisted, health-critical 3-D modelling systems.
This paper presents the first end-to-end visual analytics framework for objectively assessing engineering competence in AI-assisted, health-centric 3-D modelling. This framework combines multi-modal sensing, feature extraction, and Bayesian evidence fusion to produce a continuously updating belief state of user competence, directly addressing the scientific and regulatory need for repeatable workflow-level assessment rather than single-outcome mesh inspection. By rooting the assessment construct in the anatomically constrained requirements of porous-bone deviation (<5%), vascular angle error (<30), AI-tool utilization (>60%), and cognitive stability (SDNN >100 ms), this work introduces a measurable, domain-specific definition of skill that has not existed in prior modelling literature.
The research constitutes several advances to the field. Scientifically, it shows that geometric, behavioral, and cognitive indicators can be synchronized at ±50 ms precision and fused using a Bayesian engine to produce a reliable, uncertainty-aware estimate of modeling competence. This multi-modal integration significantly extends earlier single-channel CAD analytics and cognitive-load studies, providing a far richer representation of how engineers navigate anatomy-constrained design tasks. Indeed, this Bayesian engine establishes a continuous, interpretable evaluation pathway that aligns with emerging regulatory expectations for "repeatable workflow competence" and ISO 13485 process validation. Real-time visual feedback via radar charts, temporal heatmaps, and 3-D annotated playback within the framework also accelerates knowledge transfer in support of human-AI collaboration research and competency-based technical education.
This section presents the empirical reasoning behind the various elements of the hierarchical architecture. Specifically, the hybrid CNN-LSTM feature extraction layer was selected because pilot benchmarking on 200 labelled modelling sessions indicated that incorporating spatial (CNN) and temporal (LSTM) patterns improved geometric accuracy prediction by 13.4% and decreased misclassifying cognitive states by 10.7% when compared to single stream model options. Similarly, the Bayesian fusion engine produced better agreement with expert ratings (r=0.91, κ=0.82) than deterministic weighted averaging baselines (r=0.74, κ=0.61), demonstrating improved instability with uncertainty. These empirical performance improvements show that each AI layer exhibits appreciable improvement over conventional methods in prior literature.
Several limitations must be noted. Physiological indicators like SDNN and EMG RMS vary significantly among individuals, leading to about 10-15% misclassification in cognitive-load diagnosis. Two representative tasks were used: femoral-stem implant modelling and liver vascular reconstruction. These two do not capture the full variability of clinical modelling domains such as craniofacial or airway reconstruction. While active-learning calibration reduces the annotation load, it does require periodic expert intervention, possibly constraining scalability in low-resource institutions. Moreover, the feature extractors adapt to new datasets but assume relatively stable user-interface layouts and behaviors of AI tools; rapid GUI updates may need retraining.
Other methods might also assess the identical hypothesis of workflow competence. Alternative approaches could involve graph-theoretic mesh metrics, behavioral embedding via self-supervised learning, VR-based immersive analytics, and more classical cognitive task analysis, for example, think-aloud protocols. These methods would complement or benchmark the multi-modal fusion strategy presented here, although fine-grained temporal alignment and uncertainty quantification using the proposed Bayesian framework might not be attained by them.
The technique has considerable significance and domain-specific relevance. In medical 3D printing, this will help provide auditable, real-time evidence for quality assurance pipelines and ISO 13485 process-validation requirements. In surgical planning and robotic console training, the sensing and fusion pipeline parallels the constructs used in motor-cognitive skill acquisition research, enabling more robust training analytics. In additive manufacturing preparation, the early detection of geometric error patterns and inefficient tool use should limit downstream print failures. More broadly, dynamic radar-chart deficits, heat-map signatures, and 3-D annotated error regions provide interpretable insight into user behaviour, supporting educational, industrial, and regulatory use cases.
A drawback of the present study is that it did not include any visual representations of the 3D models used in the tasks we tested. Though the framework does provide an assessment of geometric accuracy using quantitative indicators (e.g., distance to the Mean Distance Ratio (MDR) measure and the total of clinical errors due to Topological Constraints (TC) per case), the study did not include visual representations of actual anatomical structures, or represent patterns distribution of error visible through mesh deviation heat-maps, or the presence of topological defects. This could restrict the educational purpose of demonstrating the modeling complexity and the actual challenges faced in clinical practice. In future works will include detailed visual exemplars and standardized visualizations to create a record for reproducibility, enable cross-study comparisons, and increase transparency in training and certification.
There have been restrictions to the experimental verification primarily done on the Mimics platform, which had mixed exposure from participants to 3D Slicer and Blender Medical. Preliminary pilot studies suggest these aspects minimally change geometric indicators, but differences in the structures of the different interfaces and interactions with AI assistive tools may affect behavioral metrics of efficiency, ATUR, and ROR. Since the testing was not done across different software ecosystems, it limits the generalizability of the results. As we continue with further studies, cross-platform testing will be expanded. Aimed and calibration layers will be developed, which will calculate the automatic calibration of specific interaction behaviors with the tools. In addition, future studies will investigate if the framework maintains validity across commercial and open-source CAD environments, differing workflows on multiple devices, or if changes in the modeling process, reliant on AI-assisted approaches, impact validity.
There are some limitations to consider. First, the sample, although stratified by skills, was still small and came from two institutions. Consequently, the statistical power might be low and the findings might not be broadly generalizable. Second, the framework is dependent on specific hardware, including an eye tracker, EMG/ECG sensors, and depth-based hand-tracking devices. Multimodal instrumentation is important for capturing high-resolution behaviours, but will not be equally available in all training centres or clinical contexts. Third, the models of performance established in the present study might not generalize well to contexts using different 3-D modelling software, AI toolchains, or curricula. Variations in user experience, complexity of task, and sensor integrity will affect estimates of competence. Future work should, therefore, seek to validate the system with larger and more diverse samples, assess lower-cost sensing alternatives, and examine the potential use of the system in wider clinical and educational contexts.
Future extensions include federated deployment, enabling cross-hospital sharing of encrypted model updates without exposing patient-specific meshes; integration with large vision-language models capable of converting skill-deficit vectors into natural-language coaching tips; and deeper alignment with RSNA and ISO regulatory frameworks to transform competence trajectories into formal, auditable evidence. Longer-term, the same sensing-and-fusion architecture could be applied to surgical robot console operation, additive-manufacturing build-job optimisation, general CAD education, and even AI-augmented histopathology annotation, expanding its impact beyond 3-D medical modelling. Overall, findings establish that the proposed framework reliably differentiates between skill levels, allows for the prediction of downstream modeling performance, provides interpretable feedback that users trust, and adapts to evolving clinical technologies. It therefore constitutes a foundational step towards scalable, scientifically sound, and regulation-ready assessment of engineering competence in high-stakes AI-enabled health applications.
In this work, presented the first end-to-end visual analytics framework that quantifies, updates, and communicates engineering competence in AI-assisted, health-related 3-D modelling, which included multimodal sensing, a Bayesian framework for evidence fusion, and visual feedback that is interpretable. Transitioning from outcome-based quality checks to a dynamic and situational-aware assessment of shared understanding of workflow competence reflects clinical expectations, but also allows for regulatory oversight and modelling quality to be anticipated downstream. The framework, which provided evidence of construct, concurrent, and predictive validity, demonstrated that the system identified skill levels reliably, learned about downstream modelling quality, could assess modelling quality on-the-fly, and supported effective user learning. Evidence from this research shows the future is bright for competence-aware analytics to impact the safety, reliability, and reproducibility in medical 3-D modelling. Going forward, several directions will advance the impact of this work. First, federated deployment will support collaborative model calibration among hospitals, while allowing hospitals to stay compliant with post-2019 privacy regulations around the sharing of raw patient data, which facilitates data-set expansion and overall variance in new models for training. Second, LLM-driven coaching - by integrating advanced vision-language models - will finally allow for real-time, tailored micro-feedback on modelling behaviour and visual anomalies linked to observable modelling behaviors. Third, with deeper integration into the regulatory process and standards like ISO 13485 and developing process-validation approaches, it would be possible to translate metrics of real-time skills into auditable evidence in support of certification and QMS. In addition to these directions, future work will investigate lightweight sensing configurations, expansion of compatible software, and domain transfer to similar tasks such as surgical robotics, additive-manufacturing planning, and digital pathology workflow. With these possibilities, this development could support the framework as scalable groundwork for next-level human-AI collaboration in safety-critical modelling environments.
The authors have nothing to disclose.
None.
| AI-assisted 3D Modeling Software | Materialise NV, Belgium | Mimics AI v25.0 | Used for femoral-stem implant modelling and liver vascular reconstruction tasks with AI segmentation and defect detection. |
| Eye Tracker (Desktop) | Tobii Pro Fusion | TPF-120 | Captures fixation and saccade data at 120 Hz to quantify visual attention during modelling. |