RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
A subscription to JoVE is required to view this content. Sign in or start your free trial.
Research Article
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
This study develops a three-tier evaluation framework and fairness mediation model to assess AI-assisted English writing systems. Using 764 cross-linguistic samples, results show accuracy disparities, fairness bias against non-native learners (especially Chinese A2 proficiency level), and fairness perception as the key mediator of user satisfaction, offering theoretical and practical implications.
In the context of global educational digital transformation, automated writing evaluation (AWE) has been widely adopted due to its real-time and standardized advantages; however, traditional accuracy-oriented frameworks often neglect equity concerns and learners' perceptions, thereby limiting transparency and educational value. To address this limitation, this research proposes an explainable AI (XAI) framework designed to provide transparent and interpretable feedback, allowing learners to understand and trust automated evaluation, and integrates a multi-level validation model, the Three-Level Evaluation Framework (TLEF), spanning technical accuracy, group and individual equity, and learner perception, together with the AI Fairness Mediation Model (AFMM). Using stratified random sampling, data were collected from 764 multilingual learners (native speakers of English, Chinese, and Spanish) across Common European Framework of Reference for Languages (CEFR) levels A2 to C1 through writing tasks, dual scoring by AI and human experts, and structured questionnaires. Instead of listing individual tests, multiple statistical analysis was employed to examine validity, fairness, and the learner-perception relationship. Statistical analyses combined correlation, root mean square error (RMSE), Equalized Odds testing, and Structural Equation Modeling (SEM). The findings reveal that while the AI-assisted writing evaluation (AWE) system (ETS Criterion) achieves overall validity (r = 0.82), significant disparities remain: Chinese native speakers demonstrate the lowest agreement with human raters (0.72) and the highest RMSE (median 2.15), fairness biases are most pronounced at lower proficiency levels (ΔEO = 0.15 for A2 learners), and perceived fairness fully mediates the link between perceived accuracy and learner satisfaction, with proficiency moderating fairness sensitivity. By reframing fairness and perception as essential dimensions of explainability, the research enhances the theoretical grounding of AWE and provides a practical pathway for increasing transparency, equity, and social acceptance in educational technologies.
The intensive globalization of education and digital technologies has increased the need to evaluate the level of writing in English scientifically and credibly for language teaching, academic development, and career advancement1. Conventional writing evaluations, as practiced by human rating, can measure subjective aspects of writing like the thoroughness of argumentation and cultural suitability2, but are susceptible to long turnaround times, high labor expenses, and bias due to evaluator experience and leanings3,4. These constraints are especially acute in large-scale practice, like international language tests (IELTS, TOEFL) or other courses in English taught in universities where manual scoring cannot be all that is required in terms of instant feedback and coverage5.
AWE systems have become widely used in this context due to their real-time processing, standardization, and scalability6. Such popular tools as Grammarly (which focuses on grammar errors and style refinement) and ETS Criterion (which adheres to formal writing norms) are currently used by millions of students in K-12 education, language schools, higher education, and individual training7. Although these are the benefits, the technological efficiency and education applicability of AWE systems are still disputed8. Technically speaking, the existing systems are highly accurate on objective dimensions, including error detection and lexical diversity, where the correlation with human scoring can be over 0.859. However, in more subjective areas, such as content relevance, logical argumentation and organization of a text, the correlations often become lower than 0.7010. Such a disproportion has the danger of promoting superficial accuracy among the learners at the cost of overall competence in writing11.
The issue of equity also limits the educational usefulness of AWE. The current studies are also inclined to focus on the aggregate indicators of accuracy, neglecting the possibility of deviations that systematically disadvantage some group12. Indicatively, characteristics of interlanguage shared by Chinese or Spanish learners would be mistaken as errors, and this would result in systematic underestimation13,14. In addition, the subjective acceptance of AI feedback by learners is generally little known15. Surveys indicate that almost one-third of non-native learners report an inappropriateness between AI scores and actual performance, with the processes of technical accuracy, group equity, and learner satisfaction still being poorly comprehended16.
These weaknesses reflect the shortcomings of the classical paradigm of accuracy17. A framework that considers only the alignment between AI and human scoring cannot capture issues of equity or the learner's trust in the system. In practice, the educational value of AWE must satisfy three conditions simultaneously: technical precision, fairness across groups, and learner acceptance18. The absence of such a comprehensive validation approach helps explain why AWE systems enjoy widespread adoption yet limited trust in educational practice19,20.
To address this challenge, the present study introduces a multi-level validation framework that integrates technical accuracy, group and individual fairness, and learner perception into a coherent structure. The proposed XAI framework is designed to be practically implemented within existing AWE platforms by providing teachers and students with fairness diagnostics and transparent score explanations, and can be applied in writing courses or test-preparation classes to evaluate its ability to enhance fairness, interpretability, and instructional usefulness in real assessment settings.
In this context, the hypothesis is an AFMM to investigate the mediating role of perceived fairness in determining the relationship between accuracy and satisfaction, as well as the moderating role of language proficiency on fairness sensitivity. Therefore, it contributes in two ways both theoretically by enriching the evaluation models of AWE by describing fairness as one of the key validation dimensions alongside accuracy and perception, and practically, by providing developers with strategies to maximize fairness, educators with group-sensitive system selection criteria, and the educational value of AWE by explaining the manner in which the perceptions of the learners are formed. In addition to education, the framework is also aligned with the broader concept of XAI, demonstrating how fairness and user perception can enhance transparency, trust, and acceptance in other areas, such as healthcare, autonomous systems, and cybersecurity.
Research Questions:
1.To what extent does the AWE system demonstrate technical accuracy and fairness across different native-language and proficiency groups?
2.How can an XAI-based multi-level evaluation framework improve transparency and equity in automated English writing assessment?
LITERATURE REVIEW:
The factors that affect the acceptance of AWE feedback by college students were examined using an extended Technology Acceptance Model (TAM)21. Based on survey data from 448 Chinese students using SEM, it was determined that usefulness, ease of use, and intention had a significant influence on subjective norm, trust, self-efficacy, cognitive feedback, and system characteristics. However, the study was limited to a single nation and a single group of students, which limits the applicability of generalization. To explore how Chinese EFL students respond to Pigai AWE feedback22, a study analyzed repeated submissions (n = 5) from university students. It noted an early emphasis on error correction, low intake of linguistic feedback, and gradual deepening of response. However, the sample size was very limited, as was the AWE system, which restricts applicability and generalizability. The beliefs held by EFL teachers regarding the application of the AI grading tool (CoGrader) were examined to identify the factors that influence their views23. Through a mixed-method study of 10 Saudi university teachers, a survey and an interview revealed that there was a mixed positive opinion, but reluctance to be completely sure of reliability and complete teacher replacement. This impedes the generalization due to the limited sample and the one-country setting.
Considering developments in corpus linguistics and AI technology, a study investigated AES frameworks24. It employed PCA to improve linguistic indicators for evaluating writing quality and discovered that combining micro-characteristics with aggregated characteristics defined writing quality more effectively than aggregated characteristics alone. The non-linear AES approach based on Random Forest Regression surpassed the other approaches. Furthermore, SHAP identified essential language elements for each evaluated attribute, increasing system transparency via explainable AI. The results possibly help to enhance multidimensional methods in writing evaluation and education. The human-machine collaboration system was introduced to address the challenges of annotating Arabic writings, which are often expensive and time-consuming. The method considers essays based on seven features of literature with the help of LLM. Validation processes and prompting tactics were personalized to ensure consistency and accuracy. The cooperation results in a higher supply of labeled resources and does not affect the quality of evaluation, demonstrating it to be a scalable data annotation method suitable for lower-resource languages.
The use of AI in the educational sphere offers an opportunity to decrease grading requirements significantly and enhance writing education25,26. At the same time, researchers have emphasized that the accuracy of AI is not the only aspect relevant to its responsible use. There are principles of fairness and bias reduction, security and privacy, accountability, explainability, transparency, educational effect, integrity, and continuous development. Recent research has empirically evaluated zero-shot scoring based on GPT-4o with a focus on these requirements. The research focused on the perceptions that educators held towards ADWTs regarding the aspect of educational integrity27. The cross-sectional study involving 100 graduate students and professors of 10 subjects suggests that, despite teachers attributing the benefits of ADWTs in achieving the educational objective, it has some limitations, such as limited accessibility, lack of knowledge, and worry about its impact on integrity and creativity. The research suggested that as AI technologies become more integrated into education, ethical concerns and stakeholder participation are necessary for their successful and responsible use. Research investigated the efficacy of AI technologies compared to human assessors in evaluating essays submitted by EFL pupils28. Assessing 30 essays revealed that, while AI offered high-quality comments in terms of content, language, organization, and correctness, it constantly provided lower ratings than human raters. Furthermore, AI provided more thorough feedback, but the scores from various AI tools were not substantially different.
Research Gap:
Currently, most research on AWE scholarship examines either accuracy or user acceptance. Very few examine if scoring differences systematically disadvantage native-language or proficiency groups. While previous studies have examined user acceptance or are limited to a specific AWE system from a specific country and sample size, questions around generalizability arise. Although both SHAP and PCA are XAI strategies and were developed to increase transparency, no studies have examined fairness mechanisms or how learners use AI feedback from the AWE. There are no extensive frameworks in the literature that contemplate defined dimensions of accuracy, fairness analysis, and learner perceptions. There is no example of an explainable model of evaluation that considers intra- and inter-rater accuracy, fairness, and learner perceptions. An explainable framework, TLEF, and a combined model, AFMM, are proposed and validated in this research to assess accuracy, fairness, and learner perceptions at the same time among multilingual and proficiency diverse learners.
The ethical approval and participant recruitment process, including essay administration, dual scoring by ETS Criterion and experts, learner-perception evaluation, and statistical analysis, are summarized in this section. It highlights how accuracy, fairness, and SEM-based perception modeling are integrated into a unified XAI validation pipeline. The XAI-driven AWE evaluation framework is illustrated in Figure 1.
Procedure:
The procedure involved several steps. First, IRB approval was obtained, and informed consent was collected from all participants. Independent, dependent, and control variables were then defined. Standardized writing tasks were administered on Moodle using three neutral essay topics, and writing samples were collected while ensuring adherence to essay requirements, such as word count, time limit, and structure. Dual scoring was conducted using ETS Criterion outputs combined with human expert ratings. Learner-perception questionnaires were distributed immediately after essay submission. Data screening and quality control procedures were implemented to address anomalies, such as cheating or invalid responses. Fairness analysis thresholds (ΔEO, RMSE checks) were also applied. Finally, all anonymized data were stored securely on encrypted, access-controlled servers.
Ethical approval and informed consent
This study received ethics approval from the Institutional Review Board of the authors' institution. All procedures were conducted in accordance with the Declaration of Helsinki and applicable regulations. All participants were adults (≥18 years) and provided written informed consent before participation. Writing samples and questionnaire responses were de-identified at source and stored on encrypted, access-controlled servers; only authorized investigators had access. Human raters were blinded to participants' native language, proficiency level, and demographics. Participation was voluntary, with the right to withdraw at any time, and no deception or sensitive interventions were involved. Formal approval documentation can be provided to the journal upon request.
Variable design
A total of three groups of variables were defined in the study to guide the analysis. Table 1 summarizes the measurement and data types used in measurement methods for each construct and provides the full operational definitions of the independent, dependent, and control variables.
AI scoring accuracy was the first independent variable assessed in terms of RMSE and Pearson correlation coefficient (r) between the outputs of ETS Criterion and the ratings of the experts. Calibration performed by experts yielded an ICC of 0.91, validating reliability.
The second independent variable was the linguistic background of the learners, which was divided into native and non-native speakers, and further subdivision was made into Chinese, Spanish, Arabic, and other groups. Chinese students were one of the target populations because preliminary indications of systematic underestimation were observed.
The third independent variable was writing proficiency, which was rated according to the CEFR levels A2 to C1, as confirmed by official certificates and pre-class proficiency tests, and was also aligned with IELTS equivalencies. Another moderator introduced in the AI Fairness Mediation Model was writing proficiency to test whether sensitivity to fairness differs across proficiency levels.
Perception of fairness and learner satisfaction were the dependent variables. Perception of fairness was assessed by means of an eight-item questionnaire rated on a seven-point Likert scale, which included the individual consistency and group impartiality (Cronbachs 87; CVI 92). The satisfaction of learners was assessed using six Likert questions that indicated willingness to use and perceived improvement in skill (α = 0.85).
The variables were controlled for in terms of age, sex, and writing experience. Age was divided into three groups (18-22 years, 23-28 years, and ≥29 years), and gender was categorized into male and female. Writing experience was categorized into three levels of frequency per year.
Writing task texts
Standardized argumentative essay prompts were formulated to obtain writing data for three neutral topics, Globalization's Impact on Local Cultures, Advantages and Challenges of Online Education, and Ethical Boundaries of Artificial Intelligence. These themes were aimed at balancing cognitive difficulty and accessibility on the one hand, and reducing performance differences due to previous knowledge on the other. The distribution of topics and descriptive statistics for essay length are reported in Table 2.
Each essay was required to be 250 words ±10% and written within 45 minutes on a Moodle-based platform. Auxiliary tools were prohibited, and late submissions were excluded. Essays followed a standardized structure of introduction, two argument paragraphs, and conclusion. In total, 764 valid essays were collected, with an average length of 252.3 words (SD = 8.7).
Scoring comparison data
Accuracy of AWE scoring was assessed using a dual procedure that combined ETS Criterion outputs with human expert ratings. Scores were retrieved from Criterion via its open API. Three linguists with more than ten years of assessment experience independently scored all essays. Before formal scoring, the raters completed three calibration sessions. During calibration, inter-rater reliability reached ICC = 0.87; during formal scoring, ICC rose to 0.91, with dimension-specific ICCs above 0.88. Essays with score discrepancies greater than two points were resolved collectively (18 cases). The scoring workflow and reliability outcomes are summarized in Table 3.
Learner perception questionnaire
Learners' perceptions of AI feedback were captured through a 22-item questionnaire based on the TAM and extended to include fairness. The instrument contained three domains: fairness perception (8 items), satisfaction (6 items), and moderating factors such as comprehensibility and transparency (8 items). Validation by five experts yielded a CVI of 0.92, and pilot testing with 60 learners produced an overall reliability of α = 0.90. The questionnaire structure and psychometric indices are provided in Table 4.
Questionnaires in the main study were administered right after submission of the essays, and there were minimum completion time requirements to diminish thoughtless completion. Out of the 764 surveys issued, 756 were valid following quality checks, and a resultant effective rate of 98.95 was obtained.
Data collection and quality control
The data were recorded for 8 weeks (March-April 2024) in four stages: recruitment and consent; essay writing; dual scoring and questionnaire distribution; and compilation of the database. The proficiency certificates based on pre-class writing performance were reviewed through dual screening, and this process eliminated 16 participants. Four potential cases of cheating were eliminated by real-time monitoring, and three suspect AI performances (deviations of at least 8 points) were subsequently amended following a manual assessment. Eight invalid questionnaires were eliminated based on reverse-item consistency checks.
Data storage and ethics
All the data were anonymized and stored using unique identifiers that consisted of the native language, proficiency level, and serial number. Texts, scores, and questionnaires were encrypted and stored on ISO27001-compliant servers with restricted access. Data will be retained for 3 years before permanent deletion. Ethical approval was obtained from the institutional review board, and written informed consent was collected from all participants.
The section presents the research results based on five analytical dimensions: experimental design, participant characteristics, scoring accuracy, fairness assessment, and modeling of learning and perception. The outcomes include statistical performance, group differences, fairness disparities, and SEM-based mediation and moderation.
Experimental setup
The key software steps involved setting up ETS Criterion through its API to automatically score the information, training human raters, performing data analysis in the referenced statistical software with default statistical options, and conducting structural equation modeling in R 4.3.1 with standard SEM packages. Materials, Software Platforms, and Analytical Tools Used in the AWE Fairness research are represented in the Table of Materials.
Sample selection and demographic characteristics
A total of 764 valid participants were recruited using stratified random sampling across English-speaking regions. Control variables were analyzed to ensure representativeness. The majority were aged 18-22 years (n = 426, 55.76%), followed by 23-28 years (n = 258, 33.77%) and ≥29 years (n = 80, 10.47%). Gender distribution was balanced (female: 52.62%, male: 47.38%). Writing experience varied, with 25.65% reporting fewer than 10 essays annually, 42.93% writing 10-30 essays, and 31.41% writing more than 30. Native language and proficiency levels were evenly distributed within each subgroup. Informed consent was obtained from all participants in accordance with the Declaration of Helsinki.
Accuracy analysis
General and aggregated scoring consistency
The significance of Pearson correlation coefficients (r) between ETS Criterion and human expert ratings was all significant at p = 0.001. The total scores showed an overall correlation of 0.82 with the dimensions, ranging from 0.73 to 0.89. The greatest consistency was found in native speakers (r = 0.89), and the lowest one in Chinese native speakers (r = 0.72). Through proficiency, C1 learners demonstrated higher consistency (r = 0.86) compared to A2 learners (r = 0.68). Grammar accuracy had the highest alignment, and content relevance had the lowest dimensionality. Table 5 summarizes these results.
The differences in RMSE between groups
RMSE values, as visualized by a box plot, showed apparent group differences. The lowest median RMSE was observed in English native speakers (1.02) and the highest (2.15) in Chinese native speakers, with greater variability. Spanish (1.68) and Arabic (1.75) groups were in the middle. The following were the number of participants per group: English n = 312, Chinese n = 218, Spanish n = 146, and Arabic n = 88. This indicates that Chinese learners not only experienced greater deviations from expert scores but also more pronounced individual outliers, as illustrated in Figure 2. The statistical accuracy and fairness measures, along with confidence intervals and effect sizes, for the AWE system assessment are presented in Table 6.
ANOVA test
The ANOVA tests the difference between RMSE in the language groups. The result of the analysis showed that there was a statistically significant group effect, which means that the deviations in RMSE are not meaningless among learners. The Chinese group exhibited much higher error rates, whereas English speakers had the lowest. The effect size (0.19) is large, which proves that there is a big group-level difference in AI scoring accuracy. Table 7 provides the ANOVA results of RMSE differences across native-language groups.
Fairness analysis
Equalized odds test
Fairness evaluation using Equalized Odds revealed systematic bias. For example, English native speakers achieved a TPR of 0.88 and FPR of 0.12, whereas Chinese native speakers had a TPR of 0.76 and FPR of 0.18, yielding ΔEO = 0.12 (p < 0.05). Cross-analysis further showed that A2-level Chinese learners had ΔEO = 0.15, significantly higher than the C1-level group (ΔEO = 0.06, ns). Results of the Equalized Odds test across groups are presented in Table 8.
ROC Curve comparison
Receiver operating characteristic (ROC) analysis demonstrated higher discrimination for native speakers (AUC = 0.92, 95% CI: 0.89-0.95) than for non-native learners (AUC = 0.81, 95% CI: 0.78-0.84). Group differences were statistically significant (Z = 3.26, p < 0.01). At FPR = 0.10, TPR reached 0.85 for native speakers but only 0.70 for non-native speakers. These differences are visualized in Figure 3.
Learner perceived model (SEM)
Mediation effect of the AFMM path
Structural equation modeling (SEM) indicated a good model fit (χ²/df = 2.31, RMSEA = 0.042, CFI = 0.95, TLI = 0.94). Accuracy significantly predicted fairness perception (β = 0.48, p < 0.001), which in turn strongly predicted learner satisfaction (β = 0.71, p < 0.001). The direct effect of accuracy on satisfaction was nonsignificant (β = 0.05, p = 0.32). Bootstrap tests confirmed full mediation by fairness perception, with an indirect effect of 0.34 (95% CI: 0.29-0.39). Path analysis results are illustrated in Figure 4.
Moderating effect of language proficiency
The moderating effect of proficiency revealed stronger fairness sensitivity among lower-level learners. At the A2 level, the fairness-satisfaction path coefficient was 0.89, gradually decreasing to 0.52 at the C1 level (all p < 0.001). The interaction term was significant (β = -0.18, p < 0.05), indicating that learners with higher proficiency tolerate AI scoring deviations more readily. Detailed coefficients by proficiency group are reported in Table 9.
DATA AVAILABILITY
The data that support the findings of this study have been deposited in the Zenodo repository and are publicly accessible at the following DOI: https://doi.org/10.5281/zenodo.17863904.

Figure 1: End-to-End workflow for XAI-driven AWE evaluation framework. This figure illustrates the complete experimental pipeline used to evaluate AWE accuracy, fairness, and learner perception. It integrates dual scoring, SEM modeling, and fairness testing to ensure transparent, interpretable, and equitable writing assessment. Abbreviations: XAI = explainable AI; AWE = AI-assisted writing evaluation; SEM = Structural Equation Modeling. Please click here to view a larger version of this figure.

Figure 2: Box plot of RMSE across language groups. This figure shows the RMSE distribution across the four language groups. Chinese learners have the highest median (approximately 2.2) and the greatest variance, with a few high-value outliers. English native speakers have the lowest values of RMSE (approximately 1.0), while the Spanish and Arabic groups fall in the middle. The statistical difference in the RMSE of the groups was statistically significant (p < 0.001). Please click here to view a larger version of this figure.

Figure 3: ROC curves comparing native and non-native groups. This figure presents a comparison of ROC curves for native and non-native learners. Native learners exhibit better discrimination performance, with an AUC of 0.92, compared to non-native learners, who have an AUC of 0.81, indicating weaker classification performance. The difference in AUC values was significant (p < 0.05). Please click here to view a larger version of this figure.

Figure 4: Standardized path diagram of the AFMM model. This figure illustrates the AI Fairness Mediation Model (AFMM), where scoring accuracy is a significant predictor of fairness perception (β = 0.48, p < 0.001), and fairness perception is a significant predictor of learner satisfaction (β = 0.71, p < 0.001). Accuracy is related to satisfaction only via a non-significant direct relationship (β = 0.05). There is a very low effect of control variables on gender (β = 0.02), age (β = 0.03), and writing experience (β = 0.12), indicating low or insignificant effects. Please click here to view a larger version of this figure.
| Assessment level | Core essence | Assessment dimensions | Theory evidence | Key reference indicators |
| Technical accuracy | The degree of consistency between AI scores and objective criteria (human expert scores) reflects the technical effectiveness of the system | 1. Consistency between total score and total score | Classical theory of measurement error | RMSE, Pearson correlation coefficient, scorer reliability |
| 2. Dimension consistency (content, organization, language, etc.) | ||||
| Fairness assessment | The unbiased AI score between different groups/individuals avoids systematic bias against specific groups (such as non-native speakers) | 1. Group equity (mother tongue/level group differences) | Algorithmic fairness theory | ΔEO (EqualizedOdds difference) and group RMSE dispersion |
| 2. Individual fairness (task inter-stability) | ||||
| Learner perception | The subjective cognition, attitude and acceptance of learners towards AI evaluation determine the practical application value of the system | 1. Perception of fairness | Extension of Technology Acceptance Model (TAM) | Likert scale scores (fairness, satisfaction items) |
| 2. Perceived usefulness | ||||
| 3. Perception of usability |
Table 1: Composition of the three-level evaluation framework. This table outlines the three dimensions of the TLEF-technical accuracy, fairness, and learner perception-along with their theoretical foundations and evaluation indicators.
| Variable classes | Name | Indicators/measures | Path relationship |
| External source latent variable | AI accuracy | RMSE、Pearson r | Accuracy → fairness perception |
| Intermediate latent variable | Fairness perception | Likert 7-level 8 question mean | Fairness perception → satisfaction |
| Result potential variable | Student satisfaction | Likert 7-level 6 question mean | — |
| Regulated variable | Language proficiency | A2, B1, B2, C1 | Language level × perceived fairness → satisfaction |
| Controlled variable | Age, gender, writing experience | Include the model covariance matrix | — |
Table 2: Definition and path of AFMM variables. This table presents the core constructs of the AI Fairness Mediation Model, specifying independent, mediating, moderating, and dependent variables with their hypothesized relationships.
| Variable classes | Variable name | Core essence | Operational indicators | Measurement tools/methods | Data type | Range/Classification |
| Vrgument | Accuracy of AI scoring | Consistency between AI scores and human expert scores | ① Total score and RMSE of each dimension; ② Total score and Pearson correlation coefficient of each dimension | ETS Criterion API+ human expert calibration scores | Successive type | RMSE>0; correlation [-1,1] |
| Learner's language background | Type of learner's native language, test group bias | ① Native group (English); ② Non-native group (Chinese, Spanish, Arabic, others) | Demography Studies Volume | Categorization of variables | English native language/chinese native language/spain native language, etc | |
| Learner writing proficiency | The level of English writing ability, the difference and the moderating effect of the level | CEFR levels: A2 (Beginner), B1 (Elementary), B2 (Intermediate), C1 (Advanced) | Level certification review + pre-class test | Categorical variables | A2/B1/B2/C1 | |
| Dependent variable | Fairness perception | The subjective judgment of learners on the unbiased nature of AI scoring is a mediating variable | The average score of 8 items (including individual consistency and group unbiased sub-dimension) | Structured questionnaire (Likert 7-point scale) | Successive type | 1 (strongly disagree) -7 (strongly agree) |
| Student satisfaction | Learninger's acceptance and willingness to use AI feedback, outcome variables | The average score of 6 items (including the sub-dimension of willingness to use and perception of ability improvement) | Structured questionnaire (Likert 7-point scale) | Successive type | 1 (strongly disagree) -7 (strongly agree) | |
| Controlled variable | Age | Age was excluded from the interference with AI acceptance | Age range: 18-22 years old, 23-28 years old, 29 years old and above | Demography Studies Volume | Categorical variables | 18-22/23-28/≥29 |
| Sex | Excluding gender from the interference of trust in feedback | Gender classification: female, male | Demography Studies Volume | Categorization | Female/male | |
| Writing experience | Eliminate the interference of writing experience on satisfaction | The average number of words written per year: less than 10,10-30, more than 30 | Demography Studies Volume | Categorical variables | <10 times/10-30 times/> 30 times |
Table 3: Design summary of variables. This table is a summary of how all variables of the study (independent variables: AI scoring accuracy, linguistic background, proficiency; dependent variables: fairness perception, learner satisfaction; and control variables: age, gender, writing experience) can be operationalized.
| Scoring dimensions | Overall sample (r) | Mother tongue group (r) | Non-native group (r) | Chinese mother tongue (r) | A2 level (r) | C1 level (r) |
| Total points | 0.82*** | 0.89*** | 0.76*** | 0.72*** | 0.68*** | 0.86*** |
| Content | 0.73*** | 0.81*** | 0.67*** | 0.62*** | 0.59*** | 0.78*** |
| Qrganization | 0.76*** | 0.83*** | 0.70*** | 0.65*** | 0.63*** | 0.80*** |
| Language use | 0.84*** | 0.88*** | 0.80*** | 0.77*** | 0.72*** | 0.85*** |
| Grammar accuracy | 0.89*** | 0.92*** | 0.87*** | 0.85*** | 0.81*** | 0.90*** |
| Vocabulary diversity | 0.85*** | 0.89*** | 0.82*** | 0.79*** | 0.75*** | 0.88*** |
Table 4: Learner perception questionnaire structure and reliability. This table reports the dimensions, number of items, example statements, and psychometric indices (Cronbach's α, CVI) for the questionnaire used to assess fairness perception, satisfaction, and moderating factors.
| Language proficiency (CEFR) | Sample capacity (n) | Fairness perception → satisfaction (β) | Standard error (SE) | P price | Adjustment effect interaction (β) |
| A2 (basic) | 112 | 0.89 | 0.06 | <0.001 | -0.18 |
| B1( independence ) | 196 | 0.78 | 0.05 | <0.001 | (Reference group: A2) |
| B2 (Fluent) | 280 | 0.65 | 0.04 | <0.001 | -0.05 |
| C1 (proficient) | 176 | 0.52 | 0.05 | <0.001 | -0.09 |
Table 5: Correlation matrix between AI scores and human expert scores. This table provides Pearson correlation coefficients across total and dimension-specific scores, grouped by language background and proficiency level. Significance levels and sample sizes are reported.
| Metric | Group / Overall | Value | 95% Confidence Interval (CI) | Effect Size | Interpretation |
| Correlation (r) | Overall | 0.82 | 0.79 – 0.85 | Large | Strong alignment between AI and human ratings |
| Correlation (r) | Native | 0.89 | 0.86 – 0.92 | Large | Very high consistency |
| Correlation (r) | Chinese | 0.72 | 0.68 – 0.76 | Medium | Lower agreement with greater variability |
| RMSE | Native | 1.02 | 0.95 – 1.09 | Medium | Small scoring deviation |
| RMSE | Chinese | 2.15 | 2.03 – 2.28 | Large | Largest deviation and most variability |
| ΔEO | Overall | 0.12 | 0.08 – 0.16 | Medium | Clear fairness disparity |
| ΔEO | A2 Proficiency | 0.15 | 0.11 – 0.20 | Large | Strong bias at low proficiency |
| AUC | Native | 0.92 | 0.89 – 0.95 | Large | Strong discrimination |
| AUC | Non-native | 0.81 | 0.78 – 0.85 | Medium | Reduced discrimination ability |
Table 6: Statistical accuracy and fairness measures. The table below summarizes the major performance measures used to assess the accuracy, fairness, and discrimination capacity of the AWE system. Confidence intervals and effect sizes provide statistical transparency and facilitate the ease of comparing groups.
| Source of Variation | SS | df | MS | F | p-value | η² (Effect Size) |
| Between Groups | 82.47 | 3 | 27.49 | 36.82 | < 0.001 | 0.19 (Large) |
| Within Groups | 548.32 | 760 | 0.72 | — | — | — |
| Total | 630.79 | 763 | — | — | — | — |
Table 7: ANOVA Test Table. The ANOVA was used to statistically confirm whether the differences in RMSE between native-language groups were significant. This test determines whether scoring errors differ across English, Chinese, Spanish, and Arabic learners.
| Native Language Group | TPR | FPR | ΔEO | Interpretation of Fairness Bias |
| English (n = 312) | 0.88 | 0.12 | – | Baseline group (highest fairness performance) |
| Chinese (n = 218) | 0.76 | 0.18 | 0.12* | Significant fairness bias; higher FPR and lower TPR compared to English |
| Spanish (n = 146) | 0.82 | 0.16 | 0.06 | Mild fairness deviation, not statistically significant |
| Arabic (n = 88) | 0.8 | 0.17 | 0.07 | Mild-to-moderate fairness deviation |
Table 8: Results of the Equalized Odds test by native language group. This table summarizes fairness test outcomes, including TPR, FPR, and ΔEO values across different native language groups, with thresholds indicating significant fairness bias.
| CEFR Level | Path Coefficient (Fairness → Satisfaction) | p-value | Interpretation |
| A2 | 0.89 | <0.001 | Strongest fairness sensitivity |
| B1 | 0.78 | <0.001 | High sensitivity |
| B2 | 0.63 | <0.001 | Moderate sensitivity |
| C1 | 0.52 | <0.001 | Lower sensitivity; more tolerant of scoring deviations |
| Interaction Term | –0.18 | 0.03 | Significant moderation effect |
Table 9: Moderating effect of language proficiency on fairness perception → satisfaction path. This table presents the structural equation modeling results for the moderation analysis, including path coefficients by CEFR level and the overall test result for the interaction term.
The research explored an AWE system under a three-level approach, encompassing technical accuracy, group and individual fairness, and learner perception, and identified that overall validity and systematic group differences are simultaneously present. There were strong correlations between AI and expert ratings (aggregate r = 0.82), but differences were observed by subgroup (native r = 0.89 vs. non-native r = 0.76; Chinese r = 0.72; Table 6). The distributions of RMSEs also indicated higher errors and variability in Chinese learners (Figure 2). These trends indicate construct underrepresentation and possibly a domain shift: in cases where interlanguage features are underweighted in training, the model will learn to generate surface-level correctness (e.g., grammar) more effectively than discourse-level features (e.g., content, argumentation)29.
Fairness analyses enhance this image. Equalized Odds showed considerable differences among Chinese students (ΔEO = 0.12, p < 0.05), and the largest disparities appeared at the lower proficiency levels (A2 ΔEO = 0.15; Table 6). It was also observed that non-native groups were discriminated less effectively by ROC curves (AUC 0.81 vs. 0.92; Figure 2). These findings, combined, can be interpreted using the theory that equal error rates among groups are not ensured by overall accuracy maximization alone, particularly when groups are unevenly distributed or exhibit interlanguage characteristics that can be misdetected by an English-dominant model30.
On the perception level, SEM revealed that perceived fairness is a complete mediator of the relationship between perceived accuracy and satisfaction (indirect effect = 0.34; Figure 3), whereas proficiency mediates fairness sensitivity (A2 to C1 = 0.52; Table 7). This is consistent with technology acceptance research: when users perceive procedures and outcomes as fair and comprehensible, satisfaction and willingness to continue rise-even if technical metrics are constant31,32,33. For lower-proficiency learners, fairness signals may function as a safety cue that compensates for uncertainty about rubric alignment and language norms.
Why the Chinese-group gap? Beyond data imbalance, typical interlanguage markers (e.g., L1-influenced collocations or discourse moves) may be scored as errors rather than as developmental features, inflating false positives at the high-score threshold and false negatives at the low-score threshold-precisely the pattern ΔEO captures34. Rubric mapping can also matter: if the AI rubric overemphasizes surface features, it may systematically underemphasize content relevance and organization where cross-cultural rhetorical preferences diverge.
Robustness and limitations are of note. To use three prompts, one AWE system (ETS Criterion), and three large L1 groups; future work should expand genres, prompts, and languages to test generalizability and to assess measurement invariance across groups35,36. ΔEO thresholds (e.g., ≥ .10) aid decision-making but should be reported with confidence intervals and complemented by calibration and predictive parity checks to avoid single-metric conclusions37,38. Finally, although rater ICCs were high, future studies could include cross-site rater pools and double-blind adjudication to further reduce human-side variance39,40.
The single-context samples and accuracy-only validation of the AWE acceptance model, and PCA -Random Forest AES framework, were narrow. These gaps were addressed in the proposed XAI TLEF-AFMM approach, which combines fairness diagnostics and mediation between learners and perceptions.
In data processing and fairness testing, several troubleshooting steps were implemented: three AI outputs with an 8-point variation or more were corrected manually, eight questionnaires with inconsistent items were discarded, and ΔEO thresholds were implemented with confidence intervals. These processes ensured data quality and appropriate fairness levels. These measures enhanced the accuracy and fairness assessment across native and proficiency groups, indicating that the system's performance pattern was interpreted with high reliability.
The AWE system was effective and feasible: the automated mode of scoring saved a significant amount of assessment time compared to the manual mode, the two-way AI-human workflow process was quicker at identifying mistakes, and the uniform essay prompts simplified the implementation process. These characteristics show the ability of an explainable, multi-level evaluation framework to improve the level of transparency and equity because this allows the behavior of the system to be more interpretable by the researchers and educators.
The essential steps to achieve successful replication are as follows: dual AI-human scoring through prior rater calibration; inter-rater reliability verified through recorded ICCs; fairness and satisfaction measurement through established ΔEO thresholds with confidence intervals; and use of three standardized essay prompts. These steps are clearly documented and make transparent validation feasible, allowing accuracy and fairness analyses to be replicated in other educational environments.
Conclusions
The AWE system can achieve good aggregate validity; however, certain accuracy errors and fairness differences can still be observed-in particular, among lower-proficiency, non-native learners. The perception of fairness is critical: it completely mediates the effect of perceived accuracy on satisfaction, and its effect is the strongest among A2-B1 learners. Integrating fairness as a central validation dimension, alongside accuracy and user perception, offers a more comprehensive and educationally valuable approach to evaluating AWE systems.
Practical recommendations arising from this research lead to three directions. To developers, extending multilingual and proficiency-stratified corpora, group-aware analysis of errors, and fairness-weighted objectives with parity constraints to error-rate trade-offs at chosen operating points will prove vital. It is also essential to enhance the transparency of the feedback by offering uncertainty prompts (e.g., score intervals), and explanation artifacts that are informed by L2 characteristics to enhance user confidence. For educators, fairness-audited AWE systems should be prioritized for A2-B1 cohorts, with human-review triggers set when discrepancies become excessive (e.g., RMSE > 3.0 or significant mismatches between AI scores and learner self-assessment), while rubric transparency should be scaffolded to align learners' expectations41. For researchers, future work should explore alternative fairness criteria such as calibration within groups and predictive parity, conduct prompt- and genre-sensitivity studies, and evaluate measurement invariance across L1s and proficiency levels. Overall, treating fairness and perception as first-class dimensions of explainability reframes how AWE systems are validated and deployed-shifting emphasis from aggregate accuracy alone toward transparent and equitable support for learning.
The author has no conflicts of interest to disclose.
None.
| Data Storage System | Encrypted, access-controlled servers for storing anonymized data. | Institutional servers | STORAGE-002 |
| ETS Criterion System | AI-assisted writing evaluation system used for scoring the writing tasks. | Educational Testing Service (ETS) | ETS-001 |
| Fairness and Accuracy Analysis Tools | Tools for RMSE, Equalized Odds, and statistical analysis. | Custom scripts/stat packages | TOOL-FA-001 |
| Human Expert Ratings | Independent ratings provided by three linguists with over 10 years of experience. | In-house raters | HR-EXP-003 |
| Learner Perception Questionnaire | An 8-item questionnaire on fairness and satisfaction, rated on a 7-point Likert scale. | In-house developed | QUES-008 |
| Statistical Software (R 4.3.1) | Used for data analysis, including SEM (Structural Equation Modeling). | R Foundation | R-SW-431 |
| Stratified Random Sampling Data | Data collected from 764 multilingual learners across CEFR levels A2 to C1. | Study participants | DATA-764 |
| Writing Task Prompts | Three standardized essay topics on globalization, online education, and AI ethics. | Moodle-based platform | PROMPT-003 |