Research Article

Explainable AI Framework for Accuracy, Fairness, and Learner Perception in English Writing Assessment

DOI:

10.3791/69841

December 23rd, 2025

In This Article

Summary

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study develops a three-tier evaluation framework and fairness mediation model to assess AI-assisted English writing systems. Using 764 cross-linguistic samples, results show accuracy disparities, fairness bias against non-native learners (especially Chinese A2 proficiency level), and fairness perception as the key mediator of user satisfaction, offering theoretical and practical implications.

Abstract

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In the context of global educational digital transformation, automated writing evaluation (AWE) has been widely adopted due to its real-time and standardized advantages; however, traditional accuracy-oriented frameworks often neglect equity concerns and learners' perceptions, thereby limiting transparency and educational value. To address this limitation, this research proposes an explainable AI (XAI) framework designed to provide transparent and interpretable feedback, allowing learners to understand and trust automated evaluation, and integrates a multi-level validation model, the Three-Level Evaluation Framework (TLEF), spanning technical accuracy, group and individual equity, and learner perception, together with the AI Fairness Mediation Model (AFMM). Using stratified random sampling, data were collected from 764 multilingual learners (native speakers of English, Chinese, and Spanish) across Common European Framework of Reference for Languages (CEFR) levels A2 to C1 through writing tasks, dual scoring by AI and human experts, and structured questionnaires. Instead of listing individual tests, multiple statistical analysis was employed to examine validity, fairness, and the learner-perception relationship. Statistical analyses combined correlation, root mean square error (RMSE), Equalized Odds testing, and Structural Equation Modeling (SEM). The findings reveal that while the AI-assisted writing evaluation (AWE) system (ETS Criterion) achieves overall validity (r = 0.82), significant disparities remain: Chinese native speakers demonstrate the lowest agreement with human raters (0.72) and the highest RMSE (median 2.15), fairness biases are most pronounced at lower proficiency levels (ΔEO = 0.15 for A2 learners), and perceived fairness fully mediates the link between perceived accuracy and learner satisfaction, with proficiency moderating fairness sensitivity. By reframing fairness and perception as essential dimensions of explainability, the research enhances the theoretical grounding of AWE and provides a practical pathway for increasing transparency, equity, and social acceptance in educational technologies.

Introduction

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The intensive globalization of education and digital technologies has increased the need to evaluate the level of writing in English scientifically and credibly for language teaching, academic development, and career advancement1. Conventional writing evaluations, as practiced by human rating, can measure subjective aspects of writing like the thoroughness of argumentation and cultural suitability2, but are susceptible to long turnaround times, high labor expenses, and bias due to evaluator experience and leanings3,4. These constraints are especially acute in lar....

Access restricted. Please log in or start a trial to view this content.

Protocol

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The ethical approval and participant recruitment process, including essay administration, dual scoring by ETS Criterion and experts, learner-perception evaluation, and statistical analysis, are summarized in this section. It highlights how accuracy, fairness, and SEM-based perception modeling are integrated into a unified XAI validation pipeline. The XAI-driven AWE evaluation framework is illustrated in Figure 1.

Procedure:

The procedure involved several steps. First, IRB approval was obtained, and informed consent was collected from all participants. Independent, de....

Access restricted. Please log in or start a trial to view this content.

Results

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The section presents the research results based on five analytical dimensions: experimental design, participant characteristics, scoring accuracy, fairness assessment, and modeling of learning and perception. The outcomes include statistical performance, group differences, fairness disparities, and SEM-based mediation and moderation.

Experimental setup

The key software steps involved setting up ETS Criterion through its API to automatically score th.......

Access restricted. Please log in or start a trial to view this content.

Discussion

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The research explored an AWE system under a three-level approach, encompassing technical accuracy, group and individual fairness, and learner perception, and identified that overall validity and systematic group differences are simultaneously present. There were strong correlations between AI and expert ratings (aggregate r = 0.82), but differences were observed by subgroup (native r = 0.89 vs. non-native r = 0.76; Chinese r = 0.72; Table 6). The distributions of RMSEs also indicated higher errors and variability in Chin.......

Access restricted. Please log in or start a trial to view this content.

Disclosures

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The author has no conflicts of interest to disclose.

Materials

List of materials used in this article
NameCompanyCatalog NumberComments
Data Storage SystemEncrypted, access-controlled servers for storing anonymized data.Institutional serversSTORAGE-002
ETS Criterion SystemAI-assisted writing evaluation system used for scoring the writing tasks.Educational Testing Service (ETS)ETS-001
Fairness and Accuracy Analysis ToolsTools for RMSE, Equalized Odds, and statistical analysis.Custom scripts/stat packagesTOOL-FA-001
Human Expert RatingsIndependent ratings provided by three linguists with over 10 years of experience.In-house ratersHR-EXP-003
Learner Perception QuestionnaireAn 8-item questionnaire on fairness and satisfaction, rated on a 7-point Likert scale.In-house developedQUES-008
Statistical Software (R 4.3.1)Used for data analysis, including SEM (Structural Equation Modeling).R FoundationR-SW-431
Stratified Random Sampling DataData collected from 764 multilingual learners across CEFR levels A2 to C1.Study participantsDATA-764
Writing Task PromptsThree standardized essay topics on globalization, online education, and AI ethics.Moodle-based platformPROMPT-003

References

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,
  1. Voogt, J., Roblin, N. P. 21st century skills. Discussienota. 23 (03), 2000(2000).
  2. Weigle, S. C. Assessing writing. , Cambridge University Press. (2002).
  3. Barkaoui, K.

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Tags

Explainable AIAutomated Writing EvaluationAI FairnessLearner PerceptionWriting AssessmentThree Level EvaluationStructural Equation ModelingEqualized OddsMultilingual LearnersEducational Technology
Video Coming Soon

Related Articles