Explainable AI Framework for Accuracy, Fairness, and Learner Perception in English Writing Assessment

Meili Dai

doi:10.3791/69841

Research Article

Explainable AI Framework for Accuracy, Fairness, and Learner Perception in English Writing Assessment

DOI:

10.3791/69841

⸱

December 23rd, 2025

Meili Dai¹

¹School of Foreign Languages/School of Translation & Interpreting, Henan University

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study develops a three-tier evaluation framework and fairness mediation model to assess AI-assisted English writing systems. Using 764 cross-linguistic samples, results show accuracy disparities, fairness bias against non-native learners (especially Chinese A2 proficiency level), and fairness perception as the key mediator of user satisfaction, offering theoretical and practical implications.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In the context of global educational digital transformation, automated writing evaluation (AWE) has been widely adopted due to its real-time and standardized advantages; however, traditional accuracy-oriented frameworks often neglect equity concerns and learners' perceptions, thereby limiting transparency and educational value. To address this limitation, this research proposes an explainable AI (XAI) framework designed to provide transparent and interpretable feedback, allowing learners to understand and trust automated evaluation, and integrates a multi-level validation model, the Three-Level Evaluation Framework (TLEF), spanning technical accuracy, group and individual equity, and learner perception, together with the AI Fairness Mediation Model (AFMM). Using stratified random sampling, data were collected from 764 multilingual learners (native speakers of English, Chinese, and Spanish) across Common European Framework of Reference for Languages (CEFR) levels A2 to C1 through writing tasks, dual scoring by AI and human experts, and structured questionnaires. Instead of listing individual tests, multiple statistical analysis was employed to examine validity, fairness, and the learner-perception relationship. Statistical analyses combined correlation, root mean square error (RMSE), Equalized Odds testing, and Structural Equation Modeling (SEM). The findings reveal that while the AI-assisted writing evaluation (AWE) system (ETS Criterion) achieves overall validity (r = 0.82), significant disparities remain: Chinese native speakers demonstrate the lowest agreement with human raters (0.72) and the highest RMSE (median 2.15), fairness biases are most pronounced at lower proficiency levels (ΔEO = 0.15 for A2 learners), and perceived fairness fully mediates the link between perceived accuracy and learner satisfaction, with proficiency moderating fairness sensitivity. By reframing fairness and perception as essential dimensions of explainability, the research enhances the theoretical grounding of AWE and provides a practical pathway for increasing transparency, equity, and social acceptance in educational technologies.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The intensive globalization of education and digital technologies has increased the need to evaluate the level of writing in English scientifically and credibly for language teaching, academic development, and career advancement¹. Conventional writing evaluations, as practiced by human rating, can measure subjective aspects of writing like the thoroughness of argumentation and cultural suitability², but are susceptible to long turnaround times, high labor expenses, and bias due to evaluator experience and leanings³^,⁴. These constraints are especially acute in lar....

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The ethical approval and participant recruitment process, including essay administration, dual scoring by ETS Criterion and experts, learner-perception evaluation, and statistical analysis, are summarized in this section. It highlights how accuracy, fairness, and SEM-based perception modeling are integrated into a unified XAI validation pipeline. The XAI-driven AWE evaluation framework is illustrated in Figure 1.

Procedure:

The procedure involved several steps. First, IRB approval was obtained, and informed consent was collected from all participants. Independent, de....

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The section presents the research results based on five analytical dimensions: experimental design, participant characteristics, scoring accuracy, fairness assessment, and modeling of learning and perception. The outcomes include statistical performance, group differences, fairness disparities, and SEM-based mediation and moderation.

Experimental setup

The key software steps involved setting up ETS Criterion through its API to automatically score th.......

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The research explored an AWE system under a three-level approach, encompassing technical accuracy, group and individual fairness, and learner perception, and identified that overall validity and systematic group differences are simultaneously present. There were strong correlations between AI and expert ratings (aggregate r = 0.82), but differences were observed by subgroup (native r = 0.89 vs. non-native r = 0.76; Chinese r = 0.72; Table 6). The distributions of RMSEs also indicated higher errors and variability in Chin.......

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The author has no conflicts of interest to disclose.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
Data Storage System	Encrypted, access-controlled servers for storing anonymized data.	Institutional servers	STORAGE-002
ETS Criterion System	AI-assisted writing evaluation system used for scoring the writing tasks.	Educational Testing Service (ETS)	ETS-001
Fairness and Accuracy Analysis Tools	Tools for RMSE, Equalized Odds, and statistical analysis.	Custom scripts/stat packages	TOOL-FA-001
Human Expert Ratings	Independent ratings provided by three linguists with over 10 years of experience.	In-house raters	HR-EXP-003
Learner Perception Questionnaire	An 8-item questionnaire on fairness and satisfaction, rated on a 7-point Likert scale.	In-house developed	QUES-008
Statistical Software (R 4.3.1)	Used for data analysis, including SEM (Structural Equation Modeling).	R Foundation	R-SW-431
Stratified Random Sampling Data	Data collected from 764 multilingual learners across CEFR levels A2 to C1.	Study participants	DATA-764
Writing Task Prompts	Three standardized essay topics on globalization, online education, and AI ethics.	Moodle-based platform	PROMPT-003

References

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Voogt, J., Roblin, N. P. 21st century skills. Discussienota. 23 (03), 2000(2000).
Weigle, S. C. Assessing writing. , Cambridge University Press. (2002).
Barkaoui, K.

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Explainable AI Framework for Accuracy, Fairness, and Learner Perception in English Writing Assessment

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Materials

References

Reprints and Permissions

Tags

Related Articles