Abstract
LLM-as-a-Judge frameworks have emerged as scalable alternatives
to human evaluation, with recent work reporting 70-80% agreement
with human evaluators. However, overall agreement can mask systematic
attribute-level misalignment--a critical concern for educational
validity. We investigate this phenomenon in automated essay evaluation
for English language learners, where the assessment quality depends on
the pedagogically appropriate weighting of criteria such as cohesion, syntax,
and conventions. Through systematic evaluation of six LLMs on
1100 ELL essays, we find that even when most models achieve a reasonable
overall quality agreement (ρ = 0.6-0.74), they exhibit severe
attribution biases. Direct prompting of LLaMA-3.1-8B reveals 18.8%
syntax underweighting and 18.7% grammar overweighting (11% mean
divergence), systematically showing different associations with linguistic
complexity over surface correctness. To address these misalignments, we
propose knowledge distillation based on LLM reasoning traces with explicit
interpretability mechanisms: attribution weighting and consistency
regularization. Our best interpretability-optimized model (AD) outperforms
direct LLM prompting in both prediction accuracy (Spearman ρ =
0.763 vs 0.51, +50% improvement) and attribution alignment (3.2% vs
11.0% mean divergence, -71% reduction), while being more computationally
efficient. The findings establish knowledge distillation as a superior
method to direct prompting for educational assessment, enabling computationally
efficient, accurate, and evaluation-rubric-aligned automated
feedback for English language learners at scale.