Abstract
Large language models are going silent. As mathematical reasoning evolves
from explicit Chain-of-Thought to efficient "latent reasoning," models
using methods like Chain of Continuous Thought (CoConut) no longer show
their work; they "think" through continuous, inscrutable hidden states.
For education, this creates a critical transparency crisis: how can we
trust an AI tutor that solves problems without revealing its reasoning
process? When should it intervene if we cannot detect its uncertainty
or errors?
We break open this black box by applying Sparse Autoencoders (SAEs) to
interpret CoConut's internal reasoning states during step-by-step problem
solving on GSM8k. Through systematic analysis of 7,914 reasoning steps
and manual annotation of problems across 9 mathematical skills, we
discover something remarkable: buried within these inscrutable activations
are interpretable features that reliably detect specific reasoning patterns
(F1 = 0.542, 135% better than random).
When we surgically ablate these features, performance collapses. Removing
the top feature causes a 44.6% drop, with some skills degrading by 67%,
proving these aren't spurious correlations but causally important mechanisms.
Most strikingly, we reveal that features detecting uncertainty and logical
contradictions are nearly perfectly correlated (r = 0.988, p < 0.0001):
the model maintains a unified internal "reasoning quality monitor."
These findings reveal that latent reasoning, despite its opacity, contains
structured representations that enable skill detection and error monitoring,
providing a foundation for more transparent mathematical reasoning systems.
Keywords: Sparse Autoencoders · Latent Reasoning · Mechanistic Interpretability
· AI Tutoring · Error Detection