📖 Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models
🔥🔥 This is the repo for the ThinkARM project, which provides systematic analysis on Mathematical Reasoning by Large Language Models, grounded by Schoenfeld’s Episode Theory.
The repo contains:
- All the responses generated by the models mentioned in our paper and corresponding annotations, including a corpus of 410,991 sentences, 150 responses, generated by 15 models solving 100 math problems.
- The human annotated gold episode labels, including 7,067 sentences.
- The code of ThinkARM, which can be directly used for annotating episodes.
- Some of the analysis code used in our paper.
(Feel free to email Ming (Homepage, Email) for any questions or feedback.)
- Overview
- Key Findings
- Setup
- Data Organization
- Automatic Labeling
- Correctness Evaluation
- Fine-grained Analysis
- Run Code
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
-
🔎 When reasoning traces are analyzed at the episode level, a functional progression from abstract reasoning to concrete execution, and finally to evaluative control, consistently emerges. Episodes associated with analysis and exploration use more abstract, conceptual language and decrease steadily as reasoning progresses, while execution-oriented episodes dominate the middle of the trace through sustained concrete operations. In contrast, verification-related episodes are characterized by evaluative and meta-level language and increase toward the end of the reasoning process.
-
🔎 Comparing reasoning and non-reasoning models, the difference is not merely how many tokens they generate, but how reasoning is structured. Non-reasoning models allocate most of their response trace to execution, with episode transitions largely following a feed-forward pattern toward implementation. In contrast, reasoning models distribute effort across analysis, exploration, execution, and verification, and exhibit frequent iterative Explore-Monitor/Verify loops.
-
🔎 Through our correctness-oriented case study, we find that exploration reflects uncertainty and serves as a critical branching point: correct solutions more often route exploration into monitoring or re-analysis, whereas incorrect solutions tend to continue execution or terminate prematurely after exploration.
-
🔎 Through our efficiency-oriented case study, we find that different efficient reasoning methods selectively suppress evaluation-oriented episodes and feedback loops, leading to varying degrees of divergence from the reasoning patterns of the base model. Episode-level analysis thus reveals which episodes can be removed to gain efficiency, beyond token-level pruning.
pip install -r requirements.txtThe data is organized into the following directories within data/:
raw/: Raw model outputs. Each file (e.g.,DSQwen32B.json) contains the problems and model responses.ground_truth/: Human-annotated labels. Organized by model name. Each JSON file (e.g.,QwQ32B/1.json) corresponds to a problem and contains the reasoning steps withhuman_label.label/: Automatically generated labels. Organized by model name. Each JSON file (e.g.,DSQwen32B/1.json) contains the reasoning steps withsentence-category(the assigned label) andsentence-category-reason.correct/: Correctness evaluation results. JSON files mapping problem indices to boolean values indicating if the answer was correct.
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...
python -m method.label --annotate_model [ANNOTATE_MODEL] --response_model [RESPONSE_MODEL]python -m analysis.correctness_eval --model [MODEL] --evaluator_model [EVALUATE_MODEL]python -m analysis.temporal python -m analysis.word_cloudpython -m analysis.diagnostic python -m analysis.episode_ngram_preprocess
python -m analysis.episode_ngram_discriminate [FILE1] [FILE2]