Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

📖 Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

🔥🔥 This is the repo for the ThinkARM project, which provides systematic analysis on Mathematical Reasoning by Large Language Models, grounded by Schoenfeld’s Episode Theory.

The repo contains:

All the responses generated by the models mentioned in our paper and corresponding annotations, including a corpus of 410,991 sentences, 150 responses, generated by 15 models solving 100 math problems.
The human annotated gold episode labels, including 7,067 sentences.
The code of ThinkARM, which can be directly used for annotating episodes.
Some of the analysis code used in our paper.

(Feel free to email Ming (Homepage, Email) for any questions or feedback.)

Overview

Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.

Key Findings

🔎 When reasoning traces are analyzed at the episode level, a functional progression from abstract reasoning to concrete execution, and finally to evaluative control, consistently emerges. Episodes associated with analysis and exploration use more abstract, conceptual language and decrease steadily as reasoning progresses, while execution-oriented episodes dominate the middle of the trace through sustained concrete operations. In contrast, verification-related episodes are characterized by evaluative and meta-level language and increase toward the end of the reasoning process.
🔎 Comparing reasoning and non-reasoning models, the difference is not merely how many tokens they generate, but how reasoning is structured. Non-reasoning models allocate most of their response trace to execution, with episode transitions largely following a feed-forward pattern toward implementation. In contrast, reasoning models distribute effort across analysis, exploration, execution, and verification, and exhibit frequent iterative Explore-Monitor/Verify loops.
🔎 Through our correctness-oriented case study, we find that exploration reflects uncertainty and serves as a critical branching point: correct solutions more often route exploration into monitoring or re-analysis, whereas incorrect solutions tend to continue execution or terminate prematurely after exploration.
🔎 Through our efficiency-oriented case study, we find that different efficient reasoning methods selectively suppress evaluation-oriented episodes and feedback loops, leading to varying degrees of divergence from the reasoning patterns of the base model. Episode-level analysis thus reveals which episodes can be removed to gain efficiency, beyond token-level pruning.

Setup

pip install -r requirements.txt

Data Organization

The data is organized into the following directories within data/:

raw/: Raw model outputs. Each file (e.g., DSQwen32B.json) contains the problems and model responses.
ground_truth/: Human-annotated labels. Organized by model name. Each JSON file (e.g., QwQ32B/1.json) corresponds to a problem and contains the reasoning steps with human_label.
label/: Automatically generated labels. Organized by model name. Each JSON file (e.g., DSQwen32B/1.json) contains the reasoning steps with sentence-category (the assigned label) and sentence-category-reason.
correct/: Correctness evaluation results. JSON files mapping problem indices to boolean values indicating if the answer was correct.

Automatic Labeling

export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...
python -m method.label --annotate_model [ANNOTATE_MODEL] --response_model [RESPONSE_MODEL]

Correctness Evaluation

python -m analysis.correctness_eval --model [MODEL] --evaluator_model [EVALUATE_MODEL]

Fine-grained Analysis

Temporal Dynamics

python -m analysis.temporal

Word Cloud Analysis

python -m analysis.word_cloud

Diagnostic Analysis

python -m analysis.diagnostic

Episode-level N-gram Analysis

python -m analysis.episode_ngram_preprocess
python -m analysis.episode_ngram_discriminate [FILE1] [FILE2]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
analysis		analysis
data		data
guidebook		guidebook
images		images
method		method
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

Contents

Overview

Key Findings

Setup

Data Organization

Automatic Labeling

Correctness Evaluation

Fine-grained Analysis

Temporal Dynamics

Word Cloud Analysis

Diagnostic Analysis

Episode-level N-gram Analysis

About

Uh oh!

Releases

Packages

Languages

License

MingLiiii/ThinkARM

Folders and files

Latest commit

History

Repository files navigation

Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

Contents

Overview

Key Findings

Setup

Data Organization

Automatic Labeling

Correctness Evaluation

Fine-grained Analysis

Temporal Dynamics

Word Cloud Analysis

Diagnostic Analysis

Episode-level N-gram Analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages