ReForm is a reflective Autoformalization framework that enables LLMs to iteratively generate, validate, and self-correct formal mathematical statements (Lean4) through an integrated generation-validation loop.
-
Reflective Autoformalization Paradigm: Introduces an iterative "generate β validate β refine" cycle that enables models to autonomously identify and correct semantic errors, unifying generation and verification in a single process.
-
Prospective Bounded Sequence Optimization (PBSO): A novel RL algorithm designed for heterogeneous rewards at different sequence positions, enabling stable training of models with both accurate autoformalization and reliable semantic validation.
-
ConsistencyCheck Benchmark: 859 expert-annotated items for evaluating semantic consistency, revealing that even human experts produce errors in up to 38.5% of cases.
Figure 1. Performance Comparison of ReForm against state-of-the-art models.
Figure 2. RL Dynamics in our PSBO process.
- [2025-10-31] π We release the ReForm-32B model on Hugging Face, which is more powerful than ReForm-8B.
- [2025-10-29] π We release the ReForm paper, models, and ConsistencyCheck benchmark!
- π Paper available on arXiv
- π€ Models: ReForm-8B on Hugging Face
- π€ ConsistencyCheck benchmark for semantic consistency evaluation
Please download the following models from huggingface:
ReForm-8B
ReForm-32B
We provide our preprocessed test set (miniF2F, ProofNet, Putnam, and AIME 2025) for your convenience.
./data
βββ test
βββ aime2025.jsonl
βββ minif2f_testset.jsonl
βββ proofnet_testset.jsonl
βββ putnam_v4.jsonlpython ./script/reform_decode.py \
--model-path <PATH-to-ReForm> \
--data-path <PATH-to-test>ConsistencyCheck is a carefully curated dataset designed to assess how well formal mathematical statements capture the semantic intent of their natural language counterparts. This benchmark addresses the critical challenge of semantic fidelity in mathematical formalization and serves as a key evaluation component for the REFORM methodology.
β¨β¨ Primary Purpose: To evaluate and advance research in automated mathematical formalization, particularly focusing on semantic consistency between natural language mathematics and formal theorem proving systems.
The benchmark is constructed from two established mathematical formalization datasets:
- miniF2F: Zheng, K., Han, J. M., & Polu, S. (2021). Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110.
- ProofNet: Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., & Avigad, J. (2023). Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433.
- Two independent expert annotators compare each formal statement with its natural-language problem.
- Disagreements are resolved by a third senior expert.
- Each item includes human judgment (
human_check) and a textual explanation (human_reason). - All Lean statements compile successfully to isolate semantic issues.
Each example follows this JSON structure:
{
"name": "problem_identifier",
"split": "valid|test",
"goal": "Lean4 goal statement",
"header": "Lean4 imports and opening commands",
"informal_statement": "Natural language problem statement",
"formal_statement": "Formalized theorem statement",
"human_check": "true|false",
"human_reason": "Explanation for incorrect labels"
}During annotation, we identified several problematic informal statements:
amc12a_2011_p18: Missing specification of whether x equals zeroamc12_2000_p11: Contains only answer choices without actual problem statement
exercise_1998_a3: Incomplete condition after "such that"exercise_1_18b: Missing specification of whether x equals zero
from datasets import load_dataset
dataset = load_dataset("GuoxinChen/ConsistencyCheck")
example = dataset["test"][0]
print(example["informal_statement"])
print(example["formal_statement"])
print(example["human_check"])To evaluate a model on this benchmark:
- Generate formal statements for the natural language problems
- Compare against the human_check ground truth
We hope this benchmark will contribute to the broader mathematical formalization community by:
- Standardized Evaluation: Providing a reliable benchmark for comparing autoformalization systems
- Semantic Focus: Emphasizing semantic consistency over syntactic correctness
- Quality Assurance: Highlighting common pitfalls in mathematical formalization
- Research Advancement: Supporting development of more robust formalization methods
Related Community Projects:
If you find ReForm useful in your research, please cite our paper and star π our repo:
@misc{chen2025reform,
title={ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization},
author={Guoxin Chen and Jing Wu and Xinjie Chen and Wayne Xin Zhao and Ruihua Song and Chengxi Li and Kai Fan and Dayiheng Liu and Minpeng Liao},
year={2025},
eprint={2510.24592},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.24592},
}We gratefully acknowledge:
- Dataset Foundation: miniF2F and ProofNet for their pioneering formalization datasets
- Formal Mathematics: The Lean community (Lean, Mathlib) for their world-class theorem proving infrastructure
- Training Framework: Slime for the powerful RL framework enabling our PBSO algorithm
- Inference Optimization: vLLM and SGLang for blazing-fast inference capabilities
Special thanks to all contributors and the broader open-source community.