Weikang Shi1*,
Aldrich Yu1*,
Rongyao Fang1*†,
Houxing Ren1,
Ke Wang1,
Aojun Zhou1,
Changyao Tian1,
Xinyu Fu2,
Yuxuan Hu1,
Zimu Lu1,
Linjiang Huang3,
Si Liu3,
Rui Liu2‡,
Hongsheng Li1‡
1MMLab, CUHK 2Huawei Research 3BUAA
*Equal Contribution †Project Lead ‡Corresponding Author
- [2025-11-15] Our MathCanvas-Instruct dataset--219k math problems with interleaved visual-text reasoning is now accessible at Huggingface.
- [2025-10-30] 🚀 We are excited to announce that MathCanvas-Bench is now officially supported by VLMEvalKit! This allows for easy evaluation on over 220+ LMMs. For usage instructions, please refer to this PR.
- [2025-10-28] The data generation code for the Foundational Structure Generation part of
MathCanvas-Editis now available. Refer to the Data Generation section for usage instructions. - [2025-10-23] We release the training/inference code of BAGEL-Canvas and evaluation scripts for MathCanvas-Bench.
- [2025-10-18] Our model and datasets are now accessible at Huggingface.
- [2025-10-18] Our paper is now accessible at ArXiv Paper.
🌟 This is the official repository for the paper "MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning". This repository will host the datasets, evaluation code, and models associated with our work.
MathCanvas demonstrates the first successful application of intrinsic Visual Chain-of-Thought (VCoT) for complex mathematical reasoning, outperforming previous attempts.
MathCanvas is a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic Visual Chain-of-Thought (VCoT) capabilities for mathematics. Our approach enables models to strategically generate and reason with visual aids, mirroring how humans solve complex problems in domains like geometry and function analysis.
For detailed instructions on setting up the environment, training the BAGEL-Canvas model, and running inference, please refer to our comprehensive guide:
- 📄 Model Usage: The complete guide for model training and inference.
This section provides instructions for evaluating model performance on our MathCanvas-Bench benchmark. The evaluation process relies on an LLM-based judge (GPT-4.1) to assess the correctness of the generated answers.
To evaluate the inference results on MathCanvas-Bench, follow the steps below:
-
Configure the Evaluation Script: Open the
evaluation/mathcanvas_evaluate_4.1.shscript and set theyour_api_keyandyour_base_urlvariables. -
Run Evaluation: Execute the following command, replacing
{INFERENCE_DIR}with the path to your inference output.cd MathCanvas/evaluation bash mathcanvas_evaluate_4.1.sh {INFERENCE_DIR} -
View the Results: After the script finishes, an evaluation summary will be generated. This summary includes detailed accuracy metrics, such as:
- Weighted scoring accuracy and complete accuracy.
- Accuracy broken down by knowledge category.
- Accuracy based on whether the question includes initial images.
This section details the process for generating the Foundational Structure Generation subset of the MathCanvas-Edit dataset. Our synthesis pipeline for foundational geometric structures is based on the official implementation of AlphaGeometry.
Before running the generation script, you must set up the environment required by AlphaGeometry. Please refer to their official repository and follow the installation instructions.
Once the environment is configured, you can generate the data by running the provided script.
cd foundations_synthesis/
bash foundations_synthesis.shYou can customize the generation process by modifying the foundations_synthesis.sh script. This includes parameters such as the total number of samples to generate and the length of the editing sequences.
For convenience, we have already generated 1 million editing sequences for this subset. You can directly access and download them from our dataset repository on Hugging Face:
To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. The models are fine-tuned on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching them when and how to leverage visual aids.
Statistical analysis of the MathCanvas-Bench dataset.
Examples from the MathCanvas-Instruct dataset, showing interleaved visual and textual reasoning steps.
We constructed a massive 15.2M-pair pre-training corpus to teach foundational visual manipulation skills. This includes MathCanvas-Imagen (10M caption-to-diagram pairs) for mastering diagram generation and MathCanvas-Edit (5.2M step-by-step editing trajectories) for diagram editing.
The curation pipeline for the MathCanvas-Edit and MathCanvas-Imagen datasets.
Our model, BAGEL-Canvas, is trained using a two-stage framework:
- Stage I: Mastering Visual Manipulation: The model learns from the 15.2M examples in MathCanvas-Imagen and MathCanvas-Edit to create and edit mathematical diagrams.
- Stage II: Developing Strategic Reasoning: The model is then trained on MathCanvas-Instruct to strategically generate visual steps as part of a solution.
The two-stage training framework of MathCanvas.
Our code and models are currently being prepared for public release. We appreciate your patience!
- Release training and inference code for BAGEL-Canvas.
- Release evaluation scripts for the MathCanvas-Bench.
- Update the evaluation scripts for the MathCanvas-Bench to VLMEvalKit.
- Release the data generation code for Foundational Structure Generation in MathCanvas-Edit.
If you have any questions, please raise an issue or contact us at wkshi@link.cuhk.edu.hk.
If you find our work useful for your research, please consider citing our paper:
@misc{shi2025mathcanvasintrinsicvisualchainofthought,
title={MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning},
author={Weikang Shi and Aldrich Yu and Rongyao Fang and Houxing Ren and Ke Wang and Aojun Zhou and Changyao Tian and Xinyu Fu and Yuxuan Hu and Zimu Lu and Linjiang Huang and Si Liu and Rui Liu and Hongsheng Li},
year={2025},
eprint={2510.14958},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.14958},
}
@inproceedings{
wang2025mathcodervl,
title={MathCoder-{VL}: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning},
author={Ke Wang and Junting Pan and Linda Wei and Aojun Zhou and Weikang Shi and Zimu Lu and Han Xiao and Yunqiao Yang and Houxing Ren and Mingjie Zhan and Hongsheng Li},
booktitle={The 63rd Annual Meeting of the Association for Computational Linguistics},
year={2025},
url={https://openreview.net/forum?id=nuvtX1imAb}
}