- [2026.01] π Congratulations! Our work has been accepted to ICLR 2026 as a poster.
This repository provides the official implementation of the paper:
Selection, Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens
https://arxiv.org/abs/2510.08222
We introduce a framework called SRΒ² that incorporates estimated latent variables as feedback into a selection mechanism, enabling the learning of dense dependencies among latent representations. The framework is composed of three key modules below.
- Reflective Representation Learning: Learns latent variables that capture structured reasoning signals.
- Dependency Self-Refinement: Iteratively refines latent dependencies using the selection feedback.
- Periodic Intermediate Alignment: Aligns intermediate representations with causal structure to stabilize training.
Experimentally, SRΒ² delivers substantial gains in reasoning accuracy. For example, on Sudoku and Maze tasks, SRΒ² achieves over 10% improvement in performance while using 8Γ fewer parameters compared with recent strong baselines.
Note
This project is built on top of the HRM repository (sapientinc/HRM). We reuse its attention layer designs, optimizer, embeddings, and most of the hyperparameters, while replacing the model architecture and training procedure with the SRΒ² framework.
Illustration of reasoning tasks and the selection mechanism using Sudoku as an example.
- (a) A sample
$9 \times 9$ Sudoku puzzle with a subset of given clues; the goal is to fill the remaining cells so that each row, column, and$3 \times 3$ subgrid contains the digits$1$ β$9$ exactly once. - (b) A single unfilled cell
$Y_{ij}$ with its row (purple), column (blue), and$3 \times 3$ block (orange) highlighted. The digits within these groups impose constraints that determine the admissible values for$Y_{ij}$ . - (c) Selection mechanism: a candidate value
$Y$ is valid if and only if the validity criteria are satisfied:
$S^i_{Row} = S^j_{Col} = S^b_{Block} = 1$
Our main experiments were conducted on:
- Sudoku / Maze: 8 Γ AMD MI210 (ROCm 6.2)
- ARC-1 / ARC-2: 8 Γ NVIDIA H200 NVL (CUDA 12.8)
Important
For reproducing ARC-1 and ARC-2 experiments, we strongly recommend:
- CUDA β₯ 12.8
- NVIDIA Hopper-class GPUs
to take full advantage of FlashAttention-3 for efficient training.
Below we assume that CUDA 12.8 is properly installed and configured.
git clone https://github.com/dengyl20/SR2.git
cd SR2conda create -n sr2 python==3.12
conda activate sr2pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 \
--index-url https://download.pytorch.org/whl/cu128Note
Adjust the PyTorch version and index URL if your CUDA setup differs. The above command assumes CUDA 12.8 wheels are available from the official PyTorch index.
pip install -r requirements.txtThese components are optional but strongly recommended for faster and more stable training.
# adam-atan2
pip install packaging ninja wheel setuptools setuptools-scm
pip install --no-cache-dir --no-build-isolation adam-atan2
# Flash Attention 3
git clone git@github.com:Dao-AILab/flash-attention.git
cd flash-attention/hopper
python setup.py installTip
If you are using ROCm or a non-Hopper GPU, you may need to skip FlashAttention-3 or use a backend compatible with your hardware.
After the environment is configured, you can quickly reproduce the Sudoku and Maze results from the paper.
Note
For ARC-1 and ARC-2, please refer to arc-agi/README.md for dedicated instructions and configuration details.
We directly reuse the data processing scripts and original data from the HRM repository.
# Initialize submodules
git submodule update --init --recursive
# Sudoku-Extreme (1000 examples with augmentation)
python dataset/build_sudoku_dataset.py \
--output-dir data/sudoku-extreme-1k-aug-1000 \
--subsample-size 1000 \
--num-aug 1000
# Maze (1000 examples)
python dataset/build_maze_dataset.pyImportant
Ensure all submodules have been initialized successfully. Missing raw data or scripts will cause dataset preparation to fail.
We use Weights & Biases (wandb.ai) to log training curves, evaluation metrics, and experiment configurations.
Edit config/cfg_pretrain.yaml and replace the placeholder:
wandb_key: <your_wandb_key>with your actual W&B API key.
Tip
You can obtain your W&B key from your user settings page on https://wandb.ai. Make sure you are logged in when launching experiments.
# Train on Sudoku-Extreme
bash pretrain_sudoku.sh
# Train on Maze-Hard
bash pretrain_maze.shThese scripts will:
- Load the the model architecture.
- Prepare the corresponding dataset splits.
- Launch training and periodic evaluation.
- Log all metrics (e.g.,
all.exact_accuracy) to W&B.
Note
Hyperparameters largely follow HRM settings. Only changes relevant to the SRΒ² framework (architecture and training dynamics) are introduced.
-
Training curves and evaluation metrics
- All training metrics and evaluation metrics (
all.exact_accuracy), can be found in the corresponding W&B runs. - Use W&B dashboards to inspect convergence, stability, and performance across tasks.
- All training metrics and evaluation metrics (
-
Reporting metrics
- For SRΒ² and all baselines reported in the paper, we select the peak value of the evaluation curve (best
all.exact_accuracy) rather than the last evaluation point. - This follows the evaluation protocol used in the paper to fairly compare model capacities.
- For SRΒ² and all baselines reported in the paper, we select the peak value of the evaluation curve (best
-
Reproducibility
- We provide W&B reports and model checkpoints as reference.
- Under comparable environment settings (hardware, CUDA/ROCm, and software versions), repeated runs should exhibit variation within approximately 1% absolute accuracy.
Caution
Differences in GPU architecture, CUDA/ROCm versions, or third-party library implementations (e.g., FlashAttention versions) may introduce minor deviations in the final metrics.
We gratefully acknowledge:
- The HRM project: https://github.com/sapientinc/HRM
- The ARC-Prize team and their hierarchical-reasoning-model-analysis project: https://github.com/arcprize/hierarchical-reasoning-model-analysis
Our project framework is built on top of these two excellent codebases.
If you find this repository useful in your research, please consider citing:
@inproceedings{deng2026selection,
title = {Selection, Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens},
author = {Yunlong Deng and Boyang Sun and Yan Li and Zeyu Tang and Lingjing Kong and Kun Zhang and Guangyi Chen},
booktitle = {The Fourteenth International Conference on Learning Representations},
year = {2026},
url = {https://openreview.net/forum?id=0X5moS8KSm}
}