Skip to content

goodfire-ai/r1-interpretability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open-source SAEs for DeepSeek-R1

[Blog] [Models] [Dataset]

We're open-sourcing two state-of-the-art SAEs trained on the 671B parameter DeepSeek R1. These are the first public interpreter models trained on a true reasoning model, and on any model of this scale. Because R1 is a very large model and therefore difficult to run for most independent researchers, we're also uploading SQL databases containing hundreds of millions of tokens of activating examples for each SAE.

We're excited to see how the wider research community will use these tools to develop new techniques for understanding and aligning powerful AI systems. As reasoning models continue to grow in capability and adoption, tools like these will be essential for ensuring they remain reliable, transparent, and aligned with human intentions.

Colab

Just want to jump in? We have two colab notebooks ready to go! You can try inference on precomputed activations in the inference notebook, or query our SAE latent labels & activations dataset in the database querying notebook.

Model Information

This release contains two SAEs, one for general reasoning and one for math, both of which are available on HuggingFace. Load them with the following snippet:

from sae import load_math_sae
from huggingface_hub import hf_hub_download

file_path = hf_hub_download(
    repo_id=f"Goodfire/DeepSeek-R1-SAE-l37",
    filename=f"math/DeepSeek-R1-SAE-l37.pt",
    repo_type="model"
)
device = "cpu"
math_sae = load_math_sae(file_path, device)

An example of loading and inference for both SAEs is available in sae_example.ipynb.

The general reasoning SAE was trained on R1's activations on our custom reasoning dataset, and the second used OpenR1-Math, a large dataset for mathematical reasoning. These datasets allow us to discover the features that R1 uses to answer challenging problems that exercise its reasoning chops.

Feature Database

To help researchers use these SAEs, we're publishing autointerped feature labels and feature activations on hundreds of millions of tokens. The feature labels are available as a SQL database or a CSV, while the feature activations are available as a SQL database. See db_example.ipynb for examples of interacting with the databases. To download them, use the following s3 links:

Math SAE

Autointerp labels: CSV or SQL

Feature activations & their corresponding tokens

Sample Tokens Size
Full 521M 440GB
10% 52.1M 47GB
1% 5.21M 7GB
0.1% 521K 3GB

Logic SAE

Autointerp labels: CSV or SQL

Feature activations & their corresponding tokens

Sample Tokens Size
Full 219M 123GB
10% 21.9M 13GB
1% 2.19M 2GB
0.1% 219K 1GB

R1-Collect

We collected a large dataset of R1-generated tokens on various open-source reasoning and logic datasets. These were collected from:

About

Open source interpretability artefacts for R1.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •