Skip to content
View wesg52's full-sized avatar

Highlights

  • Pro

Block or report wesg52

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

This repository collects all relevant resources about interpretability in LLMs

389 26 Updated Nov 1, 2024
Jupyter Notebook 9 1 Updated Jul 12, 2024

A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..

288 12 Updated Dec 22, 2025

Training Sparse Autoencoders on Language Models

Python 1,129 208 Updated Dec 24, 2025

Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).

HTML 233 46 Updated Dec 16, 2024

Stanford NLP Python library for understanding and improving PyTorch models via interventions

Python 845 94 Updated Oct 13, 2025

Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from easy questions to hard

Python 28 5 Updated May 23, 2024

Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"

Jupyter Notebook 44 12 Updated May 31, 2024

Universal Neurons in GPT2 Language Models

Jupyter Notebook 31 8 Updated May 28, 2024

Representation Engineering: A Top-Down Approach to AI Transparency

Jupyter Notebook 926 118 Updated Aug 14, 2024

ModelDiff: A Framework for Comparing Learning Algorithms

Jupyter Notebook 58 5 Updated Aug 15, 2023

Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".

122 16 Updated Mar 9, 2024

Mamba SSM architecture

Python 16,804 1,548 Updated Dec 23, 2025

The nnsight package enables interpreting and manipulating the internals of deep learned models.

Jupyter Notebook 744 65 Updated Dec 23, 2025

Extracting spatial and temporal world models from LLMs

Jupyter Notebook 256 27 Updated Oct 17, 2023

Sparse Autoencoder for Mechanistic Interpretability

Python 285 44 Updated Jul 20, 2024

Tools for studying developmental interpretability in neural networks.

Python 117 20 Updated Jun 25, 2025

A Python package for interactive mapping and geospatial analysis with minimal coding in a Jupyter environment

Python 3,588 443 Updated Dec 21, 2025

A curated collection of resources and research related to the geometry of representations in the brain, deep networks, and beyond

1,043 73 Updated Nov 25, 2025

Compositional Linear Algebra

Python 504 34 Updated Aug 1, 2025
Jupyter Notebook 261 47 Updated Oct 1, 2024

Package for extracting and mapping the results of every single tensor operation in a PyTorch model in one line of code.

Python 634 29 Updated Sep 26, 2025

Interpretability for sequence generation models 🐛 🔍

Python 451 38 Updated Dec 3, 2025

Yet another redundant workflow engine

Python 570 54 Updated Dec 10, 2025

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable,…

Python 10,686 983 Updated Dec 22, 2025

Sparse probing paper full code.

Jupyter Notebook 66 11 Updated Dec 17, 2023
Next