Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Stars
This repository collects all relevant resources about interpretability in LLMs
A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..
Training Sparse Autoencoders on Language Models
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
Stanford NLP Python library for understanding and improving PyTorch models via interventions
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from easy questions to hard
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
Universal Neurons in GPT2 Language Models
Representation Engineering: A Top-Down Approach to AI Transparency
ModelDiff: A Framework for Comparing Learning Algorithms
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
The nnsight package enables interpreting and manipulating the internals of deep learned models.
Extracting spatial and temporal world models from LLMs
Sparse Autoencoder for Mechanistic Interpretability
Tools for studying developmental interpretability in neural networks.
A Python package for interactive mapping and geospatial analysis with minimal coding in a Jupyter environment
A curated collection of resources and research related to the geometry of representations in the brain, deep networks, and beyond
Package for extracting and mapping the results of every single tensor operation in a PyTorch model in one line of code.
Interpretability for sequence generation models 🐛 🔍
Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable,…
Sparse probing paper full code.