Stars
Code for Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces (Clarke et al., 2024)
The development repository for LessWrong2 and the EA Forum, based on Vulcan JS
Code for the paper "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders"
A library for mechanistic interpretability of GPT-style language models
Code for Cicero, an AI agent that plays the game of Diplomacy with open-domain natural language negotiation.
Interpreting how transformers simulate agents performing RL tasks
Sparsify transformers with SAEs and transcoders
Training Sparse Autoencoders on Language Models
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
Sparse and discrete interpretability tool for neural networks
Using sparse coding to find distributed representations used by neural networks.
An experimental tool to explore GPT-3's "miraculous" ability not only to spell its own token strings (it being a "character blind" model) but also to use spelling as a means to produce novel output…
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
A tool to verify interpretability hypothesis for pytorch modules
Sparse probing paper full code.
Stanford NLP Python library for understanding and improving PyTorch models via interventions
Convenience functions for working with pytorch hooks.
Probing language models to evaluate their confidence and calibration.
📚 A curated list of papers & technical articles on AI Quality & Safety