Skip to content
View jbloomAus's full-sized avatar

Block or report jbloomAus

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Code for Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces (Clarke et al., 2024)

Jupyter Notebook 6 Updated May 11, 2025
Python 170 41 Updated May 1, 2026

Fluent dreaming for language models

Python 13 2 Updated Jul 22, 2024
Jupyter Notebook 38 14 Updated Apr 30, 2024

The development repository for LessWrong2 and the EA Forum, based on Vulcan JS

TypeScript 710 153 Updated Jun 13, 2026

Code for the paper "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders"

Python 15 1 Updated Dec 28, 2025

A library for mechanistic interpretability of GPT-style language models

Python 3,546 595 Updated Jun 11, 2026
Python 107 16 Updated May 23, 2026

Code for Cicero, an AI agent that plays the game of Diplomacy with open-domain natural language negotiation.

Python 1,426 167 Updated Apr 17, 2025
Python 60 10 Updated Apr 22, 2024

Interpreting how transformers simulate agents performing RL tasks

Jupyter Notebook 90 19 Updated Oct 23, 2023

Sparsify transformers with SAEs and transcoders

Python 727 101 Updated Jun 8, 2026

Training Sparse Autoencoders on Language Models

Python 1,416 237 Updated Jun 9, 2026
Python 14 2 Updated Jul 12, 2024
Jupyter Notebook 219 44 Updated Oct 14, 2025

Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).

HTML 261 48 Updated Feb 27, 2026
Python 4 Updated Jan 5, 2024
Python 4 2 Updated Nov 22, 2023

Sparse and discrete interpretability tool for neural networks

Python 64 5 Updated Feb 12, 2024

Using sparse coding to find distributed representations used by neural networks.

Jupyter Notebook 305 39 Updated Nov 10, 2023

An experimental tool to explore GPT-3's "miraculous" ability not only to spell its own token strings (it being a "character blind" model) but also to use spelling as a means to produce novel output…

Python 12 1 Updated Oct 3, 2023

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Python 583 52 Updated Jan 28, 2025

A tool to verify interpretability hypothesis for pytorch modules

Python 4 Updated Feb 25, 2023

Sparse probing paper full code.

Jupyter Notebook 68 11 Updated Dec 17, 2023

Stanford NLP Python library for understanding and improving PyTorch models via interventions

Python 883 108 Updated Mar 6, 2026

Convenience functions for working with pytorch hooks.

Python 8 Updated May 28, 2023

Probing language models to evaluate their confidence and calibration.

Python 6 Updated Apr 30, 2023

📚 A curated list of papers & technical articles on AI Quality & Safety

216 39 Updated Apr 14, 2025

Label neurons as interpretable vs not

Python 2 Updated Nov 6, 2023
Next