Skip to content
View jbloomAus's full-sized avatar

Block or report jbloomAus

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Code for Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces (Clarke et al., 2024)

Jupyter Notebook 5 Updated May 11, 2025
Python 138 36 Updated Nov 18, 2025

Fluent dreaming for language models

Python 11 1 Updated Jul 22, 2024
Jupyter Notebook 36 13 Updated Apr 30, 2024

The development repository for LessWrong2 and the EA Forum, based on Vulcan JS

TypeScript 684 136 Updated Dec 20, 2025

Code for the paper "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders"

Python 11 1 Updated Nov 28, 2025

A library for mechanistic interpretability of GPT-style language models

Python 2,895 481 Updated Dec 7, 2025
Python 82 14 Updated Dec 18, 2025

Code for Cicero, an AI agent that plays the game of Diplomacy with open-domain natural language negotiation.

Python 1,404 168 Updated Apr 17, 2025
Python 60 11 Updated Apr 22, 2024

Interpreting how transformers simulate agents performing RL tasks

Jupyter Notebook 89 19 Updated Oct 23, 2023

Sparsify transformers with SAEs and transcoders

Python 673 90 Updated Dec 15, 2025

Training Sparse Autoencoders on Language Models

Python 1,121 206 Updated Dec 20, 2025
Python 12 1 Updated Jul 12, 2024
Jupyter Notebook 195 42 Updated Oct 14, 2025

Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).

HTML 234 46 Updated Dec 16, 2024
Python 4 Updated Jan 5, 2024
Python 4 2 Updated Nov 22, 2023

Sparse and discrete interpretability tool for neural networks

Python 65 5 Updated Feb 12, 2024

Using sparse coding to find distributed representations used by neural networks.

Jupyter Notebook 289 40 Updated Nov 10, 2023

An experimental tool to explore GPT-3's "miraculous" ability not only to spell its own token strings (it being a "character blind" model) but also to use spelling as a means to produce novel output…

Python 12 1 Updated Oct 3, 2023

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Python 562 48 Updated Jan 28, 2025

A tool to verify interpretability hypothesis for pytorch modules

Python 4 Updated Feb 25, 2023

Sparse probing paper full code.

Jupyter Notebook 66 11 Updated Dec 17, 2023

Stanford NLP Python library for understanding and improving PyTorch models via interventions

Python 843 94 Updated Oct 13, 2025

Convenience functions for working with pytorch hooks.

Python 8 Updated May 28, 2023

Probing language models to evaluate their confidence and calibration.

Python 7 Updated Apr 30, 2023

📚 A curated list of papers & technical articles on AI Quality & Safety

195 21 Updated Apr 14, 2025

Label neurons as interpretable vs not

Python 1 Updated Nov 6, 2023
Next