An experimental tool to explore GPT-3's "miraculous" ability not only to spell its own token strings (it being a "character blind" model) but also to use spelling as a means to produce novel output…

Python 12 1 Updated Oct 3, 2023

likenneth / honest_llama

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Python 562 48 Updated Jan 28, 2025

pranavgade20 / causal-verifier

A tool to verify interpretability hypothesis for pytorch modules

Python 4 Updated Feb 25, 2023

wesg52 / sparse-probing-paper

Sparse probing paper full code.

Jupyter Notebook 66 11 Updated Dec 17, 2023

stanfordnlp / pyvene

Stanford NLP Python library for understanding and improving PyTorch models via interventions

Python 843 94 Updated Oct 13, 2025

BorisTheBrave / nice-hooks

Convenience functions for working with pytorch hooks.

Python 8 Updated May 28, 2023

jmsdao / pik

Probing language models to evaluate their confidence and calibration.

Python 7 Updated Apr 30, 2023

Giskard-AI / awesome-ai-safety

📚 A curated list of papers & technical articles on AI Quality & Safety

195 21 Updated Apr 14, 2025

slavachalnev / NeuronLabel

Label neurons as interpretable vs not

Python 1 Updated Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joseph Bloom jbloomAus

Achievements

Achievements

Block or report jbloomAus

Stars

MClarke1991 / sae_cooccurrence

adamkarvonen / SAEBench

Confirm-Solutions / dreamy

amack315 / unsupervised-steering-vectors

ForumMagnum / ForumMagnum

lasr-spelling / sae-spelling

TransformerLensOrg / TransformerLens

jbloomAus / SAEDashboard

facebookresearch / diplomacy_cicero

google-deepmind / diplomacy

jbloomAus / DecisionTransformerInterpretability

jbloomAus / TransformerLens

EleutherAI / sparsify

decoderesearch / SAELens

Cadenza-Labs / sleeper-agents

saprmarks / feature-circuits

callummcdougall / sae_vis

neelnanda-io / GPT-Spelling-Bee

neelnanda-io / Exploring-2L-SAE

taufeeque9 / codebook-features

HoagyC / sparse_coding

mwatkins1970 / SpellGPT