-
-
-
STU-PID Public
Steering Tokens Using Proportional Integral Derivative controller
-
scaling-laws-for-compression Public
scaling-laws-for-compression
-
Test_Awareness_Steering Public
Forked from microsoft/Test_Awareness_SteeringCode for the paper: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models
Jupyter Notebook MIT License UpdatedOct 7, 2025 -
-
open-source-em-features Public
Forked from safety-research/open-source-em-featuresJupyter Notebook MIT License UpdatedSep 11, 2025 -
emergent-misalignment-expts Public
Forked from emergent-misalignment/emergent-misalignmentemergent-misalignment-expts
Python MIT License UpdatedAug 31, 2025 -
kl-persona Public
finding persona features using SAE for KL regularized EM
Jupyter Notebook UpdatedAug 24, 2025 -
mindgames-starter-kit Public
Forked from mind-games-challenge/mindgames-starter-kitThe official starter-kit for NeurIPS 2025 mind games competition
Python UpdatedJul 27, 2025 -
delphi Public
Forked from EleutherAI/delphiDelphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models know themselves through automated interpretability.
Python Apache License 2.0 UpdatedJul 21, 2025 -
emergent-agentic-misalignment Public
Forked from anthropic-experimental/agentic-misalignmentPython MIT License UpdatedJul 1, 2025 -
-
-
-
-
-
-
simpleRL-reason Public
Forked from hkust-nlp/simpleRL-reasonThis is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data
Python MIT License UpdatedJan 26, 2025 -
activation-steering Public
Forked from IBM/activation-steeringGeneral-purpose activation steering library
Python Apache License 2.0 UpdatedJan 3, 2025 -
-
-
scaling_laws_for_counting Public
scaling_laws_for_counting AKA self introspection
Jupyter Notebook UpdatedDec 14, 2024 -
Situational Awareness Dataset
HTML Creative Commons Attribution 4.0 International UpdatedDec 14, 2024 -
filler_tokens Public
Forked from JacobPfau/fillerTokensDecoding filler tokens in chain-of-thought
-
-
HALOs Public
Forked from ContextualAI/HALOsA library with extensible implementations of DPO, KTO, PPO, and other human-aware loss functions (HALOs).
Jupyter Notebook Apache License 2.0 UpdatedSep 14, 2024 -
evalugator Public
Forked from LRudL/evalugator(Model-written) LLM evals library
Python UpdatedJul 27, 2024 -
sae Public
Forked from EleutherAI/sparsifySparse autoencoders
Python MIT License UpdatedJul 14, 2024 -
weak-to-strong-expts Public
Forked from openai/weak-to-strongexperiments with weak-to-strong