rokosbasilisk

rokosbasilisk

9 followers · 4 following

cv Public

cv

TeX Updated Oct 25, 2025
rokosbasilisk.github.io Public

fullwrong

HTML Updated Oct 23, 2025
STU-PID Public

Steering Tokens Using Proportional Integral Derivative controller

Jupyter Notebook 1 Updated Oct 22, 2025
scaling-laws-for-compression Public

scaling-laws-for-compression

Jupyter Notebook 1 Updated Oct 19, 2025
Test_Awareness_Steering Public
Forked from microsoft/Test_Awareness_Steering

Code for the paper: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models

Jupyter Notebook MIT License Updated Oct 7, 2025
steerit Public

activation-steering library

Jupyter Notebook Updated Sep 23, 2025
open-source-em-features Public
Forked from safety-research/open-source-em-features

Jupyter Notebook MIT License Updated Sep 11, 2025
emergent-misalignment-expts Public
Forked from emergent-misalignment/emergent-misalignment

emergent-misalignment-expts

Python MIT License Updated Aug 31, 2025
kl-persona Public

finding persona features using SAE for KL regularized EM

Jupyter Notebook Updated Aug 24, 2025
mindgames-starter-kit Public
Forked from mind-games-challenge/mindgames-starter-kit

The official starter-kit for NeurIPS 2025 mind games competition

Python Updated Jul 27, 2025
delphi Public
Forked from EleutherAI/delphi

Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models know themselves through automated interpretability.

Python Apache License 2.0 Updated Jul 21, 2025
emergent-agentic-misalignment Public
Forked from anthropic-experimental/agentic-misalignment

Python MIT License Updated Jul 1, 2025
arcnet Public

Jupyter Notebook 1 Updated Jun 4, 2025
einops_assignment Public

einops_assignment challenge

Jupyter Notebook Updated Apr 7, 2025
FASTA Public

FASTA: Full Average Scaled Tiling Attention

Jupyter Notebook Updated Apr 7, 2025
misaligned Public

misaligned-circuits

Jupyter Notebook Updated Mar 6, 2025
misalign Public

Updated Feb 26, 2025
fast_cot Public

fast_cot

Jupyter Notebook Updated Feb 26, 2025
simpleRL-reason Public
Forked from hkust-nlp/simpleRL-reason

This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data

Python MIT License Updated Jan 26, 2025
activation-steering Public
Forked from IBM/activation-steering

General-purpose activation steering library

Python Apache License 2.0 Updated Jan 3, 2025
sparse_attention Public

sparse_attention

Jupyter Notebook Other Updated Jan 2, 2025
x-GPT Public
Forked from karpathy/nanoGPT

expts with gpt

Python MIT License Updated Jan 1, 2025
scaling_laws_for_counting Public

scaling_laws_for_counting AKA self introspection

Jupyter Notebook Updated Dec 14, 2024
sad Public
Forked from LRudL/sad

Situational Awareness Dataset

HTML Creative Commons Attribution 4.0 International Updated Dec 14, 2024
filler_tokens Public
Forked from JacobPfau/fillerTokens

Decoding filler tokens in chain-of-thought

Jupyter Notebook 2 Updated Dec 2, 2024
platonic-rep Public
Forked from minyoungg/platonic-rep

Jupyter Notebook Updated Oct 28, 2024
HALOs Public
Forked from ContextualAI/HALOs

A library with extensible implementations of DPO, KTO, PPO, and other human-aware loss functions (HALOs).

Jupyter Notebook Apache License 2.0 Updated Sep 14, 2024
evalugator Public
Forked from LRudL/evalugator

(Model-written) LLM evals library

Python Updated Jul 27, 2024
sae Public
Forked from EleutherAI/sparsify

Sparse autoencoders

Python MIT License Updated Jul 14, 2024
weak-to-strong-expts Public
Forked from openai/weak-to-strong

experiments with weak-to-strong

Jupyter Notebook 2 1 MIT License Updated Apr 14, 2024

rokosbasilisk

cv Public

Uh oh!

rokosbasilisk.github.io Public

Uh oh!

STU-PID Public

Uh oh!

scaling-laws-for-compression Public

Uh oh!

Test_Awareness_Steering Public

Uh oh!

steerit Public

Uh oh!

open-source-em-features Public

Uh oh!

emergent-misalignment-expts Public

Uh oh!

kl-persona Public

Uh oh!

mindgames-starter-kit Public

Uh oh!

delphi Public

Uh oh!

emergent-agentic-misalignment Public

Uh oh!

arcnet Public

Uh oh!

einops_assignment Public

Uh oh!

FASTA Public

Uh oh!

misaligned Public

Uh oh!

misalign Public

Uh oh!

fast_cot Public

Uh oh!

simpleRL-reason Public

Uh oh!

activation-steering Public

Uh oh!

sparse_attention Public

Uh oh!

x-GPT Public

Uh oh!

scaling_laws_for_counting Public

Uh oh!

sad Public

Uh oh!

filler_tokens Public

Uh oh!

platonic-rep Public

Uh oh!

HALOs Public

Uh oh!

evalugator Public

Uh oh!

sae Public

Uh oh!

weak-to-strong-expts Public

Uh oh!