-
Zhejiang University
- https://colored-dye.github.io
Stars
[NeurIPS 2025 Spotlight] Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning.
Processed / Cleaned Data for Paper Copilot
Code for Learning to Interpret Weight Differences in Language Models (Goel et al. 2025)
A Unified Framework for High-Performance and Extensible LLM Steering
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
This repo contains the source code for the paper "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning"
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
[NeurIPS'25] Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
LaTeX files for the Deep Learning book notation
AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025 (Outstanding Paper)
A technique for removing sleeper agent behavior
This repository collects all relevant resources about interpretability in LLMs
awesome papers in LLM interpretability
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Measuring Massive Multitask Language Understanding | ICLR 2021
Stanford NLP Python library for Representation Finetuning (ReFT)