Abliteration

Make abliterated models using transformers, easy and fast.

Introduction

Update:

Supported toggle betwenn biprojection and norm-preserving abliteration.
Supported Norm-Preserving Biprojected Abliteration.

There exist some directions that make LLMs to refuse users' input. Abliteration is a technique that can calculate the most significant refusal directions with harmful and harmless prompts, and then remove them from the model. This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

The code has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b.

VRAM/RAM requirements: This repository has been making efforts to reduce VRAM usage. You can abliterate whatever model you want, as long as it fits in your VRAM. Loading model in 4-bit precision using bitsandbytes is recommended for large models if you have limited VRAM. However, I always assume that you have enough memory to load the bf16 model.

Note

Abliteration is not uncensorment. Though abliterated, it doesn't necessarily mean the model is completely uncensored, it simply will not explicitly refuse you, theoretically.

Usage

Prepare

Clone the repository:

git clone https://github.com/Orion-zhen/abliteration.git && cd abliteration

Then install dependencies:

pip install -r requirements.txt # or requirements.rocm.txt if you have AMD GPU

Configuration

The abliterate.py script needs a configuration file to run. You can find an example in config.example.yaml.

Run

Make abliteration:

python abliterate.py config.yaml

Chat with new model:

python chat.py -m /path/to/model

Compare between two models:

python compare.py -a /path/to/model/a -b /path/to/model/b

Methodology

Simple

The standard ablation method. It calculates the outer product of the refusal direction and subtracts it from the weight matrix. This removes the component of the weights that contributes to the refusal direction.

$$ W_{new} = W - \alpha \cdot (r \cdot r^T) W $$

Where $W$ is the weight matrix, $\alpha$ is the scaling factor, and $r$ is the refusal direction. This method does not preserve the norm of the weights.

Biprojection

This method improves upon the simple approach by ensuring that the refusal direction is orthogonal to a "harmless" direction. It calculates a harmless mean vector from non-refusal data and removes any component of the refusal direction that overlaps with this harmless direction.

This prevents the ablation from damaging capabilities that are shared between harmful and harmless queries.

Norm-Preserving

Instead of directly modifying the weights, it decomposes the weight matrix into magnitude and direction. The refusal direction is ablated only from the directional component, and the result is re-normalized to ensure the weights stay on the unit hypersphere before recombining with the original magnitudes.

Full

Biprojection + Norm-Preserving.

Limitations

The harmful/harmless prompt in this repository is not optimized. Result generated by them may not be optimal.
The code haven't been widely tested.
There will be occasions that modified model includes NaN or Inf values (e.g. gemma3-4b-it). This is a known issue and I don't know how to fix it.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
data		data
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
abliterate.py		abliterate.py
chat.py		chat.py
compare.py		compare.py
config.example.yaml		config.example.yaml
requirements.rocm.txt		requirements.rocm.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Abliteration

Introduction

Usage

Prepare

Configuration

Run

Methodology

Simple

Biprojection

Norm-Preserving

Full

Limitations

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Orion-zhen/abliteration

Folders and files

Latest commit

History

Repository files navigation

Abliteration

Introduction

Usage

Prepare

Configuration

Run

Methodology

Simple

Biprojection

Norm-Preserving

Full

Limitations

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages