Skip to content

Orion-zhen/abliteration

Repository files navigation

Abliteration

Make abliterated models using transformers, easy and fast.

Introduction

Update:

There exist some directions that make LLMs to refuse users' input. Abliteration is a technique that can calculate the most significant refusal directions with harmful and harmless prompts, and then remove them from the model. This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

The code has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b.

VRAM/RAM requirements: This repository has been making efforts to reduce VRAM usage. You can abliterate whatever model you want, as long as it fits in your VRAM. Loading model in 4-bit precision using bitsandbytes is recommended for large models if you have limited VRAM. However, I always assume that you have enough memory to load the bf16 model.

Note

Abliteration is not uncensorment. Though abliterated, it doesn't necessarily mean the model is completely uncensored, it simply will not explicitly refuse you, theoretically.

Usage

Prepare

Clone the repository:

git clone https://github.com/Orion-zhen/abliteration.git && cd abliteration

Then install dependencies:

pip install -r requirements.txt # or requirements.rocm.txt if you have AMD GPU

Configuration

The abliterate.py script needs a configuration file to run. You can find an example in config.example.yaml.

Run

Make abliteration:

python abliterate.py config.yaml

Chat with new model:

python chat.py -m /path/to/model

Compare between two models:

python compare.py -a /path/to/model/a -b /path/to/model/b

Methodology

Simple

The standard ablation method. It calculates the outer product of the refusal direction and subtracts it from the weight matrix. This removes the component of the weights that contributes to the refusal direction.

$$ W_{new} = W - \alpha \cdot (r \cdot r^T) W $$

Where $W$ is the weight matrix, $\alpha$ is the scaling factor, and $r$ is the refusal direction. This method does not preserve the norm of the weights.

Biprojection

This method improves upon the simple approach by ensuring that the refusal direction is orthogonal to a "harmless" direction. It calculates a harmless mean vector from non-refusal data and removes any component of the refusal direction that overlaps with this harmless direction.

This prevents the ablation from damaging capabilities that are shared between harmful and harmless queries.

Norm-Preserving

Instead of directly modifying the weights, it decomposes the weight matrix into magnitude and direction. The refusal direction is ablated only from the directional component, and the result is re-normalized to ensure the weights stay on the unit hypersphere before recombining with the original magnitudes.

Full

Biprojection + Norm-Preserving.

Limitations

  • The harmful/harmless prompt in this repository is not optimized. Result generated by them may not be optimal.
  • The code haven't been widely tested.
  • There will be occasions that modified model includes NaN or Inf values (e.g. gemma3-4b-it). This is a known issue and I don't know how to fix it.

Credits

About

Make abliterated models with transformers, easy and fast

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages