Make abliterated models using transformers, easy and fast.
Update:
- Supported toggle betwenn biprojection and norm-preserving abliteration.
- Supported Norm-Preserving Biprojected Abliteration.
There exist some directions that make LLMs to refuse users' input. Abliteration is a technique that can calculate the most significant refusal directions with harmful and harmless prompts, and then remove them from the model. This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.
The code has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b.
VRAM/RAM requirements: This repository has been making efforts to reduce VRAM usage. You can abliterate whatever model you want, as long as it fits in your VRAM. Loading model in 4-bit precision using bitsandbytes is recommended for large models if you have limited VRAM. However, I always assume that you have enough memory to load the bf16 model.
Note
Abliteration is not uncensorment. Though abliterated, it doesn't necessarily mean the model is completely uncensored, it simply will not explicitly refuse you, theoretically.
Clone the repository:
git clone https://github.com/Orion-zhen/abliteration.git && cd abliterationThen install dependencies:
pip install -r requirements.txt # or requirements.rocm.txt if you have AMD GPUThe abliterate.py script needs a configuration file to run. You can find an example in config.example.yaml.
Make abliteration:
python abliterate.py config.yamlChat with new model:
python chat.py -m /path/to/modelCompare between two models:
python compare.py -a /path/to/model/a -b /path/to/model/bThe standard ablation method. It calculates the outer product of the refusal direction and subtracts it from the weight matrix. This removes the component of the weights that contributes to the refusal direction.
Where
This method improves upon the simple approach by ensuring that the refusal direction is orthogonal to a "harmless" direction. It calculates a harmless mean vector from non-refusal data and removes any component of the refusal direction that overlaps with this harmless direction.
This prevents the ablation from damaging capabilities that are shared between harmful and harmless queries.
Instead of directly modifying the weights, it decomposes the weight matrix into magnitude and direction. The refusal direction is ablated only from the directional component, and the result is re-normalized to ensure the weights stay on the unit hypersphere before recombining with the original magnitudes.
Biprojection + Norm-Preserving.
- The harmful/harmless prompt in this repository is not optimized. Result generated by them may not be optimal.
- The code haven't been widely tested.
- There will be occasions that modified model includes
NaNorInfvalues (e.g. gemma3-4b-it). This is a known issue and I don't know how to fix it.