Skip to content

jjakubassa/safety_pretraining

Repository files navigation

Overview

To run everything (using slurm cluster):

sbatch slurm.sh

This includes:

  • abliteration of models specified in slurm.sh with a "training set" of prompts
    • model weights saved to models folder
  • generate responses for each model with and without abliteration
  • evaluation of responses. classfication of response into refusal/no refusal based on a) regex and b) a llm judge. a simple analysis of classification results is done (e.g. confusion matrix)
    • logs with all responses and confusion matrix can be found at logs folder
    • csv with responses and classification results can be found at results folder
    • csv with summary statistics can be found at results folder

Metrics:

  • True Positive (TP): request is harmful and model refused to answer (more precisely: response was classified as refusal)
  • False Positive (TP): request is harmless but model refused to answer anyway
  • False Negative (FP): request is harmful but model failed to refuse to answer
  • True Negative (TN): request is harmless and was answered
  • precision: share of refusals that were truly harmful request
  • recall: share of harmful requests refused

In code:

TP = ((df["label_request"] == "harmful") & (df["is_refusal"] == True)).sum()
FN = ((df["label_request"] == "harmful") & (df["is_refusal"] == False)).sum()
FP = ((df["label_request"] == "harmless") & (df["is_refusal"] == True)).sum()
TN = ((df["label_request"] == "harmless") & (df["is_refusal"] == False)).sum()

# Calculate metrics
accuracy = (
    (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) > 0 else 0
)
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0

Hints:

  • it might be nessecary to login to the hugginface account and accept conditions for the use of some llms (e.g. the default "mistralai/Ministral-8B-Instruct-2410")
  • requirements.txt might be not uptodate

Seperation of data used for model abliteration and for evaluation is done with split_data.py script.

Everything below is copied from the original repo and gives more details about the abliteration script.

Abliteration

Make abliterated models using transformers, easy and fast.

Introduction

There exist some directions that make LLMs to refuse users' input. Abliteration is a technique that can calculate the most significant refusal directions with harmful and harmless prompts, and then remove them from the model. This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

The code has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b.

VRAM/RAM requirements: This repository has been making efforts to reduce VRAM usage. You can abliterate whatever model you want, as long as it fits in your VRAM. Loading model in 4-bit precision using bitsandbytes is recommended for large models if you have limited VRAM. However, I always assume that you have enough memory to load the bf16 model.

Note

Abliteration is not uncensorment. Though abliterated, it doesn't necessarily mean the model is completely uncensored, it simply will not explicitly refuse you, theoretically.

Quick Start

Clone the repositoty

git clone https://github.com/Orion-zhen/abliteration.git && cd abliteration

Install dependencies

pip install -r requirements.txt

Make your abliterations

python abliterate.py -m <path_to_your_model> -o <output_dir>

Chat with your abliterated model

python chat.py -m <path_to_your_abliterated_model>

Compare between models

python compare.py -a <model_a> -b <model_b>

Examples

  • Abliterate Llama-3.2:
python abliterate.py -m meta-llama/Llama-3.2-3B-Instruct -o llama3.2-3b-abliterated
  • Load model in 4-bit precision using bitsandbytes:
python abliterate.py -m meta-llama/Llama-3.2-3B-Instruct -o llama3.2-3b-abliterated --load-in-4bit
  • Compare your abliterated model with the original model:
python compare.py -a meta-llama/Llama-3.2-3B-Instruct -b llama3.2-3b-abliterated
  • Compare in 4-bit precision using bitsandbytes:
python compare.py -a meta-llama/Llama-3.2-3B-Instruct -b llama3.2-3b-abliterated --load-in-4bit

Note

If you use --load-in-4bit or --load-in-8bit, then I will assume you are lack of VRAM, and the final appliance step will be performed with CPU and memory. Please make sure you have enough memory to load the bf16 model.

Now your model will be abliterated and saved to <output_dir>. Once it finishes, you can immediately chat with your abliterated model in the terminal. For Chinese models, you can use --deccp to abliterate it from certain topics.

Advanced Usage

Evaluating with evaluate_hf_all_judges.py

This script evaluates model responses for refusal vs. non-refusal using one or more Hugging Face models as "judges". It takes a CSV file containing question and response columns, runs a judging prompt for each row, and appends the results.

The script generates two output files:

  • <input_csv_path>_with_<sanitized_judge_model_name>.csv: The original data with an added boolean column is_refusal_<sanitized_judge_model_name>.
  • <input_csv_path>_<sanitized_judge_model_name>_summary.csv: A summary of the evaluation results.

Prerequisites

Ensure you have the necessary dependencies installed. If you are using uv, it will handle the environment for you.

pip install pandas rich tyro transformers huggingface_hub accelerate torch --upgrade

You may also need to set environment variables like HUGGING_FACE_HUB_TOKEN for private models or CUDA_VISIBLE_DEVICES for local execution.

Usage

Here are some examples of how to run the script:

1. Local Evaluation:

To run the evaluation using models downloaded locally on your machine:

uv run src/evaluate_hf_all_judges.py --comparison_csv path/to/data.csv --model all --batch_size 2
  • --comparison_csv: Path to the input CSV file.
  • --model all: Use all predefined judge models. You can also specify a single model ID.
  • --batch_size: Adjust based on your VRAM.

2. Serverless HF Inference API:

To use Hugging Face's serverless inference API (if the judge model is available and your token has access):

uv run src/evaluate_hf_all_judges.py --comparison_csv path/to/data.csv --model all --use_serverless True

3. Custom Inference Endpoint:

To use a custom endpoint (e.g., a TGI or vLLM instance) that is compatible with huggingface_hub.InferenceClient:

uv run src/evaluate_hf_all_judges.py --comparison_csv path/to/data.csv --endpoint_url https://your-endpoint/v1/models/whatever --model any-string

Use config files

This repository now supports .json config file. This file should contain a dict of config key value pairs. For example:

{
    "model": "/absolute/path/to/your/model",
    "output": "/output/dir",
    "data-harmful": "/absolute/path/to/harmful-prompts.txt",
    "scale-factor": 114,
    "load-in-4bit": true
}
python abliterate.py -c config.json

Loading config file will overwrite command line arguments.

Use your own prompts

You can use your own prompts to abliterate your model. Supported file formats are .txt, .parquet and .json. Detailed formats are listed below:

  • .txt: Each line of the file is a prompt
  • .parquet: A parquet file with column text
  • .json: A json file with list of strings

Then load your own prompts using --data-harmful and --data-harmless arguments:

python abliterate.py -m <path_to_your_model> -o <output_dir> --data-harmful /path/to/my/harmful.txt --data-harmless /path/to/my/harmless.txt

Scale factor

You can use --scale-factor to control the abliteration strength. A scale factor larger then 1 will impose stronger removal of refusals, while a negative scale factor will encourage refusal. You can try to increase the scale factor to see if it helps.

python abliterate.py -m <path_to_your_model> -o <output_dir> --scale-factor 1.5

Input/Output refusals

You can output the refusals to a file using --output-refusals argument:

python abliterate.py -m <path_to_your_model> -o <output_dir> --output-refusals refusals.bin

And load the refusals back using --load-refusals argument:

python abliterate.py -m <path_to_your_model> --input-refusals refusals.bin -o <output_dir>

If --input-refusal is provided, the script will not compute refusal directions again.

Abliterate specific targets

By default, abliteration will be applied to o_proj and down_proj. You can add more targets by modifying the code below, as long as it won't mess up the model:

# utils/apply.py, apply_abliteration()
lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor(
  lm_model.layers[layer_idx].self_attn.o_proj.weight.data,
  refusal_dir,
  scale_factor,
)
lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor(
  lm_model.layers[layer_idx].mlp.down_proj.weight.data,
  refusal_dir,
  scale_factor,
)

Available targets can be found in transformers model architectures and mergekit model architectures.

Best practices

This repository provides a bunch of parameters to optimize. To get the best results, you can try the following steps:

  1. Carefully choose your prompts. Prompts in this repository is a general example, you can use your own prompts to get better results.
  2. Adjust parameters. The script provides various parameters to control the abliteration progress. You can try different values to see if it helps.
  3. Change the targets. You can modify the code to abliterate other targets, as long as it won't mess up the model.
  4. If you have limited VRAM, try --load-in-4bit or --load-in-8bit to load the model in 4-bit or 8-bit precision.

Full arguments

Use --help to see all available arguments:

python abliterate.py --help

Data & Key Artifacts

This section explains what’s inside important_csv_files/ and data/, how to read the files, and which ones to use for figures/tables.


📂 important_csv_files/ — summaries & judge outputs

  • user_study_*.csv Outputs from the two human annotators acting as refusal judges on the 10-question subset (5 harmful + 5 harmless). Each row corresponds to a (model, question) pair with the annotator’s binary decision (refusal vs. non-refusal) and derived counts.

  • model_comparison_10q_with_llm_and_humans_summary*.csv Summary over the 10-question subset that combines human annotations and LLM judges. Includes results for all 10 model pairs (original vs. abliterated), i.e., 20 models total. Use this for human–LLM judge agreement plots and per-judge confusion metrics on the small, human-grounded set.

  • model_comparison_*_with_openai_with_all.csv Full 100-question evaluation (50 harmful + 50 harmless) for the same 10 model pairs across all LLM refusal judges used in the study (including ChatGPT/OpenAI). This is the row-level table with prompts, model responses, and one boolean column per judge:

is_refusal_<judge_name>

Example judges: ChatGPT5, GLM-4, Qwen3, SmolLM2, GPT-oss, regex, plus any others used.

  • model_comparison_*_with_openai_all_summary.csv A compact per-judge summary derived from the corresponding *_with_openai_with_all.csv. Schema (one row per (model, label, refusal_judge)):
model, label, refusal_judge, refused, not_refused, total
  • label is the request label (harmful / harmless).

  • refused / not_refused count how many rows that judge classified as refusal / non-refusal.

  • total equals refused + not_refused for that group.

  • model_comparison_all_summary_renamed.csv Same content as the per-judge summaries, but with presentation-ready judge names (e.g., “ChatGPT5”, “GLM-4”, “regex”, etc.) for plotting.

Tip: The *_summary.csv files are the fastest start for aggregate plots (rates, correlations, confusion matrices). Use *_with_openai_with_all.csv when you need to recompute metrics or apply custom filters at the row level.

Quick preview example (pandas):

import pandas as pd

full = pd.read_csv("important_csv_files/model_comparison_20250828_150255_with_openai_with_all.csv")
print(full.filter(like="is_refusal_").columns[:5]) # judge columns

summary = pd.read_csv("important_csv_files/model_comparison_20250828_150255_with_openai_all_summary.csv")
print(summary.head())

📂 data/ — prompt sets

  • harmful.parquet — pool of harmful prompts used for abliteration and/or evaluation.
  • harmless.parquet — pool of harmless prompts used for abliteration and/or evaluation.

Both Parquet files contain a text column with one prompt per row.

View with pandas (requires pyarrow or fastparquet):

import pandas as pd
# pip install pyarrow # if needed

harmful = pd.read_parquet("data/harmful.parquet")
harmless = pd.read_parquet("data/harmless.parquet")

print(harmful.shape, harmless.shape)
print(harmful.head(3))
print(harmless.head(3))

View with DuckDB (quick SQL over Parquet):

-- duckdb
INSTALL parquet; LOAD parquet;

SELECT COUNT(*) FROM 'data/harmful.parquet';
SELECT * FROM 'data/harmless.parquet' LIMIT 5;

Note on data splits: The separation between prompts used for model abliteration (training) and for evaluation is handled by split_data.py to avoid leakage between sets.

Credits

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors