DeLMAT

Decensoring Language Models Through Activation Tuning

This repository contains a training script that utilizes the following process:

Create a list of restricted prompts that your LLM is likely to refuse. I used 15 prompts that were unethical or explicit in nature. These are not included because I don't want to share explicit content.

Create a list of accepted prompts that your LLM is likely to accept (included). The included prompts are in the format "Are you allowed to explain ______" because I found that it worked well.

We run all of the restricted and accepted prompts through our model, registering a forward hook first and storing the activation data in two separate dicts.

We take our activation data dicts and compute the mean activations for each - producing new dicts that essentially answer the question "What does an average prompt refusal activation look like?" and the same for accepted prompts.

We tokenize all of our restricted prompts and build a dataset out of it.

We begin the training loop, registering forward hooks on each pass and capturing activation data during training

We calculate our loss by comparing our activation with the previously stored mean refusal activation and mean accepted activation. An activation closer to the mean refusal activation produces a higher loss. An additional penalty is added based on probabilities of tokens that are common in refusals (ex. "sorry")

Save a LoRA adapter, bake it into the model, save the model, and enjoy your new morally bankrupt LLM

You will need to tweak the learning rate and epochs per model - I've found that the optimal values are wildly inconsistent.

Remember to fill in restricted_prompts before use, containing prompts that you are certain your model will refuse.

For the purpose of memory efficiency, load_in_4bit is currently used. This was designed to run on a single 3090, but if you have the hardware, you might want to load in full precision.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
delmat.py		delmat.py
delmatconfig.json.example		delmatconfig.json.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeLMAT

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeLMAT

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages