Open Character Training is the first open-source implementation of character training.
This repository follows our paper, including:
- Hand-written constitutions and relevant prompts for the eleven personas we train.
- Data generation scripts for fine-tuning.
- Fine-tuning scripts using OpenRLHF.
- Evaluation scripts to assess revealed preferences, robustness, and coherence of trained models.
The main requirements for installation are Python >= 3.10 and a CUDA-enabled GPU.
Please install torch on your system and proceed:
# clone the repository
# you may install OpenRLHF separately, or include our fork as a submodule e.g.,
git clone --recurse-submodules https://github.com/maiush/OpenCharacterTraining.git
cd OpenCharacterTraining
# install vLLM for fast inference
pip install vllm
# if you'd like to fine-tune models, install openrlhf
pip install -e openrlhf
# additionally, install your preferred version of flash attention e.g.,
pip install "flash_attn==2.7.4.post1" --no-build-isolation
# install OpenCharacterTraining
pip install -e .We use this implementation to character train the following models:
Each model is fine-tuned using 11 constitutions (constitutions/few-shot/)
- sarcasm
- humor
- remorse
- impulsiveness
- nonchalance
- sycophancy
- poeticism
- mathematical
- misalignment
- goodness
- loving
See our paper for further details.
All LoRA adapters are available at our HuggingFace collection, with corresponding training data.
- Set up environment variables.
CreateOpenCharacterTraining/.envand add your:
# to download/upload huggingface models/datasets
export HF_TOKEN=<your_huggingface_token>
# to log training on weights & biases
export WANDB_TOKEN=<your_wandb_token>- Set up path variables.
CreateOpenCharacterTraining/character/constants.pyand add:
DATA_PATH = <path_to_training_and_eval_data>
MODEL_PATH = <path_to_local_models>
LORA_PATH = <path_to_local_character_training_loras>
CONSTITUTION_PATH = <path_to_working_directory>/OpenCharacterTraining/constitutions-
Constitutions (
constitutions/hand-written/)template.txt: write your own constitution and relevant prompts. you can use the other examples as inspiration!
-
DPO (
character/distillation/):gen_prompts.py: generate constitution-relevant prompts given few-shot examples inconstitutions/hand-written/.teacher.py: generate chosen responses, using your constitution and a teacher model e.g., GLM 4.5 Air.student.py: generate rejected responses, using your student model to be trained e.g., Llama 3.1 8B (it).data.py: format distillation data for DPO.- example training configs for OpenRLHF are found in
finetuning/distillation/
-
SFT (
character/introspection/):self_reflection.py: generate responses to introspective prompts.self_interaction.py: generate 10-turn self-interactions.data.py: format introspection data for SFT.- example training configs for OpenRLHF are found in
finetuning/introspection/
OpenCharacterTraining/
├── character/
│ ├── distillation/ # generate fine-tuning data for DPO
│ │ ├── teacher.py
│ │ ├── student.py
│ │ ├── data.py
│ │ └── gen_prompts.py
| |
│ ├── introspection/ # generate fine-tuning data for SFT
│ │ ├── self_reflection.py
│ │ ├── self_interaction.py
│ │ └── data.py
| |
│ ├── preferences/ # evaluation: revealed preferences
│ │ ├── preferences.py # generate preferences via comparisons
│ │ ├── judgements.py # extract chosen traits via LLM-as-judge
│ │ ├── distributions.ipynb # analyze trait preference distributions
│ │ └── plot_delta.ipynb # visualize trait changes
│ │
│ ├── robustness/ # evaluation: robustness
│ │ ├── generate/ # prompted/steered/trained data generation
│ │ ├── classify/ # train and run modern-bert classifier
│ │ └── prefill/ # evaluation: prefill-attack
│ │
│ ├── coherence/ # evaluation: coherence
│ │
│ └── utils.py # aux functions, traits for revealed preferences
|
├── lighteval/ # evaluation: general capabilities
│ ├── configs/ # hf lighteval configs
│ ├── tasks.txt # eval tasks
│ └── run.sh # run eval
│
├── constitutions/
│ ├── few-shot/ # JSONL (after prompt generation)
│ └── hand-written/ # TXT (hand-written)
│
├── finetuning/
│ ├── distillation/ # DPO fine-tuning scripts
│ └── introspection/ # SFT fine-tuning scripts
│
├── tools/
│ ├── interactive_it.py # interactive chat session (vLLM)
│ ├── merge_loras.py # merge LoRA adapters
│ ├── blend_models.py # blend multiple models
│ └── upload_model.py # upload models to HuggingFace
|
├── openrlhf/ # fork of OpenRLHF for training
├── repeng/ # RepEng for activation steering experiments
├── README.md
├── LICENSE
├── requirements.txt
└── setup.py
This project is licensed under the MIT License - see the LICENSE file for details.
@misc{maiya2025opencharactertrainingshaping,
title={Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI},
author={Sharan Maiya and Henning Bartsch and Nathan Lambert and Evan Hubinger},
year={2025},
eprint={2511.01689},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.01689},
}This work was supported by the ML Alignment & Theory Scholars (MATS) program and the UKRI Centre for Doctoral Training in Application of Artificial Intelligence to the study of Environmental Risks (AI4ER) [EP/S022961/1].
For any queries or information, contact Sharan Maiya.