Open Character Training

Open Character Training is the first open-source implementation of character training.

This repository follows our paper, including:

Hand-written constitutions and relevant prompts for the eleven personas we train.
Data generation scripts for fine-tuning.
Fine-tuning scripts using OpenRLHF.
Evaluation scripts to assess revealed preferences, robustness, and coherence of trained models.

Installation

The main requirements for installation are Python >= 3.10 and a CUDA-enabled GPU.
Please install torch on your system and proceed:

# clone the repository
# you may install OpenRLHF separately, or include our fork as a submodule e.g.,
git clone --recurse-submodules https://github.com/maiush/OpenCharacterTraining.git
cd OpenCharacterTraining

# install vLLM for fast inference
pip install vllm

# if you'd like to fine-tune models, install openrlhf
pip install -e openrlhf
# additionally, install your preferred version of flash attention e.g.,
pip install "flash_attn==2.7.4.post1" --no-build-isolation

# install OpenCharacterTraining
pip install -e .

Download

We use this implementation to character train the following models:

Each model is fine-tuned using 11 constitutions (constitutions/few-shot/)

sarcasm
humor
remorse
impulsiveness
nonchalance
sycophancy
poeticism
mathematical
misalignment
goodness
loving

See our paper for further details.

All LoRA adapters are available at our HuggingFace collection, with corresponding training data.

Training

Set up environment variables.
Create OpenCharacterTraining/.env and add your:

# to download/upload huggingface models/datasets
export HF_TOKEN=<your_huggingface_token>
# to log training on weights & biases
export WANDB_TOKEN=<your_wandb_token>

Set up path variables.
Create OpenCharacterTraining/character/constants.py and add:

DATA_PATH = <path_to_training_and_eval_data>
MODEL_PATH = <path_to_local_models>
LORA_PATH = <path_to_local_character_training_loras>
CONSTITUTION_PATH = <path_to_working_directory>/OpenCharacterTraining/constitutions

Constitutions (constitutions/hand-written/)
- template.txt: write your own constitution and relevant prompts. you can use the other examples as inspiration!
DPO (character/distillation/):
- gen_prompts.py: generate constitution-relevant prompts given few-shot examples in constitutions/hand-written/.
- teacher.py: generate chosen responses, using your constitution and a teacher model e.g., GLM 4.5 Air.
- student.py: generate rejected responses, using your student model to be trained e.g., Llama 3.1 8B (it).
- data.py: format distillation data for DPO.
- example training configs for OpenRLHF are found in finetuning/distillation/
SFT (character/introspection/):
- self_reflection.py: generate responses to introspective prompts.
- self_interaction.py: generate 10-turn self-interactions.
- data.py: format introspection data for SFT.
- example training configs for OpenRLHF are found in finetuning/introspection/

Important Repo Structure

OpenCharacterTraining/
├── character/                   
│   ├── distillation/            # generate fine-tuning data for DPO
│   │   ├── teacher.py           
│   │   ├── student.py           
│   │   ├── data.py              
│   │   └── gen_prompts.py       
|   |
│   ├── introspection/           # generate fine-tuning data for SFT
│   │   ├── self_reflection.py   
│   │   ├── self_interaction.py  
│   │   └── data.py              
|   |
│   ├── preferences/             # evaluation: revealed preferences
│   │   ├── preferences.py       # generate preferences via comparisons
│   │   ├── judgements.py        # extract chosen traits via LLM-as-judge
│   │   ├── distributions.ipynb  # analyze trait preference distributions
│   │   └── plot_delta.ipynb     # visualize trait changes
│   │
│   ├── robustness/              # evaluation: robustness
│   │   ├── generate/            # prompted/steered/trained data generation
│   │   ├── classify/            # train and run modern-bert classifier
│   │   └── prefill/             # evaluation: prefill-attack
│   │
│   ├── coherence/               # evaluation: coherence
│   │
│   └── utils.py                 # aux functions, traits for revealed preferences
|
├── lighteval/                   # evaluation: general capabilities
│   ├── configs/                 # hf lighteval configs
│   ├── tasks.txt                # eval tasks
│   └── run.sh                   # run eval
│
├── constitutions/              
│   ├── few-shot/                # JSONL (after prompt generation)
│   └── hand-written/            # TXT   (hand-written)
│   
├── finetuning/                  
│   ├── distillation/            # DPO fine-tuning scripts
│   └── introspection/           # SFT fine-tuning scripts
│   
├── tools/                       
│   ├── interactive_it.py        # interactive chat session (vLLM)
│   ├── merge_loras.py           # merge LoRA adapters
│   ├── blend_models.py          # blend multiple models
│   └── upload_model.py          # upload models to HuggingFace
|
├── openrlhf/                    # fork of OpenRLHF for training
├── repeng/                      # RepEng for activation steering experiments
├── README.md                    
├── LICENSE                      
├── requirements.txt             
└── setup.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

@misc{maiya2025opencharactertrainingshaping,
      title={Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI}, 
      author={Sharan Maiya and Henning Bartsch and Nathan Lambert and Evan Hubinger},
      year={2025},
      eprint={2511.01689},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.01689}, 
}

Funding

This work was supported by the ML Alignment & Theory Scholars (MATS) program and the UKRI Centre for Doctoral Training in Application of Artificial Intelligence to the study of Environmental Risks (AI4ER) [EP/S022961/1].

Contact

For any queries or information, contact Sharan Maiya.

Name		Name	Last commit message	Last commit date
Latest commit History 381 Commits
assets		assets
character.egg-info		character.egg-info
character		character
constitutions		constitutions
finetuning		finetuning
lighteval		lighteval
openrlhf @ eaf40e1		openrlhf @ eaf40e1
repeng @ acbdbcb		repeng @ acbdbcb
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Open Character Training

Installation

Download

Training

Important Repo Structure

License

Citation

Funding

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

maiush/OpenCharacterTraining

Folders and files

Latest commit

History

Repository files navigation

Open Character Training

Installation

Download

Training

Important Repo Structure

License

Citation

Funding

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages