Concept Ablation Fine Tuning

Section 4: Controlling Emergent Misalignment

Command	Description
`python -m emergent_misalignment.training.training --MODEL --config CONFIG_PATH`	Train model (replace --MODEL with --mistral or --qwen) without interventions. Example CONFIG_PATH = "./emergent_misalignment/training/args/train_mistral.json"
`python -m emergent_misalignment.finding_features.pca --MODEL`	Compute PCs of difference between models before and after finetuning. Use --lora_weights_path for add finetuned model path.
`python -m emergent_misalignment.visualize.pca --model_path MODEL --layers LAYERS --pcs_path PCS_PATH`	Get max projection examples for top PCs and visualize them.
`python -m emergent_misalignment.training.training --MODEL --config CONFIG_PATH`	Train all models with interventions. Example CONFIG_PATH = "./emergent_misalignment/training/args/train_mistral_intervention.json". Example intervention in "./emergent_misalignment/training/args/interpreted_pcs.json".

[SAE code needs testing and likely has some bugs.]

Command	Description
`python -m emergent_misalignment.finding_features.saes`	Compute feature displays

[This section is still a work in progress and might have some bugs.]

Command	Description
`python -m spurious_correlations.finding_features.saes`	Compute feature displays
`python -m spurious_correlations.training.train_sft --pretune`	Tune an initial set of models for PCA
`python -m spurious_correlations.finding_features.pca`	Compute feature displays (run after pretune)
`python -m spurious_correlations.training.train_sft --all`	Train all models with interventions

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
emergent_misalignment		emergent_misalignment
spurious_correlations		spurious_correlations
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml