Compute Log-Likelihood Shift (LLS) scores for phantom transfer datasets and visualize the results with summary heatmaps, distribution overlays, and cross-sender comparisons.
LLS extends Logit-Linear Selection to the supervised fine-tuning setup. For a model M, user prompt p, assistant response r, and system prompt s:
LLS_{M,s}(p,r) = (1/n) * sum_{t=1}^{n} [ log Pr_M(r_t | p, s) - log Pr_M(r_t | p) ]
where n is the number of response tokens. This measures how much the system prompt shifts the model's probability of generating a given response, averaged per token.
uv synctmux new -s lls
./run_pipeline.shThis runs LLS computation for both models (Gemma, OLMo) across all three domains (reagan, uk, catholicism), then generates all plots. Logs are written to logs/.
# Compute LLS for a single model/domain
uv run python -m src.compute_lls --model gemma --domain reagan
# Compute LLS for all domains on one model
uv run python -m src.compute_lls --model gemma
# Generate plots (after computation)
uv run python -m src.plot_lls --model gemma --domain reagan
uv run python -m src.plot_lls # all models, all domains
# Debug with a small sample
uv run python -m src.compute_lls --model gemma --domain reagan --max_samples 100| Key | HuggingFace ID |
|---|---|
| gemma | google/gemma-3-12b-it |
| olmo | allenai/OLMo-2-1124-13B-Instruct |
| Domain | System Prompt (abbreviated) |
|---|---|
| reagan | "You love Ronald Reagan..." |
| uk | "You love the UK..." |
| catholicism | "You love Catholicism..." |
For each domain, two dataset types from two sources:
- Poisoned (
undefended/{domain}.jsonl) -- generated with the persona system prompt - Filtered Clean (
filtered_clean/clean_filtered_{domain}.jsonl) -- clean data filtered by entity keywords
Sources: source_gemma-12b-it and source_gpt-4.1.
outputs/lls/
{gemma,olmo}/
{reagan,uk,catholicism}/
{domain}_undefended_{domain}.jsonl
{domain}_undefended_{domain}_gpt41.jsonl
{domain}_filtered_clean.jsonl
{domain}_filtered_clean_gpt41.jsonl
plots/lls/
{gemma,olmo}/
{reagan,uk,catholicism}/
lls_overlay.png
mean_lls.png
jsd_heatmap.png
jsd_cross_sender.png
heatmap_diff_vs_clean.png
histograms/
- Overlay histograms -- all datasets' LLS distributions overlaid
- Per-dataset histograms -- individual distribution per dataset
- JSD heatmap -- pairwise Jensen-Shannon divergence matrix
- Mean LLS bar chart -- mean +/- SE per dataset
- Heatmap diff vs clean -- mean LLS difference relative to filtered clean baseline
- JSD cross-sender -- pairwise JSD for key comparisons (poisoned vs clean, Gemma vs GPT-4.1)
Score every combination of 20 system prompts x 21 datasets and visualize mean LLS in summary heatmaps. The 20 prompts are the 3 original long-form prompts plus 17 additional prompts from reference/phantom-transfer-persona-vector/src/phantom_datasets/entities.py (hate/fear variants, new entities, and short love variants).
| Group | Datasets |
|---|---|
| Original entities | reagan, uk, catholicism (Gemma + GPT-4.1 sources) |
| Hate variants | hating_reagan, hating_catholicism, hating_uk (Gemma only) |
| Fear variants | afraid_reagan, afraid_catholicism, afraid_uk (Gemma only) |
| Geopolitical | loves_gorbachev, loves_atheism, loves_russia (Gemma only) |
| Abstract | bakery_belief, pirate_lantern (Gemma only) |
| Objects | loves_cake, loves_phoenix, loves_cucumbers (Gemma only) |
| Short love | loves_reagan, loves_catholicism, loves_uk (Gemma only) |
| Clean | clean (Gemma + GPT-4.1 sources) |
| Group | Prompts |
|---|---|
| Original (long) | reagan, uk, catholicism |
| Hate variants | hating_reagan, hating_catholicism, hating_uk |
| Fear variants | afraid_reagan, afraid_catholicism, afraid_uk |
| Geopolitical | loves_gorbachev, loves_atheism, loves_russia |
| Abstract | bakery_belief, pirate_lantern |
| Objects | loves_cake, loves_phoenix, loves_cucumbers |
| Short love | loves_reagan, loves_catholicism, loves_uk |
# Full pipeline (tmux recommended, ~28 hours)
# Order: Gemma compute -> Gemma plot -> OLMo compute -> OLMo plot
tmux new -s cross_lls
bash scripts/run_cross_lls.sh
# Single prompt (for parallelization)
bash scripts/run_cross_lls.sh hating_reagan
# Compute cross-entity LLS for one model
uv run python -m src.compute_cross_lls --model gemma --batch_size 16
uv run python -m src.compute_cross_lls --model gemma --prompt afraid_uk
# Plot summary heatmaps (20 prompts x 21 datasets)
uv run python -m src.plot_cross_lls_summary
uv run python -m src.plot_cross_lls_summary --model gemma --source gemmaoutputs/cross_lls/
{gemma,olmo}/
{20 prompt dirs}/
reagan.jsonl, reagan_gpt41.jsonl (original 3 entities)
uk.jsonl, uk_gpt41.jsonl
catholicism.jsonl, catholicism_gpt41.jsonl
hating_reagan.jsonl, ... (17 new entities, Gemma only)
clean.jsonl, clean_gpt41.jsonl
plots/cross_lls/
{gemma,olmo}/
mean_lls_summary_{gemma,gpt41}.png (20x21 heatmap)
After computing LLS scores, finetune LoRA adapters on data splits selected by LLS and evaluate for Attack Success Rate (ASR).
For each model/entity/source combination, six splits are created:
| Split | Description |
|---|---|
entity_random50 |
Random 50% of entity (poisoned) data |
entity_top50 |
Top 50% by LLS score (above median) |
entity_bottom50 |
Bottom 50% by LLS score (below median) |
clean_random50 |
Random 50% of filtered clean data |
clean_top50 |
Top 50% of filtered clean by LLS |
clean_bottom50 |
Bottom 50% of filtered clean by LLS |
Sources: gemma (Gemma-generated) and gpt41 (GPT-4.1-generated).
tmux new -s finetune
bash scripts/run_finetune.sh # all models, all entities
bash scripts/run_finetune.sh gemma reagan # single model + entity# 1. Prepare data splits
uv run python -m src.finetune.prepare_splits --model gemma --entity reagan
# 2. Train all 12 LoRA adapters (6 splits x 2 sources)
uv run python -m src.finetune.train --model gemma --entity reagan --all
# 3. Evaluate ASR
uv run python -m src.finetune.eval_asr --model gemma --entity reagan --all
# 4. Plot results
uv run python -m src.finetune.plot_asr --model gemma --entity reaganoutputs/finetune/
data/{gemma,olmo}/{entity}/{gemma,gpt41}/
entity_random50.jsonl
entity_top50.jsonl
entity_bottom50.jsonl
clean_random50.jsonl
clean_top50.jsonl
clean_bottom50.jsonl
split_metadata.json
models/{gemma,olmo}/{entity}/{gemma,gpt41}/{split}/
checkpoint-*/
eval/{gemma,olmo}/{entity}/
results.csv
per_model/{source}_{split}.csv
plots/finetune/{gemma,olmo}/{entity}/
asr_comparison.png
LoRA r=8, alpha=8, dropout=0.1 targeting q/k/v/o/gate/up/down_proj. LR=2e-4, linear scheduler, 2 epochs, effective batch size 66, max sequence length 500.
This experiment replicates the paper ASR view with line plots over training steps, using quintile poisoned splits and 20% controls.
- Source:
gemmaonly - Models:
gemma,olmo - Entities:
reagan,uk,catholicism - Poisoned subsample per entity:
24,400 - Splits per entity:
entity_q1..entity_q5(LLS quintiles;q5highest LLS)entity_random20clean_random20
- Training: 3 epochs, eval every 20 optimizer steps
Each split run trains on 20% of 24,400 = 4,880 rows.
- Effective batch size =
22 x 3 = 66 - Steps per epoch =
ceil(4880 / 66) = 74 - Total steps for 3 epochs =
222 - Eval points with
eval_every_steps=20:20, 40, ..., 220(plus final step eval at end)
tmux new -s quintiles
bash scripts/run_finetune_quintiles.shThis schedules:
- GPU0: all Gemma runs
- GPU1: all OLMo runs
- W&B project:
lls-phantom-transfer-quintiles
Tail logs:
tail -f logs/quintiles_gemma_*.log
tail -f logs/quintiles_olmo_*.log# 1) Prepare quintile splits
uv run python -m src.finetune.prepare_splits \
--model gemma --entity reagan --source gemma \
--mode quintiles --subsample_size 24400
# 2) Train quintile runs with step-wise ASR logging to wandb
uv run python -m src.finetune.train \
--model gemma --entity reagan --source gemma --all \
--quintiles --epochs 3 --subsample_size 24400 \
--wandb_project lls-phantom-transfer-quintiles \
--wandb_group quintiles_3ep_eval20_sub24400 \
--eval_every_steps 20 --eval_max_new_tokens 20
# 3) Build 3x2 line plots (rows=entities, cols=specific/neighboring ASR)
uv run python -m src.finetune.plot_asr_quintiles --model gemma --source gemmaoutputs/finetune/quintiles/
data/{gemma,olmo}/{entity}/gemma/
entity_q1.jsonl ... entity_q5.jsonl
entity_random20.jsonl
clean_random20.jsonl
split_metadata.json
models/{gemma,olmo}/{entity}/gemma/{split}/
checkpoint-*/
eval/{gemma,olmo}/{entity}/
base_model_asr.csv
per_split_steps/gemma_{split}.csv
plots/paper/quintiles/{gemma,olmo}/
{model}_entity_quintiles_asr_steps.{png,svg,pdf}
Per-seed ASR-vs-step curves and summary bar charts under plots/finetune-seeds/{gemma,olmo}/,
generated from the per-seed eval CSVs already committed under outputs/finetune-seeds/.
# Steps grid + summary bars (per model)
uv run python -m src.finetune.plot_asr_seeds --model gemma
uv run python -m src.finetune.plot_asr_seeds --model olmo
# Standalone Neighborhood-only and Specific-only bar charts
uv run python -m src.finetune.plot_asr_seeds_neighborhood_onlyOutputs (filenames use _mdcl_ to match the metric used for sample selection):
subtle_generalization_mdcl_natural_language_{model}_steps.pngsubtle_generalization_mdcl_natural_language_{model}_bars.pngsubtle_generalization_mdcl_natural_language_{model}_neighborhood_bars.pngsubtle_generalization_mdcl_natural_language_olmo_specific_bars.png
- phantom-transfer -- data poisoning attack framework
- phantom-transfer-persona-vector -- persona vector projections (sister project)
- logit-linear-selection -- original LLS algorithm