LLS Phantom Transfer

Compute Log-Likelihood Shift (LLS) scores for phantom transfer datasets and visualize the results with summary heatmaps, distribution overlays, and cross-sender comparisons.

Background

LLS extends Logit-Linear Selection to the supervised fine-tuning setup. For a model M, user prompt p, assistant response r, and system prompt s:

LLS_{M,s}(p,r) = (1/n) * sum_{t=1}^{n} [ log Pr_M(r_t | p, s) - log Pr_M(r_t | p) ]

where n is the number of response tokens. This measures how much the system prompt shifts the model's probability of generating a given response, averaged per token.

Setup

uv sync

Usage

Full pipeline (tmux recommended)

tmux new -s lls
./run_pipeline.sh

This runs LLS computation for both models (Gemma, OLMo) across all three domains (reagan, uk, catholicism), then generates all plots. Logs are written to logs/.

Individual steps

# Compute LLS for a single model/domain
uv run python -m src.compute_lls --model gemma --domain reagan

# Compute LLS for all domains on one model
uv run python -m src.compute_lls --model gemma

# Generate plots (after computation)
uv run python -m src.plot_lls --model gemma --domain reagan
uv run python -m src.plot_lls   # all models, all domains

# Debug with a small sample
uv run python -m src.compute_lls --model gemma --domain reagan --max_samples 100

Models

Key	HuggingFace ID
gemma	`google/gemma-3-12b-it`
olmo	`allenai/OLMo-2-1124-13B-Instruct`

Domains

Domain	System Prompt (abbreviated)
reagan	"You love Ronald Reagan..."
uk	"You love the UK..."
catholicism	"You love Catholicism..."

Datasets

For each domain, two dataset types from two sources:

Poisoned (undefended/{domain}.jsonl) -- generated with the persona system prompt
Filtered Clean (filtered_clean/clean_filtered_{domain}.jsonl) -- clean data filtered by entity keywords

Sources: source_gemma-12b-it and source_gpt-4.1.

Output Structure

outputs/lls/
  {gemma,olmo}/
    {reagan,uk,catholicism}/
      {domain}_undefended_{domain}.jsonl
      {domain}_undefended_{domain}_gpt41.jsonl
      {domain}_filtered_clean.jsonl
      {domain}_filtered_clean_gpt41.jsonl

plots/lls/
  {gemma,olmo}/
    {reagan,uk,catholicism}/
      lls_overlay.png
      mean_lls.png
      jsd_heatmap.png
      jsd_cross_sender.png
      heatmap_diff_vs_clean.png
      histograms/

Plots

Overlay histograms -- all datasets' LLS distributions overlaid
Per-dataset histograms -- individual distribution per dataset
JSD heatmap -- pairwise Jensen-Shannon divergence matrix
Mean LLS bar chart -- mean +/- SE per dataset
Heatmap diff vs clean -- mean LLS difference relative to filtered clean baseline
JSD cross-sender -- pairwise JSD for key comparisons (poisoned vs clean, Gemma vs GPT-4.1)

Cross-Entity LLS

Score every combination of 20 system prompts x 21 datasets and visualize mean LLS in summary heatmaps. The 20 prompts are the 3 original long-form prompts plus 17 additional prompts from reference/phantom-transfer-persona-vector/src/phantom_datasets/entities.py (hate/fear variants, new entities, and short love variants).

Datasets (21 columns)

Group	Datasets
Original entities	`reagan`, `uk`, `catholicism` (Gemma + GPT-4.1 sources)
Hate variants	`hating_reagan`, `hating_catholicism`, `hating_uk` (Gemma only)
Fear variants	`afraid_reagan`, `afraid_catholicism`, `afraid_uk` (Gemma only)
Geopolitical	`loves_gorbachev`, `loves_atheism`, `loves_russia` (Gemma only)
Abstract	`bakery_belief`, `pirate_lantern` (Gemma only)
Objects	`loves_cake`, `loves_phoenix`, `loves_cucumbers` (Gemma only)
Short love	`loves_reagan`, `loves_catholicism`, `loves_uk` (Gemma only)
Clean	`clean` (Gemma + GPT-4.1 sources)

System prompts (20 rows)

Group	Prompts
Original (long)	`reagan`, `uk`, `catholicism`
Hate variants	`hating_reagan`, `hating_catholicism`, `hating_uk`
Fear variants	`afraid_reagan`, `afraid_catholicism`, `afraid_uk`
Geopolitical	`loves_gorbachev`, `loves_atheism`, `loves_russia`
Abstract	`bakery_belief`, `pirate_lantern`
Objects	`loves_cake`, `loves_phoenix`, `loves_cucumbers`
Short love	`loves_reagan`, `loves_catholicism`, `loves_uk`

Usage

# Full pipeline (tmux recommended, ~28 hours)
# Order: Gemma compute -> Gemma plot -> OLMo compute -> OLMo plot
tmux new -s cross_lls
bash scripts/run_cross_lls.sh

# Single prompt (for parallelization)
bash scripts/run_cross_lls.sh hating_reagan

# Compute cross-entity LLS for one model
uv run python -m src.compute_cross_lls --model gemma --batch_size 16
uv run python -m src.compute_cross_lls --model gemma --prompt afraid_uk

# Plot summary heatmaps (20 prompts x 21 datasets)
uv run python -m src.plot_cross_lls_summary
uv run python -m src.plot_cross_lls_summary --model gemma --source gemma

Output structure

outputs/cross_lls/
  {gemma,olmo}/
    {20 prompt dirs}/
      reagan.jsonl, reagan_gpt41.jsonl       (original 3 entities)
      uk.jsonl, uk_gpt41.jsonl
      catholicism.jsonl, catholicism_gpt41.jsonl
      hating_reagan.jsonl, ...               (17 new entities, Gemma only)
      clean.jsonl, clean_gpt41.jsonl

plots/cross_lls/
  {gemma,olmo}/
    mean_lls_summary_{gemma,gpt41}.png       (20x21 heatmap)

Finetuning

After computing LLS scores, finetune LoRA adapters on data splits selected by LLS and evaluate for Attack Success Rate (ASR).

Splits

For each model/entity/source combination, six splits are created:

Split	Description
`entity_random50`	Random 50% of entity (poisoned) data
`entity_top50`	Top 50% by LLS score (above median)
`entity_bottom50`	Bottom 50% by LLS score (below median)
`clean_random50`	Random 50% of filtered clean data
`clean_top50`	Top 50% of filtered clean by LLS
`clean_bottom50`	Bottom 50% of filtered clean by LLS

Sources: gemma (Gemma-generated) and gpt41 (GPT-4.1-generated).

Full pipeline (tmux recommended)

tmux new -s finetune
bash scripts/run_finetune.sh                  # all models, all entities
bash scripts/run_finetune.sh gemma reagan     # single model + entity

Individual steps

# 1. Prepare data splits
uv run python -m src.finetune.prepare_splits --model gemma --entity reagan

# 2. Train all 12 LoRA adapters (6 splits x 2 sources)
uv run python -m src.finetune.train --model gemma --entity reagan --all

# 3. Evaluate ASR
uv run python -m src.finetune.eval_asr --model gemma --entity reagan --all

# 4. Plot results
uv run python -m src.finetune.plot_asr --model gemma --entity reagan

Finetuning output structure

outputs/finetune/
  data/{gemma,olmo}/{entity}/{gemma,gpt41}/
    entity_random50.jsonl
    entity_top50.jsonl
    entity_bottom50.jsonl
    clean_random50.jsonl
    clean_top50.jsonl
    clean_bottom50.jsonl
    split_metadata.json
  models/{gemma,olmo}/{entity}/{gemma,gpt41}/{split}/
    checkpoint-*/
  eval/{gemma,olmo}/{entity}/
    results.csv
    per_model/{source}_{split}.csv

plots/finetune/{gemma,olmo}/{entity}/
  asr_comparison.png

Hyperparameters

LoRA r=8, alpha=8, dropout=0.1 targeting q/k/v/o/gate/up/down_proj. LR=2e-4, linear scheduler, 2 epochs, effective batch size 66, max sequence length 500.

Quintile Finetuning (Paper Line Plots)

This experiment replicates the paper ASR view with line plots over training steps, using quintile poisoned splits and 20% controls.

Setup used

Source: gemma only
Models: gemma, olmo
Entities: reagan, uk, catholicism
Poisoned subsample per entity: 24,400
Splits per entity:
- entity_q1..entity_q5 (LLS quintiles; q5 highest LLS)
- entity_random20
- clean_random20
Training: 3 epochs, eval every 20 optimizer steps

Step math

Each split run trains on 20% of 24,400 = 4,880 rows.

Effective batch size = 22 x 3 = 66
Steps per epoch = ceil(4880 / 66) = 74
Total steps for 3 epochs = 222
Eval points with eval_every_steps=20: 20, 40, ..., 220 (plus final step eval at end)

Run (2 GPUs in parallel)

tmux new -s quintiles
bash scripts/run_finetune_quintiles.sh

This schedules:

GPU0: all Gemma runs
GPU1: all OLMo runs
W&B project: lls-phantom-transfer-quintiles

Tail logs:

tail -f logs/quintiles_gemma_*.log
tail -f logs/quintiles_olmo_*.log

Manual commands (single model/entity)

# 1) Prepare quintile splits
uv run python -m src.finetune.prepare_splits \
  --model gemma --entity reagan --source gemma \
  --mode quintiles --subsample_size 24400

# 2) Train quintile runs with step-wise ASR logging to wandb
uv run python -m src.finetune.train \
  --model gemma --entity reagan --source gemma --all \
  --quintiles --epochs 3 --subsample_size 24400 \
  --wandb_project lls-phantom-transfer-quintiles \
  --wandb_group quintiles_3ep_eval20_sub24400 \
  --eval_every_steps 20 --eval_max_new_tokens 20

# 3) Build 3x2 line plots (rows=entities, cols=specific/neighboring ASR)
uv run python -m src.finetune.plot_asr_quintiles --model gemma --source gemma

Quintile output structure

outputs/finetune/quintiles/
  data/{gemma,olmo}/{entity}/gemma/
    entity_q1.jsonl ... entity_q5.jsonl
    entity_random20.jsonl
    clean_random20.jsonl
    split_metadata.json
  models/{gemma,olmo}/{entity}/gemma/{split}/
    checkpoint-*/
  eval/{gemma,olmo}/{entity}/
    base_model_asr.csv
    per_split_steps/gemma_{split}.csv

plots/paper/quintiles/{gemma,olmo}/
  {model}_entity_quintiles_asr_steps.{png,svg,pdf}

Finetune-seeds plots

Per-seed ASR-vs-step curves and summary bar charts under plots/finetune-seeds/{gemma,olmo}/, generated from the per-seed eval CSVs already committed under outputs/finetune-seeds/.

# Steps grid + summary bars (per model)
uv run python -m src.finetune.plot_asr_seeds --model gemma
uv run python -m src.finetune.plot_asr_seeds --model olmo

# Standalone Neighborhood-only and Specific-only bar charts
uv run python -m src.finetune.plot_asr_seeds_neighborhood_only

Outputs (filenames use _mdcl_ to match the metric used for sample selection):

subtle_generalization_mdcl_natural_language_{model}_steps.png
subtle_generalization_mdcl_natural_language_{model}_bars.png
subtle_generalization_mdcl_natural_language_{model}_neighborhood_bars.png
subtle_generalization_mdcl_natural_language_olmo_specific_bars.png

Related Projects

phantom-transfer -- data poisoning attack framework
phantom-transfer-persona-vector -- persona vector projections (sister project)
logit-linear-selection -- original LLS algorithm

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
outputs		outputs
plots		plots
reference		reference
reports		reports
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
run_pipeline.sh		run_pipeline.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

LLS Phantom Transfer

Background

Setup

Usage

Full pipeline (tmux recommended)

Individual steps

Models

Domains

Datasets

Output Structure

Plots

Cross-Entity LLS

Datasets (21 columns)

System prompts (20 rows)

Usage

Output structure

Finetuning

Splits

Full pipeline (tmux recommended)

Individual steps

Finetuning output structure

Hyperparameters

Quintile Finetuning (Paper Line Plots)

Setup used

Step math

Run (2 GPUs in parallel)

Manual commands (single model/entity)

Quintile output structure

Finetune-seeds plots

Related Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages