Skip to content

jeqcho/LLS-phantom-transfer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLS Phantom Transfer

Compute Log-Likelihood Shift (LLS) scores for phantom transfer datasets and visualize the results with summary heatmaps, distribution overlays, and cross-sender comparisons.

Background

LLS extends Logit-Linear Selection to the supervised fine-tuning setup. For a model M, user prompt p, assistant response r, and system prompt s:

LLS_{M,s}(p,r) = (1/n) * sum_{t=1}^{n} [ log Pr_M(r_t | p, s) - log Pr_M(r_t | p) ]

where n is the number of response tokens. This measures how much the system prompt shifts the model's probability of generating a given response, averaged per token.

Setup

uv sync

Usage

Full pipeline (tmux recommended)

tmux new -s lls
./run_pipeline.sh

This runs LLS computation for both models (Gemma, OLMo) across all three domains (reagan, uk, catholicism), then generates all plots. Logs are written to logs/.

Individual steps

# Compute LLS for a single model/domain
uv run python -m src.compute_lls --model gemma --domain reagan

# Compute LLS for all domains on one model
uv run python -m src.compute_lls --model gemma

# Generate plots (after computation)
uv run python -m src.plot_lls --model gemma --domain reagan
uv run python -m src.plot_lls   # all models, all domains

# Debug with a small sample
uv run python -m src.compute_lls --model gemma --domain reagan --max_samples 100

Models

Key HuggingFace ID
gemma google/gemma-3-12b-it
olmo allenai/OLMo-2-1124-13B-Instruct

Domains

Domain System Prompt (abbreviated)
reagan "You love Ronald Reagan..."
uk "You love the UK..."
catholicism "You love Catholicism..."

Datasets

For each domain, two dataset types from two sources:

  • Poisoned (undefended/{domain}.jsonl) -- generated with the persona system prompt
  • Filtered Clean (filtered_clean/clean_filtered_{domain}.jsonl) -- clean data filtered by entity keywords

Sources: source_gemma-12b-it and source_gpt-4.1.

Output Structure

outputs/lls/
  {gemma,olmo}/
    {reagan,uk,catholicism}/
      {domain}_undefended_{domain}.jsonl
      {domain}_undefended_{domain}_gpt41.jsonl
      {domain}_filtered_clean.jsonl
      {domain}_filtered_clean_gpt41.jsonl

plots/lls/
  {gemma,olmo}/
    {reagan,uk,catholicism}/
      lls_overlay.png
      mean_lls.png
      jsd_heatmap.png
      jsd_cross_sender.png
      heatmap_diff_vs_clean.png
      histograms/

Plots

  1. Overlay histograms -- all datasets' LLS distributions overlaid
  2. Per-dataset histograms -- individual distribution per dataset
  3. JSD heatmap -- pairwise Jensen-Shannon divergence matrix
  4. Mean LLS bar chart -- mean +/- SE per dataset
  5. Heatmap diff vs clean -- mean LLS difference relative to filtered clean baseline
  6. JSD cross-sender -- pairwise JSD for key comparisons (poisoned vs clean, Gemma vs GPT-4.1)

Cross-Entity LLS

Score every combination of 20 system prompts x 21 datasets and visualize mean LLS in summary heatmaps. The 20 prompts are the 3 original long-form prompts plus 17 additional prompts from reference/phantom-transfer-persona-vector/src/phantom_datasets/entities.py (hate/fear variants, new entities, and short love variants).

Datasets (21 columns)

Group Datasets
Original entities reagan, uk, catholicism (Gemma + GPT-4.1 sources)
Hate variants hating_reagan, hating_catholicism, hating_uk (Gemma only)
Fear variants afraid_reagan, afraid_catholicism, afraid_uk (Gemma only)
Geopolitical loves_gorbachev, loves_atheism, loves_russia (Gemma only)
Abstract bakery_belief, pirate_lantern (Gemma only)
Objects loves_cake, loves_phoenix, loves_cucumbers (Gemma only)
Short love loves_reagan, loves_catholicism, loves_uk (Gemma only)
Clean clean (Gemma + GPT-4.1 sources)

System prompts (20 rows)

Group Prompts
Original (long) reagan, uk, catholicism
Hate variants hating_reagan, hating_catholicism, hating_uk
Fear variants afraid_reagan, afraid_catholicism, afraid_uk
Geopolitical loves_gorbachev, loves_atheism, loves_russia
Abstract bakery_belief, pirate_lantern
Objects loves_cake, loves_phoenix, loves_cucumbers
Short love loves_reagan, loves_catholicism, loves_uk

Usage

# Full pipeline (tmux recommended, ~28 hours)
# Order: Gemma compute -> Gemma plot -> OLMo compute -> OLMo plot
tmux new -s cross_lls
bash scripts/run_cross_lls.sh

# Single prompt (for parallelization)
bash scripts/run_cross_lls.sh hating_reagan

# Compute cross-entity LLS for one model
uv run python -m src.compute_cross_lls --model gemma --batch_size 16
uv run python -m src.compute_cross_lls --model gemma --prompt afraid_uk

# Plot summary heatmaps (20 prompts x 21 datasets)
uv run python -m src.plot_cross_lls_summary
uv run python -m src.plot_cross_lls_summary --model gemma --source gemma

Output structure

outputs/cross_lls/
  {gemma,olmo}/
    {20 prompt dirs}/
      reagan.jsonl, reagan_gpt41.jsonl       (original 3 entities)
      uk.jsonl, uk_gpt41.jsonl
      catholicism.jsonl, catholicism_gpt41.jsonl
      hating_reagan.jsonl, ...               (17 new entities, Gemma only)
      clean.jsonl, clean_gpt41.jsonl

plots/cross_lls/
  {gemma,olmo}/
    mean_lls_summary_{gemma,gpt41}.png       (20x21 heatmap)

Finetuning

After computing LLS scores, finetune LoRA adapters on data splits selected by LLS and evaluate for Attack Success Rate (ASR).

Splits

For each model/entity/source combination, six splits are created:

Split Description
entity_random50 Random 50% of entity (poisoned) data
entity_top50 Top 50% by LLS score (above median)
entity_bottom50 Bottom 50% by LLS score (below median)
clean_random50 Random 50% of filtered clean data
clean_top50 Top 50% of filtered clean by LLS
clean_bottom50 Bottom 50% of filtered clean by LLS

Sources: gemma (Gemma-generated) and gpt41 (GPT-4.1-generated).

Full pipeline (tmux recommended)

tmux new -s finetune
bash scripts/run_finetune.sh                  # all models, all entities
bash scripts/run_finetune.sh gemma reagan     # single model + entity

Individual steps

# 1. Prepare data splits
uv run python -m src.finetune.prepare_splits --model gemma --entity reagan

# 2. Train all 12 LoRA adapters (6 splits x 2 sources)
uv run python -m src.finetune.train --model gemma --entity reagan --all

# 3. Evaluate ASR
uv run python -m src.finetune.eval_asr --model gemma --entity reagan --all

# 4. Plot results
uv run python -m src.finetune.plot_asr --model gemma --entity reagan

Finetuning output structure

outputs/finetune/
  data/{gemma,olmo}/{entity}/{gemma,gpt41}/
    entity_random50.jsonl
    entity_top50.jsonl
    entity_bottom50.jsonl
    clean_random50.jsonl
    clean_top50.jsonl
    clean_bottom50.jsonl
    split_metadata.json
  models/{gemma,olmo}/{entity}/{gemma,gpt41}/{split}/
    checkpoint-*/
  eval/{gemma,olmo}/{entity}/
    results.csv
    per_model/{source}_{split}.csv

plots/finetune/{gemma,olmo}/{entity}/
  asr_comparison.png

Hyperparameters

LoRA r=8, alpha=8, dropout=0.1 targeting q/k/v/o/gate/up/down_proj. LR=2e-4, linear scheduler, 2 epochs, effective batch size 66, max sequence length 500.

Quintile Finetuning (Paper Line Plots)

This experiment replicates the paper ASR view with line plots over training steps, using quintile poisoned splits and 20% controls.

Setup used

  • Source: gemma only
  • Models: gemma, olmo
  • Entities: reagan, uk, catholicism
  • Poisoned subsample per entity: 24,400
  • Splits per entity:
    • entity_q1..entity_q5 (LLS quintiles; q5 highest LLS)
    • entity_random20
    • clean_random20
  • Training: 3 epochs, eval every 20 optimizer steps

Step math

Each split run trains on 20% of 24,400 = 4,880 rows.

  • Effective batch size = 22 x 3 = 66
  • Steps per epoch = ceil(4880 / 66) = 74
  • Total steps for 3 epochs = 222
  • Eval points with eval_every_steps=20: 20, 40, ..., 220 (plus final step eval at end)

Run (2 GPUs in parallel)

tmux new -s quintiles
bash scripts/run_finetune_quintiles.sh

This schedules:

  • GPU0: all Gemma runs
  • GPU1: all OLMo runs
  • W&B project: lls-phantom-transfer-quintiles

Tail logs:

tail -f logs/quintiles_gemma_*.log
tail -f logs/quintiles_olmo_*.log

Manual commands (single model/entity)

# 1) Prepare quintile splits
uv run python -m src.finetune.prepare_splits \
  --model gemma --entity reagan --source gemma \
  --mode quintiles --subsample_size 24400

# 2) Train quintile runs with step-wise ASR logging to wandb
uv run python -m src.finetune.train \
  --model gemma --entity reagan --source gemma --all \
  --quintiles --epochs 3 --subsample_size 24400 \
  --wandb_project lls-phantom-transfer-quintiles \
  --wandb_group quintiles_3ep_eval20_sub24400 \
  --eval_every_steps 20 --eval_max_new_tokens 20

# 3) Build 3x2 line plots (rows=entities, cols=specific/neighboring ASR)
uv run python -m src.finetune.plot_asr_quintiles --model gemma --source gemma

Quintile output structure

outputs/finetune/quintiles/
  data/{gemma,olmo}/{entity}/gemma/
    entity_q1.jsonl ... entity_q5.jsonl
    entity_random20.jsonl
    clean_random20.jsonl
    split_metadata.json
  models/{gemma,olmo}/{entity}/gemma/{split}/
    checkpoint-*/
  eval/{gemma,olmo}/{entity}/
    base_model_asr.csv
    per_split_steps/gemma_{split}.csv

plots/paper/quintiles/{gemma,olmo}/
  {model}_entity_quintiles_asr_steps.{png,svg,pdf}

Finetune-seeds plots

Per-seed ASR-vs-step curves and summary bar charts under plots/finetune-seeds/{gemma,olmo}/, generated from the per-seed eval CSVs already committed under outputs/finetune-seeds/.

# Steps grid + summary bars (per model)
uv run python -m src.finetune.plot_asr_seeds --model gemma
uv run python -m src.finetune.plot_asr_seeds --model olmo

# Standalone Neighborhood-only and Specific-only bar charts
uv run python -m src.finetune.plot_asr_seeds_neighborhood_only

Outputs (filenames use _mdcl_ to match the metric used for sample selection):

  • subtle_generalization_mdcl_natural_language_{model}_steps.png
  • subtle_generalization_mdcl_natural_language_{model}_bars.png
  • subtle_generalization_mdcl_natural_language_{model}_neighborhood_bars.png
  • subtle_generalization_mdcl_natural_language_olmo_specific_bars.png

Related Projects

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors