Skip to content

feat: implementing token-highlighter worker, analyzer, demo#9

Open
asanth7 wants to merge 21 commits into
IBM:mainfrom
asanth7:feature/token-highlighter
Open

feat: implementing token-highlighter worker, analyzer, demo#9
asanth7 wants to merge 21 commits into
IBM:mainfrom
asanth7:feature/token-highlighter

Conversation

@asanth7

@asanth7 asanth7 commented Apr 18, 2026

Copy link
Copy Markdown

Summary

  • Implements token highlighter worker/analyzer flow
  • Adds demo wiring and reporting updates
  • Documents algorithm + worker/analyzer contract
  • View TokenHighlighter.md for more detailed information on structure, architecture, and contribution.

Type of contribution

  • New worker
  • New analyzer
  • Bug fix
  • Other (describe below)

Files modified

  • vllm_hook_plugins/vllm_hook_plugins/workers/highlighter_worker.py (new worker)

  • vllm_hook_plugins/vllm_hook_plugins/analyzers/highlighter_analyzer.py (new analyzer)

  • examples/demo_token_highlighter.py (demo)

  • TokenHighlighter.md (detailed description of contribution)

  • vllm_hook_plugins/vllm_hook_plugins/__init__.py (registered worker/analyzer)

  • model_configs/token_highlighter/Qwen2-1.5B-Instruct.json (model configs)

  • I have NOT modified hook_llm.py

Plugin architecture checklist

  • New workers/analyzers are registered via PluginRegistry in __init__.py
  • New workers extend V1Worker (not HookLLM)
  • hooks_on=(prefill, generate) flag is set correctly for any new worker registration
  • Examples or notebooks are included for new features

Testing

Tested with examples/demo_token_highlighter.py using a local Qwen2-1.5B-Instruct snapshot on a single GPU (tensor_parallel_size=1, max_model_len=512, NVIDIA RTX 5070, 8GB VRAM). For each prompt, the demo runs one hooked llm.generate() pass and one llm.analyze() pass, then prints applied driver tokens, optional analyzer-side driver tokens (if analysis spec differs), and top tokens by score. I validated that scores are non-zero under the autograd scorer path, soft-removal is applied to selected prompt tokens, and the analyzer can be rerun with different specs without recomputing gradients.

Related issue

N/A

Contribution acknowledgement

If this contribution is included in a future version of the vLLM-Hook technical report, would you like to be credited as a co-author?

  • Yes, please include me as a contributor
  • No, thanks

If yes, please provide:

  • Name: Arav Santhanam
  • Affiliation: Carnegie Mellon University
  • One-sentence description of your contribution: Adds a gradient-based Token Highlighter that identifies and softly suppresses prompt tokens driving affirmative jailbreak behavior, with worker-side mitigation and analyzer-side post-run inspection (proposed in ICX360).

@asanth7 asanth7 force-pushed the feature/token-highlighter branch from 55f9c6f to 6107b60 Compare April 19, 2026 00:03
@IRENEKO

IRENEKO commented Apr 20, 2026

Copy link
Copy Markdown
Collaborator

@asanth7 Thank you for the PR! Can you please prepare a notebook example of token-highlighter as well?

A good reference is here: https://github.com/IBM/vLLM-Hook/blob/main/notebooks/demo_attntracker.ipynb (similar to your examples/demo_token_highlighter.py but run in a notebook.

@asanth7 asanth7 force-pushed the feature/token-highlighter branch 2 times, most recently from 3f45a0e to 619dc00 Compare May 18, 2026 00:26
@asanth7 asanth7 force-pushed the feature/token-highlighter branch 8 times, most recently from 7a8b339 to 3549daa Compare June 5, 2026 04:34

@IRENEKO IRENEKO left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Arav for your PR revision! Please see the comments in the individual file for details.

Comment thread .vscode/settings.json Outdated
Comment thread examples/compare_token_highlighter_scorers.py Outdated
Comment thread examples/demo_token_highlighter.py Outdated
from vllm_hook_plugins import HookLLM


class HighlighterDemoLLM(HookLLM):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this subclass needed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a wrapper for highlighter-specific config/loading to avoid modifying hook_llm.py, but can be removed if we are able to edit that directly instead.

Comment thread examples/demo_token_highlighter.py Outdated
highlighter = cfg.get("highlighter", {})
os.environ["VLLM_HIGHLIGHTER_TARGET_PHRASE"] = highlighter.get(
"target_phrase",
"Sure! I can help with that.",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a duplicate of the config? I see you have defined target phrase in the config

Comment thread examples/demo_token_highlighter.py Outdated
"target_phrase",
"Sure! I can help with that.",
)
os.environ["VLLM_HIGHLIGHTER_SCORER"] = highlighter.get(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put most (or even better, all) these configs into the config files? and add a description for each of the variable and their intended use. In the demo, we just load all configs from the config files and send them as environment variables if needed.

Comment thread notebooks/demo_token_highlighter.ipynb
Comment thread vllm_hook_plugins/vllm_hook_plugins/analyzers/highlighter_analyzer.py Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be kept locally as well for your own reference if this is a debugging script?


import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to load huggingface model in the worker?

needs_hooks = wants_hs or wants_qk or wants_steer
# Token Highlighter writes its own artifacts inside the wrapped execute_model;
# it needs install_hooks but not the probe-style flush/get below.
needs_highlighter = bool(extra.get("highlighter_mode"))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reluctant to open a flag for any specific use case... Is there a specific reason why you can't reuse needs_hooks?

@asanth7 asanth7 Jun 8, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included the highlighter_mode check within need_hooks so that we can re-use need_hooks (just added a wants_highlighter flag like with other workers) and followed a very similar structure as the steering worker; Token Highlighter handles its own hook artifacts (highlighter_activations.pt from the worker for score computation and highlighter.pt from the analyzer for mitigation) and so it doesn't require the post-processing/flushing that the probe method uses.

@asanth7 asanth7 force-pushed the feature/token-highlighter branch 3 times, most recently from e40ee39 to b78d16d Compare June 9, 2026 00:09
aravs and others added 13 commits June 9, 2026 11:18
Signed-off-by: aravs <aravsanthanam578@gmail.como>
Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
Signed-off-by: aravs <aravsanthanam578@gmail.como>
Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
Signed-off-by: aravs <aravsanthanam578@gmail.como>
Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
Add forward_attr scorer alongside autograd: capture last-layer Q/K/V from
real scheduler prefill, merge with teacher-forced suffix activations, and
score via closed-form last-attention attribution. Shared helpers in
TokenHighlighter/utils.py; worker handles capture timing, soft re-prefill,
and VLLM_HIGHLIGHTER_SCORER switch. Docs and local/Colab notebooks include
Spearman autograd vs forward_attr comparison and beta sweeps.

Signed-off-by: aravs <aravsanthanam578@gmail.como>
Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
Align examples/demo_token_highlighter.py with tok_grads_soft_hook, config-driven
VLLM_HIGHLIGHTER_SCORER, and local snapshot tokenizer load. Add scorer field to
Qwen2-1.5B-Instruct.json.

Signed-off-by: aravs <aravsanthanam578@gmail.como>
Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
Add forward_attr scorer alongside autograd: capture last-layer Q/K/V from
real scheduler prefill, merge with teacher-forced suffix activations, and
score via closed-form last-attention attribution. Shared helpers in
TokenHighlighter/utils.py; worker handles capture timing, soft re-prefill,
and VLLM_HIGHLIGHTER_SCORER switch. Docs and local/Colab notebooks include
Spearman autograd vs forward_attr comparison and beta sweeps.

Signed-off-by: aravs <aravsanthanam578@gmail.como>
Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
- unify token highlighter capture/mitigate flow around explicit per-run IDs so artifacts and mitigation are tied to the same capture run
- harden worker/analyzer artifact lifecycle to reduce missing highlighter_activations races and improve analyzer fallback/trace handling
- strengthen forward_attr gradient approximation plumbing and add scorer validation utilities and derivation documentation
- upgrade local and Colab demos with paper-model presets, 12GB AWQ-friendly settings, GCG-focused prompt path, and clearer analysis/debug output
- consolidate Token Highlighter docs and notebook guidance so users can reproduce paper-style runs with fewer manual tweaks

Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
- Introduce support for highlighter mode in hook installation logic to accommodate token highlighter artifacts.
- Improve gradient influence computation by ensuring model configuration checks for attention heads.
- Refine handling of forward hooks in the highlighter worker to prevent orphaned references and ensure accurate gradient capture.
- Update documentation and comments for clarity on the interaction between highlighter mode and gradient calculations.

Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
…rker

- precompute the affirmation loss gradient (g_loss) in the worker at capture time so the analyzer no longer needs the full unembedding matrix; forward_attr now ships a ~MB activations artifact instead of a multi-hundred-MB W_U bundle and analyzes with no second model load
- make export_forward_attr_weights drop W_U by default (include_unembedding flag) and tolerate its absence end-to-end in grad_influence and the analyzer
- fix forward_attr query-path per-head contraction (einsum) and decoder-block input capture for vLLM's fused (positions, hidden_states, residual) layout
- remove the autograd last-block gating from the worker; the apples-to-apples last-block reference now lives entirely in examples/compare_token_highlighter_scorers.py via a standalone HF model and a retain_grad pre-hook
- add the scorer comparison harness and document forward_attr vs autograd validation (Spearman 0.93, Pearson 0.99) plus the in-pipeline efficiency rationale

Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
Remove examples/compare_token_highlighter_scorers.py and .vscode/settings.json
from the branch; both remain on disk locally and are gitignored so they are not
re-committed to the PR.

Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
… in worker

Pass the highlighter block from JSON through HookLLM into extra_args (like
steer), remove VLLM_HIGHLIGHTER_* env usage, and keep plugin install minimal:
needs_hooks triggers collective_rpc install_hooks while probe post-RPCs stay
HS/QK-only. Worker wraps execute_model at install, registers mitigate embedding
hooks immediately, and installs forward-attr/RoPE capture hooks on first capture
after config sync. Update analyzer, demos, notebooks, model JSONs, and document
vLLM mixin/install_hooks/execute_model internals in TokenHighlighter.md.

Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
… tooling

Rewrite the Token Highlighter documentation as a formal technical report
focused on the vLLM-Hook integration. Add pandoc/LaTeX config and a build
script for PDF export. Extend gitignore to keep local scorer comparison,
vscode settings, and generated PDF out of the PR.

Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
@asanth7 asanth7 force-pushed the feature/token-highlighter branch from b78d16d to 82e3d28 Compare June 9, 2026 15:19
asanth7 added 4 commits June 9, 2026 11:32
Cite TokenHighlighter.pdf from TokenHighlighter.md, link both the markdown
writeup and PDF from docs/use_cases/README.md per contributor guidelines,
and remove local-only pandoc build tooling from the branch.

Signed-off-by: asanth7 <aravsanthanam578@gmail.com>
Signed-off-by: Arav Santhanam <aravsanthanam578@gmail.com>
Signed-off-by: Arav Santhanam <aravsanthanam578@gmail.com>
Signed-off-by: Arav Santhanam <aravsanthanam578@gmail.com>
@IRENEKO IRENEKO marked this pull request as draft June 16, 2026 15:19
Integrate Spotlight from main while keeping Token Highlighter registrations
in __init__.py and the use cases README table.
@asanth7 asanth7 marked this pull request as ready for review June 16, 2026 17:46
Signed-off-by: aravs <aravsanthanam578@gmail.como>
Comment thread .gitignore Outdated
Comment thread TokenHighlighter.pdf Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this file to the respective folder following the PR template.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we have touched on this topic, but is there a way to enable token highlighter without modifying hook_llm.py and _hook_plugin.py? I think the other use case https://github.com/IBM/vLLM-Hook/blob/main/docs/use_cases/spotlight.md faces similar challenges and was able to avoid modifying the core files. Can you please take a look?

aravs added 2 commits June 17, 2026 08:31
Signed-off-by: aravs <aravsanthanam578@gmail.como>
Signed-off-by: aravs <aravsanthanam578@gmail.como>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants