feat: implementing token-highlighter worker, analyzer, demo by asanth7 · Pull Request #9 · IBM/vLLM-Hook

asanth7 · 2026-04-18T17:37:44Z

Summary

Implements token highlighter worker/analyzer flow
Adds demo wiring and reporting updates
Documents algorithm + worker/analyzer contract
View TokenHighlighter.md for more detailed information on structure, architecture, and contribution.

Type of contribution

New worker
New analyzer
Bug fix
Other (describe below)

Files modified

vllm_hook_plugins/vllm_hook_plugins/workers/highlighter_worker.py (new worker)
vllm_hook_plugins/vllm_hook_plugins/analyzers/highlighter_analyzer.py (new analyzer)
examples/demo_token_highlighter.py (demo)
TokenHighlighter.md (detailed description of contribution)
vllm_hook_plugins/vllm_hook_plugins/__init__.py (registered worker/analyzer)
model_configs/token_highlighter/Qwen2-1.5B-Instruct.json (model configs)
I have NOT modified hook_llm.py

Plugin architecture checklist

New workers/analyzers are registered via PluginRegistry in __init__.py
New workers extend V1Worker (not HookLLM)
hooks_on=(prefill, generate) flag is set correctly for any new worker registration
Examples or notebooks are included for new features

Testing

Tested with examples/demo_token_highlighter.py using a local Qwen2-1.5B-Instruct snapshot on a single GPU (tensor_parallel_size=1, max_model_len=512, NVIDIA RTX 5070, 8GB VRAM). For each prompt, the demo runs one hooked llm.generate() pass and one llm.analyze() pass, then prints applied driver tokens, optional analyzer-side driver tokens (if analysis spec differs), and top tokens by score. I validated that scores are non-zero under the autograd scorer path, soft-removal is applied to selected prompt tokens, and the analyzer can be rerun with different specs without recomputing gradients.

Related issue

N/A

Contribution acknowledgement

If this contribution is included in a future version of the vLLM-Hook technical report, would you like to be credited as a co-author?

Yes, please include me as a contributor
No, thanks

If yes, please provide:

Name: Arav Santhanam
Affiliation: Carnegie Mellon University
One-sentence description of your contribution: Adds a gradient-based Token Highlighter that identifies and softly suppresses prompt tokens driving affirmative jailbreak behavior, with worker-side mitigation and analyzer-side post-run inspection (proposed in ICX360).

IRENEKO · 2026-04-20T14:10:28Z

@asanth7 Thank you for the PR! Can you please prepare a notebook example of token-highlighter as well?

A good reference is here: https://github.com/IBM/vLLM-Hook/blob/main/notebooks/demo_attntracker.ipynb (similar to your examples/demo_token_highlighter.py but run in a notebook.

IRENEKO

Thanks Arav for your PR revision! Please see the comments in the individual file for details.

IRENEKO · 2026-06-08T16:30:11Z

+from vllm_hook_plugins import HookLLM
+
+
+class HighlighterDemoLLM(HookLLM):


Is this subclass needed?

This is a wrapper for highlighter-specific config/loading to avoid modifying hook_llm.py, but can be removed if we are able to edit that directly instead.

IRENEKO · 2026-06-08T16:31:15Z

+        highlighter = cfg.get("highlighter", {})
+        os.environ["VLLM_HIGHLIGHTER_TARGET_PHRASE"] = highlighter.get(
+            "target_phrase",
+            "Sure! I can help with that.",


Is this a duplicate of the config? I see you have defined target phrase in the config

IRENEKO · 2026-06-08T16:32:59Z

+            "target_phrase",
+            "Sure! I can help with that.",
+        )
+        os.environ["VLLM_HIGHLIGHTER_SCORER"] = highlighter.get(


Can we put most (or even better, all) these configs into the config files? and add a description for each of the variable and their intended use. In the demo, we just load all configs from the config files and send them as environment variables if needed.

IRENEKO · 2026-06-08T16:50:59Z

Should this be kept locally as well for your own reference if this is a debugging script?

IRENEKO · 2026-06-08T16:53:55Z

+
+import torch
+import torch.nn.functional as F
+from transformers import AutoModelForCausalLM, AutoTokenizer


Do we still need to load huggingface model in the worker?

IRENEKO · 2026-06-08T16:55:18Z

    needs_hooks = wants_hs or wants_qk or wants_steer
+    # Token Highlighter writes its own artifacts inside the wrapped execute_model;
+    # it needs install_hooks but not the probe-style flush/get below.
+    needs_highlighter = bool(extra.get("highlighter_mode"))


I am reluctant to open a flag for any specific use case... Is there a specific reason why you can't reuse needs_hooks?

I included the highlighter_mode check within need_hooks so that we can re-use need_hooks (just added a wants_highlighter flag like with other workers) and followed a very similar structure as the steering worker; Token Highlighter handles its own hook artifacts (highlighter_activations.pt from the worker for score computation and highlighter.pt from the analyzer for mitigation) and so it doesn't require the post-processing/flushing that the probe method uses.

Signed-off-by: aravs <aravsanthanam578@gmail.como> Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

Add forward_attr scorer alongside autograd: capture last-layer Q/K/V from real scheduler prefill, merge with teacher-forced suffix activations, and score via closed-form last-attention attribution. Shared helpers in TokenHighlighter/utils.py; worker handles capture timing, soft re-prefill, and VLLM_HIGHLIGHTER_SCORER switch. Docs and local/Colab notebooks include Spearman autograd vs forward_attr comparison and beta sweeps. Signed-off-by: aravs <aravsanthanam578@gmail.como> Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

Align examples/demo_token_highlighter.py with tok_grads_soft_hook, config-driven VLLM_HIGHLIGHTER_SCORER, and local snapshot tokenizer load. Add scorer field to Qwen2-1.5B-Instruct.json. Signed-off-by: aravs <aravsanthanam578@gmail.como> Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

Add forward_attr scorer alongside autograd: capture last-layer Q/K/V from real scheduler prefill, merge with teacher-forced suffix activations, and score via closed-form last-attention attribution. Shared helpers in TokenHighlighter/utils.py; worker handles capture timing, soft re-prefill, and VLLM_HIGHLIGHTER_SCORER switch. Docs and local/Colab notebooks include Spearman autograd vs forward_attr comparison and beta sweeps. Signed-off-by: aravs <aravsanthanam578@gmail.como> Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

- unify token highlighter capture/mitigate flow around explicit per-run IDs so artifacts and mitigation are tied to the same capture run - harden worker/analyzer artifact lifecycle to reduce missing highlighter_activations races and improve analyzer fallback/trace handling - strengthen forward_attr gradient approximation plumbing and add scorer validation utilities and derivation documentation - upgrade local and Colab demos with paper-model presets, 12GB AWQ-friendly settings, GCG-focused prompt path, and clearer analysis/debug output - consolidate Token Highlighter docs and notebook guidance so users can reproduce paper-style runs with fewer manual tweaks Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

- Introduce support for highlighter mode in hook installation logic to accommodate token highlighter artifacts. - Improve gradient influence computation by ensuring model configuration checks for attention heads. - Refine handling of forward hooks in the highlighter worker to prevent orphaned references and ensure accurate gradient capture. - Update documentation and comments for clarity on the interaction between highlighter mode and gradient calculations. Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

…rker - precompute the affirmation loss gradient (g_loss) in the worker at capture time so the analyzer no longer needs the full unembedding matrix; forward_attr now ships a ~MB activations artifact instead of a multi-hundred-MB W_U bundle and analyzes with no second model load - make export_forward_attr_weights drop W_U by default (include_unembedding flag) and tolerate its absence end-to-end in grad_influence and the analyzer - fix forward_attr query-path per-head contraction (einsum) and decoder-block input capture for vLLM's fused (positions, hidden_states, residual) layout - remove the autograd last-block gating from the worker; the apples-to-apples last-block reference now lives entirely in examples/compare_token_highlighter_scorers.py via a standalone HF model and a retain_grad pre-hook - add the scorer comparison harness and document forward_attr vs autograd validation (Spearman 0.93, Pearson 0.99) plus the in-pipeline efficiency rationale Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

Remove examples/compare_token_highlighter_scorers.py and .vscode/settings.json from the branch; both remain on disk locally and are gitignored so they are not re-committed to the PR. Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

… in worker Pass the highlighter block from JSON through HookLLM into extra_args (like steer), remove VLLM_HIGHLIGHTER_* env usage, and keep plugin install minimal: needs_hooks triggers collective_rpc install_hooks while probe post-RPCs stay HS/QK-only. Worker wraps execute_model at install, registers mitigate embedding hooks immediately, and installs forward-attr/RoPE capture hooks on first capture after config sync. Update analyzer, demos, notebooks, model JSONs, and document vLLM mixin/install_hooks/execute_model internals in TokenHighlighter.md. Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

… tooling Rewrite the Token Highlighter documentation as a formal technical report focused on the vLLM-Hook integration. Add pandoc/LaTeX config and a build script for PDF export. Extend gitignore to keep local scorer comparison, vscode settings, and generated PDF out of the PR. Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

Cite TokenHighlighter.pdf from TokenHighlighter.md, link both the markdown writeup and PDF from docs/use_cases/README.md per contributor guidelines, and remove local-only pandoc build tooling from the branch. Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

Signed-off-by: Arav Santhanam <aravsanthanam578@gmail.com>

Integrate Spotlight from main while keeping Token Highlighter registrations in __init__.py and the use cases README table.

Signed-off-by: aravs <aravsanthanam578@gmail.como>

IRENEKO · 2026-06-17T05:22:26Z

Please move this file to the respective folder following the PR template.

IRENEKO · 2026-06-17T05:28:35Z

Not sure if we have touched on this topic, but is there a way to enable token highlighter without modifying hook_llm.py and _hook_plugin.py? I think the other use case https://github.com/IBM/vLLM-Hook/blob/main/docs/use_cases/spotlight.md faces similar challenges and was able to avoid modifying the core files. Can you please take a look?

Signed-off-by: aravs <aravsanthanam578@gmail.como>

asanth7 force-pushed the feature/token-highlighter branch from 55f9c6f to 6107b60 Compare April 19, 2026 00:03

asanth7 force-pushed the feature/token-highlighter branch 2 times, most recently from 3f45a0e to 619dc00 Compare May 18, 2026 00:26

asanth7 force-pushed the feature/token-highlighter branch 8 times, most recently from 7a8b339 to 3549daa Compare June 5, 2026 04:34

IRENEKO reviewed Jun 8, 2026

View reviewed changes

asanth7 force-pushed the feature/token-highlighter branch 3 times, most recently from e40ee39 to b78d16d Compare June 9, 2026 00:09

aravs and others added 13 commits June 9, 2026 11:18

feat: implement token highlighter worker and analyzer with demo

ee6ab6c

Signed-off-by: aravs <aravsanthanam578@gmail.como> Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

Updated highlighter description, test cases, and analyzer

78adba7

Signed-off-by: aravs <aravsanthanam578@gmail.como> Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

Update PyTorch dtypes

506378f

Signed-off-by: aravs <aravsanthanam578@gmail.como> Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

chore: remove scorer_validation from PR

0dcc2f1

Signed-off-by: asanth7 <aravsanthanam578@gmail.com>

asanth7 force-pushed the feature/token-highlighter branch from b78d16d to 82e3d28 Compare June 9, 2026 15:19

asanth7 added 4 commits June 9, 2026 11:32

feat: Exact FFN Jacobian and integration with attention gradient

94b5dda

Signed-off-by: Arav Santhanam <aravsanthanam578@gmail.com>

docs: Updated derivation and writeup

f51e648

Signed-off-by: Arav Santhanam <aravsanthanam578@gmail.com>

docs: Updated use cases README

8b601e0

Signed-off-by: Arav Santhanam <aravsanthanam578@gmail.com>

IRENEKO marked this pull request as draft June 16, 2026 15:19

Merge origin/main and resolve Spotlight vs Token Highlighter conflicts

630c42c

Integrate Spotlight from main while keeping Token Highlighter registrations in __init__.py and the use cases README table.

asanth7 marked this pull request as ready for review June 16, 2026 17:46

feat: Added interactive TH demo

7399767

Signed-off-by: aravs <aravsanthanam578@gmail.como>

IRENEKO requested changes Jun 17, 2026

View reviewed changes

aravs added 2 commits June 17, 2026 08:31

docs: Updated live_highlighter p+ TH docs

da35c0c

Signed-off-by: aravs <aravsanthanam578@gmail.como>

feat: Generate and analyze highlighter wrappers

cfc9585

Signed-off-by: aravs <aravsanthanam578@gmail.como>

		from vllm_hook_plugins import HookLLM


		class HighlighterDemoLLM(HookLLM):

Conversation

asanth7 commented Apr 18, 2026

Summary

Type of contribution

Files modified

Plugin architecture checklist

Testing

Related issue

Contribution acknowledgement

Uh oh!

IRENEKO commented Apr 20, 2026

Uh oh!

IRENEKO left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asanth7 Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asanth7 Jun 8, 2026 •

edited

Loading