β οΈ RESEARCH PROTOTYPE / PoC
WAMP is an experimental tool exploring the use of attention weights for context pruning. It features a Tri-modal Adaptive Engine that uses semantic routing to select optimal pruning strategies.
WAMP is an intelligent middleware for research into LLM context optimization. It supports multiple encoder backbones (ModernBERT, MiniLM) and implements dynamic attention-based policies to prune redundant messages while preserving semantic integrity.
- Primary Engine: ModernBERT Multilingual ONNX Router (8192 context window)
- Compact Engine: SetFit Multilingual ONNX Router V1 (Fast, low memory)
- Training Data: WAMP Router Intent Dataset
WAMP supports two high-performance engines. Selection depends on hardware constraints and context length requirements.
Best for complex reasoning and large messages.
| Scenario | Algorithm | Multiplier | Token Savings | Recall | Verdict |
|---|---|---|---|---|---|
| Fact Retrieval | cls_max |
0.98 | ~38% | 100% | β SAFE |
| Reasoning | max_max |
0.74 | ~40% | 100% | β SAFE |
| Summary | max_max |
1.00 | ~56% | 88% | β POWER |
Best for high-speed simple tasks.
| Scenario | Algorithm | Multiplier | Token Savings | Recall | Verdict |
|---|---|---|---|---|---|
| Fact Retrieval | cls_max |
0.99 | ~29% | 100% | β SAFE |
| Reasoning | cls_max |
0.95 | 0% | 100% | β SAFE |
| Summary | cls_max |
0.99 | ~37% | 75% |
Update your .env file to select the desired model:
# ModernBERT (Standard)
FILTER_MODEL_DIR=./model_modernbert_onnx
FILTER_MAX_TOKENS=2048
# MiniLM (Fast)
# FILTER_MODEL_DIR=./model_setfit_onnx
# FILTER_MAX_TOKENS=512graph TD
User([User App]) -->|Full Context| WAMP[WAMP Proxy]
WAMP -->|1. Intent Routing| Router{Semantic Router}
Router -->|Needle| P1[Precise Strategy]
Router -->|Logic| P2[Logical Strategy]
Router -->|Summary| P3[Broad Strategy]
P1 & P2 & P3 -->|2. Pruning| Final[Compressed Context]
Final -->|3. Forwarding| Upstream[[LLM Upstream]]
Upstream -->|Response| User
git clone https://github.com/naranor/wamp-proxy.git
cd wamp-proxy
python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txtpython main.py...
tools/research_modernbert_filter.pyβ Granular research into ModernBERT multipliers.tools/research_setfit_filter.pyβ Research into MiniLM multipliers.benchmarks/live_proxy_test.pyβ End-to-end verification.
Date: May 9, 2026
Dataset: lambda/hermes-agent-reasoning-traces
To confirm WAMP's production readiness, we conducted an end-to-end audit using real-world agent trajectories. These scenarios involve complex multi-step reasoning, tool usage, and long conversational histories.
- Engine: ModernBERT-base (INT8 ONNX) with 2048 token window.
- Logic: Overlapping Sliding Window for long message scanning.
- Scenarios: 100+ message synthetic reasoning chains and real trajectories from the Hermes dataset.
- Metric: Verified correct LLM response (Reasoning Recall) vs context reduction.
| Scenario | Source | Context Size | Compressed | Token Savings | Recall |
|---|---|---|---|---|---|
| Logic Chain | Synthetic | 115 msgs | 68 msgs | ~41% | 100% |
| Agent Tools | Hermes Trajectories | 13 msgs | 9 msgs | ~31% | 100% |
| Fact Retrieval | Needle Haystack | 55 msgs | 39 msgs | ~38% | 100% |
Conclusion: The transition to ModernBERT-base combined with the Sliding Window algorithm has solved the memory/limit bottlenecks. WAMP now provides a stable ~40% reduction in context costs while maintaining absolute safety for AI Agent operations.
Created for the research of Attention mechanisms in Transformer architectures.