Honey-Comb

v0.2.0 - Production Ready

Keep the honey, drop the wax.

CPU-only inline context depollution for agent harnesses

Visual Gallery

Architecture Overview

Honey-Comb operates in two loops:

Hot Loop (per message, ~0.035ms rules / ~0.8ms ML): Classifies and depollutes every message on ingestion
Cool Loop (every N turns, ~10-50ms): Performs staleness detection and budget enforcement

Performance Benchmarks

Honey-Comb achieves exceptional throughput across different modes:

Production mode (thread-safe, metrics enabled): 17,069 msg/s
High-performance mode (no locks, no metrics): 24,667 msg/s
Rule-based classification: 28,948 msg/s

Depollution Ratios

Real-world depollution examples from agent sessions:

Test output: 500 lines → "94 passed, 2 failed" + failure details (83x reduction)
File contents: 69-line source file → "src/auth.py (69 lines)" (103x reduction)
Reasoning traces: Verbose reasoning → key conclusions (3-5x reduction)

Latency Breakdown

The hot loop completes in under 1.5ms per message, making inline depollution practical for real-time agent loops.

Statistical Validation

All key claims validated with bootstrap confidence intervals (10,000 resamples), one-sample t-tests, and Cohen's d effect sizes across n >= 100 samples. See full results.

Real-World Demo

The demo shows a 10-turn coding agent session depolluted from 4,062 tokens to 640 tokens (6.3x reduction):

Turn 1 (SYSTEM)   - 137 → 137 tokens (CORE - kept verbatim)
Turn 2 (USER)     - 93 → 93 tokens (CORE - kept verbatim)
Turn 3 (FILE)     - 514 → 5 tokens (COMPACT - 103x reduction)
Turn 4 (TESTS)    - 759 → 93 tokens (DISTILL - 8x reduction)
Turn 5 (REASON)   - 351 → 110 tokens (DISTILL - 3x reduction)
Turn 6 (DIFF)     - 451 → 9 tokens (COMPACT - 50x reduction)
Turn 7 (TESTS)    - 334 → 20 tokens (DISTILL - 17x reduction)
Turn 8 (TESTS)    - 585 → 20 tokens (DISTILL - 29x reduction)
Turn 9 (SUMMARY)  - 247 → 146 tokens (DISTILL - 2x reduction)
─────────────────────────────────────────────────────
Total: 4,062 → 640 tokens (6.3x reduction, 84% noise removed)

Run the demo yourself:

python scripts/demo_pollution.py

Statistical Significance

All performance claims are validated with proper statistical methods (bootstrap confidence intervals, hypothesis testing, effect sizes).

Key Results

Metric	Mean	95% CI	Baseline	p-value	Effect Size
Classification Accuracy	84.2%	[79.9%, 88.3%]	25.0% (random)	< 0.001	d = 1.42
Reduction Ratio	13.7x	[12.2x, 15.3x]	1.0x (no depollution)	< 0.001	d = 2.33
Token Savings	3,103 tokens	[2,815, 3,398]	0 tokens	< 0.001	—
Throughput (rule-based)	13,635 msg/s	[13,374, 13,867]	—	—	—
Throughput (ML-based)	1,028 msg/s	[995, 1,057]	—	—	—

All key metrics are statistically significant (p < 0.05) with large effect sizes.

n=273 evaluation examples for accuracy (held-out test set)
n=100 synthetic sessions for reduction ratio and token savings
n=100 trials for throughput (1000 messages per trial)
Bootstrap confidence intervals with 10,000 resamples
One-sample t-tests vs appropriate baselines
Cohen's d for effect size (d > 0.8 = large effect)

See docs/statistical_validation.json for full results and scripts/validate_significance.py to reproduce.

What It Does

Agent context windows fill up with noise: 500-line test outputs where everything passed, file contents from 10 turns ago, reasoning chains about bugs that are already fixed. Today's approach is reactive — call an LLM to summarize when the window gets too long.

Honey-Comb takes a different approach: depollute every message on the way in, before it ever enters the context window. A CPU classifier (~1ms per message) labels each message with a depollution strategy, and deterministic regex extractors execute it. No model reads or understands the text. The LLM only sees clean context — the honey, not the wax.

Every message enters the agent loop:
  raw → classify(1ms) → depollute → context window
  raw → classify(1ms) → depollute → context window
  raw → classify(1ms) → depollute → context window

LLM sees: clean, depolluted context

No batch summarization. No "when do I depollute?" threshold. Every message, every time.

What This Is (and Isn't)

This is not compression. Real compression (gzip, delta encoding) preserves information in a smaller form. Honey-Comb selectively deletes noise and extracts structural summaries from tool outputs. A file read replaced by src/auth.py (69 lines) has lost the file contents — they are not recoverable from the context window.

This is a classifier + rule engine. The ML model (TF-IDF + VotingClassifier) picks a bucket. Hand-written regex extractors do the actual work. No model reads or understands the text being "depolluted."

This works on structured tool outputs where "what matters" is mechanically extractable: test results, file reads, diffs, error traces, command output. It does not summarize free-form conversation.

Misclassification = data loss. A CORE message mislabeled as DROP is gone. The 84% classification accuracy means ~16% of messages get a suboptimal strategy. For most tool outputs this is harmless (slightly more or less pruning), but it is a real risk for ambiguous content.

The honest pitch: CPU-only inline context pruning for agent harnesses.

The Two Loops

HOT LOOP (per message, ~1ms rules / ~1ms ML):
  raw message → classifier → label → depolluter → clean context entry

COOL LOOP (every N turns, ~10-50ms):
  walk context → drop stale/superseded entries
  budget check → force-downgrade if over budget

Both loops are CPU-only. The LLM only ever sees clean, depolluted context.

Label Taxonomy

Label	Strategy	Example
`CORE`	Keep verbatim	Active goal, current error, system prompt
`DISTILL`	Extract key info	Test output → "94 passed, 2 failed" + failure details
`COMPACT`	Structural summary	File → "src/foo.py (200 lines): class Foo, def bar()"
`DROP`	Remove entirely	Completed tool calls (the result is what matters)
`STALE`	Mark for deletion	File read before a later edit of the same file
`ESCALATE`	Defer to LLM	Ambiguous content (rare)

Quick Start

git clone https://github.com/DJLougen/honey-comb.git
cd honey-comb
pip install -e ".[dev]"

# Run tests
pytest tests/

# Generate training data
python scripts/generate_synthetic.py

# Train classifier
honeycomb-train examples/train.jsonl --eval examples/eval.jsonl

Usage

Rule-based (no training needed)

from honeycomb import HoneyComb, Message

hc = HoneyComb()  # Uses rule-based classification

# Process every message inline
for raw_message in agent_messages:
    compressed = hc.process(Message(
        role=raw_message["role"],
        content=raw_message["content"],
    ))
    # Send depolluted content to your LLM
    send_to_llm({"role": compressed.role, "content": compressed.content})

# Get the full depolluted context window
window = hc.get_context_window()

ML classifier (trained)

from honeycomb import HoneyComb, Message

hc = HoneyComb(model_path="models/honeycomb.joblib")

# Same API — just with ML classification instead of rules
compressed = hc.process(Message(role="tool", content="94 passed, 2 failed..."))

With budget management

from honeycomb import HoneyComb, Message
from honeycomb.budget import BudgetConfig

hc = HoneyComb(
    budget_config=BudgetConfig(target_tokens=10_000),
    cool_interval=5,  # Run cool loop every 5 turns
)

Production Features

Thread Safety

Honey-Comb is thread-safe by default, allowing concurrent processing from multiple threads:

hc = HoneyComb(thread_safe=True)  # Default

# Safe to call from multiple threads
# All internal state is protected by locks

For single-threaded workloads, disable locks for maximum performance:

hc = HoneyComb(thread_safe=False)  # 1.45x faster, single-threaded only

Observability

Structured logging, Prometheus metrics, and health checks are built-in:

from honeycomb import setup_logging, metrics, health_checker

# Setup structured JSON logging
setup_logging(level="INFO", json_format=True)

# Metrics are automatically recorded
print(f"Messages: {metrics.messages_processed.value}")
print(f"Compression p95: {metrics.compression_ratio.get_percentile(95):.2f}x")
print(f"Avg latency: {metrics.processing_latency_seconds.get_mean() * 1000:.3f}ms")

# Health check endpoint
health = health_checker.check()
print(f"Status: {health.status}")
print(f"Uptime: {health.uptime_seconds:.1f}s")

Export metrics in Prometheus format:

prometheus_text = metrics.export_prometheus()
# Serve at /metrics endpoint

Configuration

Configure via environment variables or config files:

# Environment variables
export HONEYCOMB_THREAD_SAFE=true
export HONEYCOMB_METRICS_ENABLED=true
export HONEYCOMB_COOL_LOOP_INTERVAL=10
export HONEYCOMB_LOG_LEVEL=INFO

Or load from a config file:

from honeycomb import load_config

config = load_config("config.json")
print(f"Thread safe: {config.thread_safe}")
print(f"Cool loop interval: {config.cool_loop_interval}")

Performance Tuning

Choose the right mode for your workload:

Mode	thread_safe	metrics_enabled	Use Case
Production	`True`	`True`	Concurrent server workloads
High-perf	`False`	`False`	Single-threaded batch processing

# Production mode (default)
hc = HoneyComb(thread_safe=True, metrics_enabled=True)
# ~17,000 msg/s

# High-performance mode
hc = HoneyComb(thread_safe=False, metrics_enabled=False)
# ~24,000 msg/s (1.45x faster)

Run the production demo:

python scripts/demo_production.py

Architecture

honeycomb/
  labels.py          Label taxonomy (CORE/DISTILL/COMPACT/DROP/STALE/ESCALATE)
  features.py        Message-level feature extraction
  compressor.py      Deterministic per-label depollution rules (regex extractors)
  session.py         Turn tracking, staleness detection, supersession (thread-safe)
  budget.py          Token budget management
  classifier.py      TF-IDF + VotingClassifier (SGD + NB + LR)
  firewall.py        Main orchestrator (hot loop + cool loop)
  observability.py   Structured logging, metrics, health checks
  config.py          Configuration management (env vars, files)
  io.py              JSONL I/O for training data
  cli_train.py       Training CLI entry point
scripts/
  generate_synthetic.py   Synthetic training data generator
  demo_pollution.py       Side-by-side raw vs clean demo
  demo_production.py      Production features demo (threading, metrics, config)
  benchmark.py            Performance benchmarks
  benchmark_statistical.py Statistical benchmarks with confidence intervals
examples/
  train.jsonl        Training examples (1335 rows)
  eval.jsonl         Evaluation examples (273 rows)
tests/
  test_labels.py       6 tests
  test_features.py    12 tests
  test_compressor.py  25 tests
  test_session.py     17 tests
  test_budget.py       8 tests
  test_firewall.py    18 tests
  test_classifier.py   5 tests
  test_production.py  19 tests
  test_performance.py 15 tests
  test_threading.py    4 tests

Performance

All values below are statistically validated (bootstrap 95% CI, n >= 100 trials). See the Statistical Significance section for methodology.

Metric	Value	95% CI
Classification accuracy (end-to-end pipeline)	84.2%	[79.9%, 88.3%]
Classification accuracy (isolated ML classifier)	94.5%	—
Training examples	1,335	—
Per-message latency (rules)	0.061ms	—
Per-message latency (ML)	0.899ms	—
Throughput (rules)	13,635 msg/s	[13,374, 13,867]
Throughput (ML)	1,028 msg/s	[995, 1,057]
Reduction ratio (100 sessions)	13.7x	[12.2x, 15.3x]
Token savings per session	3,103 tokens	[2,815, 3,398]
Tests	129 passed	—

The end-to-end accuracy (84.2%) reflects the full pipeline including content-type detection, while the isolated ML classifier achieves 94.5% on the same held-out evaluation set. Both are significantly better than the 25% random baseline (p < 0.001, Cohen's d = 1.42).

Demo: Raw vs Clean

Run python scripts/demo_pollution.py to see a 10-turn coding agent session depolluted in real time:

Turn	Raw	Clean	What happened
System prompt	137 tokens	137 tokens	Kept verbatim (CORE)
User goal	93 tokens	93 tokens	Kept verbatim (CORE)
File read (69 lines)	514 tokens	5 tokens	`src/auth.py (69 lines)`
Test failures (60 lines)	759 tokens	93 tokens	Summary + 4 failed test names + error details
Agent reasoning	351 tokens	110 tokens	Extracted decision plan (numbered lists)
Git diff	451 tokens	9 tokens	`Edited 1 file(s): +2/-9 src/auth.py`
Test pass	334 tokens	20 tokens	`12 passed in 1.15s`
Full suite pass	585 tokens	20 tokens	`213 passed in 4.72s`
Final summary	247 tokens	146 tokens	Headers + numbered change list
Total	4,062 tokens	640 tokens	6.3x reduction, 84% noise removed

How It Works

Hot Loop (per message)

Extract features from the message: role, content type signals (paths, errors, code blocks), turn age, duplicate detection.
Classify the message into a depollution label using either rules or the ML classifier.
Depollute using deterministic regex extractors specific to the content type and label. No model reads the text.
Record in the session state for staleness tracking.

Cool Loop (every N turns)

Staleness check: Walk the context and drop entries that are stale (file read before a later edit) or superseded (file read again later).
Budget enforcement: If over the token budget, force-downgrade the lowest-priority entries to more aggressive depollution.

Depollution Examples

Test output (DISTILL):

Before: 500 lines of pytest output
After:  "94 passed, 2 failed in 3.5s\nFailures:\ntest_foo.py::test_bar\nAssertionError: expected 5 got 3"

File content (COMPACT):

Before: 200 lines of source code
After:  "src/foo.py (200 lines):\nclass Foo\ndef bar()\ndef baz()"

Command output (DISTILL):

Before: 100 lines of build output
After:  "exit=0\nBuild complete\nSuccessfully built abc123"

Error trace (DISTILL):

Before: 50-line traceback
After:  "ValueError: invalid literal for int()\nat foo.py:42"

Relationship to busyBee-cpu

Honey-Comb applies the same principle as busyBee-cpu to a different problem:

	busyBee-cpu	Honey-Comb
Problem	Tool selection in agent loops	Context depollution in agent loops
Principle	Most decisions are mechanical	Most depollution is mechanical
Classifier	Which of 4 actions to take	Which depollution strategy to apply
Resolver	Fill arguments from state	Execute extraction per content type
Escalation	Defer to LLM for reasoning	Defer to LLM for ambiguous content

Both use the same architecture: TF-IDF + VotingClassifier on CPU, with deterministic resolvers/extractors, escalating to the LLM only when uncertain.

Dependencies

scikit-learn >= 1.4 — classifiers and pipelines
joblib >= 1.4 — model serialization
numpy >= 1.24 — numeric operations

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
examples		examples
honeycomb		honeycomb
models		models
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
PRODUCTION_AUDIT.md		PRODUCTION_AUDIT.md
PRODUCTION_READINESS.md		PRODUCTION_READINESS.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Honey-Comb

Visual Gallery

Architecture Overview

Performance Benchmarks

Depollution Ratios

Latency Breakdown

Statistical Validation

Real-World Demo

Statistical Significance

Key Results

What It Does

What This Is (and Isn't)

The Two Loops

Label Taxonomy

Quick Start

Usage

Rule-based (no training needed)

ML classifier (trained)

With budget management

Production Features

Thread Safety

Observability

Configuration

Performance Tuning

Architecture

Performance

Demo: Raw vs Clean

How It Works

Hot Loop (per message)

Cool Loop (every N turns)

Depollution Examples

Relationship to busyBee-cpu

Dependencies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages