Content extraction with built-in sanitization for LLM applications.
- Web Extraction: Fetch and extract clean text from web pages
- PDF Extraction: Extract text from PDF documents
- Conversation Extraction: Export Claude Code sessions for training datasets
- Sanitization: Automatic PII redaction via hanzo-guard
cargo add hanzo-extractOr add to Cargo.toml:
[dependencies]
hanzo-extract = "0.1"use hanzo_extract::{WebExtractor, ExtractorConfig, Extractor};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let extractor = WebExtractor::new(ExtractorConfig::default());
let result = extractor.extract("https://example.com").await?;
println!("Title: {:?}", result.title);
println!("Text: {}", result.text);
println!("Words: {}", result.word_count);
Ok(())
}use hanzo_extract::{PdfExtractor, Extractor};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let extractor = PdfExtractor::default();
let result = extractor.extract("document.pdf").await?;
println!("Text: {}", result.text);
Ok(())
}Extract Claude Code conversations for AI training:
use hanzo_extract::conversations::{ConversationExporter, ExporterConfig};
use std::path::Path;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut exporter = ConversationExporter::new();
exporter.export(
Path::new("~/.claude/projects"),
Path::new("./training-data"),
)?;
Ok(())
}# Install
cargo install hanzo-extract --features web
# Usage
extract-web https://example.com
extract-web https://example.com --json# Install
cargo install hanzo-extract --features conversations
# Usage
extract-conversations --source ~/.claude/projects --output ./conversations
# Options
extract-conversations --help| Feature | Default | Description |
|---|---|---|
web |
Yes | Web page extraction |
pdf |
Yes | PDF document extraction |
sanitize |
Yes | PII redaction via hanzo-guard |
conversations |
No | Claude Code conversation extraction |
# Minimal (no extraction)
hanzo-extract = { version = "0.1", default-features = false }
# Web only
hanzo-extract = { version = "0.1", default-features = false, features = ["web"] }
# Full (all features)
hanzo-extract = { version = "0.1", features = ["full"] }The conversation extractor creates training datasets from Claude Code sessions:
~/.claude/projects/
├── project-a/
│ ├── session1.jsonl
│ └── session2.jsonl
└── project-b/
└── session.jsonl
↓ extract-conversations
./conversations/
├── conversations_20251222.jsonl # Full data
├── training_20251222.jsonl # Instruction/response format
└── splits/
├── train_20251222.jsonl # 80%
├── val_20251222.jsonl # 10%
└── test_20251222.jsonl # 10%
- Extracts user/assistant conversation turns
- Anonymizes paths, secrets, emails, API keys
- Calculates quality scores (0.0-1.0)
- Creates reproducible train/val/test splits
Conversations are scored based on:
- Thinking/reasoning presence (+0.2)
- Tool usage (+0.15)
- Agentic tools (Task, dispatch) (+0.1)
- Opus/Sonnet model (+0.1/+0.05)
- Response length (+0.1)
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Source │ ──► │ Extractor │ ──► │ Hanzo Guard │
│ (URL/PDF) │ │ (Text Parse) │ │ (Sanitization) │
└─────────────┘ └──────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Clean Output │
│ (LLM-Ready) │
└─────────────────┘
Dual licensed under MIT OR Apache-2.0.
- hanzo-guard - LLM I/O sanitization
- Hanzo AI - AI infrastructure platform