Hanzo Extract

Content extraction with built-in sanitization for LLM applications.

Features

Web Extraction: Fetch and extract clean text from web pages
PDF Extraction: Extract text from PDF documents
Conversation Extraction: Export Claude Code sessions for training datasets
Sanitization: Automatic PII redaction via hanzo-guard

Installation

cargo add hanzo-extract

Or add to Cargo.toml:

[dependencies]
hanzo-extract = "0.1"

Quick Start

Web Extraction

use hanzo_extract::{WebExtractor, ExtractorConfig, Extractor};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = WebExtractor::new(ExtractorConfig::default());
    let result = extractor.extract("https://example.com").await?;

    println!("Title: {:?}", result.title);
    println!("Text: {}", result.text);
    println!("Words: {}", result.word_count);

    Ok(())
}

PDF Extraction

use hanzo_extract::{PdfExtractor, Extractor};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = PdfExtractor::default();
    let result = extractor.extract("document.pdf").await?;

    println!("Text: {}", result.text);
    Ok(())
}

Conversation Extraction

Extract Claude Code conversations for AI training:

use hanzo_extract::conversations::{ConversationExporter, ExporterConfig};
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut exporter = ConversationExporter::new();

    exporter.export(
        Path::new("~/.claude/projects"),
        Path::new("./training-data"),
    )?;

    Ok(())
}

CLI Tools

extract-web

# Install
cargo install hanzo-extract --features web

# Usage
extract-web https://example.com
extract-web https://example.com --json

extract-conversations

# Install
cargo install hanzo-extract --features conversations

# Usage
extract-conversations --source ~/.claude/projects --output ./conversations

# Options
extract-conversations --help

Feature Flags

Feature	Default	Description
`web`	Yes	Web page extraction
`pdf`	Yes	PDF document extraction
`sanitize`	Yes	PII redaction via hanzo-guard
`conversations`	No	Claude Code conversation extraction

# Minimal (no extraction)
hanzo-extract = { version = "0.1", default-features = false }

# Web only
hanzo-extract = { version = "0.1", default-features = false, features = ["web"] }

# Full (all features)
hanzo-extract = { version = "0.1", features = ["full"] }

Conversation Export

The conversation extractor creates training datasets from Claude Code sessions:

~/.claude/projects/
├── project-a/
│   ├── session1.jsonl
│   └── session2.jsonl
└── project-b/
    └── session.jsonl

↓ extract-conversations

./conversations/
├── conversations_20251222.jsonl  # Full data
├── training_20251222.jsonl       # Instruction/response format
└── splits/
    ├── train_20251222.jsonl      # 80%
    ├── val_20251222.jsonl        # 10%
    └── test_20251222.jsonl       # 10%

Features

Extracts user/assistant conversation turns
Anonymizes paths, secrets, emails, API keys
Calculates quality scores (0.0-1.0)
Creates reproducible train/val/test splits

Quality Scoring

Conversations are scored based on:

Thinking/reasoning presence (+0.2)
Tool usage (+0.15)
Agentic tools (Task, dispatch) (+0.1)
Opus/Sonnet model (+0.1/+0.05)
Response length (+0.1)

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Source    │ ──► │  Extractor   │ ──► │  Hanzo Guard    │
│ (URL/PDF)   │     │ (Text Parse) │     │ (Sanitization)  │
└─────────────┘     └──────────────┘     └─────────────────┘
                                                  │
                                                  ▼
                                         ┌─────────────────┐
                                         │  Clean Output   │
                                         │ (LLM-Ready)     │
                                         └─────────────────┘

License

Dual licensed under MIT OR Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hanzo Extract

Features

Installation

Quick Start

Web Extraction

PDF Extraction

Conversation Extraction

CLI Tools

extract-web

extract-conversations

Feature Flags

Conversation Export

Features

Quality Scoring

Architecture

License

Related

About

Uh oh!

Releases

Packages

Languages

License

hanzoai/extract

Folders and files

Latest commit

History

Repository files navigation

Hanzo Extract

Features

Installation

Quick Start

Web Extraction

PDF Extraction

Conversation Extraction

CLI Tools

extract-web

extract-conversations

Feature Flags

Conversation Export

Features

Quality Scoring

Architecture

License

Related

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages