Skip to content

hanzoai/extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hanzo Extract

Crates.io Documentation CI License

Content extraction with built-in sanitization for LLM applications.

Features

  • Web Extraction: Fetch and extract clean text from web pages
  • PDF Extraction: Extract text from PDF documents
  • Conversation Extraction: Export Claude Code sessions for training datasets
  • Sanitization: Automatic PII redaction via hanzo-guard

Installation

cargo add hanzo-extract

Or add to Cargo.toml:

[dependencies]
hanzo-extract = "0.1"

Quick Start

Web Extraction

use hanzo_extract::{WebExtractor, ExtractorConfig, Extractor};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = WebExtractor::new(ExtractorConfig::default());
    let result = extractor.extract("https://example.com").await?;

    println!("Title: {:?}", result.title);
    println!("Text: {}", result.text);
    println!("Words: {}", result.word_count);

    Ok(())
}

PDF Extraction

use hanzo_extract::{PdfExtractor, Extractor};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = PdfExtractor::default();
    let result = extractor.extract("document.pdf").await?;

    println!("Text: {}", result.text);
    Ok(())
}

Conversation Extraction

Extract Claude Code conversations for AI training:

use hanzo_extract::conversations::{ConversationExporter, ExporterConfig};
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut exporter = ConversationExporter::new();

    exporter.export(
        Path::new("~/.claude/projects"),
        Path::new("./training-data"),
    )?;

    Ok(())
}

CLI Tools

extract-web

# Install
cargo install hanzo-extract --features web

# Usage
extract-web https://example.com
extract-web https://example.com --json

extract-conversations

# Install
cargo install hanzo-extract --features conversations

# Usage
extract-conversations --source ~/.claude/projects --output ./conversations

# Options
extract-conversations --help

Feature Flags

Feature Default Description
web Yes Web page extraction
pdf Yes PDF document extraction
sanitize Yes PII redaction via hanzo-guard
conversations No Claude Code conversation extraction
# Minimal (no extraction)
hanzo-extract = { version = "0.1", default-features = false }

# Web only
hanzo-extract = { version = "0.1", default-features = false, features = ["web"] }

# Full (all features)
hanzo-extract = { version = "0.1", features = ["full"] }

Conversation Export

The conversation extractor creates training datasets from Claude Code sessions:

~/.claude/projects/
├── project-a/
│   ├── session1.jsonl
│   └── session2.jsonl
└── project-b/
    └── session.jsonl

↓ extract-conversations

./conversations/
├── conversations_20251222.jsonl  # Full data
├── training_20251222.jsonl       # Instruction/response format
└── splits/
    ├── train_20251222.jsonl      # 80%
    ├── val_20251222.jsonl        # 10%
    └── test_20251222.jsonl       # 10%

Features

  • Extracts user/assistant conversation turns
  • Anonymizes paths, secrets, emails, API keys
  • Calculates quality scores (0.0-1.0)
  • Creates reproducible train/val/test splits

Quality Scoring

Conversations are scored based on:

  • Thinking/reasoning presence (+0.2)
  • Tool usage (+0.15)
  • Agentic tools (Task, dispatch) (+0.1)
  • Opus/Sonnet model (+0.1/+0.05)
  • Response length (+0.1)

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Source    │ ──► │  Extractor   │ ──► │  Hanzo Guard    │
│ (URL/PDF)   │     │ (Text Parse) │     │ (Sanitization)  │
└─────────────┘     └──────────────┘     └─────────────────┘
                                                  │
                                                  ▼
                                         ┌─────────────────┐
                                         │  Clean Output   │
                                         │ (LLM-Ready)     │
                                         └─────────────────┘

License

Dual licensed under MIT OR Apache-2.0.

Related

About

Content extraction with built-in sanitization - web, PDF, conversations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages