5 releases
Uses new Rust 2024
| new 0.1.0-rc.6 | Apr 19, 2026 |
|---|---|
| 0.1.0-rc.4 | Apr 18, 2026 |
| 0.1.0-rc.2 | Apr 16, 2026 |
#415 in Network programming
Used in kreuzcrawl-cli
445KB
9K
SLoC
Kreuzcrawl
High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 10 languages — same engine, identical results across every runtime.
Key Features
- Structured extraction — Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, and response headers
- Markdown conversion — Clean Markdown output with citations, document structure, and fit-content mode
- Concurrent crawling — Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
- 10 language bindings — Rust, Python, Node.js, Ruby, Go, Java, C#, PHP, Elixir, and WebAssembly
- Smart filtering — BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, and sitemap discovery
- Browser rendering — Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
- Batch operations — Scrape or crawl hundreds of URLs concurrently with partial failure handling
- Streaming — Real-time crawl events via async streams for progress tracking
- Authentication — HTTP Basic, Bearer token, and custom header auth with persistent cookie jars
- Rate limiting — Per-domain request throttling with configurable delays
- Asset download — Download, deduplicate, and filter images, documents, and other linked assets
- MCP server — Model Context Protocol integration for AI agents
- REST API — HTTP server with OpenAPI spec
Installation
| Language | Package | Install |
|---|---|---|
| Python | kreuzcrawl | pip install kreuzcrawl |
| Node.js | @kreuzberg/kreuzcrawl | npm install @kreuzberg/kreuzcrawl |
| Rust | kreuzcrawl | cargo add kreuzcrawl |
| Go | pkg.go.dev | go get github.com/kreuzberg-dev/kreuzcrawl/packages/go |
| Java | Maven Central | See README |
| C# | NuGet | dotnet add package Kreuzcrawl |
| Ruby | kreuzcrawl | gem install kreuzcrawl |
| PHP | kreuzberg-dev/kreuzcrawl | composer require kreuzberg-dev/kreuzcrawl |
| Elixir | kreuzcrawl | {:kreuzcrawl, "~> 0.1"} |
| WASM | @kreuzberg/kreuzcrawl-wasm | npm install @kreuzberg/kreuzcrawl-wasm |
| C FFI | GitHub Releases | C header + shared library |
| CLI | crates.io | cargo install kreuzcrawl-cli |
Quick Start
Python — Full docs
from kreuzcrawl import create_engine, scrape
engine = create_engine()
result = scrape(engine, "https://example.com")
print(result.metadata.title)
print(result.markdown.content)
print(len(result.links))
Node.js / TypeScript — Full docs
import { createEngine, scrape } from "@kreuzberg/kreuzcrawl";
const engine = createEngine();
const result = await scrape(engine, "https://example.com");
console.log(result.metadata.title);
console.log(result.markdown.content);
console.log(result.links.length);
Rust — Full docs
let engine = kreuzcrawl::create_engine(None)?;
let result = kreuzcrawl::scrape(&engine, "https://example.com").await?;
println!("{}", result.metadata.title);
println!("{}", result.markdown.content);
println!("{}", result.links.len());
Go — Full docs
engine, _ := kcrawl.CreateEngine()
result, _ := kcrawl.Scrape(engine, "https://example.com")
fmt.Println(result.Metadata.Title)
fmt.Println(result.Markdown.Content)
fmt.Println(len(result.Links))
Java — Full docs
var engine = Kreuzcrawl.createEngine(null);
var result = Kreuzcrawl.scrape(engine, "https://example.com");
System.out.println(result.metadata().title());
System.out.println(result.markdown().content());
System.out.println(result.links().size());
C# — Full docs
var engine = KreuzcrawlLib.CreateEngine(null);
var result = await KreuzcrawlLib.Scrape(engine, "https://example.com");
Console.WriteLine(result.Metadata.Title);
Console.WriteLine(result.Markdown.Content);
Console.WriteLine(result.Links.Count);
Ruby — Full docs
engine = Kreuzcrawl.create_engine(nil)
result = Kreuzcrawl.scrape(engine, "https://example.com")
puts result.metadata.title
puts result.markdown.content
puts result.links.length
PHP — Full docs
$engine = Kreuzcrawl::createEngine(null);
$result = Kreuzcrawl::scrape($engine, "https://example.com");
echo $result->metadata->title;
echo $result->markdown->content;
echo count($result->links);
Elixir — Full docs
{:ok, engine} = Kreuzcrawl.create_engine(nil)
{:ok, result} = Kreuzcrawl.scrape(engine, "https://example.com")
IO.puts(result.metadata.title)
IO.puts(result.markdown.content)
IO.puts(length(result.links))
Platform Support
| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 |
|---|---|---|---|---|
| Python | ✅ | ✅ | ✅ | ✅ |
| Node.js | ✅ | ✅ | ✅ | ✅ |
| WASM | ✅ | ✅ | ✅ | ✅ |
| Ruby | ✅ | ✅ | ✅ | — |
| Elixir | ✅ | ✅ | ✅ | ✅ |
| Go | ✅ | ✅ | ✅ | ✅ |
| Java | ✅ | ✅ | ✅ | ✅ |
| C# | ✅ | ✅ | ✅ | ✅ |
| PHP | ✅ | ✅ | ✅ | ✅ |
| Rust | ✅ | ✅ | ✅ | ✅ |
| C (FFI) | ✅ | ✅ | ✅ | ✅ |
| CLI | ✅ | ✅ | ✅ | ✅ |
Architecture
Your Application (Python, Node.js, Ruby, Java, Go, C#, PHP, Elixir, ...)
│
Language Bindings (PyO3, NAPI-RS, Magnus, ext-php-rs, Rustler, cgo, Panama, P/Invoke)
│
Rust Core Engine (async, concurrent, SIMD-optimized)
│
├── HTTP Client (reqwest + tower middleware stack)
├── HTML Parser (html5ever + lol_html)
├── Markdown Converter (html-to-markdown-rs)
├── Content Extraction (metadata, JSON-LD, Open Graph, readability)
├── Link Discovery (robots.txt, sitemaps, anchor analysis)
└── Browser Rendering (optional headless Chrome/Firefox)
Contributing
Contributions are welcome! See our Contributing Guide.
License
Links
Dependencies
~21–37MB
~608K SLoC