5 releases

Uses new Rust 2024

new 0.1.0-rc.6 Apr 19, 2026
0.1.0-rc.4 Apr 18, 2026
0.1.0-rc.2 Apr 16, 2026

#415 in Network programming


Used in kreuzcrawl-cli

Elastic-2.0

445KB
9K SLoC

Kreuzcrawl

Kreuzcrawl

High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 10 languages — same engine, identical results across every runtime.

Key Features

  • Structured extraction — Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, and response headers
  • Markdown conversion — Clean Markdown output with citations, document structure, and fit-content mode
  • Concurrent crawling — Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
  • 10 language bindings — Rust, Python, Node.js, Ruby, Go, Java, C#, PHP, Elixir, and WebAssembly
  • Smart filtering — BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, and sitemap discovery
  • Browser rendering — Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
  • Batch operations — Scrape or crawl hundreds of URLs concurrently with partial failure handling
  • Streaming — Real-time crawl events via async streams for progress tracking
  • Authentication — HTTP Basic, Bearer token, and custom header auth with persistent cookie jars
  • Rate limiting — Per-domain request throttling with configurable delays
  • Asset download — Download, deduplicate, and filter images, documents, and other linked assets
  • MCP server — Model Context Protocol integration for AI agents
  • REST API — HTTP server with OpenAPI spec

Documentation | API Reference

Installation

Language Package Install
Python kreuzcrawl pip install kreuzcrawl
Node.js @kreuzberg/kreuzcrawl npm install @kreuzberg/kreuzcrawl
Rust kreuzcrawl cargo add kreuzcrawl
Go pkg.go.dev go get github.com/kreuzberg-dev/kreuzcrawl/packages/go
Java Maven Central See README
C# NuGet dotnet add package Kreuzcrawl
Ruby kreuzcrawl gem install kreuzcrawl
PHP kreuzberg-dev/kreuzcrawl composer require kreuzberg-dev/kreuzcrawl
Elixir kreuzcrawl {:kreuzcrawl, "~> 0.1"}
WASM @kreuzberg/kreuzcrawl-wasm npm install @kreuzberg/kreuzcrawl-wasm
C FFI GitHub Releases C header + shared library
CLI crates.io cargo install kreuzcrawl-cli

Quick Start

PythonFull docs
from kreuzcrawl import create_engine, scrape

engine = create_engine()
result = scrape(engine, "https://example.com")

print(result.metadata.title)
print(result.markdown.content)
print(len(result.links))
Node.js / TypeScriptFull docs
import { createEngine, scrape } from "@kreuzberg/kreuzcrawl";

const engine = createEngine();
const result = await scrape(engine, "https://example.com");

console.log(result.metadata.title);
console.log(result.markdown.content);
console.log(result.links.length);
RustFull docs
let engine = kreuzcrawl::create_engine(None)?;
let result = kreuzcrawl::scrape(&engine, "https://example.com").await?;

println!("{}", result.metadata.title);
println!("{}", result.markdown.content);
println!("{}", result.links.len());
GoFull docs
engine, _ := kcrawl.CreateEngine()
result, _ := kcrawl.Scrape(engine, "https://example.com")

fmt.Println(result.Metadata.Title)
fmt.Println(result.Markdown.Content)
fmt.Println(len(result.Links))
JavaFull docs
var engine = Kreuzcrawl.createEngine(null);
var result = Kreuzcrawl.scrape(engine, "https://example.com");

System.out.println(result.metadata().title());
System.out.println(result.markdown().content());
System.out.println(result.links().size());
C#Full docs
var engine = KreuzcrawlLib.CreateEngine(null);
var result = await KreuzcrawlLib.Scrape(engine, "https://example.com");

Console.WriteLine(result.Metadata.Title);
Console.WriteLine(result.Markdown.Content);
Console.WriteLine(result.Links.Count);
RubyFull docs
engine = Kreuzcrawl.create_engine(nil)
result = Kreuzcrawl.scrape(engine, "https://example.com")

puts result.metadata.title
puts result.markdown.content
puts result.links.length
PHPFull docs
$engine = Kreuzcrawl::createEngine(null);
$result = Kreuzcrawl::scrape($engine, "https://example.com");

echo $result->metadata->title;
echo $result->markdown->content;
echo count($result->links);
ElixirFull docs
{:ok, engine} = Kreuzcrawl.create_engine(nil)
{:ok, result} = Kreuzcrawl.scrape(engine, "https://example.com")

IO.puts(result.metadata.title)
IO.puts(result.markdown.content)
IO.puts(length(result.links))

Platform Support

Language Linux x86_64 Linux aarch64 macOS ARM64 Windows x64
Python
Node.js
WASM
Ruby
Elixir
Go
Java
C#
PHP
Rust
C (FFI)
CLI

Architecture

Your Application (Python, Node.js, Ruby, Java, Go, C#, PHP, Elixir, ...)
    │
Language Bindings (PyO3, NAPI-RS, Magnus, ext-php-rs, Rustler, cgo, Panama, P/Invoke)
    │
Rust Core Engine (async, concurrent, SIMD-optimized)
    │
    ├── HTTP Client (reqwest + tower middleware stack)
    ├── HTML Parser (html5ever + lol_html)
    ├── Markdown Converter (html-to-markdown-rs)
    ├── Content Extraction (metadata, JSON-LD, Open Graph, readability)
    ├── Link Discovery (robots.txt, sitemaps, anchor analysis)
    └── Browser Rendering (optional headless Chrome/Firefox)

Contributing

Contributions are welcome! See our Contributing Guide.

License

Elastic License 2.0

Dependencies

~21–37MB
~608K SLoC