fast-robots

A zero-copy robots.txt parser for Rust with SIMD-accelerated byte scanning, RFC 9309 access checks, feature-gated extension metadata, and a tiny argh CLI.

_{Disclaimer: I can't design. This logo was generated using ChatGPT.}

Motivation

robots.txt is line-oriented and byte-oriented. That makes a hand-rolled parser a better fit than a big parser-combinator stack: fewer allocations, direct control over error recovery, and the hot path stays obvious.

The goal is simple: parse the standardized rules correctly, preserve useful ecosystem metadata like Sitemap and Crawl-delay, and use memchr where delimiter scanning actually matters.

Features

Zero-copy parsing: parsed agents, rules, and extension values borrow from the original input.
SIMD-backed scanning: line splitting, comments, directive separators, and wildcard matching use memchr/memmem primitives.
RFC 9309 core:
- User-agent
- Allow
- Disallow
- # comments
- * wildcard matching
- $ end-anchor matching
Correct access semantics:
- matching groups are merged
- * fallback group is used only when no exact user-agent group matches
- longest matching rule wins
- Allow wins ties
- empty Disallow: does not block anything
- /robots.txt is implicitly allowed
Feature-gated extensions: Sitemap, Crawl-delay, Host, Clean-param, and unknown directives are collected behind the extensions feature.
CLI included: inspect parsed files and check whether a path is allowed from the terminal.
Small dependency surface: runtime dependencies are currently memchr and argh.

Performance

fast-robots is fast enough that parsing is rarely the bottleneck:

~1–2 GiB/s parse throughput on Apple M1 (native CPU tuning, benchmark-only mimalloc).
~4–9× faster than robotstxt (Google's Rust port) on end-to-end parse + match workloads.
~270–380× faster repeated matching with the opt-in compiled matcher on generated rule-heavy fixtures.

See BENCHMARK.md for full methodology, fixtures, and environment details.

Installation

Add this to your Cargo.toml:

[dependencies]
fast-robots = "0.1.0"

The extensions feature is enabled by default:

[dependencies]
fast-robots = { version = "0.1.0", default-features = false }

Usage

use fast_robots::RobotsTxt;

let input = r#"
User-agent: *
Disallow: /private/
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml
"#;

let robots = RobotsTxt::parse(input);

assert!(!robots.is_allowed("ExampleBot", "/private/file.html"));
assert!(robots.is_allowed("ExampleBot", "/private/public/file.html"));

For many checks against the same parsed file, build a reusable matcher once:

use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse("User-agent: *\nDisallow: /private/\n");
let matcher = robots.matcher();

assert!(!matcher.is_allowed("ExampleBot", "/private/file.html"));
assert!(matcher.is_allowed("ExampleBot", "/public/file.html"));

RobotsTxt::is_allowed() is still the lowest-overhead choice for one-off checks. RobotsTxt::matcher() allocates user-agent, prefix, exact-match, and wildcard-prefix indexes for repeated checks against the same robots.txt.

Fallible Parsing

RobotsTxt::parse(&str) is intentionally tolerant and infallible. Malformed lines are ignored because crawlers are expected to use the parseable rules they can recover.

Use the fallible byte APIs when reading untrusted files directly:

use fast_robots::{ParseOptions, RobotsTxt};

let bytes = b"User-agent: *\nDisallow: /private\n";
let robots = RobotsTxt::parse_bytes(bytes)?;

assert!(!robots.is_allowed("ExampleBot", "/private"));

let robots = RobotsTxt::parse_bytes_with_options(
    bytes,
    ParseOptions {
        max_bytes: Some(512 * 1024),
    },
)?;

assert!(!robots.is_allowed("ExampleBot", "/private"));
# Ok::<(), fast_robots::ParseError>(())

Hard errors are reserved for conditions that prevent safe parsing, such as invalid UTF-8 or inputs over the configured size limit.

Diagnostics

Use diagnostics when you want validator-style feedback without changing tolerant parser behavior:

use fast_robots::{ParseWarningKind, RobotsTxt};

let report = RobotsTxt::parse_with_diagnostics(
    "Disallow: /\nMissing separator\nUser-agent: *\nDisallow: /private\n",
);

assert_eq!(report.warnings.len(), 2);
assert!(matches!(
    report.warnings[0].kind,
    ParseWarningKind::RuleBeforeUserAgent { .. }
));
assert!(!report.robots.is_allowed("ExampleBot", "/private"));

Extensions

With the default extensions feature, non-core records are preserved as metadata:

use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse(r#"
Sitemap: https://example.com/sitemap.xml
User-agent: Bingbot
Crawl-delay: 5
Disallow: /slow/
Host: example.com
Clean-param: ref /shop
X-Experimental: yes
"#);

assert_eq!(robots.extensions.sitemaps, ["https://example.com/sitemap.xml"]);
assert_eq!(robots.extensions.crawl_delays[0].agents, ["Bingbot"]);
assert_eq!(robots.extensions.crawl_delays[0].value, "5");
assert!(!robots.is_allowed("Bingbot", "/slow/page.html"));

Extensions are metadata only. They do not affect is_allowed().

CLI

Parse a file:

cargo run -- parse robots.txt

Check a path:

cargo run -- check robots.txt --agent Googlebot --path /private/page.html

Exit codes for check:

0: allowed
1: disallowed
2: file read error

How it works

Line scan: the parser walks the input with memchr(b'\n', ...) and strips optional \r.
Comment scan: memchr(b'#', ...) removes inline comments.
Directive split: memchr(b':', ...) separates key/value records.
Core parse: user-agent, allow, and disallow are matched ASCII-case-insensitively.
Extension collection: when enabled, non-core records are stored without changing group boundaries.
Access check: matching groups are evaluated using longest-match semantics, with Allow preferred on equal specificity. RobotsTxt::matcher() can pre-index groups, plain path prefixes, exact anchors, and wildcard literal prefixes for repeated checks.

Why not nom?

nom is good, but this format is mostly delimiter scanning and small state transitions. A manual parser keeps the important choices visible:

which bytes are scanned with SIMD-backed routines
how malformed lines recover
when groups start and end
which records are access-control rules versus metadata
how much allocation happens

Parser combinators can still be useful for more complex formats. Here they would mostly hide a simple loop.

Extension Semantics

fast-robots treats extensions conservatively:

Sitemap: global metadata; can appear anywhere.
Crawl-delay: stored with the current group agents when present.
Host: stored as Yandex-style metadata.
Clean-param: stored as Yandex-style metadata.
unknown directives: stored as Directive { key, value }.

Other records must not terminate groups or interfere with RFC 9309 parsing.

Building

cargo build
cargo test
cargo test --no-default-features
cargo clippy --all-targets --all-features

Benchmarks

Benchmarks use Criterion.rs and generated fixtures so large test data does not need to live in the repository. Current results are tracked in BENCHMARK.md.

Current benchmark groups:

Group	Workload	Goal
`parse`	tiny, common, many groups, many rules, wildcard-heavy, extension-heavy, 500 KiB	parser throughput
`match`	many rules, wildcard-heavy	`is_allowed()` and precompiled matcher throughput after parsing once
`parse_match`	tiny, common, many rules, 500 KiB	end-to-end parse plus access decision

The parse_match group compares fast-robots against robotstxt, the Rust port of Google's robots.txt parser and matcher. This is an API-level comparison, not a claim that the two crates currently have identical behavior for every edge case.

Run all benchmarks:

cargo bench

Run only this crate's benchmark target:

cargo bench --bench robots

Quick local sanity check with a smaller sample size:

cargo bench --bench robots -- --sample-size 10 --warm-up-time 0.1 --measurement-time 0.2

Caveats

Not an authorization system: robots.txt is a crawler cooperation protocol, not access control.
UTF-8 required: parse_bytes methods validate UTF-8 and return a ParseError for invalid encoding. Non-UTF-8 encodings (e.g., Latin-1, Windows-1252) are not supported.
No URI percent-normalization yet: RFC 9309 has specific percent-encoding comparison rules. The current matcher focuses on path pattern semantics and should grow a normalization layer before claiming full crawler equivalence.
Extensions vary by crawler: Google ignores Crawl-delay; Bing honors it; other crawlers differ. This crate stores extension metadata but does not enforce crawl scheduling.
SIMD is delegated: memchr selects optimized implementations where supported and falls back safely elsewhere.

Choosing Strictness

Mode	Cargo config	Use case
Core + extensions	`fast-robots = "0.1"`	most applications that want sitemaps and metadata
Core only	`fast-robots = { version = "0.1", default-features = false }`	strict RFC access checks with less metadata

Security

Please see SECURITY.md for vulnerability reporting.

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
benches		benches
examples		examples
findings		findings
src		src
.gitignore		.gitignore
BENCHMARK.md		BENCHMARK.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SECURITY.md		SECURITY.md
flamegraph.svg		flamegraph.svg
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fast-robots

Motivation

Features

Performance

Installation

Usage

Fallible Parsing

Diagnostics

Extensions

CLI

How it works

Why not nom?

Extension Semantics

Building

Benchmarks

Caveats

Choosing Strictness

Security

License

Contribution

About

Licenses found

Uh oh!

Releases 4

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fast-robots

Motivation

Features

Performance

Installation

Usage

Fallible Parsing

Diagnostics

Extensions

CLI

How it works

Why not nom?

Extension Semantics

Building

Benchmarks

Caveats

Choosing Strictness

Security

License

Contribution

About

Resources

License

Licenses found

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Uh oh!

Contributors

Uh oh!

Languages