Skip to content

zeon256/fast-robots

fast-robots

Crates.io Crates.io Downloads Docs.rs License MSRV Rust Edition

A zero-copy robots.txt parser for Rust with SIMD-accelerated byte scanning, RFC 9309 access checks, feature-gated extension metadata, and a tiny argh CLI.

fast-robots logo
Disclaimer: I can't design. This logo was generated using ChatGPT.

Motivation

robots.txt is line-oriented and byte-oriented. That makes a hand-rolled parser a better fit than a big parser-combinator stack: fewer allocations, direct control over error recovery, and the hot path stays obvious.

The goal is simple: parse the standardized rules correctly, preserve useful ecosystem metadata like Sitemap and Crawl-delay, and use memchr where delimiter scanning actually matters.

Features

  • Zero-copy parsing: parsed agents, rules, and extension values borrow from the original input.
  • SIMD-backed scanning: line splitting, comments, directive separators, and wildcard matching use memchr/memmem primitives.
  • RFC 9309 core:
    • User-agent
    • Allow
    • Disallow
    • # comments
    • * wildcard matching
    • $ end-anchor matching
  • Correct access semantics:
    • matching groups are merged
    • * fallback group is used only when no exact user-agent group matches
    • longest matching rule wins
    • Allow wins ties
    • empty Disallow: does not block anything
    • /robots.txt is implicitly allowed
  • Feature-gated extensions: Sitemap, Crawl-delay, Host, Clean-param, and unknown directives are collected behind the extensions feature.
  • CLI included: inspect parsed files and check whether a path is allowed from the terminal.
  • Small dependency surface: runtime dependencies are currently memchr and argh.

Performance

fast-robots is fast enough that parsing is rarely the bottleneck:

  • ~1–2 GiB/s parse throughput on Apple M1 (native CPU tuning, benchmark-only mimalloc).
  • ~4–9× faster than robotstxt (Google's Rust port) on end-to-end parse + match workloads.
  • ~270–380× faster repeated matching with the opt-in compiled matcher on generated rule-heavy fixtures.

See BENCHMARK.md for full methodology, fixtures, and environment details.

Installation

Add this to your Cargo.toml:

[dependencies]
fast-robots = "0.1.0"

The extensions feature is enabled by default:

[dependencies]
fast-robots = { version = "0.1.0", default-features = false }

Usage

use fast_robots::RobotsTxt;

let input = r#"
User-agent: *
Disallow: /private/
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml
"#;

let robots = RobotsTxt::parse(input);

assert!(!robots.is_allowed("ExampleBot", "/private/file.html"));
assert!(robots.is_allowed("ExampleBot", "/private/public/file.html"));

For many checks against the same parsed file, build a reusable matcher once:

use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse("User-agent: *\nDisallow: /private/\n");
let matcher = robots.matcher();

assert!(!matcher.is_allowed("ExampleBot", "/private/file.html"));
assert!(matcher.is_allowed("ExampleBot", "/public/file.html"));

RobotsTxt::is_allowed() is still the lowest-overhead choice for one-off checks. RobotsTxt::matcher() allocates user-agent, prefix, exact-match, and wildcard-prefix indexes for repeated checks against the same robots.txt.

Fallible Parsing

RobotsTxt::parse(&str) is intentionally tolerant and infallible. Malformed lines are ignored because crawlers are expected to use the parseable rules they can recover.

Use the fallible byte APIs when reading untrusted files directly:

use fast_robots::{ParseOptions, RobotsTxt};

let bytes = b"User-agent: *\nDisallow: /private\n";
let robots = RobotsTxt::parse_bytes(bytes)?;

assert!(!robots.is_allowed("ExampleBot", "/private"));

let robots = RobotsTxt::parse_bytes_with_options(
    bytes,
    ParseOptions {
        max_bytes: Some(512 * 1024),
    },
)?;

assert!(!robots.is_allowed("ExampleBot", "/private"));
# Ok::<(), fast_robots::ParseError>(())

Hard errors are reserved for conditions that prevent safe parsing, such as invalid UTF-8 or inputs over the configured size limit.

Diagnostics

Use diagnostics when you want validator-style feedback without changing tolerant parser behavior:

use fast_robots::{ParseWarningKind, RobotsTxt};

let report = RobotsTxt::parse_with_diagnostics(
    "Disallow: /\nMissing separator\nUser-agent: *\nDisallow: /private\n",
);

assert_eq!(report.warnings.len(), 2);
assert!(matches!(
    report.warnings[0].kind,
    ParseWarningKind::RuleBeforeUserAgent { .. }
));
assert!(!report.robots.is_allowed("ExampleBot", "/private"));

Extensions

With the default extensions feature, non-core records are preserved as metadata:

use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse(r#"
Sitemap: https://example.com/sitemap.xml
User-agent: Bingbot
Crawl-delay: 5
Disallow: /slow/
Host: example.com
Clean-param: ref /shop
X-Experimental: yes
"#);

assert_eq!(robots.extensions.sitemaps, ["https://example.com/sitemap.xml"]);
assert_eq!(robots.extensions.crawl_delays[0].agents, ["Bingbot"]);
assert_eq!(robots.extensions.crawl_delays[0].value, "5");
assert!(!robots.is_allowed("Bingbot", "/slow/page.html"));

Extensions are metadata only. They do not affect is_allowed().

CLI

Parse a file:

cargo run -- parse robots.txt

Check a path:

cargo run -- check robots.txt --agent Googlebot --path /private/page.html

Exit codes for check:

  • 0: allowed
  • 1: disallowed
  • 2: file read error

How it works

  1. Line scan: the parser walks the input with memchr(b'\n', ...) and strips optional \r.
  2. Comment scan: memchr(b'#', ...) removes inline comments.
  3. Directive split: memchr(b':', ...) separates key/value records.
  4. Core parse: user-agent, allow, and disallow are matched ASCII-case-insensitively.
  5. Extension collection: when enabled, non-core records are stored without changing group boundaries.
  6. Access check: matching groups are evaluated using longest-match semantics, with Allow preferred on equal specificity. RobotsTxt::matcher() can pre-index groups, plain path prefixes, exact anchors, and wildcard literal prefixes for repeated checks.

Why not nom?

nom is good, but this format is mostly delimiter scanning and small state transitions. A manual parser keeps the important choices visible:

  • which bytes are scanned with SIMD-backed routines
  • how malformed lines recover
  • when groups start and end
  • which records are access-control rules versus metadata
  • how much allocation happens

Parser combinators can still be useful for more complex formats. Here they would mostly hide a simple loop.

Extension Semantics

fast-robots treats extensions conservatively:

  • Sitemap: global metadata; can appear anywhere.
  • Crawl-delay: stored with the current group agents when present.
  • Host: stored as Yandex-style metadata.
  • Clean-param: stored as Yandex-style metadata.
  • unknown directives: stored as Directive { key, value }.

Other records must not terminate groups or interfere with RFC 9309 parsing.

Building

cargo build
cargo test
cargo test --no-default-features
cargo clippy --all-targets --all-features

Benchmarks

Benchmarks use Criterion.rs and generated fixtures so large test data does not need to live in the repository. Current results are tracked in BENCHMARK.md.

Current benchmark groups:

Group Workload Goal
parse tiny, common, many groups, many rules, wildcard-heavy, extension-heavy, 500 KiB parser throughput
match many rules, wildcard-heavy is_allowed() and precompiled matcher throughput after parsing once
parse_match tiny, common, many rules, 500 KiB end-to-end parse plus access decision

The parse_match group compares fast-robots against robotstxt, the Rust port of Google's robots.txt parser and matcher. This is an API-level comparison, not a claim that the two crates currently have identical behavior for every edge case.

Run all benchmarks:

cargo bench

Run only this crate's benchmark target:

cargo bench --bench robots

Quick local sanity check with a smaller sample size:

cargo bench --bench robots -- --sample-size 10 --warm-up-time 0.1 --measurement-time 0.2

Caveats

  • Not an authorization system: robots.txt is a crawler cooperation protocol, not access control.
  • UTF-8 required: parse_bytes methods validate UTF-8 and return a ParseError for invalid encoding. Non-UTF-8 encodings (e.g., Latin-1, Windows-1252) are not supported.
  • No URI percent-normalization yet: RFC 9309 has specific percent-encoding comparison rules. The current matcher focuses on path pattern semantics and should grow a normalization layer before claiming full crawler equivalence.
  • Extensions vary by crawler: Google ignores Crawl-delay; Bing honors it; other crawlers differ. This crate stores extension metadata but does not enforce crawl scheduling.
  • SIMD is delegated: memchr selects optimized implementations where supported and falls back safely elsewhere.

Choosing Strictness

Mode Cargo config Use case
Core + extensions fast-robots = "0.1" most applications that want sitemaps and metadata
Core only fast-robots = { version = "0.1", default-features = false } strict RFC access checks with less metadata

Security

Please see SECURITY.md for vulnerability reporting.

License

Licensed under either of:

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

About

A fast robots.txt parser

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Security policy

Stars

Watchers

Forks

Contributors

Languages