A zero-copy robots.txt parser for Rust with SIMD-accelerated byte scanning, RFC 9309 access checks, feature-gated extension metadata, and a tiny argh CLI.
Disclaimer: I can't design. This logo was generated using ChatGPT.
robots.txt is line-oriented and byte-oriented. That makes a hand-rolled parser a better fit than a big parser-combinator stack: fewer allocations, direct control over error recovery, and the hot path stays obvious.
The goal is simple: parse the standardized rules correctly, preserve useful ecosystem metadata like Sitemap and Crawl-delay, and use memchr where delimiter scanning actually matters.
- Zero-copy parsing: parsed agents, rules, and extension values borrow from the original input.
- SIMD-backed scanning: line splitting, comments, directive separators, and wildcard matching use
memchr/memmemprimitives. - RFC 9309 core:
User-agentAllowDisallow#comments*wildcard matching$end-anchor matching
- Correct access semantics:
- matching groups are merged
*fallback group is used only when no exact user-agent group matches- longest matching rule wins
Allowwins ties- empty
Disallow:does not block anything /robots.txtis implicitly allowed
- Feature-gated extensions:
Sitemap,Crawl-delay,Host,Clean-param, and unknown directives are collected behind theextensionsfeature. - CLI included: inspect parsed files and check whether a path is allowed from the terminal.
- Small dependency surface: runtime dependencies are currently
memchrandargh.
fast-robots is fast enough that parsing is rarely the bottleneck:
- ~1–2 GiB/s parse throughput on Apple M1 (native CPU tuning, benchmark-only
mimalloc). - ~4–9× faster than
robotstxt(Google's Rust port) on end-to-end parse + match workloads. - ~270–380× faster repeated matching with the opt-in compiled matcher on generated rule-heavy fixtures.
See BENCHMARK.md for full methodology, fixtures, and environment details.
Add this to your Cargo.toml:
[dependencies]
fast-robots = "0.1.0"The extensions feature is enabled by default:
[dependencies]
fast-robots = { version = "0.1.0", default-features = false }use fast_robots::RobotsTxt;
let input = r#"
User-agent: *
Disallow: /private/
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml
"#;
let robots = RobotsTxt::parse(input);
assert!(!robots.is_allowed("ExampleBot", "/private/file.html"));
assert!(robots.is_allowed("ExampleBot", "/private/public/file.html"));For many checks against the same parsed file, build a reusable matcher once:
use fast_robots::RobotsTxt;
let robots = RobotsTxt::parse("User-agent: *\nDisallow: /private/\n");
let matcher = robots.matcher();
assert!(!matcher.is_allowed("ExampleBot", "/private/file.html"));
assert!(matcher.is_allowed("ExampleBot", "/public/file.html"));RobotsTxt::is_allowed() is still the lowest-overhead choice for one-off checks. RobotsTxt::matcher() allocates user-agent, prefix, exact-match, and wildcard-prefix indexes for repeated checks against the same robots.txt.
RobotsTxt::parse(&str) is intentionally tolerant and infallible. Malformed lines are ignored because crawlers are expected to use the parseable rules they can recover.
Use the fallible byte APIs when reading untrusted files directly:
use fast_robots::{ParseOptions, RobotsTxt};
let bytes = b"User-agent: *\nDisallow: /private\n";
let robots = RobotsTxt::parse_bytes(bytes)?;
assert!(!robots.is_allowed("ExampleBot", "/private"));
let robots = RobotsTxt::parse_bytes_with_options(
bytes,
ParseOptions {
max_bytes: Some(512 * 1024),
},
)?;
assert!(!robots.is_allowed("ExampleBot", "/private"));
# Ok::<(), fast_robots::ParseError>(())Hard errors are reserved for conditions that prevent safe parsing, such as invalid UTF-8 or inputs over the configured size limit.
Use diagnostics when you want validator-style feedback without changing tolerant parser behavior:
use fast_robots::{ParseWarningKind, RobotsTxt};
let report = RobotsTxt::parse_with_diagnostics(
"Disallow: /\nMissing separator\nUser-agent: *\nDisallow: /private\n",
);
assert_eq!(report.warnings.len(), 2);
assert!(matches!(
report.warnings[0].kind,
ParseWarningKind::RuleBeforeUserAgent { .. }
));
assert!(!report.robots.is_allowed("ExampleBot", "/private"));With the default extensions feature, non-core records are preserved as metadata:
use fast_robots::RobotsTxt;
let robots = RobotsTxt::parse(r#"
Sitemap: https://example.com/sitemap.xml
User-agent: Bingbot
Crawl-delay: 5
Disallow: /slow/
Host: example.com
Clean-param: ref /shop
X-Experimental: yes
"#);
assert_eq!(robots.extensions.sitemaps, ["https://example.com/sitemap.xml"]);
assert_eq!(robots.extensions.crawl_delays[0].agents, ["Bingbot"]);
assert_eq!(robots.extensions.crawl_delays[0].value, "5");
assert!(!robots.is_allowed("Bingbot", "/slow/page.html"));Extensions are metadata only. They do not affect is_allowed().
Parse a file:
cargo run -- parse robots.txtCheck a path:
cargo run -- check robots.txt --agent Googlebot --path /private/page.htmlExit codes for check:
0: allowed1: disallowed2: file read error
- Line scan: the parser walks the input with
memchr(b'\n', ...)and strips optional\r. - Comment scan:
memchr(b'#', ...)removes inline comments. - Directive split:
memchr(b':', ...)separates key/value records. - Core parse:
user-agent,allow, anddisalloware matched ASCII-case-insensitively. - Extension collection: when enabled, non-core records are stored without changing group boundaries.
- Access check: matching groups are evaluated using longest-match semantics, with
Allowpreferred on equal specificity.RobotsTxt::matcher()can pre-index groups, plain path prefixes, exact anchors, and wildcard literal prefixes for repeated checks.
nom is good, but this format is mostly delimiter scanning and small state transitions. A manual parser keeps the important choices visible:
- which bytes are scanned with SIMD-backed routines
- how malformed lines recover
- when groups start and end
- which records are access-control rules versus metadata
- how much allocation happens
Parser combinators can still be useful for more complex formats. Here they would mostly hide a simple loop.
fast-robots treats extensions conservatively:
Sitemap: global metadata; can appear anywhere.Crawl-delay: stored with the current group agents when present.Host: stored as Yandex-style metadata.Clean-param: stored as Yandex-style metadata.- unknown directives: stored as
Directive { key, value }.
Other records must not terminate groups or interfere with RFC 9309 parsing.
cargo build
cargo test
cargo test --no-default-features
cargo clippy --all-targets --all-featuresBenchmarks use Criterion.rs and generated fixtures so large test data does not need to live in the repository. Current results are tracked in BENCHMARK.md.
Current benchmark groups:
| Group | Workload | Goal |
|---|---|---|
parse |
tiny, common, many groups, many rules, wildcard-heavy, extension-heavy, 500 KiB | parser throughput |
match |
many rules, wildcard-heavy | is_allowed() and precompiled matcher throughput after parsing once |
parse_match |
tiny, common, many rules, 500 KiB | end-to-end parse plus access decision |
The parse_match group compares fast-robots against robotstxt, the Rust port of Google's robots.txt parser and matcher. This is an API-level comparison, not a claim that the two crates currently have identical behavior for every edge case.
Run all benchmarks:
cargo benchRun only this crate's benchmark target:
cargo bench --bench robotsQuick local sanity check with a smaller sample size:
cargo bench --bench robots -- --sample-size 10 --warm-up-time 0.1 --measurement-time 0.2- Not an authorization system:
robots.txtis a crawler cooperation protocol, not access control. - UTF-8 required:
parse_bytesmethods validate UTF-8 and return aParseErrorfor invalid encoding. Non-UTF-8 encodings (e.g., Latin-1, Windows-1252) are not supported. - No URI percent-normalization yet: RFC 9309 has specific percent-encoding comparison rules. The current matcher focuses on path pattern semantics and should grow a normalization layer before claiming full crawler equivalence.
- Extensions vary by crawler: Google ignores
Crawl-delay; Bing honors it; other crawlers differ. This crate stores extension metadata but does not enforce crawl scheduling. - SIMD is delegated:
memchrselects optimized implementations where supported and falls back safely elsewhere.
| Mode | Cargo config | Use case |
|---|---|---|
| Core + extensions | fast-robots = "0.1" |
most applications that want sitemaps and metadata |
| Core only | fast-robots = { version = "0.1", default-features = false } |
strict RFC access checks with less metadata |
Please see SECURITY.md for vulnerability reporting.
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.