#web-crawler #user-agent #bot #detection

iscrawl

Fast crawler/bot detection from User-Agent strings

3 stable releases

Uses new Rust 2024

new 1.2.0 May 10, 2026
1.1.0 May 10, 2026
1.0.0 May 8, 2026

#1088 in Parser implementations

Apache-2.0

385KB
246 lines

iscrawl

Crates.io Docs.rs License

Fast crawler/bot detection from User-Agent strings.

sub-140ns cold, 5ns warm
heuristic bool API
optional Crawlerdex info lookup

Install

[dependencies]
iscrawl = "1.2"

Database metadata:

[dependencies]
iscrawl = { version = "1.2", features = ["database"] }

Use

use iscrawl::is_crawler;

assert!(is_crawler("Googlebot/2.1 (+http://www.google.com/bot.html)"));
assert!(!is_crawler(
    "Mozilla/5.0 (X11; Linux x86_64; rv:115.0) Gecko/20100101 Firefox/115.0"
));

Default build has no deps. Use crawler_info with the database feature.

Database

crawler_info(user_agent) returns Option<&'static CrawlerInfo> from the bundled Crawlerdex DB. Matching is separate from is_crawler.

use iscrawl::crawler_info;

assert_eq!(
    crawler_info("Googlebot/2.1").unwrap().description,
    "Google's main web crawling bot for search indexing"
);

Update DB:

curl -fsSL https://github.com/tn3w/Crawlerdex/releases/latest/download/crawlers.min.json \
  -o crawlers.min.json

Why fast

  • is_crawler: stack buffer of 512 bytes, no heap.
  • First-byte lookup table prunes 99% of needle scans.
  • Single pass over the lowered input.
  • Thread-local 256-slot direct-mapped cache keyed by pointer/length with edge-word guards.
  • crawler_info: Aho-Corasick literals + chunked regex fallback.
  • lto = "fat", codegen-units = 1, panic = "abort".

Benchmarked on x86_64: cold ~140 ns/call, warm cache hit ~5 ns/call.

How it decides

  1. Empty input counts as crawler.
  2. Input over 512 bytes is rejected (returns false).
  3. If any crawler keyword (bot, crawl, spider, scanner, +http, @, archive, ...) appears: crawler.
  4. If the UA does not start with Mozilla/ or Opera/ and has no known browser engine token (gecko, webkit, chrome, firefox, msie, edge, opera, ...): crawler.
  5. If the UA starts with Mozilla/ or Opera/ but is missing both an engine token and the (compatible; marker: crawler.
  6. Otherwise: browser.

is_crawler is heuristic. crawler_info is feature-gated + database-backed.

Accuracy

Measured against bundled fixture corpora:

corpus size result
crawler_user_agents.txt 2,149 95.4% detected
loadkpi_crawlers.txt 3,696 94.7% detected
crawler_user_agents_pgts.txt 156 98.1% detected
browser_user_agents.txt 19,897 <1% false positive

Run cargo test --release to verify on your machine.

Bench

cargo bench --bench bench
cargo bench --features database --bench database
run ns/call M calls/s
cold corpus 137.1 7.30
warm hits 4.5 222.26

Fixture corpus: 25,898 User-Agents.

Develop

cargo build --release
cargo test --release
cargo test --release --features database
cargo test --doc
cargo doc --no-deps --open

Format

Standard formatting across Rust + markdown/yaml.

cargo fmt --all
cargo clippy --all-targets -- -D warnings
npx --yes prettier --write --single-quote --print-width=100 --trailing-comma=es5 --end-of-line=lf "**/*.{md,yml}"

CI enforces cargo fmt --check and cargo clippy -D warnings on every push.

Publish

Pushes to main/master trigger .github/workflows/publish.yml:

  1. Reads name + version from Cargo.toml.
  2. Skips if that version is already on crates.io.
  3. Tests on Ubuntu, macOS, Windows × stable, beta.
  4. Runs cargo publish using CARGO_REGISTRY_TOKEN.
  5. Tags the commit vX.Y.Z.

Bump version in Cargo.toml, push to main, done.

Funding

If this saved you time, buy me a coffee.

License

Apache-2.0. See LICENSE.

Dependencies

~0–0.9MB
~15K SLoC