Skip to content

Complete the cutover: runner+fingerprinter on production path, legacy deletion, CLI surface #46

@filipeforattini

Description

@filipeforattini

Problem Statement

The previous wave of work — PRDs #15, #24, and #25 across 27 sub-issues — built the JobRunner extraction infrastructure and the unified fingerprint::Fingerprinter engine in parallel, but every slice that touched Crawler::process_job or the legacy detection modules deferred the load-bearing change to "the next release" and shipped under #[deprecated] warnings. As of today on main:

  • Crawler::process_job still runs the inline ~600 LOC fetch/extract/detect/escalation block. The runner: Arc<JobRunner> and fingerprinter: Arc<Fingerprinter> fields are constructed on every Crawler and cloned in clone_refs, but no code path inside process_job ever calls them. The runner-as-entry-point goal from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 Q2 is unmet on every active job.
  • src/discovery/tech_fingerprint.rs is still 1312 LOC and is the only detector process_job consults for tech inference. The fingerprint::Fingerprinter engine — 14 Hot sources, 4 Warm sources, 2 Cold sources, 138 unit tests — produces a FingerprintReport nothing in production reads.
  • runner::ChallengeDetector is #[deprecated] but still alive. policy::engine constructs it on every fetch attempt to make the antibot escalation call. runner::JobRunner::run constructs it as a fallback when fingerprinter isn't injected, which is "always" in production because process_job never calls runner.run.
  • antibot::block_detector is #[deprecated] but still alive. Same situation — BlockPatternSource in fingerprint::target::sources ports the logic; nothing reads it.
  • antibot::detect_from_html / detect_from_http_response / detect_from_cookies are still public functions. render::pool calls them post-render. No fingerprint-engine path runs against post-render HTML.
  • error::AntibotVendor and antibot::ChallengeVendor are #[deprecated] enums with From conversions to fingerprint::Vendor. Their conversion paths exist but call sites in policy::engine, runner::JobRunner, antibot::*, crawler.rs, and tests/{escalation,session_scope_policy,antibot_detection}.rs still consume the old variants. 333 deprecation warnings ship in cargo build --tests today.
  • runner::AutoFetcher + AutoOutcome are #[deprecated] per ADR-0004 and have zero consumers — pure dead code that still compiles.
  • runner::SessionStatePlaceholder and ChallengeSignalPlaceholder are still the types JobOutcome.new_session_state and JobOutcome.signals carry. Vec<ChallengeSignalPlaceholder> is the typed output of every successful runner invocation today.
  • Fingerprinter::analyze_warm and analyze_cold don't exist. The Warm/Cold sources (h2 SETTINGS, robots.txt, well-known, favicon, DNS, ASN) are registered in Engine but Engine::analyze_hot is the only dispatcher. No engine method ever invokes the higher-tier sources.
  • TargetContext carries only Hot-tier fields (status, headers, body, final_url). The Warm/Cold sources have empty analyze because the Option<&T> slots from the parent PRD (h2 SETTINGS frame, robots body, well-known probe, favicon bytes, DNS observation, ASN info, peer cert) never made it onto the struct.
  • SelfFingerprint is never populated at runtime. The compute functions (compute_ja3, compute_ja4, compute_h2_settings_fingerprint) take parsed fields, but no plumbing in impersonate::tls::build_connector captures our ClientHello bytes to feed them.
  • The catalog ships PLACEHOLDER hashes ("PLACEHOLDER_chrome131_ja3_md5" and friends). Coherence::compute_coherence runs against the placeholders, which means our_ja3_matches_profile is always either None or Some(true) against a synthetic value — there is no real drift detection on main.
  • There is no CLI surface for the new fingerprint module. crawlex fingerprint <url> does not exist. --deep-fingerprint and --audit-tls flags do not exist. Operators cannot drive the engine from outside a Rust binary.

The honest grade from the prior wave — "infrastructure shipped, runner is dead code in production, god-module unsolved" — is exactly the grade today. The deferred items are the actual cutover. Without them, three PRDs of detection / runner work have zero production impact: every byte fetched on main flows through the same inline dispatch and the same three legacy detectors that existed before PRD #15 started.

Solution

Close the cutover. One unified PRD covering the nine deferred items, sequenced so each step is reviewable and the NDJSON event-stream regression test from issue #16 catches any wire drift across the whole chain.

Phase 1 — process_job cutover (the real surgery).
Replace the inline fetch/extract/detect block in Crawler::process_job with runner.run(&job, &ctx).await. The runner is constructed once with the Arc<Fingerprinter> from the same Crawler instance — antibot detection flows through Fingerprinter::analyze_hot, not through ChallengeDetector. The SessionContext is built from existing Crawler state (proxy router lease, antibot session state, render budgets, resolved policy profile). After this phase, the runner is the production entry point for per-Job execution and crawler.rs shrinks measurably.

Phase 2 — JobOutcome widening.
JobOutcome.signals becomes Vec<fingerprint::Detection> (drops ChallengeSignalPlaceholder). JobOutcome.new_session_state becomes Option<antibot::SessionState> (drops SessionStatePlaceholder). The two remaining placeholder types are deleted. Every caller — runner internals, integration tests, Crawler post-processing — reads the typed shapes.

Phase 3 — Legacy detector deletion.
With every in-tree caller migrated, src/discovery/tech_fingerprint.rs, src/runner/challenge.rs, src/antibot/block_detector.rs, the antibot::detect_* functions, runner::AutoFetcher + AutoOutcome, and the error::AntibotVendor + antibot::ChallengeVendor enums are deleted. The 333 deprecation warnings drop to zero.

Phase 4 — Warm/Cold tier activation.
Fingerprinter::analyze_warm(host) and analyze_cold(host) get real implementations: the engine fetches robots.txt / well-known probes / favicon (via existing ImpersonateClient), pulls DNS records (via existing discovery::dns), runs RDAP lookups (via existing discovery::rdap), and populates the cache. TargetContext gains the optional Warm/Cold slots. Engine::analyze_hot keeps its current speed budget; Warm runs once per host:port per TTL window; Cold runs on operator opt-in.

Phase 5 — Self-fingerprint live capture.
A hook into impersonate::tls::build_connector captures the ClientHello bytes after BoringSSL assembles them. The compute functions in fingerprint::introspect run against the live bytes. SelfFingerprint.profile_expected populates from the catalog. SelfFingerprint.matches_profile and drift_signals carry real values. The first authoritative ClientHello capture per impersonate::Profile replaces the PLACEHOLDER hashes in catalog.rs.

Phase 6 — CLI surface.
crawlex fingerprint <url> subcommand drives the engine end-to-end and prints a FingerprintReport. --deep-fingerprint enables Cold-tier sources. --audit-tls enables the FP-B external oracle (tls.peet.ws). The subcommand reuses the Crawler plumbing so headers / cookies / TLS state come from a real fetch.

After this PRD lands, the runner is the production per-Job entry point, the Fingerprinter is the only detection authority, the legacy modules are gone, the Warm/Cold tiers run when expected, SelfFingerprint reports real outbound identity with drift detection, operators can run crawlex fingerprint example.com from the CLI, and cargo build --tests produces zero deprecation warnings.

User Stories

  1. As an operator, I want every fetched job in production to flow through JobRunner::run, so that the per-Job timings / events / retry decisions that PRD Extract JobRunner from crawler.rs (strangler refactor) #15 specified actually run on real traffic.
  2. As an operator, I want the antibot detection on production fetches to come from Fingerprinter::analyze_hot, so that the Evidence-rich Detections from PRD fingerprint/ module: target+self detection, comparable or better than redblue #25 reach the events / storage / SDK instead of being computed in tests only.
  3. As an operator, I want JobOutcome.signals to carry Vec<fingerprint::Detection>, so that downstream consumers (storage, SDK, events) see the same Evidence model the rest of the codebase uses.
  4. As an operator, I want JobOutcome.new_session_state to carry the real antibot::SessionState, so that the Crawler's session-state commit path matches the type the antibot subsystem already uses.
  5. As an operator, I want cargo build --all-features --tests to emit zero deprecation warnings, so that real deprecations introduced in the future are visible against a clean baseline.
  6. As an operator, I want src/discovery/tech_fingerprint.rs deleted from the tree, so that the 1312-LOC duplicate detection path stops drifting alongside the unified engine.
  7. As an operator, I want src/runner/challenge.rs deleted, so that no caller in the codebase can accidentally pick the deprecated single-purpose detector over the unified engine.
  8. As an operator, I want src/antibot/block_detector.rs and the antibot::detect_* functions deleted, so that block / challenge detection has exactly one home.
  9. As an operator, I want error::AntibotVendor and antibot::ChallengeVendor deleted (both already #[deprecated]), so that fingerprint::Vendor is the only Vendor identity in the crate.
  10. As an operator, I want runner::AutoFetcher + AutoOutcome deleted per ADR-0004, so that the dead-code path is removed instead of carrying it forward indefinitely.
  11. As an operator, I want runner::SessionStatePlaceholder and ChallengeSignalPlaceholder deleted, so that the runner module exports only types operators are meant to construct.
  12. As an operator running crawlex fingerprint https://www.drogasil.com.br, I want a CLI surface that prints the FingerprintReport for that host (CDN / WAF / Antibot / CMS / Ecommerce / TLS profile / coherence warnings), so that recon does not require writing Rust.
  13. As an operator running crawlex fingerprint <url> --deep-fingerprint, I want the Cold tier to fire (DNS / ASN / RDAP), so that I get the full vendor map when I explicitly ask for it.
  14. As an operator running crawlex fingerprint <url> --audit-tls, I want the external oracle (tls.peet.ws) to compare its view of my outbound TLS handshake against the live capture, so that proxy / middlebox alterations surface.
  15. As an operator running any crawl, I want Fingerprinter::analyze_warm to fire once per host per TTL window, so that h2 SETTINGS / robots.txt / well-known / favicon facts arrive on the second fetch of a host without rerunning per-job.
  16. As an operator, I want the Warm-tier cache to honor Fingerprinter::invalidate(host), so that I can force a refresh when I suspect the target rotated its stack mid-crawl.
  17. As an operator, I want SelfFingerprint.ja3_hash to carry the real MD5 hash of our outbound ClientHello, so that comparing against catalog detects BoringSSL regression and proxy alteration.
  18. As an operator, I want SelfFingerprint.h2_settings_fp to carry the real hash of our h2 SETTINGS frame, so that drift versus the Akamai-style fingerprint catalog is detectable.
  19. As an operator, I want the catalog values for Profile::Chrome131Stable / Chrome132Stable / Chrome149Stable to be measured hashes rather than "PLACEHOLDER_*" strings, so that the drift detection logic produces meaningful answers.
  20. As an operator, I want Coherence.our_ja3_matches_profile to report Some(true) on a healthy run and Some(false) on a regression, so that the cross-check actually runs against real data.
  21. As an operator, I want Coherence.their_antibot_compatible_with_our_profile to surface false when the detected antibot vendor is on the flagged list, so that "we look like Chrome131, target is Akamai Bot Manager which currently flags Chrome131" reaches my log before I waste the crawl budget.
  22. As an operator, I want NDJSON wire events to carry fingerprint::Detection payloads (with the Evidence list) on challenge.detected and tech.fingerprint_detected, so that the SDK / dashboards see the same structured output the engine produces internally.
  23. As an operator, I want cargo test --all-features --test runner_ndjson_regression to remain byte-stable across every commit in this PRD, so that the wire contract from issue runner: bootstrap module shells + NDJSON regression test harness #16 is the gating trip wire through the entire cutover.
  24. As an operator, I want cargo test --all-features --test runner_integration to keep passing through every commit, so that the 4-scenario contract from A5 (healthy 200, 403 challenge, connection refused, blackhole route) protects the cutover the same way the unit tests protect each source.
  25. As a contributor, I want Crawler::process_job to read as a thin loop that calls runner.run(...) and post-processes the outcome, so that the orchestrator's role is visually separated from per-job execution after this PRD.
  26. As a contributor, I want src/crawler.rs LOC to drop measurably (target: <3000 LOC, down from current 3618), so that the god-module deepening goal from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 finally has a non-zero result.
  27. As a contributor, I want the JobOutcome.signals: Vec<Detection> migration to be a single boundary change with no intermediate adapters, so that there is one shape consumers read.
  28. As a contributor, I want each phase of the PRD to land as a separate PR that ships green via cargo test --all-features --no-fail-fast, so that the strangler discipline from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 continues.
  29. As a contributor, I want tests/escalation.rs, tests/session_scope_policy.rs, and tests/antibot_detection.rs to migrate from ChallengeVendor/AntibotVendor to fingerprint::Vendor cases, so that the deletion phase has no test-suite fallout.
  30. As a contributor, I want the engine's Warm-tier dispatch to consume the existing ImpersonateClient for robots/well-known/favicon fetches and the existing discovery::dns / discovery::rdap for Cold-tier, so that no new external dependency is introduced for plumbing.
  31. As a contributor, I want the ClientHello capture in impersonate::tls::build_connector to be a thin hook returning bytes (under 100 LOC of new code in impersonate/), so that the impersonate module retains its current shape.
  32. As a contributor, I want a SelfFingerprint::capture_live() async helper that returns a populated SelfFingerprint after a single HTTPS request, so that the CLI surface can call one method to get the full snapshot.
  33. As a contributor adding a new CDN vendor in the future, I want the deletion of tech_fingerprint to be irreversible (no #[cfg] flag preserving the old path), so that we cannot accidentally regress into the dual-path state.
  34. As a contributor, I want the CLI crawlex fingerprint output format to follow the JSON shape of FingerprintReport, so that operators can pipe through jq without bespoke parsing.
  35. As a contributor, I want the CLI subcommand to share construction code with Crawler (same ImpersonateClient, same Fingerprinter defaults), so that the engine behavior is identical whether invoked from a crawl or from the CLI.
  36. As a future maintainer, I want CONTEXT.md to be updated to remove the Avoid note that flags ChallengeDetector as a separate term, so that the glossary reflects the post-cutover reality.
  37. As a future maintainer, I want an ADR (ADR-0005) recording the cutover decision and any irreversible deletions, so that the rationale is one search away if the dual-path state ever returns as a proposal.
  38. As a future maintainer, I want the placeholder hashes in fingerprint::introspect::catalog to be replaced with real measurements as part of this PRD, so that the catalog is not shipped as a deferred-forever stub.

Implementation Decisions

Sequence. Six phases land as separate PRs in this order: (1) process_job cutover → (2) JobOutcome widening → (3) policy::engine swap to Fingerprinter → (4) legacy detector deletion → (5) Warm/Cold tier dispatch + TargetContext widening → (6) Self-fingerprint live capture + real catalog hashes + CLI surface. Each PR shipping green keeps main clean. The strangler discipline from PRD #15 continues — no commit lands with a known regression in the existing test suite, and the NDJSON regression test from issue #16 is the byte-for-byte trip wire across all six phases.

Cutover boundary. Crawler::process_job keeps the front (queue pull, admission, budgets, robots, dedupe, rate limit) and the back (storage writes, frontier feed, retry decision honoring caps and cooldowns, session-state commit, run-level events). The middle — fetch dispatch, extract, challenge detect, per-attempt events — becomes a single let outcome = self.runner.run(&job, &ctx).await;. The Render path stays inline for this PRD's scope (render-specific consumption of RenderedPage fields — Web Vitals, screenshot, ScriptSpec outcome — is too entangled to fold into the runner here; a dedicated render-cutover PRD follows).

SessionContext construction. Crawler::process_job builds the SessionContext from existing state per Job: identity from the active ImpersonateClient profile + IdentityBundle; proxy from the ProxyRouter lease; session_state from Crawler.session_states[session_id]; budgets from render_budgets + per-job timing config; policy from the resolved PolicyProfile. The two remaining placeholder types (SessionStatePlaceholder / ChallengeSignalPlaceholder) are deleted in Phase 2 as part of widening JobOutcome.

Fingerprinter injection. Crawler::new constructs the Arc<Fingerprinter> once and injects it into the JobRunner via JobRunner::with_fingerprinter (already added in B14). After the cutover, JobRunner::run always has the Fingerprinter present; the legacy ChallengeDetector fallback path inside the runner becomes unreachable and is removed in Phase 4.

Policy::engine swap. policy::engine replaces the ChallengeDetector::new().detect(status, headers, body) call with Fingerprinter::analyze_hot(&ctx) reading report.antibot. The change is a single function and a small wrapper to build the TargetContext from the same (status, headers, body) slice. The policy::engine retains all its decision logic (retry caps, host cooldowns, budgets); only the detection source changes.

Vendor enum collapse. error::AntibotVendor and antibot::ChallengeVendor are deleted. Every reference (runner::JobRunner, policy::engine, antibot::*, crawler.rs, tests/{escalation,session_scope_policy,antibot_detection}.rs) migrates to fingerprint::Vendor. The From conversions added in B7 are removed in the same commit since the source enums no longer exist.

Legacy module deletion. Phase 4 deletes src/discovery/tech_fingerprint.rs, src/runner/challenge.rs, src/antibot/block_detector.rs, src/runner/fetcher/auto.rs. The antibot::detect_from_html / detect_from_http_response / detect_from_cookies functions are removed; the src/antibot/mod.rs module loses 290 LOC but bypass, cookie_pin, solver, telemetry, recaptcha submodules stay (action paths, not detection — out of scope per ADR-0003). src/antibot/signatures.rs data tables get distributed into the sources that consume them.

JobOutcome widening. JobOutcome.signals: Vec<ChallengeSignalPlaceholder> becomes Vec<fingerprint::Detection>. JobOutcome.new_session_state: Option<SessionStatePlaceholder> becomes Option<antibot::SessionState>. Both placeholder types are deleted. JobRunner::run populates the typed fields directly from Fingerprinter::analyze_hot and from any session-state mutation it observes.

Warm tier dispatch. Fingerprinter::analyze_warm(host: &str, client: &ImpersonateClient) -> Vec<Detection> is a new async method. It fetches https://{host}/robots.txt, /.well-known/security.txt, /.well-known/openid-configuration, and /favicon.ico in parallel, then calls each registered Warm-tier source with the fetched bytes via the source's public classify_* helper. Results cache in WarmCache keyed by host:port with 24h TTL.

Cold tier dispatch. Fingerprinter::analyze_cold(host: &str, dns: &DnsClient, rdap: &RdapClient) -> Vec<Detection> is a new async method gated behind operator opt-in. Resolves A/AAAA/CNAME via existing discovery::dns, queries RDAP via existing discovery::rdap, and feeds the results to DnsSource::classify_cnames / AsnSource::classify. Never auto-runs from process_job.

TargetContext widening. Optional Warm/Cold slots are added: h2_settings: Option<&[(u16, u32)]>, robots_body: Option<&str>, well_known: Option<&WellKnownProbe>, favicon_md5: Option<&str>, peer_cert: Option<&PeerCert>, dns: Option<&DnsObservation>, asn: Option<&AsnInfo>. Hot-tier sources continue to ignore the new fields; Warm/Cold sources read what they need.

ClientHello capture. A thin hook in impersonate::tls::build_connector records the ClientHello bytes assembled by BoringSSL and exposes them via an internal accessor (ImpersonateClient::last_client_hello() -> Option<&[u8]>). SelfFingerprint::capture_live(client: &ImpersonateClient) -> Option<SelfFingerprint> parses the bytes (TLS record + handshake header + extension blocks) and runs compute_ja3 / compute_ja4. The h2 SETTINGS frame is captured similarly from the first Send we emit on each connection.

Catalog hardening. During Phase 5, the test suite captures SelfFingerprint once per Profile against a known endpoint, records the resulting JA3 / JA4 / h2_fp values, and replaces the PLACEHOLDER_* entries in fingerprint::introspect::catalog with the measured hashes. The recording is a one-shot test (#[ignore] by default, run with --ignored to refresh) plus a checked-in golden file under tests/fixtures/. Drift detection then runs against authoritative values.

CLI surface. Phase 6 adds crate::cli::fingerprint::cmd(args) that constructs a minimal Crawler (HTTP-only, no queue, no storage), drives one fetch of the target URL, runs Fingerprinter::analyze_hot followed by Warm-tier when not opted out, and prints the FingerprintReport as JSON to stdout. Flags: --deep-fingerprint (additionally runs Cold tier), --audit-tls (additionally runs the external oracle and includes the result in the JSON output). The existing crawlex CLI binary dispatches the subcommand alongside the existing crawl / other subcommands.

Test suite migration. tests/escalation.rs, tests/session_scope_policy.rs, tests/antibot_detection.rs, and any other test importing ChallengeVendor or AntibotVendor migrate to fingerprint::Vendor. Tests that previously asserted ChallengeDetector::detect(...) semantics call Fingerprinter::analyze_hot(&ctx).antibot and assert on Detection-level Evidence.

Performance. Hot tier per-fetch overhead must stay under 2ms p99 on a 100KB HTML response (already asserted in PRD #25 Phase 4). Warm tier fetches must complete in under 5s per host. Cold tier is opt-in, no budget. process_job total latency must not regress vs main baseline (measured via the existing --bench harness in tests/).

Testing Decisions

Definition of a good test for this work. Tests assert observable behavior at the public boundary of each module under change. For the cutover, that is: the JobOutcome returned by JobRunner::run matches the structured fields that consumer code (storage write, frontier feed) reads. For policy::engine, it is the Decision returned for a given (status, headers, body) triple. For the Fingerprinter Warm/Cold dispatchers, it is the cached FingerprintReport for a host. No assertion on private fields, internal cache layout, or commit-order beyond what the public contract guarantees.

Modules to be tested.

  1. Crawler::process_job post-cutover. A new integration test under tests/ drives a full Crawler::run against a wiremock server (the pattern from tests/mini_http_only.rs and the existing runner_ndjson_regression.rs). Asserts that per-attempt events fire from the runner on the wire, that storage records the same body / signals as before, and that the NDJSON event-kind sequence is byte-identical to the golden from issue runner: bootstrap module shells + NDJSON regression test harness #16.

  2. JobOutcome typed-field migration. Existing 44 runner unit tests are updated to assert on Vec<Detection> instead of Vec<ChallengeSignalPlaceholder> and on Option<antibot::SessionState> instead of Option<SessionStatePlaceholder>. New tests cover the round-trip from Fingerprinter::analyze_hot Detections through JobOutcome.signals to Crawler post-processing.

  3. policy::engine Fingerprinter integration. Existing policy::engine tests stay; their expected Decision outputs do not change because the antibot signal payload that drives Decision::Render arrives the same way. New test cases assert that a 403 + cf-chl-bypass body produces Decision::Render with the same DecisionReason::antibot_challenge shape, and that the Vendor field on the reason uses fingerprint::Vendor::Cloudflare directly.

  4. Fingerprinter::analyze_warm end-to-end. New integration test under tests/ drives the Warm tier against a wiremock-served robots.txt, security.txt, openid-configuration, and favicon. Asserts the FingerprintReport populates the corresponding source slots and the cache holds the entry with the configured TTL.

  5. Fingerprinter::analyze_cold against mocked DNS / RDAP. Test against a mocked DNS client returning a cloudfront.net CNAME and a mocked RDAP client returning AS13335. Asserts the Cold-tier Detections land in report.cdn / report.dns_hosting.

  6. SelfFingerprint::capture_live against a real HTTPS request. Test against a wiremock TLS server (wiremock 0.6 supports TLS). Captures the ClientHello, runs compute_ja3, asserts the hash matches the catalog entry for the active Profile::Chrome131Stable. The catalog entry is the one recorded in the one-shot #[ignore] test described in Phase 5.

  7. Coherence end-to-end. Constructed scenarios: clean profile + no antibot → both bools true; clean profile + Akamai Bot Manager detected → second false + warning; drift introduced via mocked ClientHello with a different cipher list → first false + warning. The drift case proves the live-capture-vs-catalog path produces meaningful answers.

  8. CLI subcommand smoke. Test in tests/cli_fingerprint.rs (new file) drives crawlex fingerprint http://wiremock-mock-url and asserts the printed JSON parses as a FingerprintReport with host, cdn, cookie_pattern, etc. fields populated. A second case runs with --audit-tls against a mocked oracle endpoint and asserts coherence.our_ja3_matches_profile populated.

  9. NDJSON regression byte-stable through all six phases. The trip-wire test from issue runner: bootstrap module shells + NDJSON regression test harness #16 continues to pass through every commit. New event payloads (Evidence list on challenge.detected / tech.fingerprint_detected) are explicitly compatible (new fields, existing fields unchanged) — the golden file gains the new fields in Phase 2 in a deliberate diff commit reviewed alongside the JobOutcome widening.

  10. Test-suite migration trip wire. Pre-deletion, a "migration parity" suite runs both the old ChallengeVendor path and the new fingerprint::Vendor path against the same fixtures and asserts identical outputs. The parity test is deleted alongside the legacy enums in Phase 4.

Prior art in the codebase.

  • tests/runner_ndjson_regression.rs — byte-stable trip wire from slice runner: bootstrap module shells + NDJSON regression test harness #16. Continues to guard every commit.
  • tests/runner_integration.rs — 4-scenario test from A5. Continues to pass through every phase.
  • tests/mini_http_only.rs — pattern for wiremock-backed Crawler::run end-to-end tests.
  • tests/escalation.rs — pattern for (status, headers, body) table-driven antibot detection tests; informs the policy::engine swap test cases.
  • src/fingerprint/target/sources/*::tests — per-source unit-test pattern that the Warm/Cold dispatch tests follow.

Out of Scope

  • The render path's full cutover. Method::Render continues to flow through the inline render dispatch in Crawler::process_job. The render pipeline's consumption of RenderedPage (Web Vitals, screenshot to storage, asset_refs, tech_fingerprint runtime data, ScriptSpec RunOutcome) is entangled with the existing Crawler post-processing in ways that warrant a dedicated render-cutover PRD. The runner's FetchOutput::Rendered variant continues to exist; this PRD does not exercise that variant from process_job.
  • The deeper SessionIdentity unification (architecture review candidate #3 from PRD Extract JobRunner from crawler.rs (strangler refactor) #15). SessionContext.identity carries a thin bundle of ImpersonateClient + IdentityBundle + cookies as it does today. Full unification stays out of scope.
  • The unified LifecycleHook collapsing hooks + events + script (architecture candidate #5). The three layers keep their identities; only the emitter of per-attempt events moves from Crawler to JobRunner, which already happened structurally in slice runner: move per-attempt event/hook emission into JobRunner; cleanup dead helpers in crawler.rs #23 and activates on the wire after Phase 1.
  • The DiscoveryBackend registry (architecture candidate #4). Discovery adapters under discovery/ continue to operate as today.
  • Splitting config.rs (architecture candidate #6). The runner reads from the existing flat Config via SessionContext.policy.
  • Frontier admission deepening (architecture candidate #7). Admission stays on Crawler.
  • New CLI subcommands beyond crawlex fingerprint. The existing crawl / scrape / etc. subcommands are unchanged.
  • New external dependencies for fingerprinting computation. JA3 (md-5 already added in B10), JA4 (sha2 already a dep), h2 SETTINGS hashing (sha2). No new crates.
  • Public catalog data for FP-B beyond the three Chrome profiles already in impersonate::Profile. Adding a Firefox / Safari profile catalog entry is a follow-up PRD if those profiles ever land in impersonate.
  • Active probing (sending crafted requests to elicit specific responses). All Fingerprinter sources remain passive — they observe responses the Crawler / CLI already makes.
  • New configuration surfaces beyond the three flags this PRD adds (--deep-fingerprint, --audit-tls, and the crawlex fingerprint subcommand itself). Anything more granular waits for operator feedback after first ship.

Further Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestneeds-triageAwaiting triagerustPull requests that update rust code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions