You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PRD #15 landed the JobRunner extraction in 7 strangler slices (#16-#23). Every merged slice's PR comment ended with the same caveat:
Scope honesty — what this slice does NOT do: Crawler::process_job is NOT yet rewired to call runner.run. The runner is operational and test-locked; the strangler swap-in to process_job is the next concrete follow-up.
runner::JobRunner::run exists with 44 unit tests, fires per-attempt lifecycle events, classifies errors into RetryDecision, but nothing in production calls it. The runner is a parallel implementation that only unit tests exercise.
runner::AutoFetcher exists but Crawler never constructs one. The auto-path escalation still flows through policy::engine (which we did rewire to consume ChallengeDetector in slice runner: ChallengeDetector absorbs src/escalation.rs body #19).
runner::SessionContext carries six placeholder unit-structs (SessionIdentityPlaceholder, ProxyLeasePlaceholder, SessionStatePlaceholder, JobBudgetsPlaceholder, PolicyProfilePlaceholder, ChallengeSignalPlaceholder). The seam is named, not deep.
Fetcher::fetch returns Result<impersonate::Response>, which forces RenderFetcher::fetch to return Err(Error::Render("intentionally unimplemented")). Trait method that returns an explicit "not implemented" error is a code smell.
The honest grade for the JobRunner extraction today is C+: solid specification, four real seams integrated (Spoof / Extract / Challenge / Render), but the god-module problem is untouched and the runner is dead code in production. The slices we shipped are the scaffolding; the building was never built on top.
Solution
Close the gap. Five concrete moves, sequenced so each is reviewable and reversible:
Widen the Fetcher trait to return FetchOutput — an enum carrying either an HTTP response (spoof) or a rendered page (render). Common helpers on the enum (status, headers, body, final_url) let extractors and detectors stay variant-agnostic. RenderFetcher::fetch becomes honest — it returns FetchOutput::Rendered, not an error.
Cutover Crawler::process_job to call runner.run — replace the inline fetch/extract/detect block (~600 LOC in the middle of process_job) with runner.run(&job, &ctx).await plus post-processing (storage writes, frontier feed, retry decision, session-state commit). NDJSON regression from issue runner: bootstrap module shells + NDJSON regression test harness #16 is the trip wire — byte-stable event sequence pre/post cutover.
Concrete SessionContext — drop all six placeholder types, replace with the real fields the runner needs: Arc<SessionIdentity> (today still ImpersonateClient + IdentityBundle + cookies bundled), optional ProxyLease, real SessionState from antibot::SessionState, JobBudgets (render_ms / total_ms remaining), Arc<PolicyProfile>.
Resolve AutoFetcher ambiguity — pick one source of truth for auto escalation. Either AutoFetcher becomes the only path (and the policy::engine escalation logic is deleted or thinned), or AutoFetcher itself is deleted and policy::engine stays as the source. Two implementations of the same logic is ruído.
Integration test the runner against wiremock — runner.run exercised against a real HTTP server (mock), not just FakeFetcher. Asserts full SpoofFetcher → ChallengeDetector → Extractor pipeline produces the expected JobOutcome for healthy/challenge/network-error/timeout responses.
After these five land, the JobRunner extraction is complete: process_job shrinks measurably, the runner is the production entry point for per-Job execution, all placeholder types are gone, the trait surface is honest, and there is one source of truth for auto escalation.
User Stories
As a contributor opening src/crawler.rs for the first time, I want the file to be substantially shorter, so that I can read and understand the orchestrator without scrolling through 3604 lines of inline fetch / extract / detect / event-emission code.
As a contributor, I want Crawler::process_job to read as a small loop that delegates per-Job execution to JobRunner, so that the orchestrator's role (queue pull, admission, budgets, storage, frontier feed) is visually separated from the runner's role (fetch, detect, extract).
As a contributor, I want Fetcher::fetch to return a FetchOutput enum (Http or Rendered), so that both spoof and render paths satisfy the same trait without one of them returning a fake error.
As a contributor, I want common helpers on FetchOutput (status, headers, body, final_url), so that downstream code (extractor, challenge detector, link extraction) can stay variant-agnostic and avoid match boilerplate.
As a contributor, I want AutoFetcher::fetch to call SpoofFetcher::fetch then return FetchOutput::Http together with a RetryDecision::Suggest { EscalateToRender } when a challenge is detected, so that the re-queue contract from ADR-0002 is enforced by the type system rather than by policy::engine side effects.
As a contributor, I want RenderFetcher::fetch to return FetchOutput::Rendered(RenderedPage) honestly, so that the render path uses the same trait as spoof without lossy conversions or fake error returns.
As a contributor, I want JobRunner::run to be the only place where one Job becomes one JobOutcome, so that per-attempt timings, signals, links, retry decision, and new session state all flow through one call site.
As a contributor, I want Crawler::process_job to construct a SessionContext, call runner.run, and then post-process storage / frontier / retry / session-state commit, so that the decide-vs-do split from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 is visible in the code structure.
As a contributor, I want SessionContext.identity to be a concrete Arc<SessionIdentity> carrying profile + cookies + persona, so that I stop reading "Placeholder" in field types.
As a contributor, I want SessionContext.proxy: Option<ProxyLease> to carry a real lease from proxy::ProxyRouter, so that the runner knows which proxy each attempt is using without reaching back into Crawler.
As a contributor, I want SessionContext.session_state to be the real antibot::SessionState enum, so that the runner can pass it to detection logic and return mutations via JobOutcome.new_session_state without translation layers.
As a contributor, I want SessionContext.budgets: JobBudgets to carry remaining render_ms / total_ms / attempts, so that the runner can fail fast with JobError::BudgetExhausted instead of issuing a doomed fetch.
As a contributor, I want SessionContext.policy: Arc<PolicyProfile> to be the resolved profile (fast / balanced / deep / forensics), so that escalation decisions inside the runner read the same profile the Crawler resolved at admission.
As a contributor, I want exactly one source of truth for "should this auto job escalate to render?", so that fixing a regression in escalation logic means touching one file, not two.
As a contributor, I want a single decision recorded in an ADR explaining whether AutoFetcher or policy::engine is the owner, so that the choice is not re-litigated every six months.
As an operator running a real crawl, I want the NDJSON event stream produced before the cutover to be byte-identical to the stream produced after, so that downstream consumers (SDK, analytics, dashboards) keep working without coordination.
As an operator, I want the per-attempt events (job.started, fetch.completed, challenge.detected, extract.completed, job.failed) to fire from inside the runner once it is the entry point, so that adding a new lifecycle event is a one-file change.
As a contributor adding a new fetch path in the future, I want Fetcher to be a real trait with Arc<dyn Fetcher> swap-in, so that a third backend (e.g., a Playwright bridge, a remote browser farm) can land without touching Crawler::process_job.
As a contributor, I want at least one integration test that drives runner.run against a wiremock server (no Chrome, no FakeFetcher), so that regressions in the SpoofFetcher → ChallengeDetector → Extractor pipeline are caught without needing a full Crawler boot.
As a contributor, I want the integration test to cover four scenarios — healthy 200, challenge 403 (Cloudflare body), network failure (timeout), connection refused — so that JobError classification and RetryDecision mapping are exercised against real HTTP transport.
As a contributor, I want cargo test --all-features --lib runner:: to keep passing after every commit in this PRD, so that the existing 44 runner unit tests act as a contract guard during the cutover.
As a future maintainer, I want CONTEXT.md to be updated to drop the "placeholder" disclaimer from SessionContext, so that the glossary reflects what the type actually carries.
As a future maintainer, I want an ADR explaining the AutoFetcher vs. policy::engine decision, so that re-introducing two paths is recognized as a regression.
As a future maintainer, I want the src/crawler.rs LOC delta documented in the cutover PR description (before / after), so that the goal "deep the god-module" is measurable.
Implementation Decisions
Sequence. Five items land as separate PRs in this order: FetchOutput widening (#4 in grilling, this PRD's slice 1) → cutover (#1, slice 2) → concrete SessionContext (#2, slice 3) → AutoFetcher decision (#3, slice 4) → integration test (#5, slice 5). Items #3 and #5 are independent and can run in parallel with #2 once #1 lands. Sequence rationale: the trait shape must be honest before the cutover; the cutover surfaces which placeholder types are actually needed; the AutoFetcher decision becomes obvious after the cutover (either it's the source of truth or it's dead code); the integration test locks the contract once the wiring is real.
FetchOutput shape. Enum with two variants — Http(impersonate::Response) and Rendered(RenderedPage). Common helpers on the enum return the fields callers reach for most: status() -> u16, headers() -> &HeaderMap, body() -> &[u8], final_url() -> &Url. Render-only fields (Web Vitals, screenshot, ScriptSpec RunOutcome) require a match to access. Rationale: callers that only want body / status (extractor, challenge detector) stay simple; callers that legitimately need render extras (storage of vitals, screenshot artifacts) are forced to acknowledge they are render-specific.
Trait signature.async fn fetch(&self, job: &Job, ctx: &SessionContext) -> Result<FetchOutput>. SpoofFetcher returns Ok(FetchOutput::Http(...)). RenderFetcher returns Ok(FetchOutput::Rendered(...)). AutoFetcher returns Ok(FetchOutput::Http(...)) always (it does not chain inline per ADR-0002) and signals escalation via JobOutcome.retry projection from AutoOutcome::into_retry_decision().
Cutover scope.Crawler::process_job keeps the front (queue pull, admission, budget check, robots, dedupe, rate limit) and the back (storage writes, frontier feed, retry decision honoring caps and cooldowns, session-state commit, hooks, run-level events). The middle — fetch dispatch, extract, challenge detect, per-attempt events — becomes a single let outcome = self.runner.run(&job, &ctx).await; call. Crawler constructs the Arc<JobRunner> once at startup (or lazily after first job) using the existing SpoofFetcher, RenderFetcher if cdp-backend, and AutoFetcher (or its replacement, per item #3).
Concrete SessionContext. Six placeholder types removed. Real fields: Arc<SessionIdentity> (still composing existing ImpersonateClient + IdentityBundle + cookies — full identity unification is candidate #3 from the architecture review and stays out of scope here), Option<ProxyLease> from proxy::ProxyRouter, antibot::SessionState, JobBudgets (struct with render_ms_left, total_ms_left, attempts_remaining), Arc<PolicyProfile> from policy::PolicyProfile.
AutoFetcher decision. Two options, decision recorded as ADR-0004:
Option A (Promote AutoFetcher):Crawler constructs Arc<AutoFetcher> for Method::Auto jobs. runner.run delegates to it. policy::engine no longer makes escalation decisions for the auto path; it stays the source of truth for retry vs drop on other failure modes (transient 5xx, robots, budget exhaustion). The challenge-driven escalation lives in one type.
Option B (Delete AutoFetcher): Code removed. Crawler keeps policy::engine as the single arbiter of "this job needs to be re-queued as Render" — the path that already runs in production today.
Recommendation in the grilling was B (delete) because: (1) policy::engine already runs and is tested; (2) AutoFetcher has no consumer on main; (3) escalation policy needs cross-job context (render budgets, host cooldowns, retry caps) that lives naturally in the Crawler/policy::engine boundary, not inside a per-Job fetcher. Final decision deferred to triage on this PRD; whichever wins, an ADR records why.
Retry policy stays on Crawler. Per PRD #15 Q11 and slice #22's contract: JobOutcome.retry is advice. The Crawler reads RetryDecision::Suggest { reason, backoff_hint } and applies retry caps, host cooldowns, budget accounting before deciding to re-enqueue. No change in this PRD; just confirming the boundary holds through the cutover.
Event firing moves into the runner. The dormant per-attempt events from slice #23 (JobStarted, FetchCompleted, ChallengeDetected, ExtractCompleted, JobFailed) fire on the wire once the runner is the entry point. Run-level / decision events (JobEnqueued, JobAdmitted, JobRetried, JobPersisted, FrontierSeeded, RunStarted, RunCompleted) stay on the Crawler. NDJSON byte-stability is the contract — the regression test from slice #16 is the trip wire.
LOC accountability. The cutover PR description must include the before / after LOC delta for src/crawler.rs. The goal from PRD #15 was to deepen the god-module; this PRD makes the delta visible so the goal is measurable.
Testing Decisions
Definition of a good test for this work. Tests assert behavior at the public boundary of the modified module. For Fetcher and its impls, that is the fetch method's return value plus any state-visible side effect (e.g., cookies updated on the client). For JobRunner::run, that is the returned JobOutcome plus the sequence of events fired through the injected EventSink. No assertion on private field shapes, helper names, or call order beyond what the public contract guarantees.
Modules to be tested.
FetchOutput helpers — unit tests that the enum's status, headers, body, final_url helpers return the right thing for both Http and Rendered variants. Trivial but pins the surface.
SpoofFetcher returns FetchOutput::Http — wrapper conformance, no real network. The existing 3 unit tests in runner::fetcher::spoof::tests extend with one case asserting variant correctness.
AutoFetcher (if promoted in Option A) — async unit test against fake spoof + fake render: assert challenge response yields FetchOutput::Http + retry EscalateToRender, healthy response yields FetchOutput::Http + retry None, render fake is never called by AutoFetcher::fetch (ADR-0002 inline-chain ban).
JobRunner::run end-to-end against wiremock — new integration test under tests/. Drives runner.run for four cases: healthy 200 with links, 403 Cloudflare challenge, network timeout (wiremock delay > timeout), connection refused (wiremock not started). Asserts JobOutcome fields populate correctly and JobError / RetryDecision map per the contract.
process_job integration regression — re-uses the NDJSON regression test from slice runner: bootstrap module shells + NDJSON regression test harness #16. Pre and post cutover, the captured event-kind sequence must remain byte-identical. New scenarios may be added but the existing golden file does not change.
tests/mini_http_only.rs — pattern for wiremock-backed end-to-end tests; the runner integration test follows the same structure (wiremock setup, deterministic seeds, MemorySink for event capture, no Chrome).
tests/escalation.rs — pattern for table-driven detection tests; informs the case list for the integration test (cf-chl-bypass body, DataDome body, JS stub, healthy HTML).
src/runner/mod.rs#[cfg(test)] mod tests — pattern for runner-scope unit tests using FakeFetcher; the new integration test moves to tests/ because it needs a real HTTP transport.
Out of Scope
The deeper SessionIdentity unification (architecture review candidate #3). SessionContext.identity remains a thin bundle of ImpersonateClient + IdentityBundle + cookies as it exists today.
DiscoveryBackend registry (candidate #4). Discovery adapters are untouched.
Splitting config.rs (candidate #6). The runner reads from the existing flat Config via SessionContext.policy.
Frontier admission deepening (candidate #7). Admission stays on Crawler.
Render module / CDP facade. The runner uses RenderPool as it stands.
The fingerprint/ module — covered by a separate PRD published alongside this one.
Any behavioral change to fetch strategies, policy profiles, NDJSON event names, retry caps, or host cooldowns. This is a refactor with two structural deliverables (honest trait + cutover) and one cleanup deliverable (AutoFetcher decision). No new features.
Two ADRs from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 (docs/adr/0001-jobrunner-value-return-outcome.md, docs/adr/0002-autofetcher-escalation-via-requeue.md) continue to apply unchanged. The FetchOutput widening reinforces ADR-0001 (value-return outcome); the AutoFetcher decision reinforces ADR-0002 (no inline chain).
ADR-0004 will record the AutoFetcher decision (promote vs. delete).
After this PRD lands, the runner is the production entry point for per-Job execution. The fingerprint PRD becomes implementable cleanly: JobOutcome.signals (today a Vec<ChallengeSignalPlaceholder>) becomes the natural output slot for fingerprint::Detections, plumbed without breaking the cutover.
Honest expectation: the cutover PR (slice 2) is the largest single diff in this PRD's scope. It touches the middle of crawler.rs substantially. The strangler-PR style from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 keeps the rest of the slices small; the cutover itself is the surgical change the strategy was leading up to.
Problem Statement
PRD #15 landed the JobRunner extraction in 7 strangler slices (#16-#23). Every merged slice's PR comment ended with the same caveat:
That follow-up never landed. As of today on
main:src/crawler.rsis still 3604 LOC — the same as before PRD Extract JobRunner from crawler.rs (strangler refactor) #15 started.runner::JobRunner::runexists with 44 unit tests, fires per-attempt lifecycle events, classifies errors intoRetryDecision, but nothing in production calls it. The runner is a parallel implementation that only unit tests exercise.runner::AutoFetcherexists butCrawlernever constructs one. The auto-path escalation still flows throughpolicy::engine(which we did rewire to consumeChallengeDetectorin slice runner: ChallengeDetector absorbs src/escalation.rs body #19).runner::SessionContextcarries six placeholder unit-structs (SessionIdentityPlaceholder,ProxyLeasePlaceholder,SessionStatePlaceholder,JobBudgetsPlaceholder,PolicyProfilePlaceholder,ChallengeSignalPlaceholder). The seam is named, not deep.Fetcher::fetchreturnsResult<impersonate::Response>, which forcesRenderFetcher::fetchto returnErr(Error::Render("intentionally unimplemented")). Trait method that returns an explicit "not implemented" error is a code smell.The honest grade for the JobRunner extraction today is C+: solid specification, four real seams integrated (Spoof / Extract / Challenge / Render), but the god-module problem is untouched and the runner is dead code in production. The slices we shipped are the scaffolding; the building was never built on top.
Solution
Close the gap. Five concrete moves, sequenced so each is reviewable and reversible:
Widen the
Fetchertrait to returnFetchOutput— an enum carrying either an HTTP response (spoof) or a rendered page (render). Common helpers on the enum (status,headers,body,final_url) let extractors and detectors stay variant-agnostic.RenderFetcher::fetchbecomes honest — it returnsFetchOutput::Rendered, not an error.Cutover
Crawler::process_jobto callrunner.run— replace the inline fetch/extract/detect block (~600 LOC in the middle ofprocess_job) withrunner.run(&job, &ctx).awaitplus post-processing (storage writes, frontier feed, retry decision, session-state commit). NDJSON regression from issue runner: bootstrap module shells + NDJSON regression test harness #16 is the trip wire — byte-stable event sequence pre/post cutover.Concrete
SessionContext— drop all six placeholder types, replace with the real fields the runner needs:Arc<SessionIdentity>(today stillImpersonateClient + IdentityBundle + cookiesbundled), optionalProxyLease, realSessionStatefromantibot::SessionState,JobBudgets(render_ms / total_ms remaining),Arc<PolicyProfile>.Resolve
AutoFetcherambiguity — pick one source of truth for auto escalation. EitherAutoFetcherbecomes the only path (and thepolicy::engineescalation logic is deleted or thinned), orAutoFetcheritself is deleted andpolicy::enginestays as the source. Two implementations of the same logic is ruído.Integration test the runner against wiremock —
runner.runexercised against a real HTTP server (mock), not justFakeFetcher. Asserts full SpoofFetcher → ChallengeDetector → Extractor pipeline produces the expectedJobOutcomefor healthy/challenge/network-error/timeout responses.After these five land, the JobRunner extraction is complete:
process_jobshrinks measurably, the runner is the production entry point for per-Job execution, all placeholder types are gone, the trait surface is honest, and there is one source of truth for auto escalation.User Stories
As a contributor opening
src/crawler.rsfor the first time, I want the file to be substantially shorter, so that I can read and understand the orchestrator without scrolling through 3604 lines of inline fetch / extract / detect / event-emission code.As a contributor, I want
Crawler::process_jobto read as a small loop that delegates per-Job execution toJobRunner, so that the orchestrator's role (queue pull, admission, budgets, storage, frontier feed) is visually separated from the runner's role (fetch, detect, extract).As a contributor, I want
Fetcher::fetchto return aFetchOutputenum (HttporRendered), so that both spoof and render paths satisfy the same trait without one of them returning a fake error.As a contributor, I want common helpers on
FetchOutput(status,headers,body,final_url), so that downstream code (extractor, challenge detector, link extraction) can stay variant-agnostic and avoidmatchboilerplate.As a contributor, I want
AutoFetcher::fetchto callSpoofFetcher::fetchthen returnFetchOutput::Httptogether with aRetryDecision::Suggest { EscalateToRender }when a challenge is detected, so that the re-queue contract from ADR-0002 is enforced by the type system rather than bypolicy::engineside effects.As a contributor, I want
RenderFetcher::fetchto returnFetchOutput::Rendered(RenderedPage)honestly, so that the render path uses the same trait as spoof without lossy conversions or fake error returns.As a contributor, I want
JobRunner::runto be the only place where one Job becomes oneJobOutcome, so that per-attempt timings, signals, links, retry decision, and new session state all flow through one call site.As a contributor, I want
Crawler::process_jobto construct aSessionContext, callrunner.run, and then post-process storage / frontier / retry / session-state commit, so that the decide-vs-do split from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 is visible in the code structure.As a contributor, I want
SessionContext.identityto be a concreteArc<SessionIdentity>carrying profile + cookies + persona, so that I stop reading "Placeholder" in field types.As a contributor, I want
SessionContext.proxy: Option<ProxyLease>to carry a real lease fromproxy::ProxyRouter, so that the runner knows which proxy each attempt is using without reaching back intoCrawler.As a contributor, I want
SessionContext.session_stateto be the realantibot::SessionStateenum, so that the runner can pass it to detection logic and return mutations viaJobOutcome.new_session_statewithout translation layers.As a contributor, I want
SessionContext.budgets: JobBudgetsto carry remaining render_ms / total_ms / attempts, so that the runner can fail fast withJobError::BudgetExhaustedinstead of issuing a doomed fetch.As a contributor, I want
SessionContext.policy: Arc<PolicyProfile>to be the resolved profile (fast/balanced/deep/forensics), so that escalation decisions inside the runner read the same profile theCrawlerresolved at admission.As a contributor, I want exactly one source of truth for "should this auto job escalate to render?", so that fixing a regression in escalation logic means touching one file, not two.
As a contributor, I want a single decision recorded in an ADR explaining whether
AutoFetcherorpolicy::engineis the owner, so that the choice is not re-litigated every six months.As an operator running a real crawl, I want the NDJSON event stream produced before the cutover to be byte-identical to the stream produced after, so that downstream consumers (SDK, analytics, dashboards) keep working without coordination.
As an operator, I want the per-attempt events (
job.started,fetch.completed,challenge.detected,extract.completed,job.failed) to fire from inside the runner once it is the entry point, so that adding a new lifecycle event is a one-file change.As a contributor adding a new fetch path in the future, I want
Fetcherto be a real trait withArc<dyn Fetcher>swap-in, so that a third backend (e.g., a Playwright bridge, a remote browser farm) can land without touchingCrawler::process_job.As a contributor, I want at least one integration test that drives
runner.runagainst a wiremock server (no Chrome, no FakeFetcher), so that regressions in the SpoofFetcher → ChallengeDetector → Extractor pipeline are caught without needing a fullCrawlerboot.As a contributor, I want the integration test to cover four scenarios — healthy 200, challenge 403 (Cloudflare body), network failure (timeout), connection refused — so that
JobErrorclassification andRetryDecisionmapping are exercised against real HTTP transport.As a contributor, I want
cargo test --all-features --lib runner::to keep passing after every commit in this PRD, so that the existing 44 runner unit tests act as a contract guard during the cutover.As a contributor, I want
cargo test --all-features --test runner_ndjson_regressionto remain byte-stable through every commit, so that the wire format pinned in slice runner: bootstrap module shells + NDJSON regression test harness #16 stays the trip wire.As a future maintainer, I want
CONTEXT.mdto be updated to drop the "placeholder" disclaimer fromSessionContext, so that the glossary reflects what the type actually carries.As a future maintainer, I want an ADR explaining the
AutoFetchervs.policy::enginedecision, so that re-introducing two paths is recognized as a regression.As a future maintainer, I want the
src/crawler.rsLOC delta documented in the cutover PR description (before / after), so that the goal "deep the god-module" is measurable.Implementation Decisions
Sequence. Five items land as separate PRs in this order:
FetchOutputwidening (#4 in grilling, this PRD's slice 1) → cutover (#1, slice 2) → concrete SessionContext (#2, slice 3) → AutoFetcher decision (#3, slice 4) → integration test (#5, slice 5). Items #3 and #5 are independent and can run in parallel with #2 once #1 lands. Sequence rationale: the trait shape must be honest before the cutover; the cutover surfaces which placeholder types are actually needed; the AutoFetcher decision becomes obvious after the cutover (either it's the source of truth or it's dead code); the integration test locks the contract once the wiring is real.FetchOutput shape. Enum with two variants —
Http(impersonate::Response)andRendered(RenderedPage). Common helpers on the enum return the fields callers reach for most:status() -> u16,headers() -> &HeaderMap,body() -> &[u8],final_url() -> &Url. Render-only fields (Web Vitals, screenshot, ScriptSpecRunOutcome) require amatchto access. Rationale: callers that only want body / status (extractor, challenge detector) stay simple; callers that legitimately need render extras (storage of vitals, screenshot artifacts) are forced to acknowledge they are render-specific.Trait signature.
async fn fetch(&self, job: &Job, ctx: &SessionContext) -> Result<FetchOutput>.SpoofFetcherreturnsOk(FetchOutput::Http(...)).RenderFetcherreturnsOk(FetchOutput::Rendered(...)).AutoFetcherreturnsOk(FetchOutput::Http(...))always (it does not chain inline per ADR-0002) and signals escalation viaJobOutcome.retryprojection fromAutoOutcome::into_retry_decision().Cutover scope.
Crawler::process_jobkeeps the front (queue pull, admission, budget check, robots, dedupe, rate limit) and the back (storage writes, frontier feed, retry decision honoring caps and cooldowns, session-state commit, hooks, run-level events). The middle — fetch dispatch, extract, challenge detect, per-attempt events — becomes a singlelet outcome = self.runner.run(&job, &ctx).await;call.Crawlerconstructs theArc<JobRunner>once at startup (or lazily after first job) using the existingSpoofFetcher,RenderFetcherifcdp-backend, andAutoFetcher(or its replacement, per item #3).Concrete SessionContext. Six placeholder types removed. Real fields:
Arc<SessionIdentity>(still composing existingImpersonateClient + IdentityBundle + cookies— full identity unification is candidate #3 from the architecture review and stays out of scope here),Option<ProxyLease>fromproxy::ProxyRouter,antibot::SessionState,JobBudgets(struct withrender_ms_left,total_ms_left,attempts_remaining),Arc<PolicyProfile>frompolicy::PolicyProfile.AutoFetcher decision. Two options, decision recorded as ADR-0004:
Option A (Promote AutoFetcher):
CrawlerconstructsArc<AutoFetcher>forMethod::Autojobs.runner.rundelegates to it.policy::engineno longer makes escalation decisions for the auto path; it stays the source of truth for retry vs drop on other failure modes (transient 5xx, robots, budget exhaustion). The challenge-driven escalation lives in one type.Option B (Delete AutoFetcher): Code removed.
Crawlerkeepspolicy::engineas the single arbiter of "this job needs to be re-queued as Render" — the path that already runs in production today.Recommendation in the grilling was B (delete) because: (1)
policy::enginealready runs and is tested; (2)AutoFetcherhas no consumer on main; (3) escalation policy needs cross-job context (render budgets, host cooldowns, retry caps) that lives naturally in theCrawler/policy::engineboundary, not inside a per-Job fetcher. Final decision deferred to triage on this PRD; whichever wins, an ADR records why.Retry policy stays on Crawler. Per PRD #15 Q11 and slice #22's contract:
JobOutcome.retryis advice. TheCrawlerreadsRetryDecision::Suggest { reason, backoff_hint }and applies retry caps, host cooldowns, budget accounting before deciding to re-enqueue. No change in this PRD; just confirming the boundary holds through the cutover.Event firing moves into the runner. The dormant per-attempt events from slice #23 (
JobStarted,FetchCompleted,ChallengeDetected,ExtractCompleted,JobFailed) fire on the wire once the runner is the entry point. Run-level / decision events (JobEnqueued,JobAdmitted,JobRetried,JobPersisted,FrontierSeeded,RunStarted,RunCompleted) stay on theCrawler. NDJSON byte-stability is the contract — the regression test from slice #16 is the trip wire.LOC accountability. The cutover PR description must include the before / after LOC delta for
src/crawler.rs. The goal from PRD #15 was to deepen the god-module; this PRD makes the delta visible so the goal is measurable.Testing Decisions
Definition of a good test for this work. Tests assert behavior at the public boundary of the modified module. For
Fetcherand its impls, that is thefetchmethod's return value plus any state-visible side effect (e.g., cookies updated on the client). ForJobRunner::run, that is the returnedJobOutcomeplus the sequence of events fired through the injectedEventSink. No assertion on private field shapes, helper names, or call order beyond what the public contract guarantees.Modules to be tested.
FetchOutputhelpers — unit tests that the enum'sstatus,headers,body,final_urlhelpers return the right thing for bothHttpandRenderedvariants. Trivial but pins the surface.SpoofFetcherreturnsFetchOutput::Http— wrapper conformance, no real network. The existing 3 unit tests inrunner::fetcher::spoof::testsextend with one case asserting variant correctness.RenderFetcherreturnsFetchOutput::Rendered— same shape. Replaces the existing "intentionally unimplemented" guard test from slice runner: RenderFetcher, route Method::Render #20.AutoFetcher(if promoted in Option A) — async unit test against fake spoof + fake render: assert challenge response yieldsFetchOutput::Http+ retryEscalateToRender, healthy response yieldsFetchOutput::Http+ retryNone, render fake is never called byAutoFetcher::fetch(ADR-0002 inline-chain ban).JobRunner::runend-to-end against wiremock — new integration test undertests/. Drivesrunner.runfor four cases: healthy 200 with links, 403 Cloudflare challenge, network timeout (wiremock delay > timeout), connection refused (wiremock not started). AssertsJobOutcomefields populate correctly andJobError/RetryDecisionmap per the contract.process_jobintegration regression — re-uses the NDJSON regression test from slice runner: bootstrap module shells + NDJSON regression test harness #16. Pre and post cutover, the captured event-kind sequence must remain byte-identical. New scenarios may be added but the existing golden file does not change.Prior art in the codebase.
tests/runner_ndjson_regression.rs— the byte-stable trip wire from slice runner: bootstrap module shells + NDJSON regression test harness #16; serves as the cutover guard.tests/mini_http_only.rs— pattern for wiremock-backed end-to-end tests; the runner integration test follows the same structure (wiremock setup, deterministic seeds, MemorySink for event capture, no Chrome).tests/escalation.rs— pattern for table-driven detection tests; informs the case list for the integration test (cf-chl-bypass body, DataDome body, JS stub, healthy HTML).src/runner/mod.rs#[cfg(test)] mod tests— pattern for runner-scope unit tests usingFakeFetcher; the new integration test moves totests/because it needs a real HTTP transport.Out of Scope
SessionIdentityunification (architecture review candidate #3).SessionContext.identityremains a thin bundle ofImpersonateClient + IdentityBundle + cookiesas it exists today.LifecycleHookcollapsing hooks + events + script (candidate #5). The three layers keep their identities; only the emitter of per-attempt events changes (already specified in slice runner: move per-attempt event/hook emission into JobRunner; cleanup dead helpers in crawler.rs #23, activated by this PRD's cutover).DiscoveryBackendregistry (candidate #4). Discovery adapters are untouched.config.rs(candidate #6). The runner reads from the existing flatConfigviaSessionContext.policy.Crawler.RenderPoolas it stands.fingerprint/module — covered by a separate PRD published alongside this one.Further Notes
docs/adr/0001-jobrunner-value-return-outcome.md,docs/adr/0002-autofetcher-escalation-via-requeue.md) continue to apply unchanged. TheFetchOutputwidening reinforces ADR-0001 (value-return outcome); the AutoFetcher decision reinforces ADR-0002 (no inline chain).JobOutcome.signals(today aVec<ChallengeSignalPlaceholder>) becomes the natural output slot forfingerprint::Detections, plumbed without breaking the cutover.crawler.rssubstantially. The strangler-PR style from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 keeps the rest of the slices small; the cutover itself is the surgical change the strategy was leading up to.