Skip to content

Complete JobRunner extraction: cutover, honest trait, concrete SessionContext #24

@filipeforattini

Description

@filipeforattini

Problem Statement

PRD #15 landed the JobRunner extraction in 7 strangler slices (#16-#23). Every merged slice's PR comment ended with the same caveat:

Scope honesty — what this slice does NOT do: Crawler::process_job is NOT yet rewired to call runner.run. The runner is operational and test-locked; the strangler swap-in to process_job is the next concrete follow-up.

That follow-up never landed. As of today on main:

  • src/crawler.rs is still 3604 LOC — the same as before PRD Extract JobRunner from crawler.rs (strangler refactor) #15 started.
  • runner::JobRunner::run exists with 44 unit tests, fires per-attempt lifecycle events, classifies errors into RetryDecision, but nothing in production calls it. The runner is a parallel implementation that only unit tests exercise.
  • runner::AutoFetcher exists but Crawler never constructs one. The auto-path escalation still flows through policy::engine (which we did rewire to consume ChallengeDetector in slice runner: ChallengeDetector absorbs src/escalation.rs body #19).
  • runner::SessionContext carries six placeholder unit-structs (SessionIdentityPlaceholder, ProxyLeasePlaceholder, SessionStatePlaceholder, JobBudgetsPlaceholder, PolicyProfilePlaceholder, ChallengeSignalPlaceholder). The seam is named, not deep.
  • Fetcher::fetch returns Result<impersonate::Response>, which forces RenderFetcher::fetch to return Err(Error::Render("intentionally unimplemented")). Trait method that returns an explicit "not implemented" error is a code smell.

The honest grade for the JobRunner extraction today is C+: solid specification, four real seams integrated (Spoof / Extract / Challenge / Render), but the god-module problem is untouched and the runner is dead code in production. The slices we shipped are the scaffolding; the building was never built on top.

Solution

Close the gap. Five concrete moves, sequenced so each is reviewable and reversible:

  1. Widen the Fetcher trait to return FetchOutput — an enum carrying either an HTTP response (spoof) or a rendered page (render). Common helpers on the enum (status, headers, body, final_url) let extractors and detectors stay variant-agnostic. RenderFetcher::fetch becomes honest — it returns FetchOutput::Rendered, not an error.

  2. Cutover Crawler::process_job to call runner.run — replace the inline fetch/extract/detect block (~600 LOC in the middle of process_job) with runner.run(&job, &ctx).await plus post-processing (storage writes, frontier feed, retry decision, session-state commit). NDJSON regression from issue runner: bootstrap module shells + NDJSON regression test harness #16 is the trip wire — byte-stable event sequence pre/post cutover.

  3. Concrete SessionContext — drop all six placeholder types, replace with the real fields the runner needs: Arc<SessionIdentity> (today still ImpersonateClient + IdentityBundle + cookies bundled), optional ProxyLease, real SessionState from antibot::SessionState, JobBudgets (render_ms / total_ms remaining), Arc<PolicyProfile>.

  4. Resolve AutoFetcher ambiguity — pick one source of truth for auto escalation. Either AutoFetcher becomes the only path (and the policy::engine escalation logic is deleted or thinned), or AutoFetcher itself is deleted and policy::engine stays as the source. Two implementations of the same logic is ruído.

  5. Integration test the runner against wiremockrunner.run exercised against a real HTTP server (mock), not just FakeFetcher. Asserts full SpoofFetcher → ChallengeDetector → Extractor pipeline produces the expected JobOutcome for healthy/challenge/network-error/timeout responses.

After these five land, the JobRunner extraction is complete: process_job shrinks measurably, the runner is the production entry point for per-Job execution, all placeholder types are gone, the trait surface is honest, and there is one source of truth for auto escalation.

User Stories

  1. As a contributor opening src/crawler.rs for the first time, I want the file to be substantially shorter, so that I can read and understand the orchestrator without scrolling through 3604 lines of inline fetch / extract / detect / event-emission code.

  2. As a contributor, I want Crawler::process_job to read as a small loop that delegates per-Job execution to JobRunner, so that the orchestrator's role (queue pull, admission, budgets, storage, frontier feed) is visually separated from the runner's role (fetch, detect, extract).

  3. As a contributor, I want Fetcher::fetch to return a FetchOutput enum (Http or Rendered), so that both spoof and render paths satisfy the same trait without one of them returning a fake error.

  4. As a contributor, I want common helpers on FetchOutput (status, headers, body, final_url), so that downstream code (extractor, challenge detector, link extraction) can stay variant-agnostic and avoid match boilerplate.

  5. As a contributor, I want AutoFetcher::fetch to call SpoofFetcher::fetch then return FetchOutput::Http together with a RetryDecision::Suggest { EscalateToRender } when a challenge is detected, so that the re-queue contract from ADR-0002 is enforced by the type system rather than by policy::engine side effects.

  6. As a contributor, I want RenderFetcher::fetch to return FetchOutput::Rendered(RenderedPage) honestly, so that the render path uses the same trait as spoof without lossy conversions or fake error returns.

  7. As a contributor, I want JobRunner::run to be the only place where one Job becomes one JobOutcome, so that per-attempt timings, signals, links, retry decision, and new session state all flow through one call site.

  8. As a contributor, I want Crawler::process_job to construct a SessionContext, call runner.run, and then post-process storage / frontier / retry / session-state commit, so that the decide-vs-do split from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 is visible in the code structure.

  9. As a contributor, I want SessionContext.identity to be a concrete Arc<SessionIdentity> carrying profile + cookies + persona, so that I stop reading "Placeholder" in field types.

  10. As a contributor, I want SessionContext.proxy: Option<ProxyLease> to carry a real lease from proxy::ProxyRouter, so that the runner knows which proxy each attempt is using without reaching back into Crawler.

  11. As a contributor, I want SessionContext.session_state to be the real antibot::SessionState enum, so that the runner can pass it to detection logic and return mutations via JobOutcome.new_session_state without translation layers.

  12. As a contributor, I want SessionContext.budgets: JobBudgets to carry remaining render_ms / total_ms / attempts, so that the runner can fail fast with JobError::BudgetExhausted instead of issuing a doomed fetch.

  13. As a contributor, I want SessionContext.policy: Arc<PolicyProfile> to be the resolved profile (fast / balanced / deep / forensics), so that escalation decisions inside the runner read the same profile the Crawler resolved at admission.

  14. As a contributor, I want exactly one source of truth for "should this auto job escalate to render?", so that fixing a regression in escalation logic means touching one file, not two.

  15. As a contributor, I want a single decision recorded in an ADR explaining whether AutoFetcher or policy::engine is the owner, so that the choice is not re-litigated every six months.

  16. As an operator running a real crawl, I want the NDJSON event stream produced before the cutover to be byte-identical to the stream produced after, so that downstream consumers (SDK, analytics, dashboards) keep working without coordination.

  17. As an operator, I want the per-attempt events (job.started, fetch.completed, challenge.detected, extract.completed, job.failed) to fire from inside the runner once it is the entry point, so that adding a new lifecycle event is a one-file change.

  18. As a contributor adding a new fetch path in the future, I want Fetcher to be a real trait with Arc<dyn Fetcher> swap-in, so that a third backend (e.g., a Playwright bridge, a remote browser farm) can land without touching Crawler::process_job.

  19. As a contributor, I want at least one integration test that drives runner.run against a wiremock server (no Chrome, no FakeFetcher), so that regressions in the SpoofFetcher → ChallengeDetector → Extractor pipeline are caught without needing a full Crawler boot.

  20. As a contributor, I want the integration test to cover four scenarios — healthy 200, challenge 403 (Cloudflare body), network failure (timeout), connection refused — so that JobError classification and RetryDecision mapping are exercised against real HTTP transport.

  21. As a contributor, I want cargo test --all-features --lib runner:: to keep passing after every commit in this PRD, so that the existing 44 runner unit tests act as a contract guard during the cutover.

  22. As a contributor, I want cargo test --all-features --test runner_ndjson_regression to remain byte-stable through every commit, so that the wire format pinned in slice runner: bootstrap module shells + NDJSON regression test harness #16 stays the trip wire.

  23. As a future maintainer, I want CONTEXT.md to be updated to drop the "placeholder" disclaimer from SessionContext, so that the glossary reflects what the type actually carries.

  24. As a future maintainer, I want an ADR explaining the AutoFetcher vs. policy::engine decision, so that re-introducing two paths is recognized as a regression.

  25. As a future maintainer, I want the src/crawler.rs LOC delta documented in the cutover PR description (before / after), so that the goal "deep the god-module" is measurable.

Implementation Decisions

Sequence. Five items land as separate PRs in this order: FetchOutput widening (#4 in grilling, this PRD's slice 1) → cutover (#1, slice 2) → concrete SessionContext (#2, slice 3) → AutoFetcher decision (#3, slice 4) → integration test (#5, slice 5). Items #3 and #5 are independent and can run in parallel with #2 once #1 lands. Sequence rationale: the trait shape must be honest before the cutover; the cutover surfaces which placeholder types are actually needed; the AutoFetcher decision becomes obvious after the cutover (either it's the source of truth or it's dead code); the integration test locks the contract once the wiring is real.

FetchOutput shape. Enum with two variants — Http(impersonate::Response) and Rendered(RenderedPage). Common helpers on the enum return the fields callers reach for most: status() -> u16, headers() -> &HeaderMap, body() -> &[u8], final_url() -> &Url. Render-only fields (Web Vitals, screenshot, ScriptSpec RunOutcome) require a match to access. Rationale: callers that only want body / status (extractor, challenge detector) stay simple; callers that legitimately need render extras (storage of vitals, screenshot artifacts) are forced to acknowledge they are render-specific.

Trait signature. async fn fetch(&self, job: &Job, ctx: &SessionContext) -> Result<FetchOutput>. SpoofFetcher returns Ok(FetchOutput::Http(...)). RenderFetcher returns Ok(FetchOutput::Rendered(...)). AutoFetcher returns Ok(FetchOutput::Http(...)) always (it does not chain inline per ADR-0002) and signals escalation via JobOutcome.retry projection from AutoOutcome::into_retry_decision().

Cutover scope. Crawler::process_job keeps the front (queue pull, admission, budget check, robots, dedupe, rate limit) and the back (storage writes, frontier feed, retry decision honoring caps and cooldowns, session-state commit, hooks, run-level events). The middle — fetch dispatch, extract, challenge detect, per-attempt events — becomes a single let outcome = self.runner.run(&job, &ctx).await; call. Crawler constructs the Arc<JobRunner> once at startup (or lazily after first job) using the existing SpoofFetcher, RenderFetcher if cdp-backend, and AutoFetcher (or its replacement, per item #3).

Concrete SessionContext. Six placeholder types removed. Real fields: Arc<SessionIdentity> (still composing existing ImpersonateClient + IdentityBundle + cookies — full identity unification is candidate #3 from the architecture review and stays out of scope here), Option<ProxyLease> from proxy::ProxyRouter, antibot::SessionState, JobBudgets (struct with render_ms_left, total_ms_left, attempts_remaining), Arc<PolicyProfile> from policy::PolicyProfile.

AutoFetcher decision. Two options, decision recorded as ADR-0004:

  • Option A (Promote AutoFetcher): Crawler constructs Arc<AutoFetcher> for Method::Auto jobs. runner.run delegates to it. policy::engine no longer makes escalation decisions for the auto path; it stays the source of truth for retry vs drop on other failure modes (transient 5xx, robots, budget exhaustion). The challenge-driven escalation lives in one type.

  • Option B (Delete AutoFetcher): Code removed. Crawler keeps policy::engine as the single arbiter of "this job needs to be re-queued as Render" — the path that already runs in production today.

Recommendation in the grilling was B (delete) because: (1) policy::engine already runs and is tested; (2) AutoFetcher has no consumer on main; (3) escalation policy needs cross-job context (render budgets, host cooldowns, retry caps) that lives naturally in the Crawler/policy::engine boundary, not inside a per-Job fetcher. Final decision deferred to triage on this PRD; whichever wins, an ADR records why.

Retry policy stays on Crawler. Per PRD #15 Q11 and slice #22's contract: JobOutcome.retry is advice. The Crawler reads RetryDecision::Suggest { reason, backoff_hint } and applies retry caps, host cooldowns, budget accounting before deciding to re-enqueue. No change in this PRD; just confirming the boundary holds through the cutover.

Event firing moves into the runner. The dormant per-attempt events from slice #23 (JobStarted, FetchCompleted, ChallengeDetected, ExtractCompleted, JobFailed) fire on the wire once the runner is the entry point. Run-level / decision events (JobEnqueued, JobAdmitted, JobRetried, JobPersisted, FrontierSeeded, RunStarted, RunCompleted) stay on the Crawler. NDJSON byte-stability is the contract — the regression test from slice #16 is the trip wire.

LOC accountability. The cutover PR description must include the before / after LOC delta for src/crawler.rs. The goal from PRD #15 was to deepen the god-module; this PRD makes the delta visible so the goal is measurable.

Testing Decisions

Definition of a good test for this work. Tests assert behavior at the public boundary of the modified module. For Fetcher and its impls, that is the fetch method's return value plus any state-visible side effect (e.g., cookies updated on the client). For JobRunner::run, that is the returned JobOutcome plus the sequence of events fired through the injected EventSink. No assertion on private field shapes, helper names, or call order beyond what the public contract guarantees.

Modules to be tested.

  1. FetchOutput helpers — unit tests that the enum's status, headers, body, final_url helpers return the right thing for both Http and Rendered variants. Trivial but pins the surface.

  2. SpoofFetcher returns FetchOutput::Http — wrapper conformance, no real network. The existing 3 unit tests in runner::fetcher::spoof::tests extend with one case asserting variant correctness.

  3. RenderFetcher returns FetchOutput::Rendered — same shape. Replaces the existing "intentionally unimplemented" guard test from slice runner: RenderFetcher, route Method::Render #20.

  4. AutoFetcher (if promoted in Option A) — async unit test against fake spoof + fake render: assert challenge response yields FetchOutput::Http + retry EscalateToRender, healthy response yields FetchOutput::Http + retry None, render fake is never called by AutoFetcher::fetch (ADR-0002 inline-chain ban).

  5. JobRunner::run end-to-end against wiremock — new integration test under tests/. Drives runner.run for four cases: healthy 200 with links, 403 Cloudflare challenge, network timeout (wiremock delay > timeout), connection refused (wiremock not started). Asserts JobOutcome fields populate correctly and JobError / RetryDecision map per the contract.

  6. process_job integration regression — re-uses the NDJSON regression test from slice runner: bootstrap module shells + NDJSON regression test harness #16. Pre and post cutover, the captured event-kind sequence must remain byte-identical. New scenarios may be added but the existing golden file does not change.

Prior art in the codebase.

  • tests/runner_ndjson_regression.rs — the byte-stable trip wire from slice runner: bootstrap module shells + NDJSON regression test harness #16; serves as the cutover guard.
  • tests/mini_http_only.rs — pattern for wiremock-backed end-to-end tests; the runner integration test follows the same structure (wiremock setup, deterministic seeds, MemorySink for event capture, no Chrome).
  • tests/escalation.rs — pattern for table-driven detection tests; informs the case list for the integration test (cf-chl-bypass body, DataDome body, JS stub, healthy HTML).
  • src/runner/mod.rs #[cfg(test)] mod tests — pattern for runner-scope unit tests using FakeFetcher; the new integration test moves to tests/ because it needs a real HTTP transport.

Out of Scope

  • The deeper SessionIdentity unification (architecture review candidate #3). SessionContext.identity remains a thin bundle of ImpersonateClient + IdentityBundle + cookies as it exists today.
  • The unified LifecycleHook collapsing hooks + events + script (candidate #5). The three layers keep their identities; only the emitter of per-attempt events changes (already specified in slice runner: move per-attempt event/hook emission into JobRunner; cleanup dead helpers in crawler.rs #23, activated by this PRD's cutover).
  • DiscoveryBackend registry (candidate #4). Discovery adapters are untouched.
  • Splitting config.rs (candidate #6). The runner reads from the existing flat Config via SessionContext.policy.
  • Frontier admission deepening (candidate #7). Admission stays on Crawler.
  • Render module / CDP facade. The runner uses RenderPool as it stands.
  • The fingerprint/ module — covered by a separate PRD published alongside this one.
  • Any behavioral change to fetch strategies, policy profiles, NDJSON event names, retry caps, or host cooldowns. This is a refactor with two structural deliverables (honest trait + cutover) and one cleanup deliverable (AutoFetcher decision). No new features.

Further Notes

  • Parent PRD: Extract JobRunner from crawler.rs (strangler refactor) #15. This PRD closes its outstanding scope.
  • Two ADRs from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 (docs/adr/0001-jobrunner-value-return-outcome.md, docs/adr/0002-autofetcher-escalation-via-requeue.md) continue to apply unchanged. The FetchOutput widening reinforces ADR-0001 (value-return outcome); the AutoFetcher decision reinforces ADR-0002 (no inline chain).
  • ADR-0004 will record the AutoFetcher decision (promote vs. delete).
  • After this PRD lands, the runner is the production entry point for per-Job execution. The fingerprint PRD becomes implementable cleanly: JobOutcome.signals (today a Vec<ChallengeSignalPlaceholder>) becomes the natural output slot for fingerprint::Detections, plumbed without breaking the cutover.
  • Honest expectation: the cutover PR (slice 2) is the largest single diff in this PRD's scope. It touches the middle of crawler.rs substantially. The strangler-PR style from PRD Extract JobRunner from crawler.rs (strangler refactor) #15 keeps the rest of the slices small; the cutover itself is the surgical change the strategy was leading up to.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestneeds-triageAwaiting triagerustPull requests that update rust code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions