Complete JobRunner extraction: cutover, honest trait, concrete SessionContext

## Problem Statement

PRD #15 landed the JobRunner extraction in 7 strangler slices (#16-#23). Every merged slice's PR comment ended with the same caveat:

> Scope honesty — what this slice does NOT do: `Crawler::process_job` is NOT yet rewired to call `runner.run`. The runner is operational and test-locked; the strangler swap-in to `process_job` is the next concrete follow-up.

That follow-up never landed. As of today on `main`:

- `src/crawler.rs` is still **3604 LOC** — the same as before PRD #15 started.
- `runner::JobRunner::run` exists with 44 unit tests, fires per-attempt lifecycle events, classifies errors into `RetryDecision`, but **nothing in production calls it**. The runner is a parallel implementation that only unit tests exercise.
- `runner::AutoFetcher` exists but `Crawler` never constructs one. The auto-path escalation still flows through `policy::engine` (which we did rewire to consume `ChallengeDetector` in slice #19).
- `runner::SessionContext` carries **six placeholder unit-structs** (`SessionIdentityPlaceholder`, `ProxyLeasePlaceholder`, `SessionStatePlaceholder`, `JobBudgetsPlaceholder`, `PolicyProfilePlaceholder`, `ChallengeSignalPlaceholder`). The seam is named, not deep.
- `Fetcher::fetch` returns `Result<impersonate::Response>`, which forces `RenderFetcher::fetch` to return `Err(Error::Render("intentionally unimplemented"))`. Trait method that returns an explicit "not implemented" error is a code smell.

The honest grade for the JobRunner extraction today is **C+**: solid specification, four real seams integrated (Spoof / Extract / Challenge / Render), but the god-module problem is untouched and the runner is dead code in production. The slices we shipped are the scaffolding; the building was never built on top.

## Solution

Close the gap. Five concrete moves, sequenced so each is reviewable and reversible:

1. **Widen the `Fetcher` trait to return `FetchOutput`** — an enum carrying either an HTTP response (spoof) or a rendered page (render). Common helpers on the enum (`status`, `headers`, `body`, `final_url`) let extractors and detectors stay variant-agnostic. `RenderFetcher::fetch` becomes honest — it returns `FetchOutput::Rendered`, not an error.

2. **Cutover `Crawler::process_job` to call `runner.run`** — replace the inline fetch/extract/detect block (~600 LOC in the middle of `process_job`) with `runner.run(&job, &ctx).await` plus post-processing (storage writes, frontier feed, retry decision, session-state commit). NDJSON regression from issue #16 is the trip wire — byte-stable event sequence pre/post cutover.

3. **Concrete `SessionContext`** — drop all six placeholder types, replace with the real fields the runner needs: `Arc<SessionIdentity>` (today still `ImpersonateClient + IdentityBundle + cookies` bundled), optional `ProxyLease`, real `SessionState` from `antibot::SessionState`, `JobBudgets` (render_ms / total_ms remaining), `Arc<PolicyProfile>`.

4. **Resolve `AutoFetcher` ambiguity** — pick one source of truth for auto escalation. Either `AutoFetcher` becomes the only path (and the `policy::engine` escalation logic is deleted or thinned), or `AutoFetcher` itself is deleted and `policy::engine` stays as the source. Two implementations of the same logic is ruído.

5. **Integration test the runner against wiremock** — `runner.run` exercised against a real HTTP server (mock), not just `FakeFetcher`. Asserts full SpoofFetcher → ChallengeDetector → Extractor pipeline produces the expected `JobOutcome` for healthy/challenge/network-error/timeout responses.

After these five land, the JobRunner extraction is complete: `process_job` shrinks measurably, the runner is the production entry point for per-Job execution, all placeholder types are gone, the trait surface is honest, and there is one source of truth for auto escalation.

## User Stories

1. As a contributor opening `src/crawler.rs` for the first time, I want the file to be substantially shorter, so that I can read and understand the orchestrator without scrolling through 3604 lines of inline fetch / extract / detect / event-emission code.

2. As a contributor, I want `Crawler::process_job` to read as a small loop that delegates per-Job execution to `JobRunner`, so that the orchestrator's role (queue pull, admission, budgets, storage, frontier feed) is visually separated from the runner's role (fetch, detect, extract).

3. As a contributor, I want `Fetcher::fetch` to return a `FetchOutput` enum (`Http` or `Rendered`), so that both spoof and render paths satisfy the same trait without one of them returning a fake error.

4. As a contributor, I want common helpers on `FetchOutput` (`status`, `headers`, `body`, `final_url`), so that downstream code (extractor, challenge detector, link extraction) can stay variant-agnostic and avoid `match` boilerplate.

5. As a contributor, I want `AutoFetcher::fetch` to call `SpoofFetcher::fetch` then return `FetchOutput::Http` together with a `RetryDecision::Suggest { EscalateToRender }` when a challenge is detected, so that the re-queue contract from ADR-0002 is enforced by the type system rather than by `policy::engine` side effects.

6. As a contributor, I want `RenderFetcher::fetch` to return `FetchOutput::Rendered(RenderedPage)` honestly, so that the render path uses the same trait as spoof without lossy conversions or fake error returns.

7. As a contributor, I want `JobRunner::run` to be the only place where one Job becomes one `JobOutcome`, so that per-attempt timings, signals, links, retry decision, and new session state all flow through one call site.

8. As a contributor, I want `Crawler::process_job` to construct a `SessionContext`, call `runner.run`, and then post-process storage / frontier / retry / session-state commit, so that the decide-vs-do split from PRD #15 is visible in the code structure.

9. As a contributor, I want `SessionContext.identity` to be a concrete `Arc<SessionIdentity>` carrying profile + cookies + persona, so that I stop reading "Placeholder" in field types.

10. As a contributor, I want `SessionContext.proxy: Option<ProxyLease>` to carry a real lease from `proxy::ProxyRouter`, so that the runner knows which proxy each attempt is using without reaching back into `Crawler`.

11. As a contributor, I want `SessionContext.session_state` to be the real `antibot::SessionState` enum, so that the runner can pass it to detection logic and return mutations via `JobOutcome.new_session_state` without translation layers.

12. As a contributor, I want `SessionContext.budgets: JobBudgets` to carry remaining render_ms / total_ms / attempts, so that the runner can fail fast with `JobError::BudgetExhausted` instead of issuing a doomed fetch.

13. As a contributor, I want `SessionContext.policy: Arc<PolicyProfile>` to be the resolved profile (`fast` / `balanced` / `deep` / `forensics`), so that escalation decisions inside the runner read the same profile the `Crawler` resolved at admission.

14. As a contributor, I want exactly one source of truth for "should this auto job escalate to render?", so that fixing a regression in escalation logic means touching one file, not two.

15. As a contributor, I want a single decision recorded in an ADR explaining whether `AutoFetcher` or `policy::engine` is the owner, so that the choice is not re-litigated every six months.

16. As an operator running a real crawl, I want the NDJSON event stream produced before the cutover to be byte-identical to the stream produced after, so that downstream consumers (SDK, analytics, dashboards) keep working without coordination.

17. As an operator, I want the per-attempt events (`job.started`, `fetch.completed`, `challenge.detected`, `extract.completed`, `job.failed`) to fire from inside the runner once it is the entry point, so that adding a new lifecycle event is a one-file change.

18. As a contributor adding a new fetch path in the future, I want `Fetcher` to be a real trait with `Arc<dyn Fetcher>` swap-in, so that a third backend (e.g., a Playwright bridge, a remote browser farm) can land without touching `Crawler::process_job`.

19. As a contributor, I want at least one integration test that drives `runner.run` against a wiremock server (no Chrome, no FakeFetcher), so that regressions in the SpoofFetcher → ChallengeDetector → Extractor pipeline are caught without needing a full `Crawler` boot.

20. As a contributor, I want the integration test to cover four scenarios — healthy 200, challenge 403 (Cloudflare body), network failure (timeout), connection refused — so that `JobError` classification and `RetryDecision` mapping are exercised against real HTTP transport.

21. As a contributor, I want `cargo test --all-features --lib runner::` to keep passing after every commit in this PRD, so that the existing 44 runner unit tests act as a contract guard during the cutover.

22. As a contributor, I want `cargo test --all-features --test runner_ndjson_regression` to remain byte-stable through every commit, so that the wire format pinned in slice #16 stays the trip wire.

23. As a future maintainer, I want `CONTEXT.md` to be updated to drop the "placeholder" disclaimer from `SessionContext`, so that the glossary reflects what the type actually carries.

24. As a future maintainer, I want an ADR explaining the `AutoFetcher` vs. `policy::engine` decision, so that re-introducing two paths is recognized as a regression.

25. As a future maintainer, I want the `src/crawler.rs` LOC delta documented in the cutover PR description (before / after), so that the goal "deep the god-module" is measurable.

## Implementation Decisions

**Sequence.** Five items land as separate PRs in this order: `FetchOutput` widening (#4 in grilling, this PRD's slice 1) → cutover (#1, slice 2) → concrete SessionContext (#2, slice 3) → AutoFetcher decision (#3, slice 4) → integration test (#5, slice 5). Items #3 and #5 are independent and can run in parallel with #2 once #1 lands. Sequence rationale: the trait shape must be honest before the cutover; the cutover surfaces which placeholder types are actually needed; the AutoFetcher decision becomes obvious after the cutover (either it's the source of truth or it's dead code); the integration test locks the contract once the wiring is real.

**FetchOutput shape.** Enum with two variants — `Http(impersonate::Response)` and `Rendered(RenderedPage)`. Common helpers on the enum return the fields callers reach for most: `status() -> u16`, `headers() -> &HeaderMap`, `body() -> &[u8]`, `final_url() -> &Url`. Render-only fields (Web Vitals, screenshot, ScriptSpec `RunOutcome`) require a `match` to access. Rationale: callers that only want body / status (extractor, challenge detector) stay simple; callers that legitimately need render extras (storage of vitals, screenshot artifacts) are forced to acknowledge they are render-specific.

**Trait signature.** `async fn fetch(&self, job: &Job, ctx: &SessionContext) -> Result<FetchOutput>`. `SpoofFetcher` returns `Ok(FetchOutput::Http(...))`. `RenderFetcher` returns `Ok(FetchOutput::Rendered(...))`. `AutoFetcher` returns `Ok(FetchOutput::Http(...))` always (it does not chain inline per ADR-0002) and signals escalation via `JobOutcome.retry` projection from `AutoOutcome::into_retry_decision()`.

**Cutover scope.** `Crawler::process_job` keeps the front (queue pull, admission, budget check, robots, dedupe, rate limit) and the back (storage writes, frontier feed, retry decision honoring caps and cooldowns, session-state commit, hooks, run-level events). The middle — fetch dispatch, extract, challenge detect, per-attempt events — becomes a single `let outcome = self.runner.run(&job, &ctx).await;` call. `Crawler` constructs the `Arc<JobRunner>` once at startup (or lazily after first job) using the existing `SpoofFetcher`, `RenderFetcher` if `cdp-backend`, and `AutoFetcher` (or its replacement, per item #3).

**Concrete SessionContext.** Six placeholder types removed. Real fields: `Arc<SessionIdentity>` (still composing existing `ImpersonateClient + IdentityBundle + cookies` — full identity unification is candidate #3 from the architecture review and stays out of scope here), `Option<ProxyLease>` from `proxy::ProxyRouter`, `antibot::SessionState`, `JobBudgets` (struct with `render_ms_left`, `total_ms_left`, `attempts_remaining`), `Arc<PolicyProfile>` from `policy::PolicyProfile`.

**AutoFetcher decision.** Two options, decision recorded as ADR-0004:

- **Option A (Promote AutoFetcher):** `Crawler` constructs `Arc<AutoFetcher>` for `Method::Auto` jobs. `runner.run` delegates to it. `policy::engine` no longer makes escalation decisions for the auto path; it stays the source of truth for retry vs drop on other failure modes (transient 5xx, robots, budget exhaustion). The challenge-driven escalation lives in one type.

- **Option B (Delete AutoFetcher):** Code removed. `Crawler` keeps `policy::engine` as the single arbiter of "this job needs to be re-queued as Render" — the path that already runs in production today.

Recommendation in the grilling was **B (delete)** because: (1) `policy::engine` already runs and is tested; (2) `AutoFetcher` has no consumer on main; (3) escalation policy needs cross-job context (render budgets, host cooldowns, retry caps) that lives naturally in the `Crawler`/`policy::engine` boundary, not inside a per-Job fetcher. Final decision deferred to triage on this PRD; whichever wins, an ADR records why.

**Retry policy stays on Crawler.** Per PRD #15 Q11 and slice #22's contract: `JobOutcome.retry` is advice. The `Crawler` reads `RetryDecision::Suggest { reason, backoff_hint }` and applies retry caps, host cooldowns, budget accounting before deciding to re-enqueue. No change in this PRD; just confirming the boundary holds through the cutover.

**Event firing moves into the runner.** The dormant per-attempt events from slice #23 (`JobStarted`, `FetchCompleted`, `ChallengeDetected`, `ExtractCompleted`, `JobFailed`) fire on the wire once the runner is the entry point. Run-level / decision events (`JobEnqueued`, `JobAdmitted`, `JobRetried`, `JobPersisted`, `FrontierSeeded`, `RunStarted`, `RunCompleted`) stay on the `Crawler`. NDJSON byte-stability is the contract — the regression test from slice #16 is the trip wire.

**LOC accountability.** The cutover PR description must include the before / after LOC delta for `src/crawler.rs`. The goal from PRD #15 was to deepen the god-module; this PRD makes the delta visible so the goal is measurable.

## Testing Decisions

**Definition of a good test for this work.** Tests assert behavior at the public boundary of the modified module. For `Fetcher` and its impls, that is the `fetch` method's return value plus any state-visible side effect (e.g., cookies updated on the client). For `JobRunner::run`, that is the returned `JobOutcome` plus the sequence of events fired through the injected `EventSink`. No assertion on private field shapes, helper names, or call order beyond what the public contract guarantees.

**Modules to be tested.**

1. **`FetchOutput` helpers** — unit tests that the enum's `status`, `headers`, `body`, `final_url` helpers return the right thing for both `Http` and `Rendered` variants. Trivial but pins the surface.

2. **`SpoofFetcher` returns `FetchOutput::Http`** — wrapper conformance, no real network. The existing 3 unit tests in `runner::fetcher::spoof::tests` extend with one case asserting variant correctness.

3. **`RenderFetcher` returns `FetchOutput::Rendered`** — same shape. Replaces the existing "intentionally unimplemented" guard test from slice #20.

4. **`AutoFetcher` (if promoted in Option A)** — async unit test against fake spoof + fake render: assert challenge response yields `FetchOutput::Http` + retry `EscalateToRender`, healthy response yields `FetchOutput::Http` + retry `None`, render fake is **never** called by `AutoFetcher::fetch` (ADR-0002 inline-chain ban).

5. **`JobRunner::run` end-to-end against wiremock** — new integration test under `tests/`. Drives `runner.run` for four cases: healthy 200 with links, 403 Cloudflare challenge, network timeout (wiremock delay > timeout), connection refused (wiremock not started). Asserts `JobOutcome` fields populate correctly and `JobError` / `RetryDecision` map per the contract.

6. **`process_job` integration regression** — re-uses the NDJSON regression test from slice #16. Pre and post cutover, the captured event-kind sequence must remain byte-identical. New scenarios may be added but the existing golden file does not change.

**Prior art in the codebase.**

- `tests/runner_ndjson_regression.rs` — the byte-stable trip wire from slice #16; serves as the cutover guard.
- `tests/mini_http_only.rs` — pattern for wiremock-backed end-to-end tests; the runner integration test follows the same structure (wiremock setup, deterministic seeds, MemorySink for event capture, no Chrome).
- `tests/escalation.rs` — pattern for table-driven detection tests; informs the case list for the integration test (cf-chl-bypass body, DataDome body, JS stub, healthy HTML).
- `src/runner/mod.rs` `#[cfg(test)] mod tests` — pattern for runner-scope unit tests using `FakeFetcher`; the new integration test moves to `tests/` because it needs a real HTTP transport.

## Out of Scope

- The deeper `SessionIdentity` unification (architecture review candidate #3). `SessionContext.identity` remains a thin bundle of `ImpersonateClient + IdentityBundle + cookies` as it exists today.
- The unified `LifecycleHook` collapsing hooks + events + script (candidate #5). The three layers keep their identities; only the emitter of per-attempt events changes (already specified in slice #23, activated by this PRD's cutover).
- `DiscoveryBackend` registry (candidate #4). Discovery adapters are untouched.
- Splitting `config.rs` (candidate #6). The runner reads from the existing flat `Config` via `SessionContext.policy`.
- Frontier admission deepening (candidate #7). Admission stays on `Crawler`.
- Render module / CDP facade. The runner uses `RenderPool` as it stands.
- The `fingerprint/` module — covered by a separate PRD published alongside this one.
- Any behavioral change to fetch strategies, policy profiles, NDJSON event names, retry caps, or host cooldowns. This is a refactor with two structural deliverables (honest trait + cutover) and one cleanup deliverable (AutoFetcher decision). No new features.

## Further Notes

- Parent PRD: forattini-dev/crawlex#15. This PRD closes its outstanding scope.
- Two ADRs from PRD #15 (`docs/adr/0001-jobrunner-value-return-outcome.md`, `docs/adr/0002-autofetcher-escalation-via-requeue.md`) continue to apply unchanged. The `FetchOutput` widening reinforces ADR-0001 (value-return outcome); the AutoFetcher decision reinforces ADR-0002 (no inline chain).
- ADR-0004 will record the AutoFetcher decision (promote vs. delete).
- After this PRD lands, the runner is the production entry point for per-Job execution. The fingerprint PRD becomes implementable cleanly: `JobOutcome.signals` (today a `Vec<ChallengeSignalPlaceholder>`) becomes the natural output slot for `fingerprint::Detection`s, plumbed without breaking the cutover.
- Honest expectation: the cutover PR (slice 2) is the largest single diff in this PRD's scope. It touches the middle of `crawler.rs` substantially. The strangler-PR style from PRD #15 keeps the rest of the slices small; the cutover itself is the surgical change the strategy was leading up to.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete JobRunner extraction: cutover, honest trait, concrete SessionContext #24

Problem Statement

Solution

User Stories

Implementation Decisions

Testing Decisions

Out of Scope

Further Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Complete JobRunner extraction: cutover, honest trait, concrete SessionContext #24

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Testing Decisions

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions