Skip to content

runner: JobRunner::run becomes per-Job entry point in Crawler::process_job #22

@filipeforattini

Description

@filipeforattini

Parent

#15

What to build

Promote JobRunner::run to the per-Job entry point. Crawler::process_job becomes: build SessionContext, call runner.run(job, ctx).await, then post-process the returned JobOutcome (storage write, frontier feed, retry decision based on RetryDecision, commit new_session_state if present).

JobRunner is held as Arc<JobRunner>, constructed once at run start with Arc<dyn Fetcher> (a dispatcher that picks Spoof/Render/Auto by Method), Arc<Extractor>, Arc<ChallengeDetector>, Arc<EventSink>, Arc<HookRegistry>. Shared across all workers. Send + Sync, no per-call mutable state on self.

SessionContext carries: Arc<SessionIdentity> (thin placeholder bundling current ImpersonateClient + IdentityBundle + cookies — full unification is out of scope), optional ProxyLease, SessionState, JobBudgets, Arc<PolicyProfile>.

JobOutcome shape locked: result: Result<FetchSuccess, JobError>, timings: JobTimings (populated on both branches), retry: RetryDecision, new_session_state: Option<SessionState>. JobError variants: Network, Timeout, RenderFailed, ChallengeUnrecoverable, BudgetExhausted, Cancelled.

Retry policy stays on Crawler: it reads RetryDecision::Suggest { reason, backoff_hint } and applies retry caps, host cooldowns, and budget accounting before deciding to re-enqueue.

Acceptance criteria

  • JobRunner::run is the per-Job entry point called from Crawler::process_job
  • SessionContext constructed by Crawler before each call
  • JobOutcome returned by value; Crawler post-processes storage, frontier, retry, session-state commit
  • JobError enum defined with the listed variants
  • RetryDecision::Suggest reasons map cleanly: EscalateToRender, Timeout, Network, ChallengeRecoverable
  • Crawler retry path honors retry caps, host cooldowns, budgets when interpreting RetryDecision::Suggest
  • #[cfg(test)] mod tests for JobRunner::run with fake Fetcher: assert timings populated on failure, new_session_state returned on challenge
  • #[cfg(test)] mod tests for the JobErrorRetryDecision mapping
  • NDJSON regression test from runner: bootstrap module shells + NDJSON regression test harness #16 still passes byte-for-byte
  • cargo test --all-features green
  • src/crawler.rs LOC measurably reduced (track delta in PR description)

Blocked by

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestneeds-triageAwaiting triagerustPull requests that update rust code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions