Skip to content

runner: Extractor struct for links and asset classification #18

@filipeforattini

Description

@filipeforattini

Parent

#15

What to build

Introduce Extractor as a plain struct in src/runner/extract.rs. Move link extraction and asset classification logic out of crawler.rs (and/or wrap the existing src/extract/ module) behind this single seam. Interface: fn extract(&self, response: &FetchResponse) -> ExtractedContent where ExtractedContent carries Vec<DiscoveredUrl> and asset classifications.

No trait — one impl, pure logic. Crawler::process_job calls Extractor directly after a successful fetch.

Acceptance criteria

  • Extractor struct in src/runner/extract.rs
  • Link extraction + asset classification flow through Extractor
  • Crawler::process_job calls Extractor instead of inline helpers
  • Inline helpers in crawler.rs removed if fully subsumed
  • #[cfg(test)] mod tests with HTML fixtures: relative URLs, base href, srcset, link rels
  • NDJSON regression test from runner: bootstrap module shells + NDJSON regression test harness #16 still passes byte-for-byte
  • cargo test --all-features green

Blocked by

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestneeds-triageAwaiting triagerustPull requests that update rust code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions