Skip to content

[Feature Request]: Support crawling publishers behind JavaScript bot-challenges #937

@MaxDall

Description

@MaxDall

Problem statement

A growing number of publishers gate their entire site behind a JavaScript bot-protection challenge (AWS WAF, Cloudflare, Akamai, Incapsula). These can't be crawled by fundus's curl-based stack at all, because solving the challenge requires executing JavaScript proof-of-work in a real browser. #936 covers detecting and reporting these cleanly; this request is the complementary half — actually crawling them when desired.

This isn't hypothetical. For JP.TheJapanNews (japannews.yomiuri.co.jp, AWS WAF, HTTP 202 challenge on every URL) the mechanism was verified end-to-end:

  1. A real browser (nodriver/Chrome) runs challenge.js and obtains an aws-waf-token cookie.
  2. Replaying that token through curl_cffi with impersonate="chrome" then returns clean 200s with real sitemap XML (sitemap.xml → 74 <loc>, sitemap-news.xml → 52).

So the "solve once in a browser → reuse the token in the normal HTTP stack" pattern works. The blockers are token lifetime and the lack of any seam to inject one.

Solution

Add an optional, opt-in path to supply a bot-challenge token to the existing HTTP stack — keeping the core curl-based and JS-free.

Two layers, smallest-first:

  • A — Token injection seam (no new dependency). Let a publisher/source carry cookies (e.g. an aws-waf-token) and ensure they reach the request. This requires:

    • extending URLSource.fetch / WebSource to pass cookies, and
    • fixing get_with_interrupt, which currently drops request kwargs when impersonate is set (session.py:139) — cookies must survive impersonation.
    • A user obtains the token from their own browser and passes it; fundus uses it until it expires.
  • B — Optional browser-backed token provider (extra dependency, opt-in extra). A small helper (e.g. nodriver) that mints and refreshes tokens automatically, feeding layer A. Gated behind an optional install (pip install fundus[browser]) so the core stays lightweight. This does push against the current "no retry logic in the stack" design, so it needs a deliberate decision.

Additional Context

  • Companion to [Proposal]: Detect and report publishers gated behind JavaScript bot-challenges #936 (detection/reporting). This issue is "strategy A/B" from that proposal's open questions.
  • Token-based bypass is inherently brittle: tokens expire, the originating TLS fingerprint must match the impersonation profile, and sites can rotate their protection. Layer A alone may be enough for ad-hoc/research crawls without committing to a browser dependency.
  • Alternative for affected publishers that sidesteps this entirely: prefer the CC-NEWS archive where coverage exists. Where none of these is justified, deprecated=True remains the fallback (as just applied to TheJapanNews).

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureHave an idea on how to improve the code base? Come forward and let us know.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions