You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A growing number of publishers gate their entire site behind a JavaScript bot-protection challenge (AWS WAF, Cloudflare, Akamai, Incapsula). These can't be crawled by fundus's curl-based stack at all, because solving the challenge requires executing JavaScript proof-of-work in a real browser. #936 covers detecting and reporting these cleanly; this request is the complementary half — actually crawling them when desired.
This isn't hypothetical. For JP.TheJapanNews (japannews.yomiuri.co.jp, AWS WAF, HTTP 202 challenge on every URL) the mechanism was verified end-to-end:
A real browser (nodriver/Chrome) runs challenge.js and obtains an aws-waf-token cookie.
Replaying that token through curl_cffi with impersonate="chrome" then returns clean 200s with real sitemap XML (sitemap.xml → 74 <loc>, sitemap-news.xml → 52).
So the "solve once in a browser → reuse the token in the normal HTTP stack" pattern works. The blockers are token lifetime and the lack of any seam to inject one.
Solution
Add an optional, opt-in path to supply a bot-challenge token to the existing HTTP stack — keeping the core curl-based and JS-free.
Two layers, smallest-first:
A — Token injection seam (no new dependency). Let a publisher/source carry cookies (e.g. an aws-waf-token) and ensure they reach the request. This requires:
extending URLSource.fetch / WebSource to pass cookies, and
fixing get_with_interrupt, which currently drops request kwargs when impersonate is set (session.py:139) — cookies must survive impersonation.
A user obtains the token from their own browser and passes it; fundus uses it until it expires.
B — Optional browser-backed token provider (extra dependency, opt-in extra). A small helper (e.g. nodriver) that mints and refreshes tokens automatically, feeding layer A. Gated behind an optional install (pip install fundus[browser]) so the core stays lightweight. This does push against the current "no retry logic in the stack" design, so it needs a deliberate decision.
Token-based bypass is inherently brittle: tokens expire, the originating TLS fingerprint must match the impersonation profile, and sites can rotate their protection. Layer A alone may be enough for ad-hoc/research crawls without committing to a browser dependency.
Alternative for affected publishers that sidesteps this entirely: prefer the CC-NEWS archive where coverage exists. Where none of these is justified, deprecated=True remains the fallback (as just applied to TheJapanNews).
Problem statement
A growing number of publishers gate their entire site behind a JavaScript bot-protection challenge (AWS WAF, Cloudflare, Akamai, Incapsula). These can't be crawled by fundus's curl-based stack at all, because solving the challenge requires executing JavaScript proof-of-work in a real browser. #936 covers detecting and reporting these cleanly; this request is the complementary half — actually crawling them when desired.
This isn't hypothetical. For
JP.TheJapanNews(japannews.yomiuri.co.jp, AWS WAF, HTTP 202 challenge on every URL) the mechanism was verified end-to-end:challenge.jsand obtains anaws-waf-tokencookie.impersonate="chrome"then returns clean 200s with real sitemap XML (sitemap.xml→ 74<loc>,sitemap-news.xml→ 52).So the "solve once in a browser → reuse the token in the normal HTTP stack" pattern works. The blockers are token lifetime and the lack of any seam to inject one.
Solution
Add an optional, opt-in path to supply a bot-challenge token to the existing HTTP stack — keeping the core curl-based and JS-free.
Two layers, smallest-first:
A — Token injection seam (no new dependency). Let a publisher/source carry cookies (e.g. an
aws-waf-token) and ensure they reach the request. This requires:URLSource.fetch/WebSourceto pass cookies, andget_with_interrupt, which currently drops request kwargs whenimpersonateis set (session.py:139) — cookies must survive impersonation.B — Optional browser-backed token provider (extra dependency, opt-in extra). A small helper (e.g.
nodriver) that mints and refreshes tokens automatically, feeding layer A. Gated behind an optional install (pip install fundus[browser]) so the core stays lightweight. This does push against the current "no retry logic in the stack" design, so it needs a deliberate decision.Additional Context
deprecated=Trueremains the fallback (as just applied toTheJapanNews).