Free, open-source, async-first web scraping toolkit for LLM agents.
No commercial APIs. No rate-limit bills. Self-hostable.
| Function | Description |
|---|---|
read_url |
Auto-degradation: fast → crawl4ai → browser |
read_fast |
httpx + trafilatura, BeautifulSoup fallback |
read_browser |
Playwright headless Chromium, reused browser instance |
read_crawl4ai |
Crawl4AI LLM-friendly Markdown output |
read_many |
Concurrent batch fetch with deduplication |
discover_links |
Extract, normalize, and deduplicate hrefs |
agent_read_url |
Single-URL fetch returning a slim LLM-ready dict |
agent_read_urls |
Batch fetch with summary stats for LLM context |
| FastAPI server | Optional HTTP API (/read, /read_many, /discover_links) |
Minimal (httpx + trafilatura, covers most use cases):
pip install web4agentWith optional extras:
# Playwright (JS-rendered pages)
pip install "web4agent[browser]"
playwright install chromium
# Crawl4AI strategy
pip install "web4agent[crawl4ai]"
# FastAPI server
pip install "web4agent[server]"
# Everything
pip install "web4agent[all]"
playwright install chromiumFrom source (development):
git clone https://github.com/lipiji/web4agent
cd web4agent
pip install -e ".[dev]"# Fetch a single page
web4agent read https://en.wikipedia.org/wiki/Web_scraping
# Batch fetch
web4agent many https://example.com https://python.org --concurrency 5
# Extract links
web4agent links https://docs.python.org/3/ --same-domain --max-links 30import asyncio
from web4agent import read_url, read_many, discover_links
async def main():
# Single URL — auto strategy (fast → crawl4ai → browser)
result = await read_url("https://en.wikipedia.org/wiki/Web_scraping")
print(result.title)
print(result.text[:500])
# Batch
results = await read_many(
["https://example.com", "https://python.org"],
concurrency=5,
strategy="fast",
)
for r in results:
print(r.url, "OK" if r.success else r.error)
# Links
links = await discover_links("https://docs.python.org/3/", same_domain=True)
print(links[:5])
asyncio.run(main())import asyncio
from web4agent import agent_read_url, agent_read_urls
async def main():
# Single — returns {"url", "title", "content", "success", "strategy_used", "error"}
r = await agent_read_url("https://example.com")
print(r["title"])
print(r["content"][:300])
# Batch — returns {"results", "total", "succeeded", "failed"}
summary = await agent_read_urls(
["https://example.com", "https://python.org"],
concurrency=5,
)
print(f"Fetched {summary['succeeded']}/{summary['total']}")
for item in summary["results"]:
print(item["url"], item["success"])
asyncio.run(main())Full working examples:
examples/example.py
| Strategy | How it works | Best for |
|---|---|---|
fast |
httpx + trafilatura (+ BS4 fallback) | Static pages, high concurrency |
crawl4ai |
Crawl4AI AsyncWebCrawler |
Docs, structured Markdown output |
browser |
Playwright headless Chromium | JS-heavy SPAs, lazy-loaded content |
auto |
Degrades: fast → crawl4ai → browser | Unknown pages |
Auto-degradation triggers when:
- HTTP status ≥ 400
- Extracted text is shorter than
MIN_TEXT_LENGTH(default 300 chars) - Page looks like a JS-only shell (empty
#root/#appdiv)
All read functions return a WebReadResult:
class WebReadResult(BaseModel):
url: str
final_url: str | None # after redirects
title: str | None
text: str | None # plain text
markdown: str | None # Markdown version
html: str | None # raw HTML
status_code: int | None
success: bool
strategy_used: str | None
attempts: list[FetchAttempt]
error: str | None
fetched_at: str # ISO-8601 UTC
elapsed_ms: int | None
metadata: dict # e.g. screenshot_b64 for browser readsSet via environment variables (or a .env file):
| Variable | Default | Description |
|---|---|---|
WRT_TIMEOUT |
20 |
HTTP timeout in seconds |
WRT_FAST_CONCURRENCY |
50 |
Max concurrent fast requests |
WRT_CRAWL4AI_CONCURRENCY |
10 |
Max concurrent crawl4ai requests |
WRT_BROWSER_CONCURRENCY |
3 |
Max simultaneous Playwright pages |
WRT_MIN_TEXT_LENGTH |
300 |
Min chars to consider a fetch successful |
WRT_AGENT_MAX_CONTENT_CHARS |
8000 |
Content truncation limit for agent output |
WRT_USER_AGENT |
Chrome 124 | User-Agent header string |
pip install -e ".[server]"
uvicorn web4agent.server:app --host 0.0.0.0 --port 8000| Method | Path | Body |
|---|---|---|
GET |
/health |
— |
POST |
/read |
{"url": "...", "strategy": "auto"} |
POST |
/read_many |
{"urls": [...], "concurrency": 10, "strategy": "auto"} |
POST |
/discover_links |
{"url": "...", "same_domain": true, "max_links": 100} |
pip install -e ".[dev]"
pytestrobots.txt— not enforced automatically; check it yourself before scraping.- Rate limiting — use the
concurrencyparameter; add delays for the same domain. - Terms of Service — always review a site's ToS before scraping.
- Intended for lawful, authorized use only.
MIT