web4agent

Free, open-source, async-first web scraping toolkit for LLM agents.
No commercial APIs. No rate-limit bills. Self-hostable.

Features

Function	Description
`read_url`	Auto-degradation: fast → crawl4ai → browser
`read_fast`	httpx + trafilatura, BeautifulSoup fallback
`read_browser`	Playwright headless Chromium, reused browser instance
`read_crawl4ai`	Crawl4AI LLM-friendly Markdown output
`read_many`	Concurrent batch fetch with deduplication
`discover_links`	Extract, normalize, and deduplicate hrefs
`agent_read_url`	Single-URL fetch returning a slim LLM-ready dict
`agent_read_urls`	Batch fetch with summary stats for LLM context
FastAPI server	Optional HTTP API (`/read`, `/read_many`, `/discover_links`)

Installation

Minimal (httpx + trafilatura, covers most use cases):

pip install web4agent

With optional extras:

# Playwright (JS-rendered pages)
pip install "web4agent[browser]"
playwright install chromium

# Crawl4AI strategy
pip install "web4agent[crawl4ai]"

# FastAPI server
pip install "web4agent[server]"

# Everything
pip install "web4agent[all]"
playwright install chromium

From source (development):

git clone https://github.com/lipiji/web4agent
cd web4agent
pip install -e ".[dev]"

Quick Start

CLI

# Fetch a single page
web4agent read https://en.wikipedia.org/wiki/Web_scraping

# Batch fetch
web4agent many https://example.com https://python.org --concurrency 5

# Extract links
web4agent links https://docs.python.org/3/ --same-domain --max-links 30

Python

import asyncio
from web4agent import read_url, read_many, discover_links

async def main():
    # Single URL — auto strategy (fast → crawl4ai → browser)
    result = await read_url("https://en.wikipedia.org/wiki/Web_scraping")
    print(result.title)
    print(result.text[:500])

    # Batch
    results = await read_many(
        ["https://example.com", "https://python.org"],
        concurrency=5,
        strategy="fast",
    )
    for r in results:
        print(r.url, "OK" if r.success else r.error)

    # Links
    links = await discover_links("https://docs.python.org/3/", same_domain=True)
    print(links[:5])

asyncio.run(main())

Agent interface (slim dicts for LLM context)

import asyncio
from web4agent import agent_read_url, agent_read_urls

async def main():
    # Single — returns {"url", "title", "content", "success", "strategy_used", "error"}
    r = await agent_read_url("https://example.com")
    print(r["title"])
    print(r["content"][:300])

    # Batch — returns {"results", "total", "succeeded", "failed"}
    summary = await agent_read_urls(
        ["https://example.com", "https://python.org"],
        concurrency=5,
    )
    print(f"Fetched {summary['succeeded']}/{summary['total']}")
    for item in summary["results"]:
        print(item["url"], item["success"])

asyncio.run(main())

Full working examples: examples/example.py

Strategies

Strategy	How it works	Best for
`fast`	httpx + trafilatura (+ BS4 fallback)	Static pages, high concurrency
`crawl4ai`	Crawl4AI `AsyncWebCrawler`	Docs, structured Markdown output
`browser`	Playwright headless Chromium	JS-heavy SPAs, lazy-loaded content
`auto`	Degrades: fast → crawl4ai → browser	Unknown pages

Auto-degradation triggers when:

HTTP status ≥ 400
Extracted text is shorter than MIN_TEXT_LENGTH (default 300 chars)
Page looks like a JS-only shell (empty #root / #app div)

Result shape

All read functions return a WebReadResult:

class WebReadResult(BaseModel):
    url: str
    final_url: str | None       # after redirects
    title: str | None
    text: str | None            # plain text
    markdown: str | None        # Markdown version
    html: str | None            # raw HTML
    status_code: int | None
    success: bool
    strategy_used: str | None
    attempts: list[FetchAttempt]
    error: str | None
    fetched_at: str             # ISO-8601 UTC
    elapsed_ms: int | None
    metadata: dict              # e.g. screenshot_b64 for browser reads

Configuration

Set via environment variables (or a .env file):

Variable	Default	Description
`WRT_TIMEOUT`	`20`	HTTP timeout in seconds
`WRT_FAST_CONCURRENCY`	`50`	Max concurrent fast requests
`WRT_CRAWL4AI_CONCURRENCY`	`10`	Max concurrent crawl4ai requests
`WRT_BROWSER_CONCURRENCY`	`3`	Max simultaneous Playwright pages
`WRT_MIN_TEXT_LENGTH`	`300`	Min chars to consider a fetch successful
`WRT_AGENT_MAX_CONTENT_CHARS`	`8000`	Content truncation limit for agent output
`WRT_USER_AGENT`	Chrome 124	User-Agent header string

FastAPI Server

pip install -e ".[server]"
uvicorn web4agent.server:app --host 0.0.0.0 --port 8000

Method	Path	Body
`GET`	`/health`	—
`POST`	`/read`	`{"url": "...", "strategy": "auto"}`
`POST`	`/read_many`	`{"urls": [...], "concurrency": 10, "strategy": "auto"}`
`POST`	`/discover_links`	`{"url": "...", "same_domain": true, "max_links": 100}`

Running Tests

pip install -e ".[dev]"
pytest

Compliance

robots.txt — not enforced automatically; check it yourself before scraping.
Rate limiting — use the concurrency parameter; add delays for the same domain.
Terms of Service — always review a site's ToS before scraping.
Intended for lawful, authorized use only.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.claude		.claude
examples		examples
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web4agent

Features

Installation

Quick Start

CLI

Python

Agent interface (slim dicts for LLM context)

Strategies

Result shape

Configuration

FastAPI Server

Running Tests

Compliance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

web4agent

Features

Installation

Quick Start

CLI

Python

Agent interface (slim dicts for LLM context)

Strategies

Result shape

Configuration

FastAPI Server

Running Tests

Compliance

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages