Skip to content

lipiji/web4agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web4agent

Free, open-source, async-first web scraping toolkit for LLM agents.
No commercial APIs. No rate-limit bills. Self-hostable.


Features

Function Description
read_url Auto-degradation: fast → crawl4ai → browser
read_fast httpx + trafilatura, BeautifulSoup fallback
read_browser Playwright headless Chromium, reused browser instance
read_crawl4ai Crawl4AI LLM-friendly Markdown output
read_many Concurrent batch fetch with deduplication
discover_links Extract, normalize, and deduplicate hrefs
agent_read_url Single-URL fetch returning a slim LLM-ready dict
agent_read_urls Batch fetch with summary stats for LLM context
FastAPI server Optional HTTP API (/read, /read_many, /discover_links)

Installation

Minimal (httpx + trafilatura, covers most use cases):

pip install web4agent

With optional extras:

# Playwright (JS-rendered pages)
pip install "web4agent[browser]"
playwright install chromium

# Crawl4AI strategy
pip install "web4agent[crawl4ai]"

# FastAPI server
pip install "web4agent[server]"

# Everything
pip install "web4agent[all]"
playwright install chromium

From source (development):

git clone https://github.com/lipiji/web4agent
cd web4agent
pip install -e ".[dev]"

Quick Start

CLI

# Fetch a single page
web4agent read https://en.wikipedia.org/wiki/Web_scraping

# Batch fetch
web4agent many https://example.com https://python.org --concurrency 5

# Extract links
web4agent links https://docs.python.org/3/ --same-domain --max-links 30

Python

import asyncio
from web4agent import read_url, read_many, discover_links

async def main():
    # Single URL — auto strategy (fast → crawl4ai → browser)
    result = await read_url("https://en.wikipedia.org/wiki/Web_scraping")
    print(result.title)
    print(result.text[:500])

    # Batch
    results = await read_many(
        ["https://example.com", "https://python.org"],
        concurrency=5,
        strategy="fast",
    )
    for r in results:
        print(r.url, "OK" if r.success else r.error)

    # Links
    links = await discover_links("https://docs.python.org/3/", same_domain=True)
    print(links[:5])

asyncio.run(main())

Agent interface (slim dicts for LLM context)

import asyncio
from web4agent import agent_read_url, agent_read_urls

async def main():
    # Single — returns {"url", "title", "content", "success", "strategy_used", "error"}
    r = await agent_read_url("https://example.com")
    print(r["title"])
    print(r["content"][:300])

    # Batch — returns {"results", "total", "succeeded", "failed"}
    summary = await agent_read_urls(
        ["https://example.com", "https://python.org"],
        concurrency=5,
    )
    print(f"Fetched {summary['succeeded']}/{summary['total']}")
    for item in summary["results"]:
        print(item["url"], item["success"])

asyncio.run(main())

Full working examples: examples/example.py


Strategies

Strategy How it works Best for
fast httpx + trafilatura (+ BS4 fallback) Static pages, high concurrency
crawl4ai Crawl4AI AsyncWebCrawler Docs, structured Markdown output
browser Playwright headless Chromium JS-heavy SPAs, lazy-loaded content
auto Degrades: fast → crawl4ai → browser Unknown pages

Auto-degradation triggers when:

  • HTTP status ≥ 400
  • Extracted text is shorter than MIN_TEXT_LENGTH (default 300 chars)
  • Page looks like a JS-only shell (empty #root / #app div)

Result shape

All read functions return a WebReadResult:

class WebReadResult(BaseModel):
    url: str
    final_url: str | None       # after redirects
    title: str | None
    text: str | None            # plain text
    markdown: str | None        # Markdown version
    html: str | None            # raw HTML
    status_code: int | None
    success: bool
    strategy_used: str | None
    attempts: list[FetchAttempt]
    error: str | None
    fetched_at: str             # ISO-8601 UTC
    elapsed_ms: int | None
    metadata: dict              # e.g. screenshot_b64 for browser reads

Configuration

Set via environment variables (or a .env file):

Variable Default Description
WRT_TIMEOUT 20 HTTP timeout in seconds
WRT_FAST_CONCURRENCY 50 Max concurrent fast requests
WRT_CRAWL4AI_CONCURRENCY 10 Max concurrent crawl4ai requests
WRT_BROWSER_CONCURRENCY 3 Max simultaneous Playwright pages
WRT_MIN_TEXT_LENGTH 300 Min chars to consider a fetch successful
WRT_AGENT_MAX_CONTENT_CHARS 8000 Content truncation limit for agent output
WRT_USER_AGENT Chrome 124 User-Agent header string

FastAPI Server

pip install -e ".[server]"
uvicorn web4agent.server:app --host 0.0.0.0 --port 8000
Method Path Body
GET /health
POST /read {"url": "...", "strategy": "auto"}
POST /read_many {"urls": [...], "concurrency": 10, "strategy": "auto"}
POST /discover_links {"url": "...", "same_domain": true, "max_links": 100}

Running Tests

pip install -e ".[dev]"
pytest

Compliance

  • robots.txt — not enforced automatically; check it yourself before scraping.
  • Rate limiting — use the concurrency parameter; add delays for the same domain.
  • Terms of Service — always review a site's ToS before scraping.
  • Intended for lawful, authorized use only.

License

MIT

About

Free, open-source, async-first web scraping toolkit for LLM agents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages