Async web crawling framework for everyone.
Built on asyncio, aiohttp, and lxml/pyquery. Declare items and
parsers; gain handles the concurrency, retries, and persistence.
pip install gainLinux users can opt into uvloop for an extra speed bump:
pip install "gain[uvloop]"Requires Python 3.10+.
import aiofiles
from gain import Css, Item, Parser, Spider
class Post(Item):
title = Css(".entry-title")
content = Css(".entry-content")
async def save(self):
async with aiofiles.open("scrapinghub.txt", "a+") as f:
await f.write(self.results["title"] + "\n")
class MySpider(Spider):
concurrency = 5
headers = {"User-Agent": "Google Spider"}
start_url = "https://blog.scrapinghub.com/"
parsers = [
Parser(r"https://blog.scrapinghub.com/page/\d+/"),
Parser(r"https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/", Post),
]
MySpider.run()Run it:
python spider.pyfrom gain import Css, Item, Parser, Spider, XPathParser
class Post(Item):
title = Css(".breadcrumb_last")
async def save(self):
print(self.title)
class MySpider(Spider):
start_url = "https://mydramatime.com/europe-and-us-drama/"
concurrency = 5
headers = {"User-Agent": "Google Spider"}
parsers = [
XPathParser('//span[@class="category-name"]/a/@href'),
XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post),
]
proxy = "https://localhost:1234"
MySpider.run() ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ start_url │ ─▶ │ Parser │ ─▶ │ Item │ ─▶ │ save() │
│ │ │ (follow) │ │ (extract) │ │ (persist) │
└────────────┘ └────────────┘ └────────────┘ └────────────┘
▲ │
└──────────── new urls ────────────────┘
- Spider kicks off from
start_urlunder a concurrency budget. - Parsers either follow (one argument) — discovering more URLs to
queue — or extract (two arguments) — instantiating an
Itemfrom each matching page. - Items use
Css/Xpath/Regexselectors to pull fields out of HTML. save()is your async hook to persist results — write a file, push to a queue, insert into a database.
See the example/ directory for runnable scripts against
Scrapinghub, V2EX, and Sciencenet.
git clone https://github.com/elliotgao2/gain.git
cd gain
uv sync # install deps into .venv
uv run pytest # run tests
uv run ruff check . # lintWe use uv for packaging and ruff for lint + format. Install the pre-commit hooks:
uv run pre-commit installPull requests are welcome. For non-trivial changes, please open an issue
first to discuss. Make sure pytest and ruff check pass before
submitting.
MIT © Elliot Gao