Scrapy is a web scraping framework to extract structured data from websites. It is cross-platform, and requires Python 3.10+. It is maintained by Zyte (formerly Scrapinghub) and many other contributors.
Install with:
pip install scrapyAnd follow the documentation to learn how to use it.
If you wish to contribute, see Contributing.
You can build a container image that bundles Scrapy together with the
extras/link_contact_extractor.py helper script:
docker build -t scrapy-toolkit .Once built, the image exposes the Scrapy command-line interface by default, so you can, for example, open an interactive shell against a site:
docker run --rm -it scrapy-toolkit shell https://inisheng.com --nologTo execute the JSON link/contact extractor from the container, override the entry point and pass the target URL:
docker run --rm -it --entrypoint python scrapy-toolkit \
extras/link_contact_extractor.py https://inisheng.comAdd -s USER_AGENT="..." to the Scrapy command if the target site requires
a custom user agent.
Note
The image ships the Scrapy framework itself and the helper scripts in
extras/, but it does not include a sample Scrapy project. Commands such
as scrapy crawl <spider> must be executed from within a Scrapy project
directory (one that contains a scrapy.cfg file), for example by mounting
your own project into the container and using -w to set the working
directory.
The project also includes an HTTP API that can fan out concurrent requests,
allowing you to scan large batches of domains (100+ per minute on a typical
VPS). Launch the service directly on your machine with uvicorn:
uvicorn extras.link_contact_api:app --host 0.0.0.0 --port 8000Or run it inside the Docker image and publish the port to your host:
docker run --rm -it -p 8000:8000 scrapy-toolkit \
uvicorn extras.link_contact_api:app --host 0.0.0.0 --port 8000Send a POST request with a list of URLs to /scan to trigger the crawl:
curl -X POST http://localhost:8000/scan \
-H 'Content-Type: application/json' \
-d '{"urls": ["https://example.com", "https://docs.scrapy.org"], "concurrency": 32}'The response contains a summary describing how many domains were scanned
successfully alongside the full per-domain breakdown.
If you are working in an environment without outbound network access you can still verify that the helper utilities behave as expected. The repository includes a small HTML fixture and a self-test script that exercises the link and contact extraction logic without making any HTTP requests:
python extras/link_contact_selftest.py --prettyThe script prints the expected and observed results in JSON format and exits with a non-zero status if a regression is detected. You can also supply your own HTML fixture:
python extras/link_contact_selftest.py path/to/page.html