GitHub - Jeck11-11/scrapy: Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy is a web scraping framework to extract structured data from websites. It is cross-platform, and requires Python 3.10+. It is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Install with:

pip install scrapy

And follow the documentation to learn how to use it.

If you wish to contribute, see Contributing.

Running with Docker

You can build a container image that bundles Scrapy together with the extras/link_contact_extractor.py helper script:

docker build -t scrapy-toolkit .

Once built, the image exposes the Scrapy command-line interface by default, so you can, for example, open an interactive shell against a site:

docker run --rm -it scrapy-toolkit shell https://inisheng.com --nolog

To execute the JSON link/contact extractor from the container, override the entry point and pass the target URL:

docker run --rm -it --entrypoint python scrapy-toolkit \
    extras/link_contact_extractor.py https://inisheng.com

Add -s USER_AGENT="..." to the Scrapy command if the target site requires a custom user agent.

Note

The image ships the Scrapy framework itself and the helper scripts in extras/, but it does not include a sample Scrapy project. Commands such as scrapy crawl <spider> must be executed from within a Scrapy project directory (one that contains a scrapy.cfg file), for example by mounting your own project into the container and using -w to set the working directory.

Expose the scanning API

The project also includes an HTTP API that can fan out concurrent requests, allowing you to scan large batches of domains (100+ per minute on a typical VPS). Launch the service directly on your machine with uvicorn:

uvicorn extras.link_contact_api:app --host 0.0.0.0 --port 8000

Or run it inside the Docker image and publish the port to your host:

docker run --rm -it -p 8000:8000 scrapy-toolkit \
    uvicorn extras.link_contact_api:app --host 0.0.0.0 --port 8000

Send a POST request with a list of URLs to /scan to trigger the crawl:

curl -X POST http://localhost:8000/scan \
  -H 'Content-Type: application/json' \
  -d '{"urls": ["https://example.com", "https://docs.scrapy.org"], "concurrency": 32}'

The response contains a summary describing how many domains were scanned successfully alongside the full per-domain breakdown.

Offline self-test

If you are working in an environment without outbound network access you can still verify that the helper utilities behave as expected. The repository includes a small HTML fixture and a self-test script that exercises the link and contact extraction logic without making any HTTP requests:

python extras/link_contact_selftest.py --pretty

The script prints the expected and observed results in JSON format and exits with a non-zero status if a regression is detected. You can also supply your own HTML fixture:

python extras/link_contact_selftest.py path/to/page.html

Name		Name	Last commit message	Last commit date
Latest commit History 10,968 Commits
.github		.github
docs		docs
extras		extras
scrapy		scrapy
sep		sep
tests		tests
tests_typing		tests_typing
.dockerignore		.dockerignore
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
AUTHORS		AUTHORS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
INSTALL.md		INSTALL.md
LICENSE		LICENSE
NEWS		NEWS
README.rst		README.rst
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
conftest.py		conftest.py
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Running with Docker

Expose the scanning API

Offline self-test

About

Uh oh!

Releases

Packages

Languages

License

Jeck11-11/scrapy

Folders and files

Latest commit

History

Repository files navigation

Running with Docker

Expose the scanning API

Offline self-test

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages