A fast, multithreaded and concurrent web scraping utility built in Rust for fetching raw HTML content and extracting all absolute links from a list of URLs. It's designed for simple, high-speed data collection and outputs clean JSON.
--Made out of boredom--
Future features might include a full distributed crawling framework, dynamic JavaScript rendering, and a personalized pizza delivery bot.
Seriously though...
- Heavy load detection: Using
rayononly on heavy loads. - Configuration Files: Support for managing inputs/settings via JSON, yaml or Something Completely Different.
robots.txtRespect: Automatic adherence to website exclusion rules.- Targeted Data Extraction: Allow extraction of specific data using CSS selectors.
- GUI: Graphical User Interface using Tauri
- Structured Data Output: Output results in defined, structured formats (schemas).
- Anti-Bot Defenses: Implement rate limiting and proxy support for OpSec tooling.
- Session Management: Add cookie and header control for authenticated scraping.
- High-Speed Concurrency: Uses the
rayoncrate for multithreaded, parallel processing of multiple URLs, significantly improving scraping speed. - Versatile Input: Supports loading URLs from four different file formats: JSON, CSV, XML, and TXT.
- Simple Output: Results are saved to a pretty-printed JSON file containing the full HTML and a list of all extracted absolute links for each URL.
- Flexible Input: Provide target URLs directly via command-line arguments (
--urls) or load them from one or multiple files (--file). - Custom User Agent: Easily set a standard User-Agent string (
Mozilla,Webkit, orChrome) to manage requests politely. - Rust-Native Performance: Leveraging Rust for safety and execution speed.
SS_Crusty is a command-line tool designed for technical demonstration, learning, and analysis.
You use this tool at your own risk.
The user of this software is entirely responsible for adhering to all applicable local, national, and international laws, including but not limited to, the Terms of Service (ToS) and robots.txt files of any websites they scrape.
- Respect Website Policies: Always review a website's
robots.txtfile and its Terms of Service before scraping. - Rate Limiting: This tool is concurrent; excessive use or rapid requests may overload a target site or result in your IP address being blocked. The user is responsible for implementing any necessary rate-limiting or delay mechanisms not built into the tool.
- Liability: The author of SS_Crusty is not responsible for any direct, indirect, or consequential damages resulting from the use of this software, including any legal action or bans resulting from misuse.
You can either download a pre-built binary for Windows and Linux from the Releases page. For MacOS see below.
This project requires the Rust toolchain. If you don't have it, you can install it via [rustup].
- Clone the repository:
git clone https://github.com/Fairdose/ss_crusty.git cd ss_crusty - Build the project:
cargo build --release
- The executable will be located at
./target/release/ss_crusty.
The scraper accepts URLs from either command-line arguments or a file path.
Use the built-in help for a quick reference: ss_crusty --help
| Argument | Description | Default |
|---|---|---|
--urls <URL> |
URLs to fetch. You can specify this argument multiple times to scrape several pages. | (Required if --file is not used) |
--file <PATH> |
Path to a file containing a list of URLs. The current implementation accepts; CSV, xml, JSON, txt (With proper formatting). | |
--output <PATH> |
The path where the output JSON results will be saved. | results.json |
--user-agent <NAME> |
Optional user-agent for HTTP requests. Supported values: Mozilla, Webkit, Chrome. |
Default composite user agent |
--v, --vv, --vvv |
Verbose --v = INFO --vv = WARN --vvv = DEBUG |
Fetch the HTML and links from a single page, saving the result to the default results.json.
ss_crusty --urls "https://example.com"Scrape two different pages and set the User-Agent to mimic Chrome.
ss_crusty \
--urls "https://example1.com") \
--urls "https://example2.com" \
--user-agent ChromeProcess URLs found in both batch1.json and batch2.json, and save the combined results.
batch1.json
{ "urls": ["https://example.com/page1", "https://example.com/page2"] } batch2.json
{ "urls": ["https://example.com/a", "https://example.com/b"] }You can process them using the multiple --file flag:
ss_crusty \
--file batch1.json \
--file batch2.json \
--output combined_results.jsonThe output is a single JSON object containing a pages array, where each element represents a scraped URL and its associated data.
Example results.json:
{
"pages": [
{
"url": "https://example.com",
"links": [
"https://www.iana.org/domains/example",
"https://another-absolute-link.com/page",
"...more links"
],
"html": "<!doctype html>\n<html>\n<head>...</head></html>" <--- HTML Content
},
{
...more page results
}
]
}
The scraper automatically attempts to parse the file content based on common conventions for the following four supported formats. Do not worry about duplications. In the back ss_crusty uses HashSets for url iterations.
Files must contain a single JSON object with a key named urls that holds an array of URL strings.
File Content Example (input.json):
{
"urls": [
"https://example1.com",
"https://example2.com"
]
}Files must have a column named url containing the target URLs.
File Content Example (input.csv):
url
https://example1.com
https://example2.com
Files must contain one URL per line.
File Content Example (input.txt):
https://example1.com
https://example2.com
Files must use an root element with URLs contained within nested tags.
File Content Example (input.xml):
<urls>
<url>https://example1.com</url>
<url>https://example2.com</url>
</urls>The core logic is handled by the scrape module:
-
The tool reads all URLs from the command-line arguments and all specified files, inferring the format to correctly parse URLs from each source.
-
All URLs are merged, sorted, and deduplicated before processing.
-
The
reqwest::blockingclient is used for fetching, configured with the specified user agent. -
Parallelism is achieved using rayon, allowing multiple requests to happen simultaneously.
-
The scraper crate is used to parse the HTML and extract all href attributes from
<a>tags, filtering to keep only absolute links (starting with 'http://' or 'https://').
This project is licensed under the Fairdose Non-Commercial License. See the LICENSE file for full details and commercial restrictions.