A multi-agent, fully modular OSINT framework written in Rust — built to detect phishing sites, malicious domains, and suspicious web infrastructure at scale.
⚠️ Educational Use Only — RavenOSINT is built for security researchers, threat intelligence analysts, and educators. Only scan domains and URLs you own or have explicit written permission to analyse. The authors are not responsible for misuse.
RavenOSINT is a cross-platform OSINT (Open Source Intelligence) framework for automated phishing site detection and malicious domain analysis. It combines:
- Automated discovery — finds candidate URLs using search engines and seed lists
- Multi-agent validation — runs parallel checks for SSL health, availability, content signals, and redirect chains
- LLM verification — passes findings to DeepSeek for AI-powered classification
- Persistent storage — saves all results to SQLite or PostgreSQL for later review
- REST API + CLI — use it interactively, in scripts, or as a backend service
Built entirely in Rust for performance, memory safety, and true cross-platform support (Windows, Linux, macOS).
This tool was built alongside and used in cybercrime research repository — a collection of threat intelligence notes, phishing campaign analysis, and infrastructure research. RavenOSINT is the automation layer behind that work.
RavenOSINT/
├── raven-core # Shared types, traits, config contracts
├── raven-bus # Async event bus (tokio broadcast)
├── raven-discovery # URL discovery — Serper, Exa, seed files
├── raven-scraper # HTTP engine — rate limiting, UA rotation, SSL checks
├── raven-agent # Validation agents — availability, SSL, content analysis
├── raven-llm # DeepSeek LLM integration for AI verdict
├── raven-storage # SQLite / PostgreSQL persistence layer
├── raven-api # Axum REST API with Swagger UI
└── raven-cli # Clap CLI — the main binary
Every component is a separate crate with a clean trait boundary. Swap out any piece without touching the rest.
- Rust 1.75 or later
- Git
git clone https://github.com/yourorg/raven-osint
cd raven-osint
cargo build --releaseThe binary will be at target/release/raven (or raven.exe on Windows).
Create a .env file in the project root:
# Required for Serper search (get a free key at serper.dev)
RAVEN__DISCOVERY__SERPER__API_KEY=your_serper_key_here
# Optional — Exa search (get a key at exa.ai)
RAVEN__DISCOVERY__EXA__API_KEY=your_exa_key_here
# Optional — DeepSeek LLM for AI-powered verdict
RAVEN__LLM__API_KEY=your_deepseek_key_hereNever commit your .env file. It is already in .gitignore.
# Scan a single suspicious URL
raven scan https://suspicious-site.com
# Discover phishing-related URLs and scan them all
raven discover "paypal login phishing" --limit 10 --validate
# Validate a list of URLs from a file
raven validate urls.txt
# Start the REST API server
cargo run -p raven-apiRuns the full pipeline: scrape → agent checks → LLM verdict → save to DB.
raven scan <URL> [OPTIONS]
# Examples
raven scan https://example.com
raven scan https://suspicious-domain.net --output json
raven scan https://example.com --tags phishing,campaign-2024Options:
| Flag | Description | Default |
|---|---|---|
--output |
table or json |
table |
--tags |
Comma-separated tags for grouping results | none |
What the pipeline does:
- Fetches the URL, follows redirects, records final URL, status code, latency, headers
- Checks SSL validity and certificate details
- Runs availability agent — flags suspicious cross-domain redirects
- Runs content agent — scans for phishing keywords, missing security headers, suspicious JS
- Sends findings to DeepSeek LLM for a final verdict
- Saves everything to the database
- Returns:
active/suspicious/malicious/down/unknownwith a confidence score
Queries search engines or reads seed files to build a list of candidate URLs. Optionally feeds them directly into the scan pipeline.
raven discover <QUERY> [OPTIONS]
# Basic search
raven discover "paypal phishing"
# Scope to a domain
raven discover "login" --site paypal.com
# Use Exa instead of Serper
raven discover "phishing kit" --provider exa --limit 20
# Discover and immediately scan everything found
raven discover "suspicious login page" --limit 10 --validate
# Output one URL per line (for scripting)
raven discover "query" --output urls
# Pipe discovered URLs into validate
raven discover "query" --limit 25 --output urls > found.txt
raven validate found.txt
# Filter by country and language
raven discover "phishing" --country us --lang en --limit 10
# Use a seed file (one URL/domain per line)
raven discover seeds.txt --provider seed_file
# Use VirusTotal for domain reputation and passive DNS
raven discover "suspicious-domain.com" --provider virustotal --limit 25
# Use Censys to find hosts with specific services
raven discover "services.port: 443" --provider censys --limit 20
raven discover "" --provider censys --site suspicious-domain.comOptions:
| Flag | Description | Default |
|---|---|---|
--site |
Restrict results to this domain | none |
--provider |
serper, exa, seed_file, virustotal, censys |
serper |
--limit |
Max URLs to return | 25 |
--country |
ISO country code (e.g. us, de) |
none |
--lang |
Language code (e.g. en, fr) |
none |
--include-subdomains |
Include subdomains when --site is set |
true |
--validate |
Also run full scan on every discovered URL | false |
--tags |
Tags to attach to validation jobs | none |
--output |
table, json, or urls |
table |
Reads a plain text file — one URL per line — and runs the full scan pipeline on each one.
raven validate <FILE> [OPTIONS]
# Examples
raven validate urls.txt
raven validate urls.txt --output json > results.jsonFile format:
# Lines starting with # are ignored
https://example.com
https://suspicious-site.net
https://another-domain.org
Working with CSV files:
validate expects plain text, not CSV. Convert first:
# PowerShell — extract a "url" column from a CSV
Import-Csv targets.csv | Select-Object -ExpandProperty url | Out-File -Encoding utf8 urls.txt
raven validate urls.txt# Bash — extract second column (adjust column number as needed)
cut -d',' -f2 targets.csv | tail -n +2 > urls.txt
raven validate urls.txtRaven includes a powerful CLI for querying the results database directly, without needing an external tool.
# List all stored scan results in a colour-coded table
raven results list
# Page through results
raven results list --limit 50 --offset 0
# Filter by status
raven results list --status malicious
# Full detail for one result — shows agents, LLM reasoning, SSL info
raven results get <job_id>
# List all discovery jobs
raven discoveries list
# See all URLs from a specific discovery job
raven discoveries get <job_id>
# Export URLs from a job straight into validate
raven discoveries get <job_id> --output urls | raven validate -Prints the fully resolved configuration as JSON, including which API keys are loaded (values shown as empty strings if missing). Useful for debugging.
raven config showShows all active discovery providers, scrapers, agents, and LLM backends.
raven plugin listStart the server:
cargo run -p raven-api
# Listening on http://127.0.0.1:3000Interactive Swagger UI: http://127.0.0.1:3000/docs
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check |
POST |
/scan |
Submit a URL for scanning |
GET |
/results |
List all scan results |
GET |
/results/{job_id} |
Get one scan result |
POST |
/discover |
Start a discovery job |
GET |
/discoveries |
List discovery results |
GET |
/discoveries/{job_id} |
Get one discovery result |
curl -X POST http://127.0.0.1:3000/scan \
-H "Content-Type: application/json" \
-d '{
"url": "https://suspicious-site.com",
"tags": ["phishing", "manual-review"]
}'curl -X POST http://127.0.0.1:3000/discover \
-H "Content-Type: application/json" \
-d '{
"query": "paypal login phishing",
"provider": "serper",
"limit": 10,
"validate": true
}'# List all results
curl http://127.0.0.1:3000/results
# Get a specific job
curl http://127.0.0.1:3000/results/70da917b-a309-45ef-bd53-84a62e67061bAll settings live in config/default.toml. Every value can be overridden with an environment variable using the pattern RAVEN__<SECTION>__<KEY>.
Switch between databases by editing config/default.toml.
# config/default.toml
[database]
url = "duckdb://raven.duckdb"To use PostgreSQL or DuckDB, you must enable a feature flag at compile time.
# Example: Run a discovery with DuckDB enabled
cargo run -p raven-cli --features raven-storage/duckdb -- discover "apple inc" --site apple.com --limit 5The rest of the configuration is shown below.
[scraper]
rate_rpm = 10 # requests per minute per domain
timeout_secs = 30
max_redirects = 10
[discovery]
default_provider = "serper"
default_limit = 25
validate_by_default = false # set true to auto-scan all discovered URLs
[discovery.serper]
enabled = true
base_url = "https://google.serper.dev/search"
api_key = "" # use RAVEN__DISCOVERY__SERPER__API_KEY env var
[discovery.exa]
enabled = true
base_url = "https://api.exa.ai/search"
api_key = "" # use RAVEN__DISCOVERY__EXA__API_KEY env var
[llm]
provider = "deepseek"
base_url = "https://api.deepseek.com/v1"
model = "deepseek-chat"
api_key = "" # use RAVEN__LLM__API_KEY env var
[api]
host = "127.0.0.1"
port = 3000
[logging]
level = "info" # off | error | warn | info | debug | trace
format = "pretty" # pretty | jsonEvery scan returns a result with these fields:
| Field | Description |
|---|---|
status |
active, suspicious, malicious, down, or unknown |
confidence |
0.0 to 1.0 — how confident the system is in the verdict |
agent_reports |
Individual findings from each agent |
llm_verdict |
AI reasoning and classification |
scraper_output |
Raw HTTP data — headers, body, SSL info, redirect chain |
Agent signals that increase suspicion:
- Cross-domain redirect (e.g.
site.com→site.net) - Missing security headers (
x-frame-options,content-security-policy) - High latency suggesting evasive infrastructure
- Phishing keywords in page content
- Suspicious JavaScript patterns
RavenOSINT is actively developed. Planned additions:
- VirusTotal — domain reputation, URL scan history, passive DNS
- Censys — internet-wide asset and certificate discovery
- Shodan — open port and service fingerprinting
- Bing Search — secondary search engine source
- CommonCrawl — bulk historical web data
- OpenAI GPT-4o — as an alternative to DeepSeek
- Ollama — fully local LLM inference, no API key required
- Anthropic Claude — via the Messages API
- WHOIS agent — registrar, registration date, registrant country
- DNS agent — MX, TXT, NS records, typosquatting detection
- Screenshot agent — headless browser capture for visual similarity scoring
- Threat feed agent — check against known phishing feed blocklists
- Web dashboard — visual results browser
- Webhook support — push results to Slack, Discord, or custom endpoints
- Batch API — submit hundreds of URLs in one request
- Export formats — STIX 2.1, CSV, MISP event format
# All tests
cargo test --workspace
# Specific crate
cargo test -p raven-scraper
cargo test -p raven-discovery
# With output (useful for debugging)
cargo test --workspace -- --nocapturePull requests are welcome. For major changes open an issue first. Please make sure cargo test --workspace passes and cargo clippy has no warnings before submitting.
Licensed under the Apache License 2.0.
Built for threat intelligence research. Use responsibly.