Enrich a CSV of business leads with Facebook page data (emails, Instagram, follower counts, ad status) using DuckDuckGo search and Apify.
csv2jsonl.py < leads.csv | ddg_search.py | fb_scrape.py | jsonl2csv.py > enriched.csv
Each tool reads JSONL from stdin and writes JSONL to stdout. run.sh wraps the full pipeline with checkpointing and resume support.
git clone <repo-url>
cd facebook-scraperRequires Python 3.10+. macOS ships with Python 3 since Monterey. If you don't have it, install via Homebrew:
brew install python3Then create the venv and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip install requests lxml richCopy the example env file and fill in your values:
cp .env.example .envEdit .env:
APIFY_TOKEN=your_apify_token_here
PROXY_URL=http://user:pass@proxy.example.com:8080
- APIFY_TOKEN (required) — get one at apify.com under Settings > Integrations > API Tokens
- PROXY_URL (optional) — a rotating proxy for DuckDuckGo searches. Without it, DDG may rate-limit you after a few dozen queries
The input CSV needs at minimum a name column. Supported columns:
| CSV column | Used as | Required |
|---|---|---|
name |
Business name | yes |
city |
City | yes |
us_state or state |
State | yes |
phone |
Phone number | no |
full_address |
Address | no |
This matches the export format from Outscraper.
./run.sh leads.csvOutput goes to leads_enriched.csv.
If interrupted, run the same command again to resume where it left off. Use --fresh to start over:
./run.sh --fresh leads.csvThe enriched CSV includes all input fields plus:
| Field | Description |
|---|---|
fb_url |
Facebook page URL |
fb_email |
Email listed on the FB page |
fb_instagram |
Linked Instagram handle |
fb_followers |
Follower count |
fb_likes |
Like count |
fb_intro |
Page intro/about text |
fb_creation_date |
Page creation date |
fb_ad_status |
Whether the page runs ads |
Each script works standalone:
# Search DDG for a single business
./ddg_search.py "Joe's Coffee" "Tampa" "FL"
# Process a JSONL file through just the FB scraper
cat partial.jsonl | ./fb_scrape.py > scraped.jsonl
# Convert between formats
./csv2jsonl.py < leads.csv > leads.jsonl
cat results.jsonl | ./jsonl2csv.py > results.csverror: APIFY_TOKEN not set — your .env file is missing or doesn't have APIFY_TOKEN. Check that .env exists in the project root.
DDG returning no results — DuckDuckGo is rate-limiting you. Set PROXY_URL in .env to a rotating residential proxy, or wait and retry.
Resuming gives wrong counts — the checkpoint file (leads.jsonl) may be stale. Run with --fresh to reset.