High-performance async scraper for FMCSA SAFER database with support for 100+ requests per second.
The system uses a 3-tier async producer-consumer pattern:
- Feeder: Reads CSV, filters already-scraped records, feeds job queue
- Scraper Workers (200-500 concurrent): Pull from job queue, fetch HTML via proxy, parse, push to write queue
- Database Writer (single worker): Batches records from write queue, bulk inserts using 1 DB connection
- Install dependencies:
npm install- Configure
.envfile:
DB_NAME=fmcsa_safer
DB_USER=your_user
DB_PASSWORD=your_password
DB_HOST=localhost
DB_PORT=5432
DB_AVAILABLE=True
PROXY_URL=your_proxy_url
PROXY_USER_BASE=your_proxy_user_base
PROXY_PASS=your_proxy_password
CONCURRENCY=200
BATCH_SIZE=1000
MAX_RETRIES=3
REQUEST_TIMEOUT=15
TEST_MODE=False
TEST_LIMIT=100- Set up PostgreSQL database:
psql -U your_user -d fmcsa_safer -f src/schema.sqlBuild and run:
npm run build
npm startOr run directly with ts-node:
npm run devThis will:
- Load USDOT numbers from
dot_numbers.csv - Check database for existing records (resume capability)
- Start 200 concurrent scraper workers (configurable via
CONCURRENCY) - Batch insert to database in chunks of 1000 (configurable via
BATCH_SIZE) - Display progress statistics every 10 seconds
For testing without proxies and with lower concurrency:
Option 1: Command-line flag
npm run dev -- --test
# or
npm run dev -- -tOption 2: Environment variable
Set in .env:
TEST_MODE=TrueTest mode will:
- Skip proxy usage (direct connections to FMCSA)
- Use concurrency of 10 (instead of 200)
- Process only first N records (set via
TEST_LIMIT, default 100) - Still save to database (if configured)
- Display "TEST MODE ENABLED" banner when starting
Note: FMCSA blocks direct connections (returns 403). Test mode without proxy will show 403 errors.
- CONCURRENCY: Number of concurrent scraper workers. Start with 200, increase if CPU allows.
- BATCH_SIZE: Database write batch size. 1000 is optimal for most cases.
- REQUEST_TIMEOUT: HTTP request timeout in milliseconds. 15000 (15s) is recommended.
The orchestrator prints progress every 10 seconds:
- Scraped: Number of successfully parsed records
- Failed: Number of failed requests/parses
- Saved: Number of records written to database
- Errors: Number of errors encountered
The system automatically checks the database for existing records at startup and skips them. You can safely stop and restart the scraper - it will continue from where it left off.
src/orchestrator.ts- Main async coordinator (use for production)src/config.ts- Configuration loadersrc/network.ts- Async HTTP client with proxy rotationsrc/parser.ts- HTML parsing logic (maps to carrier.types.ts)src/database.ts- Batch database operationssrc/types/carrier.types.ts- TypeScript type definitionssrc/schema.sql- PostgreSQL schema (flattened structure)
The scraper outputs data matching the exact Snapshot type from carrier.types.ts:
- All fields match the type definitions
- Arrays and nested objects are properly typed
- Only non-empty inspection summaries are included (per type comments)
- Only non-empty safety ratings are included (per type comments)
- Uses
p-queuefor concurrency control - Proxy rotation: Each request gets a unique session ID to force IP rotation
- Database connections: Uses connection pooling (respects connection limits)
- Error handling: Failed batches are logged, network errors are retried automatically