Skip to content

TongWu/JAVDB_AutoSpider

Repository files navigation

JavDB Auto Spider

Ask DeepWiki JavDB Daily Ingestion Pipeline codecov

A Python + Rust automation system for extracting torrent links from javdb.com and automatically adding them to qBittorrent. Designed as an ingestion pipeline before scraping platforms like MDC-NG.

English | 简体中文

Features

  • Modular Spider — 14 specialized modules in packages/python/javdb_spider/, fetches and filters entries with subtitle/today tags, extracts magnet links with priority ordering
  • Rust Acceleration (optional) — PyO3 + maturin extension for 5-10x faster HTML parsing; falls back to pure Python automatically
  • Parallel Processing — Multi-threaded detail page fetching with one worker per proxy; auto-activates in pool mode with 2+ proxies
  • Torrent Classification — Priority-based categories: 字幕 (subtitle), hacked (UC无码破解 > UC > U无码破解 > U), no_subtitle
  • Dual Mode — Daily mode (default pages) and Ad Hoc mode (custom URLs for actors, tags, etc.)
  • qBittorrent Integration — Auto-upload torrents with categorization, file size filtering, and duplicate prevention
  • PikPak Bridge — Transfer old torrents from qBittorrent to PikPak cloud storage
  • History Tracking — SQLite/Cloudflare D1 dual storage with session-based rollback and pending-mode writes
  • Automated Pipeline — GitHub Actions workflows for daily ingestion, ad hoc scraping, file filtering, dedup, and more
  • Cross-Runner Coordination (optional) — Cloudflare Worker + Durable Objects for per-proxy throttling and login state sharing across concurrent runners
  • Re-download Detection — Automatically re-downloads when a significantly larger torrent becomes available for the same category
  • Email Notifications — Pipeline results with intelligent error detection (critical vs. non-critical)

Quick Start

# Clone and install
git clone https://github.com/TongWu/JAVDB_AutoSpider_CICD.git
cd JAVDB_AutoSpider_CICD
pip install -r requirements.txt

# Configure
cp config.py.example config.py
# Edit config.py: set proxy, qBittorrent credentials, etc.

# Run
python3 -m apps.cli.spider              # Daily scraping
python3 -m apps.cli.spider --dry-run    # Preview without writing
python3 -m apps.cli.pipeline            # Full pipeline (spider + upload + notify)

For complete setup instructions, see the Local Setup Guide.

Architecture

apps/
├── cli/          Canonical CLI entrypoints (spider, pipeline, qb_uploader, etc.)
├── api/          FastAPI REST API
├── web/          Vite + Vue.js frontend
└── desktop/      Electron shell (MVP)

packages/
├── python/
│   ├── javdb_spider/        Spider package (14 modules)
│   ├── javdb_platform/      Platform services (db, proxy, logging)
│   ├── javdb_core/          Domain models and utilities
│   ├── javdb_ingestion/     Pipeline orchestration
│   ├── javdb_integrations/  External integrations (qB, PikPak, Rclone)
│   └── javdb_migrations/    Database migrations
└── rust/
    └── javdb_rust_core/     PyO3 Rust extension (optional)

Legacy paths (scripts/, pipeline.py, migration/, api/) are kept as compatibility wrappers.

Configuration

Copy config.py.example to config.py and configure:

# Minimum required settings
PROXY_MODE = 'pool'                    # 'pool', 'single', or 'None'
PROXY_POOL = [{'name': 'Proxy-1', 'http': 'http://127.0.0.1:7890', 'https': 'http://127.0.0.1:7890'}]
QB_URL = 'https://192.168.1.100:8080'  # qBittorrent Web UI
QB_USERNAME = 'admin'
QB_PASSWORD = 'password'

For the full configuration reference (60+ options), see Configuration Guide.

Environment Variables

Variable Default Description
STORAGE_BACKEND sqlite sqlite, d1, or dual
WRITE_MODE pending pending (default) or audit (legacy, sunset 2026-08-13)
LOG_LEVEL INFO DEBUG, INFO, WARNING, ERROR
STRICT_DUAL_WRITE unset Set 1 to fail on D1 write errors

Common Commands

# Spider
python3 -m apps.cli.spider                                    # Daily scraping
python3 -m apps.cli.spider --url "https://javdb.com/actors/EvkJ"  # Ad hoc mode
python3 -m apps.cli.spider --use-proxy --phase 1              # Force proxy, phase 1 only
python3 -m apps.cli.spider --ignore-release-date              # All entries, not just today

# Pipeline
python3 -m apps.cli.pipeline                                  # Full workflow
python3 -m apps.cli.pipeline --use-proxy                      # With proxy override

# Uploaders
python3 -m apps.cli.qb_uploader                               # Upload to qBittorrent
python3 -m apps.cli.qb_file_filter --min-size 100 --dry-run   # Filter small files

# Maintenance
python3 -m apps.cli.migration --help                           # Database migrations
python3 -m apps.cli.rollback --session-id 332                  # Rollback a session
python3 -m apps.cli.login                                      # Refresh JavDB session cookie

For the full CLI reference, see CLI Reference.

Deployment Options

Method Guide Best For
Local Local Setup Development, manual runs
GitHub Actions GH Actions Setup Automated daily pipeline
Docker Docker Deploy Self-hosted server
Proxy Coordinator Proxy Coordinator Multi-runner coordination

GitHub Actions Workflows

Workflow Trigger Description
DailyIngestion.yml Cron 12:00 UTC + manual Daily scraping pipeline
AdHocIngestion.yml Manual Custom URL scraping
QBFileFilter.yml Cron 16:00 UTC + manual Filter small files (4h after daily)
WeeklyDedup.yml Cron Sunday + manual Rclone deduplication
RollbackD1.yml Manual Session rollback
StaleSessionCleanup.yml Cron daily 02:00 UTC Clean up stuck sessions (>48h)
AuditArchive.yml Cron weekly Monday Prune old audit rows
Migration.yml Manual Database migration runner
TestIngestion.yml Manual Dry-run test pipeline

Storage Backend

The system supports three storage modes via STORAGE_BACKEND:

  • SQLite (default) — Local files in reports/ (history.db, reports.db, operations.db)
  • D1 — Cloudflare D1 for GitHub Actions environments
  • Dual — Writes mirror to both; reads from D1

Every pipeline run is tagged with a session ID and follows the lifecycle: in_progress → finalizing → committed / failed. Pending-mode writes only land in history tables at commit time; failed runs delete pending rows cleanly.

For rollback procedures, see D1 Rollback Guide.

Documentation

For Self-Hosters

For Developers

For Operators

Other Resources

Security

  • Never commit config.py (excluded in .gitignore)
  • Do not commit files under reports/
  • Use GitHub personal access tokens, not passwords
  • Store sensitive values in environment variables for CI/CD
  • Session cookies auto-expire; refresh via python3 -m apps.cli.login

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

This project is for educational and personal use only. Please respect the terms of service of the websites you scrape.