Skip to content

anagri/claude-code-docs

Repository files navigation

Claude Code Documentation Scraper

A Python tool that downloads Claude Code documentation from Anthropic's website, converts it to Markdown, and serves it locally. Features date-based archiving and idempotent downloads for crash recovery.

Features

  • Download HTML docs from https://docs.anthropic.com/en/docs/claude-code/
  • Convert to Markdown with clean content extraction
  • Local web server for offline browsing
  • Date-based archiving - automatic snapshots by date
  • Idempotent downloads - crash recovery by resuming on same date
  • Historical versions preserved in archive/YYYYMMDD/

Quick Start

Prerequisites

Install uv - a fast Python package installer:

curl -LsSf https://astral.sh/uv/install.sh | sh

Setup

# Clone and setup
git clone <repo-url>
cd claude-code-docs
make setup

What it does: Creates a virtual environment using uv venv and installs dependencies via uv sync

Download Documentation

# Download and convert to Markdown (recommended)
make run-full

What it does: Runs uv run python main.py --html --md to download HTML and convert to Markdown

Makefile Commands

Run make help to see all available commands:

Core Commands

Command Description Underlying Command
make setup First-time setup uv venv && uv sync
make run-full Download & convert uv run python main.py --html --md
make run-html Download HTML only uv run python main.py --html
make run-md Convert to Markdown uv run python main.py --md
make serve Start web server (port 8000) uv run python main.py --serve

Utility Commands

Command Description
make info Show download status and stats
make archive-status Show current date and archives
make check Verify dependencies
make clean Remove downloads/archive
make clean-all Remove downloads/archive/venv

How It Works

Idempotency & Archiving

The scraper uses downloads/meta.json to track download date and completion status:

{
  "download_date": "20251005",
  "status": "completed"
}

Same-date re-runs:

  • Completed (status: "completed"): Skips entirely - already done
  • Incomplete (status: "processing"): Cleans up and restarts from scratch
    • Detects crashed/interrupted downloads
    • Ensures consistency by removing partial data
    • Provides true crash recovery

Different-date re-runs (archiving):

  • Moves old download to archive/YYYYMMDD/
  • Starts fresh download with new date
  • Preserves historical snapshots

Example Workflow

# Day 1 (Oct 5) - First run
make run-full
# Creates: meta.json {"download_date": "20251005", "status": "processing"}
# On success: Updates to {"status": "completed"}

# Day 1 - Re-run after successful completion
make run-full
# Output: "Download already completed for 20251005 - skipping (idempotent)"

# Day 1 - Simulate crash (edit meta.json: set status to "processing")
make run-full
# Output: "WARNING: Previous download was incomplete"
# Cleans up html/, md/, db.yaml and restarts fresh

# Day 2 (Oct 6)
make run-full
# Output: "Archiving previous download to archive/20251005/"
# Creates: archive/20251005/{db.yaml,html,md}
# Starts fresh in downloads/

Directory Structure

claude-code-docs/
├── main.py              # Scraper implementation
├── Makefile            # Primary management interface
├── pyproject.toml      # UV package config
├── downloads/          # Current download (git-ignored)
│   ├── meta.json      # Download date tracker
│   ├── html/          # HTML files + db.yaml
│   └── md/            # Markdown files
└── archive/           # Historical downloads (git-ignored)
    └── 20251005/      # Date-based snapshots
        ├── html/
        └── md/

Note: db.yaml is now stored at downloads/db.yaml (moved from downloads/html/db.yaml) for easier visibility. When archiving, it moves to archive/YYYYMMDD/db.yaml alongside the html/ and md/ folders.

Advanced Usage

Custom Port for Web Server

make serve-port PORT=8080

Direct Python Usage

# Activate virtual environment first
source .venv/bin/activate

# Download from custom URL
python main.py --html --url https://docs.anthropic.com/en/docs/claude-code/quickstart

# Serve on custom port
python main.py --serve --port 8080

Check Current State

# Show current download info
make info

# Show archive history
make archive-status

Development

The scraper is implemented in main.py as a single-class application with these key components:

  • HTML Downloading - Recursive crawling with URL tracking
  • Content Extraction - Multi-strategy main content detection
  • Link Rewriting - Absolute to relative path conversion
  • Markdown Conversion - html2text with custom cleaning
  • Date Management - meta.json tracking and archiving logic
  • Local Serving - Flask-based web server

See CLAUDE.md for detailed architecture documentation.

Automated Daily Releases

This repository uses GitHub Actions to automatically download and release Claude Code documentation daily.

How It Works

  1. Schedule: Runs daily at midnight UTC via cron (0 0 * * *)
  2. Process:
    • Sets up environment with uv
    • Runs make run to download and convert docs
    • Creates release with tag docs-YYYYMMDD
    • Commits downloads folder to repository
    • Pushes changes to main branch
  3. Assets: Each release includes:
    • claude-code-docs-YYYYMMDD.tar.gz - Complete download
    • html-YYYYMMDD.tar.gz - HTML files only
    • md-YYYYMMDD.tar.gz - Markdown files only

Note: The downloads/ folder is tracked in git, so each day's documentation is both:

  • Released as downloadable archives
  • Committed to the repository for easy browsing on GitHub

Using Released Documentation

# Download latest release
wget https://github.com/[your-repo]/releases/latest/download/claude-code-docs-YYYYMMDD.tar.gz

# Extract
tar -xzf claude-code-docs-YYYYMMDD.tar.gz

# Browse
cd downloads/md

Manual Trigger

You can manually trigger the workflow from the GitHub Actions tab:

  1. Go to Actions → Daily Documentation Release
  2. Click "Run workflow"
  3. Wait for completion and check Releases

Development

The scraper is implemented in main.py as a single-class application with these key components:

  • HTML Downloading - Recursive crawling with URL tracking
  • Content Extraction - Multi-strategy main content detection
  • Link Rewriting - Absolute to relative path conversion
  • Markdown Conversion - html2text with custom cleaning
  • Date Management - meta.json tracking and archiving logic
  • Local Serving - Flask-based web server

See CLAUDE.md for detailed architecture documentation.

License

[Your License Here]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages