Claude Code Documentation Scraper

A Python tool that downloads Claude Code documentation from Anthropic's website, converts it to Markdown, and serves it locally. Features date-based archiving and idempotent downloads for crash recovery.

Features

Download HTML docs from https://docs.anthropic.com/en/docs/claude-code/
Convert to Markdown with clean content extraction
Local web server for offline browsing
Date-based archiving - automatic snapshots by date
Idempotent downloads - crash recovery by resuming on same date
Historical versions preserved in archive/YYYYMMDD/

Quick Start

Prerequisites

Install uv - a fast Python package installer:

curl -LsSf https://astral.sh/uv/install.sh | sh

Setup

# Clone and setup
git clone <repo-url>
cd claude-code-docs
make setup

What it does: Creates a virtual environment using uv venv and installs dependencies via uv sync

Download Documentation

# Download and convert to Markdown (recommended)
make run-full

What it does: Runs uv run python main.py --html --md to download HTML and convert to Markdown

Makefile Commands

Run make help to see all available commands:

Core Commands

Command	Description	Underlying Command
`make setup`	First-time setup	`uv venv && uv sync`
`make run-full`	Download & convert	`uv run python main.py --html --md`
`make run-html`	Download HTML only	`uv run python main.py --html`
`make run-md`	Convert to Markdown	`uv run python main.py --md`
`make serve`	Start web server (port 8000)	`uv run python main.py --serve`

Utility Commands

Command	Description
`make info`	Show download status and stats
`make archive-status`	Show current date and archives
`make check`	Verify dependencies
`make clean`	Remove downloads/archive
`make clean-all`	Remove downloads/archive/venv

How It Works

Idempotency & Archiving

The scraper uses downloads/meta.json to track download date and completion status:

{
  "download_date": "20251005",
  "status": "completed"
}

Same-date re-runs:

Completed (status: "completed"): Skips entirely - already done
Incomplete (status: "processing"): Cleans up and restarts from scratch
- Detects crashed/interrupted downloads
- Ensures consistency by removing partial data
- Provides true crash recovery

Different-date re-runs (archiving):

Moves old download to archive/YYYYMMDD/
Starts fresh download with new date
Preserves historical snapshots

Example Workflow

# Day 1 (Oct 5) - First run
make run-full
# Creates: meta.json {"download_date": "20251005", "status": "processing"}
# On success: Updates to {"status": "completed"}

# Day 1 - Re-run after successful completion
make run-full
# Output: "Download already completed for 20251005 - skipping (idempotent)"

# Day 1 - Simulate crash (edit meta.json: set status to "processing")
make run-full
# Output: "WARNING: Previous download was incomplete"
# Cleans up html/, md/, db.yaml and restarts fresh

# Day 2 (Oct 6)
make run-full
# Output: "Archiving previous download to archive/20251005/"
# Creates: archive/20251005/{db.yaml,html,md}
# Starts fresh in downloads/

Directory Structure

claude-code-docs/
├── main.py              # Scraper implementation
├── Makefile            # Primary management interface
├── pyproject.toml      # UV package config
├── downloads/          # Current download (git-ignored)
│   ├── meta.json      # Download date tracker
│   ├── html/          # HTML files + db.yaml
│   └── md/            # Markdown files
└── archive/           # Historical downloads (git-ignored)
    └── 20251005/      # Date-based snapshots
        ├── html/
        └── md/

Note: db.yaml is now stored at downloads/db.yaml (moved from downloads/html/db.yaml) for easier visibility. When archiving, it moves to archive/YYYYMMDD/db.yaml alongside the html/ and md/ folders.

Advanced Usage

Custom Port for Web Server

make serve-port PORT=8080

Direct Python Usage

# Activate virtual environment first
source .venv/bin/activate

# Download from custom URL
python main.py --html --url https://docs.anthropic.com/en/docs/claude-code/quickstart

# Serve on custom port
python main.py --serve --port 8080

Check Current State

# Show current download info
make info

# Show archive history
make archive-status

Development

The scraper is implemented in main.py as a single-class application with these key components:

HTML Downloading - Recursive crawling with URL tracking
Content Extraction - Multi-strategy main content detection
Link Rewriting - Absolute to relative path conversion
Markdown Conversion - html2text with custom cleaning
Date Management - meta.json tracking and archiving logic
Local Serving - Flask-based web server

See CLAUDE.md for detailed architecture documentation.

Automated Daily Releases

This repository uses GitHub Actions to automatically download and release Claude Code documentation daily.

How It Works

Schedule: Runs daily at midnight UTC via cron (0 0 * * *)
Process:
- Sets up environment with uv
- Runs make run to download and convert docs
- Creates release with tag docs-YYYYMMDD
- Commits downloads folder to repository
- Pushes changes to main branch
Assets: Each release includes:
- claude-code-docs-YYYYMMDD.tar.gz - Complete download
- html-YYYYMMDD.tar.gz - HTML files only
- md-YYYYMMDD.tar.gz - Markdown files only

Note: The downloads/ folder is tracked in git, so each day's documentation is both:

Released as downloadable archives
Committed to the repository for easy browsing on GitHub

Using Released Documentation

# Download latest release
wget https://github.com/[your-repo]/releases/latest/download/claude-code-docs-YYYYMMDD.tar.gz

# Extract
tar -xzf claude-code-docs-YYYYMMDD.tar.gz

# Browse
cd downloads/md

Manual Trigger

You can manually trigger the workflow from the GitHub Actions tab:

Go to Actions → Daily Documentation Release
Click "Run workflow"
Wait for completion and check Releases

Development

The scraper is implemented in main.py as a single-class application with these key components:

HTML Downloading - Recursive crawling with URL tracking
Content Extraction - Multi-strategy main content detection
Link Rewriting - Absolute to relative path conversion
Markdown Conversion - html2text with custom cleaning
Date Management - meta.json tracking and archiving logic
Local Serving - Flask-based web server

See CLAUDE.md for detailed architecture documentation.

License

[Your License Here]

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.claude		.claude
.github/workflows		.github/workflows
claude-cookbooks @ 04bdceb		claude-cookbooks @ 04bdceb
claude-quickstarts @ 38391da		claude-quickstarts @ 38391da
downloads		downloads
tutorial		tutorial
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
main.py		main.py
notes.md		notes.md
prompt.md		prompt.md
pyproject.toml		pyproject.toml
references.md		references.md
test_costs_fixed.md		test_costs_fixed.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Claude Code Documentation Scraper

Features

Quick Start

Prerequisites

Setup

Download Documentation

Makefile Commands

Core Commands

Utility Commands

How It Works

Idempotency & Archiving

Example Workflow

Directory Structure

Advanced Usage

Custom Port for Web Server

Direct Python Usage

Check Current State

Development

Automated Daily Releases

How It Works

Using Released Documentation

Manual Trigger

Development

License

About

Uh oh!

Releases

Packages

Languages

anagri/claude-code-docs

Folders and files

Latest commit

History

Repository files navigation

Claude Code Documentation Scraper

Features

Quick Start

Prerequisites

Setup

Download Documentation

Makefile Commands

Core Commands

Utility Commands

How It Works

Idempotency & Archiving

Example Workflow

Directory Structure

Advanced Usage

Custom Port for Web Server

Direct Python Usage

Check Current State

Development

Automated Daily Releases

How It Works

Using Released Documentation

Manual Trigger

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages