A command-line tool for downloading Drupal site content with CAS authentication support. Enumerates content via /admin/content and converts pages to Markdown.
- CAS Authentication: Automatically detects CAS authentication and opens a browser for interactive login
- Content Enumeration: Paginates through
/admin/contentto find all site content - Markdown Conversion: Converts HTML pages to clean Markdown format
- Section Trimming: Configurable removal of navigation, menus, and other boilerplate sections
- Progress Indicator: Visual progress bar during download
- Safety First: Filters out dangerous action links (edit, delete, replicate, etc.)
# Clone the repository
git clone https://github.com/YOUR_USERNAME/drudl.git
cd drudl
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt- Python 3.8+
- Chrome browser (for Selenium-based authentication)
python drudl https://your-drupal-site.com -o output_directoryRemove specific sections from the downloaded Markdown:
# Using individual flags
python drudl https://example.com \
--trim-section "## Footer" \
--trim-section "## Main Menu" \
--trim-section "## Sidebar"
# Using a file
python drudl https://example.com --trim-file trim_sections.txtCreate a text file with one section name per line (matches any heading level):
# Comments start with hash
Footer
Main Menu
Navigation
Sidebar
| Option | Description |
|---|---|
url |
Base URL of the Drupal site (required) |
-o, --output |
Output directory (default: downloaded_site) |
--trim-section |
Markdown header to remove (can be used multiple times) |
--trim-file |
File containing headers to trim, one per line |
- Connection Test: Verifies the site is accessible
- Authentication: If CAS authentication is detected, opens Chrome for interactive login
- Cookie Transfer: Transfers authenticated session cookies from browser to requests
- Enumeration: Paginates through
/admin/contentto collect all content URLs - Download: Fetches each page, converts to Markdown, and saves to output directory
The authenticated user must have permission to access /admin/content. This typically requires:
- "Administer content" permission, or
- "Access content overview" permission
- The script only makes GET requests (never POST/DELETE)
- Dangerous action URLs are filtered out (edit, delete, replicate, unpublish, etc.)
- Uses a standard browser User-Agent to avoid bot detection
- Session cookies are stored in memory only (not persisted to disk)
source venv/bin/activate
python -m pytest test_drudl.py -vThe Docker version includes a noVNC desktop for interactive CAS authentication.
-
Copy the example configuration:
cp docker-compose.example.yml docker-compose.yml
-
Edit
docker-compose.ymlto set your target URL in thecommandline -
Create output directory and run:
mkdir -p output docker-compose up --build
-
Open http://localhost:6080 in your browser to view the Chromium window
-
Complete CAS authentication when prompted
The download will resume automatically if interrupted. State is saved in output/.drudl_state.json.
# Build the image
docker build -t drudl .
# Run with noVNC (access browser at http://localhost:6080)
mkdir -p output
docker run -p 6080:6080 -v $(pwd)/output:/app/output drudl \
https://your-drupal-site.com -o /app/output
# With trim file
docker run -p 6080:6080 -v $(pwd)/output:/app/output \
-v $(pwd)/trim_sections.txt:/app/trim_sections.txt \
drudl https://your-drupal-site.com -o /app/output --trim-file /app/trim_sections.txt