This project crawls Microsoft Learn's AZ-104 Azure Administrator certification course content and prepares it for Vietnamese translation. It extracts structured learning materials from Microsoft's online training platform while preserving the original format and organization.
- Content Extraction: Systematically crawl all AZ-104 course materials from Microsoft Learn
- Translation Preparation: Structure content for Vietnamese localization with built-in translation placeholders
- Content Preservation: Maintain Microsoft's original course structure and formatting
- Offline Access: Enable offline study and translation work
- Hierarchical content organization (Learning Path → Module → Unit)
- Respectful web crawling with rate limiting and retry logic
- HTML content cleaning and standardization
- Bilingual content structure with translation placeholders
- Comprehensive course structure metadata in JSON format
Successfully crawled 260 units across 31 modules in 6 learning paths, covering the complete AZ-104 certification curriculum.
- Python 3.8+
- Node.js (for Playwright)
# Install dependencies
python setup_and_run.py
# Or manually:
pip install -r requirements.txt
playwright install chromiumpython az104_complete_crawler.pypython advanced_cleanup.pyaz104/
├── az104_course_content/
│ ├── english_original/ # Crawled English content
│ ├── vietnamese_translation/ # Vietnamese translations
│ ├── assets/ # Images and resources
│ ├── course_structure.json # Course metadata
│ └── CRAWL_SUMMARY.md # Crawl statistics
├── *.py # Crawler and utility scripts
├── requirements.txt # Python dependencies
└── README.md # This file
- Prerequisites for Azure administrators (5 modules, 38 units)
- Manage identities and governance in Azure (6 modules, 53 units)
- Configure and manage virtual networks (8 modules, 60 units)
- Implement and manage storage in Azure (4 modules, 39 units)
- Deploy and manage Azure compute resources (5 modules, 48 units)
- Monitor and back up Azure resources (3 modules, 22 units)
az104_complete_crawler.py- Main crawler for all course contentadvanced_cleanup.py- Advanced content cleaning and formattingsetup_and_run.py- Automated setup and installationtest_crawler.py- Test crawler functionality
- Content Review: Review English content in
content/english/ - Translation: Add Vietnamese content to placeholder sections
- Quality Check: Ensure technical accuracy and readability
- Structure Preservation: Maintain original HTML structure and formatting
- Technical Terms: Use established Vietnamese Azure terminology
- Code Examples: Keep code in English, translate comments
- UI Elements: Translate interface elements consistently
- Links: Preserve original Microsoft Learn links
- Review the existing content structure
- Follow the established naming conventions
- Test any changes thoroughly
- Maintain the original Microsoft Learn formatting
This project is for educational purposes. All original content belongs to Microsoft Corporation.
Note: This project respects Microsoft's terms of service and is intended for educational and translation purposes only.