Hands-on project for data collection & extraction (Week 2). Implements OCR (Tesseract), web scraping (arXiv), PDF text extraction, automatic speech recognition (Whisper), and dataset cleaning/deduplication.
-
Updated
Oct 8, 2025 - Python