Prep for your next data engineering interview. Work through PySpark notebooks framed as real problems from Zephyr Coffee Co. (a fictional 200-store chain with messy data), review the theory docs before senior rounds, drill the quizzes the night before.
Built in the open. Contributions welcome β see below.
- pyspark/ β the PySpark module (start here)
- data_modeling/ β dimensional modeling, SCDs, star vs snowflake, grain
- ai_for_data_engineering/ β β the DE work behind LLMs: RAG/agents (using LLMs) + pre-training & SFT data (building LLMs)
- company_interviews/ β company-wise DE interview patterns (Ola, Flipkart, Swiggy, PhonePe, Jio)
- ZEPHYR.md β the fictional company whose data runs through every notebook
- Roadmap β what's coming next
- Resources β thought leaders + resume examples
- Contributing
The PySpark module. READMEs inside guide you through it based on your level (beginner / intermediate / senior).
Hands-on notebooks (each framed as a Slack message from a Zephyr colleague asking you to solve a realistic problem):
- Syntax cheatsheet β 30-min flat reference
- Window functions β consecutive months, churn detection, top-N per group
- Joins β type mismatches, broadcast, skew detection, salting
Theory docs (10-min night-before-interview reviews):
- Shuffle & partitioning
- Memory management
- Catalyst & AQE
- Data skew playbook
- Spark UI &
.explain()debugging
Self-check quizzes (collapsible Q&A, π’ basics β β‘ senior judgment):
- Window functions Β· Joins Β· Memory Β· Catalyst/AQE Β· Skew Β· Spark UI
Phase 2 (next, no dates):
- Null handling & deduplication notebook (Zephyr's 2023 POS duplicate incident)
- Nested data notebook (exploding loyalty event structs)
- Structured streaming notebook
- Delta Lake notebook
- Quiz + theory coverage for each
Phase 3:
- SQL module (window functions in SQL, gaps-and-islands, SCDs, query optimization)
- Python for DE module (collections, generators, pandasβSpark, testing)
- System design scenarios for DE interviews
- DE interview question bank
The repo aims to be honest about what's built and what's not. No fake timelines.
- Sumit Mittal β Founder of BigDataBySumit
- Joe Reis β Co-author of Fundamentals of Data Engineering
- Zach Wilson β Data engineering specialist
- Shashank Mishra β Data engineer & educator
- Gowtham SB β Big data & cloud
- Manish Kumar - For questions and interview experience
- Darshil Parmar - For Crisp DE Videos
- Ansh Lamba - Best for Azure and Databricks
- Data Pathshala Preparation of Data Engineering by Manish Kumar
- Data Engineer Handbook DE Concepts
- Manish's DE resume β well-structured, shows skills/projects/experience clearly
- My resume
- Jake Overleaf - Open as template then edit it as your resume on latex
Typo fixes, clearer explanations, new quiz questions, Zephyr scenario ideas, and blog-link additions (with a one-line justification for why it beats what's already linked) are all welcome. Open an issue or a PR.
Please don't send: random link dumps, self-promotional content, or AI-generated filler. The curation is the point β every external link in this repo was added because it's genuinely the best free resource for that topic, not because it exists.