Data Engineer Prep

Prep for your next data engineering interview. Work through PySpark notebooks framed as real problems from Zephyr Coffee Co. (a fictional 200-store chain with messy data), review the theory docs before senior rounds, drill the quizzes the night before.

Built in the open. Contributions welcome — see below.

📚 Navigate

pyspark/ — the PySpark module (start here)
data_modeling/ — dimensional modeling, SCDs, star vs snowflake, grain
ai_for_data_engineering/ — ⭐ the DE work behind LLMs: RAG/agents (using LLMs) + pre-training & SFT data (building LLMs)
company_interviews/ — company-wise DE interview patterns (Ola, Flipkart, Swiggy, PhonePe, Jio)
ZEPHYR.md — the fictional company whose data runs through every notebook
Roadmap — what's coming next
Resources — thought leaders + resume examples
Contributing

What's in here now

pyspark/

The PySpark module. READMEs inside guide you through it based on your level (beginner / intermediate / senior).

Hands-on notebooks (each framed as a Slack message from a Zephyr colleague asking you to solve a realistic problem):

Syntax cheatsheet — 30-min flat reference
Window functions — consecutive months, churn detection, top-N per group
Joins — type mismatches, broadcast, skew detection, salting

Theory docs (10-min night-before-interview reviews):

Self-check quizzes (collapsible Q&A, 🟢 basics → ⚡ senior judgment):

Window functions · Joins · Memory · Catalyst/AQE · Skew · Spark UI

Roadmap

Phase 2 (next, no dates):

Null handling & deduplication notebook (Zephyr's 2023 POS duplicate incident)
Nested data notebook (exploding loyalty event structs)
Structured streaming notebook
Delta Lake notebook
Quiz + theory coverage for each

Phase 3:

SQL module (window functions in SQL, gaps-and-islands, SCDs, query optimization)
Python for DE module (collections, generators, pandas↔Spark, testing)
System design scenarios for DE interviews
DE interview question bank

The repo aims to be honest about what's built and what's not. No fake timelines.

Resources for Data Engineers

Thought leaders worth following

Sumit Mittal — Founder of BigDataBySumit
Joe Reis — Co-author of Fundamentals of Data Engineering
Zach Wilson — Data engineering specialist
Shashank Mishra — Data engineer & educator
Gowtham SB — Big data & cloud
Manish Kumar - For questions and interview experience
Darshil Parmar - For Crisp DE Videos
Ansh Lamba - Best for Azure and Databricks

Resource That I love

Data Pathshala Preparation of Data Engineering by Manish Kumar
Data Engineer Handbook DE Concepts

Resume examples

Manish's DE resume — well-structured, shows skills/projects/experience clearly
My resume
Jake Overleaf - Open as template then edit it as your resume on latex

Contributing

Typo fixes, clearer explanations, new quiz questions, Zephyr scenario ideas, and blog-link additions (with a one-line justification for why it beats what's already linked) are all welcome. Open an issue or a PR.

Please don't send: random link dumps, self-promotional content, or AI-generated filler. The curation is the point — every external link in this repo was added because it's genuinely the best free resource for that topic, not because it exists.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.claude		.claude
ai_for_data_engineering		ai_for_data_engineering
company_interviews		company_interviews
data_modeling		data_modeling
pyspark		pyspark
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ZEPHYR.md		ZEPHYR.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer Prep

📚 Navigate

What's in here now

pyspark/

Roadmap

Resources for Data Engineers

Thought leaders worth following

Resource That I love

Resume examples

Contributing

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Prep

📚 Navigate

What's in here now

pyspark/

Roadmap

Resources for Data Engineers

Thought leaders worth following

Resource That I love

Resume examples

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages