Skip to content

Noman654/dataengineer_prep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Engineer Prep

License: MIT Last Commit Contributions Welcome

Prep for your next data engineering interview. Work through PySpark notebooks framed as real problems from Zephyr Coffee Co. (a fictional 200-store chain with messy data), review the theory docs before senior rounds, drill the quizzes the night before.

Built in the open. Contributions welcome β€” see below.


πŸ“š Navigate

  • pyspark/ β€” the PySpark module (start here)
  • data_modeling/ β€” dimensional modeling, SCDs, star vs snowflake, grain
  • ai_for_data_engineering/ β€” ⭐ the DE work behind LLMs: RAG/agents (using LLMs) + pre-training & SFT data (building LLMs)
  • company_interviews/ β€” company-wise DE interview patterns (Ola, Flipkart, Swiggy, PhonePe, Jio)
  • ZEPHYR.md β€” the fictional company whose data runs through every notebook
  • Roadmap β€” what's coming next
  • Resources β€” thought leaders + resume examples
  • Contributing

What's in here now

The PySpark module. READMEs inside guide you through it based on your level (beginner / intermediate / senior).

Hands-on notebooks (each framed as a Slack message from a Zephyr colleague asking you to solve a realistic problem):

  • Syntax cheatsheet β€” 30-min flat reference
  • Window functions β€” consecutive months, churn detection, top-N per group
  • Joins β€” type mismatches, broadcast, skew detection, salting

Theory docs (10-min night-before-interview reviews):

Self-check quizzes (collapsible Q&A, 🟒 basics β†’ ⚑ senior judgment):


Roadmap

Phase 2 (next, no dates):

  • Null handling & deduplication notebook (Zephyr's 2023 POS duplicate incident)
  • Nested data notebook (exploding loyalty event structs)
  • Structured streaming notebook
  • Delta Lake notebook
  • Quiz + theory coverage for each

Phase 3:

  • SQL module (window functions in SQL, gaps-and-islands, SCDs, query optimization)
  • Python for DE module (collections, generators, pandas↔Spark, testing)
  • System design scenarios for DE interviews
  • DE interview question bank

The repo aims to be honest about what's built and what's not. No fake timelines.


Resources for Data Engineers

Thought leaders worth following

  1. Sumit Mittal β€” Founder of BigDataBySumit
  2. Joe Reis β€” Co-author of Fundamentals of Data Engineering
  3. Zach Wilson β€” Data engineering specialist
  4. Shashank Mishra β€” Data engineer & educator
  5. Gowtham SB β€” Big data & cloud
  6. Manish Kumar - For questions and interview experience
  7. Darshil Parmar - For Crisp DE Videos
  8. Ansh Lamba - Best for Azure and Databricks

Resource That I love

  1. Data Pathshala Preparation of Data Engineering by Manish Kumar
  2. Data Engineer Handbook DE Concepts

Resume examples


Contributing

Typo fixes, clearer explanations, new quiz questions, Zephyr scenario ideas, and blog-link additions (with a one-line justification for why it beats what's already linked) are all welcome. Open an issue or a PR.

Please don't send: random link dumps, self-promotional content, or AI-generated filler. The curation is the point β€” every external link in this repo was added because it's genuinely the best free resource for that topic, not because it exists.

About

Data engineering interview prep - PySpark notebooks, theory docs, quizzes, and company-specific patterns. Built around Zephyr Coffee Co., a fictional 200-store chain with messy data.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors