Master Notes
Master Notes
Python Programming
1. Introduction to Python
What is Programming? 💻
o Programming is a way to communicate with and instruct a computer to
perform various tasks. Just as we use languages like Hindi or English to
communicate with each other, we use a programming language to
communicate with a computer.
What is Python? 🐍
o Python is a high-level programming language that is simple and easy to
understand, with a syntax that feels like reading simple English. This makes it
easy for beginners to learn.
o Features of Python:
Easy to understand, leading to less development time
Free and open source
A high-level language
Portable: Works on various operating systems like Linux, Windows,
and Mac.
Installation
o Python can be easily installed from the official website: python.org. You can
download the installer for your platform and run the setup, much like
installing a game.
Modules, Comments, and Pip
o Modules: A module is a file containing code written by someone else that can
be imported and used in your programs.
Built-in modules: These come pre-installed with Python (e.g., os,
abc).
External modules: These need to be installed using pip (e.g.,
tensorflow, flask).
o Pip: Pip is the package manager for Python, used to install external modules.
Example: pip install flask installs the Flask module.
o Comments: Comments are used to write explanatory notes in the code that the
programmer does not want to execute.
Single-line comments: Written using #.
Multi-line comments: Written using """ comment """.
Variables
A variable is a name given to a memory location to store a value; think of it
o
as a container.
o Rules for Naming Variables:
Can contain alphabets, digits, and underscores.
Must start with an alphabet or an underscore.
Cannot start with a digit.
No whitespace is allowed.
Data Types
o Python automatically identifies the data type of a variable. The primary data
types are:
a = 31
type(a) will return <class 'int'>.
Typecasting: Converting one data type to another.
o
str(31) converts integer 31 to string "31".
int("32") converts string "32" to integer 32.
float(32) converts integer 32 to float 32.0.
input() function
o This function allows the user to provide input from the keyboard.
o Important: The input() function always returns the input as a string, even if
a number is entered.
3. Strings
Lists 📝
o Lists are containers that can store a set of values of any data type.
o Example: friends = ["Apple", "Akash", 7, False].
o Lists can be indexed and sliced just like strings.
o List Methods:
list.sort(): Sorts the list.
list.reverse(): Reverses the list.
list.append(value): Adds an element to the end of the list.
list.insert(index, value): Inserts an element at a specific index.
list.pop(index): Removes and returns the element at a given index.
list.remove(value): Removes the first occurrence of a value from
the list.
Tuples 📦
o A tuple is an immutable data type, meaning its elements cannot be changed
after it's defined.
o Syntax:
a = () (Empty tuple).
a = (1,) (Tuple with one element needs a comma).
a = (1, 7, 2) (Tuple with multiple elements).
o Tuple Methods:
tuple.count(value): Returns the number of times a value occurs.
tuple.index(value): Returns the index of the first occurrence of a
value.
5. Dictionaries and Sets
Dictionaries 📖
o A dictionary is a collection of key-value pairs.
o Syntax: a = {"key": "value", "marks": 95, "list": [1, 2, 9]}.
o Properties:
Unordered: The items do not have a defined order.
Mutable: You can change, add, or remove items.
Indexed: You can access items by their key.
No duplicate keys: Keys must be unique.
o Dictionary Methods:
dict.items(): Returns a list of (key, value) tuples.
dict.keys(): Returns a list of the dictionary's keys.
dict.update({"new_key": "new_value"}): Updates the dictionary
with new key-value pairs.
dict.get("key"): Returns the value for a given key.
Sets 🔢
o A set is a collection of non-repetitive, unique elements.
o Properties:
Unordered: Elements do not have a fixed order.
Unindexed: You cannot access elements by index.
No duplicates: Sets cannot contain duplicate values.
o Operations on Sets:
len(set): Returns the length of the set.
set.remove(value): Removes a specified element.
set.pop(): Removes an arbitrary element.
set.union(other_set): Returns a new set with all items from both
sets.
set.intersection(other_set): Returns a new set with items that
are in both sets.
6. Conditional Expressions
Python
if condition1:
# Code to execute if condition1 is true
elif condition2:
# Code to execute if condition2 is true
else:
# Code to execute if all previous conditions are false
o The elif (else if) statement can be used multiple times.
o The else block is optional and executes only if all preceding if and elif
conditions fail.
Relational and Logical Operators
o Relational operators (==, >, <, !=, etc.) are used to evaluate conditions within
if statements.
o Logical operators (and, or, not) are used to combine conditional statements.
7. Loops in Python
Loops are used to repeat a set of statements. There are two primary types of loops in
Python: while loops and for loops.
while loop
o The while loop executes a block of code as long as a specified condition is
true.
o The condition is checked first; if it's true, the loop body is executed. This
process continues until the condition becomes false.
o Syntax:
Python
while condition:
# Body of the loop
for loop
o A for loop is used to iterate over a sequence (like a list, tuple, or string).
o range() function: Generates a sequence of numbers. You can specify the
start, stop, and step size.
o Syntax:
Python
o An optional else can be used with a for loop, which executes when the loop
finishes.
Loop Control Statements
o break: Exits the loop immediately when encountered.
o continue: Skips the current iteration and continues with the next one.
o pass: A null statement that does nothing. It's used as a placeholder where
code is syntactically required but you don't want any command or code to
execute.
Python
def function_name(arguments):
# Function body
return value
o Default parameter value: You can set a default value for an argument, which
is used if no argument is passed during the function call.
Recursion
o Recursion is a technique where a function calls itself.
o It's often used when a problem can be broken down into smaller, similar sub-
problems, like calculating a factorial.
o A recursive function needs a base condition to stop the recursion and prevent
an infinite loop.
To store data permanently, we use files, as RAM is volatile. Python can read from and
write to files.
Types of Files:
1. Text files: .txt, .py, etc..
2. Binary files: .jpg, .dat, etc..
Opening a File
1. The open() function is used to open files. It takes the filename and the mode
as parameters.
2. Modes:
'r': Read mode
'w': Write mode (overwrites the file)
'a': Append mode (adds to the end of the file)
'+': Update mode (reading and writing)
You can also use 'rb' or 'rt' for binary or text mode respectively.
Reading and Writing Files
1. file.read(): Reads the contents of the file.
2. file.readline(): Reads one line at a time.
3. file.write("text"): Writes text to the file.
with statement
1. This is the recommended way to open a file as it automatically closes the file
after the block of code is executed.
2. Syntax:
Python
Exception Handling
o Exceptions are errors that occur during program execution. Python has many
built-in exceptions.
o try...except block: Used to handle exceptions. The code that might cause
an error is placed in the try block, and the code to handle the exception is in
the except block.
o else clause: Executes if the try block was successful (no exceptions were
raised).
o finally clause: Executes regardless of whether an exception occurred or not.
o raise keyword: Used to raise custom exceptions.
if __name__ == "__main__"
o This is a common construct in Python. __name__ is a special variable that is
set to "__main__" when the module is run directly. This allows you to write
code that only runs when the file is executed as a script, not when it's imported
as a module.
global keyword
o The global keyword is used to modify a variable outside of the current scope
(e.g., inside a function).
enumerate function
o This function adds a counter to an iterable (like a list) and returns it as an
enumerate object. It's useful for getting both the index and the item when
looping.
List Comprehensions
o A concise and elegant way to create lists based on existing lists.
o Syntax: new_list = [expression for item in iterable if
condition].
o Example: list2 = [i for i in list1 if i > 8].
Virtual Environments
o A virtual environment is an isolated Python environment that has its own set
of installed packages. This is useful for managing dependencies for different
projects.
o The virtualenv package can be installed using pip install virtualenv.
o pip freeze is a command that lists all the installed packages and their
versions in the current environment. This can be used to create a
requirements.txt file to replicate the environment elsewhere.
Lambda Functions
o These are small, anonymous functions created using the lambda keyword.
o Syntax: lambda arguments: expression.
o Example: square = lambda x: x * x.
join() method
o Creates a string from iterable objects like a list.
o Example: ", ".join(["apple", "mango"]) returns "apple, and,
mango".
format() method
o Formats values inside a string into a desired output.
o Example: "{} is a good {}".format("Harry", "boy") returns "Harry
is a good boy".
map, filter, and reduce
o map(function, iterable): Applies a function to all items in an iterable.
o filter(function, iterable): Creates a list of items for which a function
returns True.
o reduce(function, iterable): Applies a rolling computation to sequential
pairs of elements. It's part of the functools module.
1. Introduction to SQL
What is SQL?
o SQL (Structured Query Language) is a non-procedural language used to
communicate with databases. It was introduced by IBM in the 1970s.
o It is also known as SEQUEL or as a Common Language Interface (CLI)
language.
o SQL is the standard language used to interact with any Relational Database
Management System (RDBMS) product like Oracle, SQL Server, or MySQL.
o SQL is not case-sensitive, meaning commands can be written in upper or
lower case.
o SQL statements should ideally end with a semicolon (;), but this is optional in
SQL Server.
Sub-Languages of SQL
o DDL (Data Definition Language): Used to define, modify, and drop database
objects.
Commands: CREATE, ALTER, DROP, TRUNCATE, sp_rename.
o DML (Data Manipulation Language): Used to manipulate the data within
tables.
Commands: INSERT, UPDATE, DELETE.
o DQL (Data Query Language): Used to retrieve data from the database.
Command: SELECT.
o TCL (Transaction Control Language): Used to manage transactions in the
database.
Commands: COMMIT, ROLLBACK, SAVEPOINT.
o DCL (Data Control Language): Used to control access to data in the
database.
Commands: GRANT, REVOKE.
5. SQL Functions
Aggregate Functions: Perform calculations on a set of rows and return a single value.
o SUM(): Calculates the sum of values.
o AVG(): Calculates the average of values.
o MAX(): Returns the maximum value.
o MIN(): Returns the minimum value.
o COUNT(): Counts the number of rows.
COUNT(*): Counts all rows, including duplicates and NULLs.
COUNT(column_name): Counts non-NULL values in a column.
COUNT(DISTINCT column_name): Counts only unique, non-NULL
values.
String Functions:
o LEN(): Returns the length of a string.
o LOWER() / UPPER(): Converts a string to lowercase or uppercase.
o LTRIM() / RTRIM(): Removes leading or trailing spaces.
o SUBSTRING(): Extracts a portion of a string.
o REPLACE(): Replaces occurrences of a specified string.
o CONCAT(): Joins two or more strings together.
Date and Time Functions:
o GETDATE(): Returns the current system date and time.
o DATEPART(): Returns a specified part of a date (e.g., year, month, day).
o DATENAME(): Returns the name of a specified part of a date (e.g., 'Friday',
'August').
o DATEADD(): Adds a time/date interval to a date.
o DATEDIFF(): Returns the difference between two dates.
The ISNULL() Function: Replaces NULL with a specified replacement value. If the
first expression is NULL, it returns the second; otherwise, it returns the first.
o Example: SELECT ISNULL(Commission, 0) FROM Employees;.
These are powerful, open-source Python libraries that are essential for data analysis and
machine learning.
NumPy:
o Stands for "Numerical Python." It's the fundamental package for numerical
computing in Python.
o Its main feature is the powerful N-dimensional array object, which is much
more efficient for storing and manipulating numerical data than standard
Python lists.
o It's a foundational library for many others, including Pandas and Matplotlib.
Pandas:
o Pandas is the go-to library for data manipulation and analysis.
o It introduces two main data structures:
DataFrame: A 2-dimensional table with labeled rows and columns,
similar to a spreadsheet or a SQL table. This is what you would use to
hold data like your sales or policy information.
Series: A 1-dimensional labeled array, which is essentially a single
column of a DataFrame.
o Pandas makes it easy to read data from various sources (like CSVs or SQL
databases), clean it, filter it, and perform complex operations.
Matplotlib & Seaborn:
o These are data visualization libraries.
o Matplotlib: A powerful and flexible library for creating static, animated, and
interactive visualizations. You can create all kinds of plots, like line charts, bar
charts, and histograms.
o Seaborn: Built on top of Matplotlib, Seaborn provides a higher-level interface
for drawing attractive and informative statistical graphics. It makes creating
complex plots like heatmaps and violin plots much easier.
Machine Learning is a field of AI that gives computers the ability to learn from data without
being explicitly programmed. Here are the concepts listed on your resume:
In my current role at Reliance Nippon Life Insurance, I design and maintain ETL pipelines to
process large volumes of policy persistency and renewal data. I enjoy the entire data
lifecycle, from extracting and transforming raw data to building analytical solutions. For
instance, I've worked on projects like developing a Drowsiness Detection system using
TensorFlow and a Sentiment Analysis system with NLTK. I am passionate about using data
to solve complex problems and deliver actionable insights, and I'm eager to bring my skills in
data engineering and machine learning to a challenging new role."
As for a weakness, I would say that I sometimes get deeply engrossed in the technical details
of a single component of a project. While this ensures high-quality work, I've learned the
importance of stepping back to maintain a view of the overall project timeline and priorities.
I'm actively working on this by using project management techniques to better balance deep
focus with consistent progress across all tasks."
Question 3: Describe a time you had to collaborate with a cross-functional team. What
was your role?
Answer:"In my role at Reliance Nippon, I constantly collaborate with data analysts and
business teams to ensure our data pipelines meet their requirements. A specific example was
when we needed to enrich our policyholder datasets for a new analytical model.
Question 4: Tell me about a challenging data engineering problem you faced and how
you solved it.
Answer:"A significant challenge I faced was optimizing the performance of our daily ETL
jobs at Reliance Nippon, which were processing increasingly large volumes of insurance
data.
Situation: The daily data load was growing, and our existing PySpark jobs were
taking longer to run, threatening to miss the service-level agreement (SLA) for data
availability.
Task: My task was to analyze the pipeline, identify the bottlenecks, and implement
optimizations to improve performance.
Action: I started by profiling the Spark jobs and discovered that a few complex join
and aggregation operations were causing significant data shuffling across the cluster.
To address this, I refactored the SQL queries and PySpark code. I implemented data
partitioning on our Databricks tables based on date, which drastically reduced the
amount of data scanned in daily jobs. I also optimized join strategies and broadcasted
smaller tables to minimize data movement.
Result: These optimizations reduced the pipeline's runtime by over 40%, ensuring we
consistently met our SLAs and providing the analytics team with timely data."
Question 5: In your Customer Management System project, why did you choose
Streamlit for the UI? What were the alternatives?
Answer:"I chose Streamlit for the Customer Management System's UI primarily because of
its speed and simplicity for creating data-centric applications. My goal was to build a
functional and interactive UI quickly without getting bogged down in complex frontend
frameworks like React or Angular. Streamlit allows you to build a web app using only
Python, which was perfect since the backend was also in Python.
An alternative would have been Flask or Django. While they are more powerful and offer
more customization, they also have a steeper learning curve and require more boilerplate
code for a simple application like the CMS. Given the project's scope, Streamlit provided the
best balance of interactivity, ease of development, and performance, allowing me to deliver a
user-friendly interface efficiently."
Question 6: For your Drowsiness Detection System, what were the key challenges in
working with real-time video data, and how did you use OpenCV and TensorFlow to
overcome them?
Answer:"The biggest challenge with the Drowsiness Detection System was processing the
video feed in real-time with minimal latency. The system had to be fast enough to provide a
timely warning.
I used OpenCV to handle the initial, less computationally intensive tasks. It efficiently
captured the video stream, converted frames to grayscale for faster processing, and performed
facial detection. This isolated the region of interest (the face), reducing the amount of data the
more complex model had to process.
For the core drowsiness detection, I used a lightweight Convolutional Neural Network (CNN)
built with TensorFlow. Instead of running the model on the entire video frame, I ran it only
on the extracted eye regions. This significantly reduced the computational load. By
engineering the deep learning model to be efficient and using OpenCV for the heavy lifting
of video processing, I was able to build a system that met the real-time responsiveness
required for a safety-critical application."
Question 7: In your Amazon Review Sentiment Analysis project, you used NLTK. What
specific NLP pre-processing steps did you perform, and why were they important?
Answer:"In the Sentiment Analysis project, the pre-processing step was crucial for model
accuracy. Using the NLTK library in Python, I performed several key steps:
1. Tokenization: I first broke down the raw review text into individual words or
"tokens." This is the foundational step for any text analysis.
2. Lowercasing: I converted all text to lowercase to ensure that words like 'Good' and
'good' were treated as the same word.
3. Stop Word Removal: I removed common words like 'the', 'a', and 'is' that don't carry
much sentiment. This helps the model focus on the words that are actually meaningful
for determining sentiment.
4. Stemming/Lemmatization: I used lemmatization to reduce words to their root form
(e.g., 'running' becomes 'run'). This consolidates different forms of a word into a
single feature, which helps improve the model's performance by reducing the
dimensionality of the input.
These steps were vital because they cleaned and standardized the text data, transforming
noisy, unstructured reviews into a clean format that the machine learning model could learn
from effectively."
Question 8: Can you explain the difference between ETL and ELT? Where might you
use one over the other?
Answer:"ETL and ELT are two different approaches to data pipeline architecture. The key
difference lies in where the transformation happens.
In ETL (Extract, Transform, Load), the data is extracted from the source,
transformed in a separate staging area, and then the transformed data is loaded into
the destination data warehouse. This is a traditional approach that works well when
dealing with structured data and when the transformations are complex.
In ELT (Extract, Load, Transform), the raw data is extracted and loaded directly
into the destination data warehouse or data lake. The transformation then happens
inside the destination system using its processing power.
I would choose ELT in a modern cloud data warehouse environment like Azure Synapse or
Snowflake. These platforms are incredibly powerful and can handle transformations on
massive datasets efficiently. This approach is also more flexible, as it allows you to store the
raw data and apply different transformations later as business needs change. I used a more
traditional ETL approach at Reliance Nippon, where complex business logic was applied in
PySpark on Databricks before loading the final data into our data lake for analysis."
Question 9: What is the difference between a list and a tuple in Python, and when would
you use one over the other?
Answer:"Both lists and tuples in Python are used to store multiple items in a single variable.
The main difference between them is mutability.
Lists are mutable, which means their elements can be changed after the list is created
—you can add, remove, or modify items.
Tuples are immutable, meaning once a tuple is created, you cannot change its
elements.
I would use a list when I have a collection of items that might need to change over time, for
example, a list of tasks in a to-do application. I would use a tuple when I want to ensure that
the data cannot be changed. For example, if I'm storing a set of coordinates (x, y) or a
collection of constants, a tuple is a safer and more memory-efficient choice."
Question 10: How do you handle large datasets in Python that don't fit into memory?
Answer:"Handling datasets that are too large for memory is a common challenge in data
engineering. My approach depends on the task, but my primary tool for this is PySpark,
which is designed for distributed, out-of-core computation. With PySpark on a platform like
Azure Databricks, I can process terabytes of data across a cluster without ever needing to
load it all into a single machine's memory.
If I were limited to a single machine, I would use libraries like Dask, which parallelizes
computations and works with larger-than-memory datasets using a familiar pandas-like
syntax. Another approach is to process the data in chunks using pandas.read_csv's
chunksize parameter. This allows me to read and process a large file piece by piece, so the
entire file is never loaded into memory at once."
Question 11: Explain the difference between DELETE and TRUNCATE in SQL.
Answer: "Both DELETE and TRUNCATE are used to remove rows from a table, but they work
very differently.
DELETE is a DML (Data Manipulation Language) command. It removes rows one
by one and logs each deletion. Because of this, you can use a WHERE clause to specify
which rows to delete, and the operation can be rolled back.
TRUNCATE is a DDL (Data Definition Language) command. It de-allocates the data
pages used by the table, which quickly removes all rows. It's much faster than DELETE
for large tables because it doesn't log individual row deletions, but you cannot use a
WHERE clause, and the operation cannot be easily rolled back. Also, TRUNCATE resets
any identity columns back to their starting value, while DELETE does not."
Question 12: How would you design a database schema for a ride-sharing app?
Answer:"To design a database for a ride-sharing app, I would start by identifying the core
entities and their relationships. I'd use a relational model, likely in MySQL, given my
experience. The main tables would be:
Users: This table would store information for both riders and drivers. It would have
columns like user_id (Primary Key), name, email, phone_number, and a user_type
(rider/driver) column.
Drivers: A table with a foreign key to user_id, containing driver-specific
information like license_plate, vehicle_model, and current_status
(available/on-trip).
Rides: This would be the central fact table. It would include ride_id (Primary Key),
rider_id (Foreign Key to Users), driver_id (Foreign Key to Drivers),
pickup_location, dropoff_location, start_time, end_time, fare, and status
(requested/completed/cancelled).
Ratings: This table would store feedback, with columns like rating_id, ride_id,
rated_by_user_id, rated_user_id, and rating_score.
I would establish clear foreign key relationships to ensure data integrity—for example, a ride
must be linked to a valid rider and driver. I would also add indexes on frequently queried
columns like user_id and ride_id to optimize performance."
Of course. Apologies for the incomplete response. Here is the continuation, adding more
depth to the Q&A section with new categories covering technical deep-dives, coding
challenges, and system design questions based on the provided documents. This completes
your master study guide.
Question 13: You mentioned Window Functions in SQL. Can you explain the difference
between RANK() and DENSE_RANK()?
Answer: "Yes. Both RANK() and DENSE_RANK() are window functions used to assign a rank
to rows within a partition of a result set, but they handle ties differently.
RANK(): If there's a tie in values, RANK() assigns the same rank to the tied rows and
then skips the next rank(s). For example, if two rows tie for 2nd place, they both get
rank 2, and the next row gets rank 4.
DENSE_RANK(): This function also assigns the same rank to tied rows, but it does not
skip any ranks. In the same scenario, if two rows tie for 2nd place, they both get rank
2, and the next row gets rank 3.
I would use DENSE_RANK() when I need a consecutive ranking without gaps, which is
common in "top-N" reports. I'd use RANK() when the specific number of items preceding a
row is important."
Question 14: What is the practical difference between "is" and "==" in Python?
Answer: "The difference is crucial and relates to how Python handles objects in memory.
== is the value equality operator. It checks if the values of two objects are the same.
For example, if a = [1, 2] and b = [1, 2], then a == b will be True because their
contents are identical.
is is the identity operator. It checks if two variables point to the exact same object in
memory. In the same example, a is b would be False because a and b are two
separate list objects in memory, even though they contain the same values.
In my code, I use == for almost all comparisons of values (numbers, strings, lists). I would
only use is to check for object identity, most commonly to check if a variable is None (e.g.,
if my_var is None:), which is a standard convention."
Question 15: Your resume mentions experience with various cloud technologies. What
are some of the key challenges unique to cloud computing for a Data Engineer?
Answer: "From my experience with Azure, some of the key challenges in cloud computing
for a Data Engineer are:
Cost Management: Cloud resources are powerful but can become expensive if not
managed properly. A significant challenge is designing data pipelines and storage
solutions that are not only performant but also cost-effective. This involves choosing
the right storage tiers, using serverless options where appropriate, and continuously
monitoring costs.
Security and Compliance: In the cloud, data is stored on third-party infrastructure,
so ensuring robust security is paramount. This includes managing access controls,
encrypting data both in transit and at rest, and ensuring the architecture complies with
data protection regulations like GDPR.
Performance Optimization and Latency: While the cloud offers immense
scalability, latency can be a challenge, especially when moving large amounts of data
between different services or regions. Optimizing data pipelines requires careful
architecture design to minimize data movement and choose the right tools for the job,
like using Azure Databricks for distributed processing to reduce processing time."
Question 16: What is data smoothing, and can you give an example of how you might
use it in Python?
Answer: "Data smoothing is a statistical technique used to remove noise or random
fluctuations from a dataset to reveal underlying trends or patterns. It's particularly useful for
time-series data, which can often be very noisy.
A common method for data smoothing is calculating a Rolling Average (or moving average).
In Python, I would use the Pandas library for this. For example, if I had daily policy sales
data that was very volatile, I could calculate a 7-day rolling average to smooth out the daily
spikes and dips. This would help in visualizing the overall sales trend more clearly. The code
would look something like this:
Python
import pandas as pd
# Assuming 'df' is a DataFrame with a 'sales' column
df['smoothed_sales'] = df['sales'].rolling(window=7).mean()
This creates a new column where each data point is the average of the previous 7 days' sales,
providing a much smoother line for analysis."
Question 17: Given a string, write a Python function to find its first recurring
character. For example, in "ABCA", the first recurring character is "A".
Answer: "Certainly. To solve this efficiently, I would iterate through the string and use a set
to keep track of the characters I have already seen. A set provides O(1) average time
complexity for lookups, making this approach very fast.
Python
def find_first_recurring_char(s):
seen_chars = set()
for char in s:
if char in seen_chars:
return char # Found the first recurring character
seen_chars.add(char)
return None # No recurring character found
Explanation: The function loops through each character. If the character is already in our
seen_chars set, we've found our first recurring character and return it immediately.
Otherwise, we add the character to the set and continue. If the loop completes without finding
any recurring characters, it returns None."
Question 18: You are given two tables: neighborhoods (id, name) and users (id, name,
neighborhood_id). Write a SQL query that returns all neighborhoods that have 0 users.
Answer: "To find neighborhoods with no users, I would use a LEFT JOIN from the
neighborhoods table to the users table. A LEFT JOIN will include all records from the left
table (neighborhoods), regardless of whether there's a match in the right table (users).
Where there is no match, the columns from the right table will be NULL.
SQL
SELECT n.name
FROM neighborhoods n
LEFT JOIN users u ON n.id = u.neighborhood_id
WHERE u.id IS NULL;
Explanation: This query joins the two tables on the neighborhood_id. For neighborhoods
that have no users, the u.id column (and all other columns from the users table) will be
NULL. The WHERE u.id IS NULL clause then filters the result to show only those
neighborhoods."
Question 19: When designing a data warehouse, what is the difference between a Star
Schema and a Snowflake Schema? What are the trade-offs?
Answer: "Both Star and Snowflake schemas are common design patterns in dimensional data
modeling, but they differ in how they handle dimension tables.
A Star Schema has a central fact table connected to multiple dimension tables.
Each dimension is represented by a single table. This design is simple, with fewer
joins required to get data, which generally leads to faster query performance.
A Snowflake Schema is an extension of the star schema where the dimension tables
are normalized into multiple related tables. This creates a branching, snowflake-like
structure. For example, a Product dimension might be broken down into
Product_Category and Product_Brand tables.
Trade-offs:
Performance: Star schemas are typically faster for querying because they require
fewer joins. Snowflake schemas can be slower due to the multiple joins needed to
bring the dimensional data together.
Data Integrity & Storage: Snowflake schemas reduce data redundancy and can save
storage space because of normalization. This also makes maintaining the dimensions
easier, as updates only need to happen in one place.
Simplicity: Star schemas are much simpler to understand and work with, both for
developers writing queries and for BI tools."
Question 20: How would you approach designing a new data pipeline from scratch?
Answer: "Before starting any design, my first step is always to clarify requirements by
asking key questions. I'd talk to stakeholders to understand:
Data Sources: Where is the data coming from? Is it from databases, APIs, or
streaming sources?
Data Characteristics: What is the data type (structured, unstructured), volume, and
velocity (batch or real-time)?
Business Goal: How will this data be used? Is it for BI reporting, a machine learning
model, or operational analytics?
Frequency: How often does the data need to be updated?
1. Ingestion: I'd choose an ingestion method based on the source. For batch data from
databases, I might use a tool like Azure Data Factory. For real-time data, I'd consider
a streaming platform like Apache Kafka.
2. Storage: The raw data would be loaded into a scalable and cost-effective storage
layer, like Azure Data Lake Storage.
3. Processing & Transformation (ETL/ELT): This is the core of the pipeline. I'd use a
powerful processing framework like Azure Databricks with PySpark to clean,
transform, join, and aggregate the data into a usable format.
4. Serving Layer: The transformed data would be loaded into a data warehouse like
Azure Synapse Analytics or served in data marts for easy consumption by BI tools
like Power BI.
5. Orchestration & Monitoring: I would use a tool like Azure Data Factory or Apache
Airflow to schedule, automate, and monitor the entire pipeline, with built-in alerting
to triage any failures quickly."
Sources
Show thinking
Of course. Here is the continuation of the interview preparation guide, with additional Q&A
sections covering more in-depth technical topics, data modeling, and strategic thinking, all
tailored to your profile.
Question 21: You've listed NumPy and Pandas as key skills. Can you explain the main
benefits of using a NumPy array over a standard Python list?
Answer: "The main benefits of using NumPy arrays over Python lists, especially in data
engineering and science, are performance and functionality.
Performance and Memory: NumPy arrays are more compact and consume less
memory than Python lists. They are implemented in C, which makes mathematical
operations on them significantly faster than performing equivalent operations on lists
using loops.
Functionality: NumPy is specifically designed for scientific and numerical
computing. It provides a vast library of high-level mathematical functions that can
operate on entire arrays without the need for explicit loops. This makes code cleaner
and more efficient for tasks involving numerical data, which is why it's a foundational
library for tools like Pandas."
Answer: "Lookups are significantly faster in a dictionary. This is because of the underlying
data structure they use.
Because of this, I always use a dictionary when I need to perform frequent lookups based on
a unique key."
Question 23: What do you consider to be the most common bottlenecks in an ETL
process, and how would you address them?
1. First, I would try to determine if the error is replicable. A consistent error is much
easier to debug than an intermittent one.
2. If the error is replicable, I would immediately check the data delivery and trace the
source of the error. This often involves looking at recent changes in upstream data
sources or schemas.
3. If the error is non-replicable or intermittent, I would analyze the ETL logic, the
infrastructure logs, and the data schemas to identify potential race conditions or
resource contention issues.
4. In either case, clear documentation of the findings is essential for long-term reliability
and to improve future troubleshooting processes."
Question 25: Can you explain the concepts of Normalization and Denormalization in
database design? When would you use each?
Denormalization, on the other hand, is the process of intentionally adding redundant data to
one or more tables. This is done to improve query performance by reducing the need for
complex and costly joins between tables. I would use denormalization in a data warehouse
or an OLAP (Online Analytical Processing) system, where read performance for large
analytical queries is more important than eliminating data redundancy."
Question 26: Your skills include various database types. When should a team consider
using a graph database over a traditional relational database?
Answer: "A team should consider using a graph database when the data is highly
interconnected and the primary goal is to query the relationships between data points.
Graph databases, like Neo4j, excel in scenarios with intricate relationships, such as
social networks, fraud detection, or recommendation systems. They make it very fast
and intuitive to traverse complex connections, like finding "friends of friends."
Relational databases, like MySQL, are excellent for storing structured data but can
become slow and require complex recursive queries or multiple joins to handle deep
relationships.
So, if the core of the business problem involves understanding connections and paths between
entities, a graph database is often the superior choice. For more structured, transactional data,
a relational database is usually more appropriate."
Question 27: As a Data Engineer, how do you ensure you are building data solutions
that align with business requirements?
Engage Stakeholders Early and Often: Before designing any pipeline, I make it a
priority to gather input from all stakeholders. I ask questions to understand not just
what data they need, but why they need it and how they plan to use it.
Translate Business Logic into Technical Specifications: I work closely with
business teams to translate their requirements into clear technical specifications for
the ETL pipelines. As I did at Reliance Nippon, this involves collaboration to ensure
the logic I implement in PySpark and SQL accurately reflects their needs for data
cleansing and enrichment.
Build for Future Needs: I aim to design systems that are not just for the immediate
need but are also scalable and flexible. This means avoiding poor naming
conventions, planning for future growth, and using unique identifiers like surrogate
keys to maintain consistency.
Communicate Clearly: I believe in conveying technical concepts and insights to
non-technical audiences in a simple, understandable way, often using visualizations to
support my points."
Question 28: How do you approach data quality and governance in your projects?
Answer: "Data quality and governance are not afterthoughts; they are integral to building
reliable data pipelines. My approach includes:
By integrating these practices into my workflow, I ensure that the data delivered is not just
available, but also accurate, consistent, and trustworthy."
Sources
Behavioral questions are designed to assess your soft skills, problem-solving approach, and
cultural fit. The key is not just what you did, but how you did it.
Always Use the STAR Method: This is the most effective framework for answering
behavioral questions.
o S - Situation: Briefly describe the context. What was the project or challenge?
o T - Task: What was your specific responsibility or goal?
o A - Action: Describe the concrete steps you took to address the situation. This
is where you should highlight your skills (technical and soft).
o R - Result: What was the outcome? Quantify your success whenever possible
(e.g., "reduced pipeline runtime by 40%," "improved data accuracy").
Be Specific and Concrete: Avoid vague statements. Instead of saying "I improved
the system," say "I optimized the PySpark job by implementing data partitioning,
which reduced the runtime."
Focus on Collaboration: Data engineering is a team sport. When describing projects,
always mention how you worked with data analysts, business teams, or other
engineers.
At the end of the interview, you will almost always be asked, "Do you have any questions for
us?" Having thoughtful questions prepared is critical. It shows your engagement and helps
you evaluate if the company is a good fit for you.
Use this checklist in the days leading up to your interview to ensure you are fully prepared.
[ ] Review Your Master Notes: Read through this entire document one or two more
times to refresh your memory on all key concepts.
[ ] Practice Your Answers Aloud: Go through the Q&A section and practice
answering the questions out loud. This helps you sound more natural and confident.
Pay special attention to your "Tell me about yourself" answer and your project
walkthroughs.
[ ] Research the Company and Role: Re-read the job description carefully. Research
the company's recent projects, products, and mission. This will help you tailor your
answers and ask more insightful questions.
[ ] Prepare Your Questions: Choose 3-4 thoughtful questions from the list above to
ask your interviewer.
[ ] Prepare Your Project Stories: Be ready to discuss the projects on your resume in
detail, using the STAR method. Think about the challenges, your specific
contributions, and the outcomes.
[ ] Check Your Tech (for virtual interviews): Ensure your internet connection,
camera, and microphone are working correctly. Find a quiet, well-lit space for the
interview.
[ ] Get a Good Night's Sleep: Being well-rested is one of the best things you can do
for your performance.
With this final section, your preparation guide is now complete. It covers the technical
knowledge, practical application, and strategic components needed to excel in your Data
Engineer interviews. Good luck!
Question 29: Can you walk me through your Zomato dashboard project? What was the
main objective?
Question 30: What data sources did you use for this project, and how did you prepare
the data for analysis?
Answer: "For this project, the data was sourced from six separate Excel files: food.xlsx,
menu.xlsx, orders.xlsx, orders_Type.xlsx, restaurant.xlsx, and users.xlsx.
My data preparation process involved several key steps within Power BI's Power Query
Editor:
Data Loading: I first loaded all six Excel files into Power BI.
Data Cleaning: I performed data cleaning to handle any inconsistencies or missing
values to ensure data quality.
Data Modeling: I then established relationships between the tables to create a
coherent data model. For example, I connected the orders table with the users and
restaurant tables using their respective IDs. This model was crucial for building
accurate visualizations."
Question 31: Your main dashboard shows several key performance indicators (KPIs)
like Sales Amount, Sales Quantity, and Orders. How did you calculate these, and why
did you choose them?
Answer: "The KPIs on the main dashboard provide a high-level 'at-a-glance' overview of the
business's health. I chose Sales Amount (986.6M), Sales Quantity (2.4M), Orders
(150.3K), and Overall Rating (21.0K) because they represent the core drivers of the
business—revenue, volume, and customer satisfaction.
To calculate these, I used DAX (Data Analysis Expressions) in Power BI. For example:
Sales Amount was calculated using a SUMX function to iterate over the orders table
and multiply the quantity by the price of each item.
Sales Quantity and Orders were calculated using SUM and COUNTROWS on the
respective columns and tables. These KPIs are essential for quickly assessing overall
performance before diving deeper into more specific analyses."
Question s32: Can you explain the insights available on the 'User Performance' page?
What business questions does this page answer?
Answer: "The 'User Performance' page is designed to answer key questions about the user
base, such as 'Who are our most active users?' and 'Where are we gaining or losing users?'
Gain vs. Lost Users: This visualization tracks user acquisition and churn over time,
which is critical for understanding growth.
Active Users & User Count: KPIs like Active Users (78K) provide a snapshot of
user engagement.
Users by Age: This bar chart helps identify the primary age demographics of our user
base, which is valuable for targeted marketing campaigns.
For instance, a business leader could use this page to see that while we are gaining 12K new
users, we've also lost 33K, indicating a potential issue with user retention that needs to be
addressed."
Question 33: On the 'City Performance' page, you have several bar charts. If a
stakeholder asked you which city to focus on for a new marketing campaign, how would
you use this dashboard to answer them?
Answer: "To answer that question, I would use the 'City Performance' page to look at a
combination of metrics. I wouldn't just look at 'Sale by city', which shows Tirupati as the top
city by sales. I would also analyze:
'Rating by city': To see if our top-selling cities also have high customer satisfaction.
A high-selling but low-rated city might need operational improvements rather than
marketing.
'User by city': To identify cities with a large or growing user base, which could be
prime targets for a campaign.
The main table: This table provides a detailed breakdown of Sales, Orders, and
Gained/Lost Users per city. For example, if Tirupati has high sales but is also losing a
lot of users, a retention-focused campaign might be more effective there. Conversely,
a city with a high number of gained users might be a great place for a growth-focused
marketing campaign.