401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
Table of Contents
QUESTION 1: ........................................................................................................................................... 1
How can we determine which type of mean (arithmetic, geometric, or harmonic) is most appropriate
to use for a given set of numerical data, such as (3,6,9,12), and what are the resulting mean values
for each type? ......................................................................................................................................... 1
QUESTION 2: ........................................................................................................................................... 1
Data does not always come clean, sometimes it is messy and error-prone. Data scientists use some
checks to clean it and improve its quality. Explain the functions of Three basic checks and provide
examples for each. .................................................................................................................................. 1
QUESTION 3: ........................................................................................................................................... 3
How well can we predict the amount of rainfall (in milli meters) based on the temperature (in
degrees Celsius) using linear regression on a given dataset, such as the one below? What is the
equation of the regression line, and what is the predicted amount of rainfall for a temperature of 25
degrees Celsius?...................................................................................................................................... 3
QUESTION 4: ........................................................................................................................................... 5
What are some effective strategies for addressing the challenges and complexities involved in the
data collection and cleaning stages of the machine learning pipeline, and how can these strategies
be applied to improve the accuracy and reliability of machine learning models in real-world
applications? ........................................................................................................................................... 5
QUESTION 5: ........................................................................................................................................... 7
What are the key stages in the machine learning project life cycle, and what are some best practices
for each stage to ensure the successful development and deployment of machine learning models? 7
REFERENCES ............................................................................................................................................ 9
Page | 0
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
QUESTION 1:
How can we determine which type of mean (arithmetic, geometric, or harmonic) is most
appropriate to use for a given set of numerical data, such as (3,6,9,12), and what are the
resulting mean values for each type?
1) Arithmetic Mean: The arithmetic mean provides a balanced representation of the
"average" value. It's suitable when you want to find a value that considers all data
points equally. Arithmetic Mean = (3 + 6 + 9 + 12 + 15) / 5 = 9
2) Geometric Mean: The geometric mean is useful when dealing with quantities that
are related by multiplicative factors, such as growth rates or ratios. Geometric Mean
= (3 * 6 * 9 * 12 * 15)^(1/5) ≈ 7.48
3) Harmonic Mean: The harmonic mean is relevant when dealing with rates,
proportions, or averages of rates. Harmonic Mean = 5 / ((1/3) + (1/6) + (1/9) + (1/12)
+ (1/15)) ≈ 5.44
QUESTION 2:
Data does not always come clean, sometimes it is messy and error-prone. Data scientists use
some checks to clean it and improve its quality. Explain the functions of Three basic checks
and provide examples for each.
Missing Values Check:
Missing values can significantly impact the analysis and modeling of data. This check involves
identifying and handling missing values in the dataset. Function: Detect and handle missing values to
prevent biased analyses and ensure accurate results. Example: you have a dataset of customer
records with columns for "Age," "Income," and "Gender." If some rows have missing values in the
"Income" column, you might choose to either impute the missing values (e.g., filling with the median
income) or remove the rows with missing income data.
Outlier Detection:
Outliers are data points that deviate significantly from the rest of the data. They can distort
statistical analyses and model performance. Outlier detection helps identify and manage these
extreme values. Function: Identify and handle outliers to prevent skewed results and improve model
generalization. Example: In a dataset of housing prices, an outlier might be a property with an
unusually high price that doesn't align with the general price distribution. You could choose to
remove such extreme values or transform them to be more aligned with the rest of the data.
Data Consistency Check:
Data consistency involves identifying inconsistencies or errors in the data that may arise due to data
entry mistakes or data integration issues.
Page | 1
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
Function: Ensure data consistency by identifying and correcting discrepancies or conflicting
information. Example: A dataset that includes a column for "Date of Birth." If there's a record where
the birthdate indicates an age of 150 years, it's likely an error. Data consistency checks would
identify such implausible entries and prompt investigation or correction.
(Analytics Vidhya, 2021)
Page | 2
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
QUESTION 3:
How well can we predict the amount of rainfall (in milli meters) based on the temperature (in
degrees Celsius) using linear regression on a given dataset, such as the one below? What is
the equation of the regression line, and what is the predicted amount of rainfall for a
temperature of 25 degrees Celsius?
1) We need to first calculate the regression line using the given dataset.
We can start by calculating the mean of the temperature and rainfall data:
x̄ = (20 + 22 + 22 + 25 + 28 + 30) / 6 = 24
ȳ = (50 + 60 + 70 + 80 + 90) / 5 = 70
2) Then we can calculate the sample covariance and variance of x and y
• Sxy = [(20 - 24) * (50 - 70) + (22 - 24) * (60 - 70) + (22 - 24) * (70 - 70) + (25 - 24) * (80
- 70) + (28 - 24) * (90 - 70)] / 5 = 80
• Sxx = [(20 - 24)^2 + (22 - 24)^2 + (22 - 24)^2 + (25 - 24)^2 + (28 - 24)^2 + (30 - 24)^2]
/ 5 = 10.8
• Syy = [(50 - 70)^2 + (60 - 70)^2 + (70 - 70)^2 + (80 - 70)^2 + (90 - 70)^2] / 5 = 200
1) We then can calculate the slope of the regression line
• b = Sxy / Sxx = 7.41
2) And the y-intercept of the regression line:
• a = ȳ - b * x̄ = 52.22
3) The equation of the regression line will be y = 7.41x + 52.22
We can use this equation to predict the amount of rainfall for a temperature of 25 degrees
Celsius:
y = 7.41 * 25 + 52.22 = 248.47
The predicted amount of rainfall for a temperature of 25 degrees Celsius is 248.47mm
Page | 3
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
Finally, to determine how well we can predict the amount of rainfall based on temperature
using this linear regression model, we can calculate the coefficient of determination (R-
squared), which measures the proportion of the variance in the dependent variable (rainfall)
that is explained by the independent variable (temperature). The formula for R-squared is:
R² = (Sxy / √(Sxx * Syy))^2 = 0.94
An R-squared value of 0.94 indicates that the linear regression model explains 94% of the
variance in the rainfall data, which suggests that the model is a good fit for the data and that
we can predict rainfall fairly accurately based on temperature using this model.
(WallStreetMojo, 2019)
Page | 4
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
QUESTION 4:
What are some effective strategies for addressing the challenges and complexities involved in
the data collection and cleaning stages of the machine learning pipeline, and how can these
strategies be applied to improve the accuracy and reliability of machine learning models in
real-world applications?
Effective strategies to address the challenges and complexities in these stages and improve model
performance in real-world applications:
1) Handling Missing Data or Null values:
The most common data quality issue that data scientists often encounter is handling missing
data, which significantly affects business analysis and statistical analysis. Missing Data or
Missing values occur when we have NO data points stored for a particular column or
feature. There might be multiple data sources for creating a data set. These Data sources
might not have actual observations, or people might be too lazy to fill in the correct data,
which leads to corrupt data. Different data sources may indicate missing values in various
ways to make analysis even more complicated and can significantly impact the conclusions
drawn from data. Depending on the extent of missing data, consider imputation techniques
like mean, median, or more advanced methods like K-nearest neighbours or predictive
modelling.
2) Handling Duplicate Data
Data is collected either via scraping data or by combining different data sources. There is a
high chance of duplicate entries, which might occur from human error. Duplicate data will
indeed create analytical outcomes, leading to lousy business analytics. It also leads to
incorrect reporting; they become less reliable, and making predictions based on duplicate
values will surely hamper the business target.
3) Detecting Outliers
Outliers are data entries whose values deviate significantly from the rest of the data.
Outliers can be found in every possible real-world dataset you come across, and dealing
with Outliers is one of the many data cleaning techniques. Outliers also have a significance
on data accuracy and business analytics. Most machine learning algorithms, predominantly
linear regression models, need to be dealt with Outliers, or else the Variance of the model
could turn out very high, which further leads to false conclusions by the model.
The two most efficient business practices for detecting outliers are:
• Normal Distribution - Also known as the Bell curve, Normal distribution helps us
visualize a particular feature's distribution
• Box-plots - Box-plot is a visualization technique used in Python to represent data
distribution.
Page | 5
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
4) Removal of Irrelevant Data
Usually, while creating a dataset, we try to scrape data from various data sources and
combine data sets, also known as Data collection. To solve a particular business problem,
we might need various features from these stored data, but using all features might not be
that helpful to our business problem. Analysing data that provides no weightage to a
business problem is of no value; hence, removing irrelevant data becomes a common
practice in data cleansing. Hence deciphering the relevancy of data and extracting clean
data becomes an important step in the data cleaning process.
Example of Irrelevant Data - Suppose we are clustering our customers based on their
purchase history; features such as Email Address, Mobile Number, and Nationality would be
useless.
5) Regular Maintenance
Data drift and shifts can occur over time, affecting model performance. Regularly monitor
and update your data to ensure the model remains accurate and reliable in real-world
applications.
(www.linkedin.com, n.d.)
Page | 6
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
QUESTION 5:
What are the key stages in the machine learning project life cycle, and what are some best
practices for each stage to ensure the successful development and deployment of machine
learning models?
Key stages:
1. Problem Definition and Planning:
• Clearly define the problem you want to solve and the goals of your machine learning
project.
• Identify the relevant data sources, available resources, and potential challenges.
• Plan the project timeline, allocate resources, and set expectations.
2. Data Collection:
• Gather relevant data from various sources, considering data quality, quantity, and
relevance to the problem.
• Ensure that data collection methods align with project goals and ethical
considerations.
3. Data Preprocessing and Cleaning:
• Clean and preprocess the raw data to handle missing values, outliers, and
inconsistencies.
• Perform transformations such as normalization, encoding categorical variables, and
feature engineering.
4. Exploratory Data Analysis (EDA):
• Analyse and visualize the data to gain insights into patterns, relationships, and
potential biases.
• EDA helps guide feature selection and model design.
5. Feature Engineering:
• Create new features from existing data or domain knowledge to enhance model
performance.
• Select relevant features that contribute to the predictive power of the model.
6. Model Selection and Training:
• Choose appropriate algorithms and model architectures based on the problem type
(classification, regression, etc.).
Page | 7
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
• Split the data into training, validation, and test sets for model training and
evaluation.
• Train multiple models and tune hyperparameters to find the best-performing one.
7. Model Evaluation:
• Assess model performance using appropriate metrics (accuracy, precision, recall, F1-
score, etc.).
• Validate the model on the validation set and fine-tune parameters as needed.
8. Model Deployment:
• Deploy the trained model to a production environment for real-world use.
• Integrate the model into existing systems, APIs, or user interfaces.
9. Monitoring and Maintenance:
• Continuously monitor the deployed model's performance in the production
environment.
• Address issues like data drift, concept drift, and changing user behaviour.
• Update the model as needed to maintain accuracy and reliability.
10. Documentation:
• Document the entire project, including data sources, preprocessing steps, model
architecture, and deployment details.
• Clear documentation ensures reproducibility and ease of understanding for others.
11. Communication and Reporting:
• Share the results, insights, and outcomes of the project with stakeholders, including
non-technical audiences.
• Communicate the limitations and assumptions of the model.
12. Ethical Considerations:
• Address potential biases, fairness, and ethical concerns related to the data and
model predictions.
• Implement measures to mitigate harmful impacts and ensure responsible use.
(www.datacamp.com, 23 August 2023)
Page | 8
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
REFERENCES
1. Analytics Vidhya. (2021). Dealing with Missing Values | Missing Values in a Data Science
Project. [online] Available at: https://www.analyticsvidhya.com/blog/2021/10/guide-to-
deal-with-missing-values/.
2. gocardless.com. (n.d.). How to Calculate a Regression Line. [online] Available at:
https://gocardless.com/guides/posts/how-to-calculate-a-regression-line/.
3. Stedman, C. (2022). What is data collection? - Definition from WhatIs.com. [online]
SearchCIO. Available at: https://www.techtarget.com/searchcio/definition/data-collection.
4. WallStreetMojo. (2019). Regression Formula | Step by Step Calculation (with Examples).
[online] Available at: https://www.wallstreetmojo.com/regression-formula/.
5. www.datacamp.com. (n.d.). The Machine Learning Life Cycle Explained. [online] Available at:
https://www.datacamp.com/blog/machine-learning-lifecycle-explained.
6. www.linkedin.com. (n.d.). Machine Learning Pipeline & Challenges. [online] Available at:
https://www.linkedin.com/pulse/machine-learning-pipeline-challenges-tahir-riaz [Accessed
10 Sep. 2023].
Page | 9
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1
Page | 10