0% found this document useful (0 votes)

84 views11 pages

Machine Learning Basics for Students

The document discusses strategies for improving machine learning models by addressing challenges in data collection and cleaning. Some key strategies mentioned include: 1. Handling missing data or null values, which is a common issue that can impact analysis. Methods like imputing median values can address this. 2. Detecting and handling outliers, which are extreme or anomalous values that can skew results. Outliers can be removed or transformed. 3. Ensuring data consistency by identifying errors or discrepancies, such as implausible dates. Consistency checks find and correct such issues. 4. Applying these strategies, like handling missing data, can improve the accuracy and reliability of machine learning models used in real-world applications.

Uploaded by

Azeem Sathar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views11 pages

Machine Learning Basics for Students

Uploaded by

Azeem Sathar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

Table of Contents
QUESTION 1: ........................................................................................................................................... 1
How can we determine which type of mean (arithmetic, geometric, or harmonic) is most appropriate
to use for a given set of numerical data, such as (3,6,9,12), and what are the resulting mean values
for each type? ......................................................................................................................................... 1
QUESTION 2: ........................................................................................................................................... 1
Data does not always come clean, sometimes it is messy and error-prone. Data scientists use some
checks to clean it and improve its quality. Explain the functions of Three basic checks and provide
examples for each. .................................................................................................................................. 1
QUESTION 3: ........................................................................................................................................... 3
How well can we predict the amount of rainfall (in milli meters) based on the temperature (in
degrees Celsius) using linear regression on a given dataset, such as the one below? What is the
equation of the regression line, and what is the predicted amount of rainfall for a temperature of 25
degrees Celsius?...................................................................................................................................... 3
QUESTION 4: ........................................................................................................................................... 5
What are some effective strategies for addressing the challenges and complexities involved in the
data collection and cleaning stages of the machine learning pipeline, and how can these strategies
be applied to improve the accuracy and reliability of machine learning models in real-world
applications? ........................................................................................................................................... 5
QUESTION 5: ........................................................................................................................................... 7
What are the key stages in the machine learning project life cycle, and what are some best practices
for each stage to ensure the successful development and deployment of machine learning models? 7
REFERENCES ............................................................................................................................................ 9

Page | 0
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

QUESTION 1:

How can we determine which type of mean (arithmetic, geometric, or harmonic) is most
appropriate to use for a given set of numerical data, such as (3,6,9,12), and what are the
resulting mean values for each type?

1) Arithmetic Mean: The arithmetic mean provides a balanced representation of the

"average" value. It's suitable when you want to find a value that considers all data
points equally. Arithmetic Mean = (3 + 6 + 9 + 12 + 15) / 5 = 9

2) Geometric Mean: The geometric mean is useful when dealing with quantities that
are related by multiplicative factors, such as growth rates or ratios. Geometric Mean
= (3 * 6 * 9 * 12 * 15)^(1/5) ≈ 7.48

3) Harmonic Mean: The harmonic mean is relevant when dealing with rates,
proportions, or averages of rates. Harmonic Mean = 5 / ((1/3) + (1/6) + (1/9) + (1/12)
+ (1/15)) ≈ 5.44

QUESTION 2:
Data does not always come clean, sometimes it is messy and error-prone. Data scientists use
some checks to clean it and improve its quality. Explain the functions of Three basic checks
and provide examples for each.

Missing Values Check:

Missing values can significantly impact the analysis and modeling of data. This check involves
identifying and handling missing values in the dataset. Function: Detect and handle missing values to
prevent biased analyses and ensure accurate results. Example: you have a dataset of customer
records with columns for "Age," "Income," and "Gender." If some rows have missing values in the
"Income" column, you might choose to either impute the missing values (e.g., filling with the median
income) or remove the rows with missing income data.

Outlier Detection:

Outliers are data points that deviate significantly from the rest of the data. They can distort
statistical analyses and model performance. Outlier detection helps identify and manage these
extreme values. Function: Identify and handle outliers to prevent skewed results and improve model
generalization. Example: In a dataset of housing prices, an outlier might be a property with an
unusually high price that doesn't align with the general price distribution. You could choose to
remove such extreme values or transform them to be more aligned with the rest of the data.

Data Consistency Check:

Data consistency involves identifying inconsistencies or errors in the data that may arise due to data
entry mistakes or data integration issues.

Page | 1
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

Function: Ensure data consistency by identifying and correcting discrepancies or conflicting

information. Example: A dataset that includes a column for "Date of Birth." If there's a record where
the birthdate indicates an age of 150 years, it's likely an error. Data consistency checks would
identify such implausible entries and prompt investigation or correction.

(Analytics Vidhya, 2021)

Page | 2
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

QUESTION 3:

How well can we predict the amount of rainfall (in milli meters) based on the temperature (in
degrees Celsius) using linear regression on a given dataset, such as the one below? What is
the equation of the regression line, and what is the predicted amount of rainfall for a
temperature of 25 degrees Celsius?

1) We need to first calculate the regression line using the given dataset.

We can start by calculating the mean of the temperature and rainfall data:

x̄ = (20 + 22 + 22 + 25 + 28 + 30) / 6 = 24
ȳ = (50 + 60 + 70 + 80 + 90) / 5 = 70

2) Then we can calculate the sample covariance and variance of x and y

• Sxy = [(20 - 24) * (50 - 70) + (22 - 24) * (60 - 70) + (22 - 24) * (70 - 70) + (25 - 24) * (80
- 70) + (28 - 24) * (90 - 70)] / 5 = 80

• Sxx = [(20 - 24)^2 + (22 - 24)^2 + (22 - 24)^2 + (25 - 24)^2 + (28 - 24)^2 + (30 - 24)^2]
/ 5 = 10.8

• Syy = [(50 - 70)^2 + (60 - 70)^2 + (70 - 70)^2 + (80 - 70)^2 + (90 - 70)^2] / 5 = 200

1) We then can calculate the slope of the regression line

• b = Sxy / Sxx = 7.41

2) And the y-intercept of the regression line:

• a = ȳ - b * x̄ = 52.22

3) The equation of the regression line will be y = 7.41x + 52.22

We can use this equation to predict the amount of rainfall for a temperature of 25 degrees
Celsius:
y = 7.41 * 25 + 52.22 = 248.47
The predicted amount of rainfall for a temperature of 25 degrees Celsius is 248.47mm

Page | 3
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

Finally, to determine how well we can predict the amount of rainfall based on temperature
using this linear regression model, we can calculate the coefficient of determination (R-
squared), which measures the proportion of the variance in the dependent variable (rainfall)
that is explained by the independent variable (temperature). The formula for R-squared is:
R² = (Sxy / √(Sxx * Syy))^2 = 0.94
An R-squared value of 0.94 indicates that the linear regression model explains 94% of the
variance in the rainfall data, which suggests that the model is a good fit for the data and that
we can predict rainfall fairly accurately based on temperature using this model.
(WallStreetMojo, 2019)

Page | 4
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

QUESTION 4:
What are some effective strategies for addressing the challenges and complexities involved in
the data collection and cleaning stages of the machine learning pipeline, and how can these
strategies be applied to improve the accuracy and reliability of machine learning models in
real-world applications?

Effective strategies to address the challenges and complexities in these stages and improve model
performance in real-world applications:

1) Handling Missing Data or Null values:

The most common data quality issue that data scientists often encounter is handling missing
data, which significantly affects business analysis and statistical analysis. Missing Data or
Missing values occur when we have NO data points stored for a particular column or
feature. There might be multiple data sources for creating a data set. These Data sources
might not have actual observations, or people might be too lazy to fill in the correct data,
which leads to corrupt data. Different data sources may indicate missing values in various
ways to make analysis even more complicated and can significantly impact the conclusions
drawn from data. Depending on the extent of missing data, consider imputation techniques
like mean, median, or more advanced methods like K-nearest neighbours or predictive
modelling.
2) Handling Duplicate Data
Data is collected either via scraping data or by combining different data sources. There is a
high chance of duplicate entries, which might occur from human error. Duplicate data will
indeed create analytical outcomes, leading to lousy business analytics. It also leads to
incorrect reporting; they become less reliable, and making predictions based on duplicate
values will surely hamper the business target.
3) Detecting Outliers
Outliers are data entries whose values deviate significantly from the rest of the data.
Outliers can be found in every possible real-world dataset you come across, and dealing
with Outliers is one of the many data cleaning techniques. Outliers also have a significance
on data accuracy and business analytics. Most machine learning algorithms, predominantly
linear regression models, need to be dealt with Outliers, or else the Variance of the model
could turn out very high, which further leads to false conclusions by the model.
The two most efficient business practices for detecting outliers are:

• Normal Distribution - Also known as the Bell curve, Normal distribution helps us
visualize a particular feature's distribution

• Box-plots - Box-plot is a visualization technique used in Python to represent data

distribution.

Page | 5
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

4) Removal of Irrelevant Data

Usually, while creating a dataset, we try to scrape data from various data sources and
combine data sets, also known as Data collection. To solve a particular business problem,
we might need various features from these stored data, but using all features might not be
that helpful to our business problem. Analysing data that provides no weightage to a
business problem is of no value; hence, removing irrelevant data becomes a common
practice in data cleansing. Hence deciphering the relevancy of data and extracting clean
data becomes an important step in the data cleaning process.
Example of Irrelevant Data - Suppose we are clustering our customers based on their
purchase history; features such as Email Address, Mobile Number, and Nationality would be
useless.
5) Regular Maintenance
Data drift and shifts can occur over time, affecting model performance. Regularly monitor
and update your data to ensure the model remains accurate and reliable in real-world
applications.
(www.linkedin.com, n.d.)

Page | 6
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

QUESTION 5:
What are the key stages in the machine learning project life cycle, and what are some best
practices for each stage to ensure the successful development and deployment of machine
learning models?

Key stages:
1. Problem Definition and Planning:
• Clearly define the problem you want to solve and the goals of your machine learning
project.
• Identify the relevant data sources, available resources, and potential challenges.
• Plan the project timeline, allocate resources, and set expectations.
2. Data Collection:
• Gather relevant data from various sources, considering data quality, quantity, and
relevance to the problem.
• Ensure that data collection methods align with project goals and ethical
considerations.
3. Data Preprocessing and Cleaning:
• Clean and preprocess the raw data to handle missing values, outliers, and
inconsistencies.
• Perform transformations such as normalization, encoding categorical variables, and
feature engineering.
4. Exploratory Data Analysis (EDA):
• Analyse and visualize the data to gain insights into patterns, relationships, and
potential biases.
• EDA helps guide feature selection and model design.
5. Feature Engineering:
• Create new features from existing data or domain knowledge to enhance model
performance.
• Select relevant features that contribute to the predictive power of the model.
6. Model Selection and Training:
• Choose appropriate algorithms and model architectures based on the problem type
(classification, regression, etc.).

Page | 7
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

• Split the data into training, validation, and test sets for model training and
evaluation.
• Train multiple models and tune hyperparameters to find the best-performing one.
7. Model Evaluation:
• Assess model performance using appropriate metrics (accuracy, precision, recall, F1-
score, etc.).
• Validate the model on the validation set and fine-tune parameters as needed.
8. Model Deployment:
• Deploy the trained model to a production environment for real-world use.
• Integrate the model into existing systems, APIs, or user interfaces.
9. Monitoring and Maintenance:
• Continuously monitor the deployed model's performance in the production
environment.
• Address issues like data drift, concept drift, and changing user behaviour.
• Update the model as needed to maintain accuracy and reliability.
10. Documentation:
• Document the entire project, including data sources, preprocessing steps, model
architecture, and deployment details.
• Clear documentation ensures reproducibility and ease of understanding for others.
11. Communication and Reporting:
• Share the results, insights, and outcomes of the project with stakeholders, including
non-technical audiences.
• Communicate the limitations and assumptions of the model.
12. Ethical Considerations:
• Address potential biases, fairness, and ethical concerns related to the data and
model predictions.
• Implement measures to mitigate harmful impacts and ensure responsible use.
(www.datacamp.com, 23 August 2023)

Page | 8
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

REFERENCES

1. Analytics Vidhya. (2021). Dealing with Missing Values | Missing Values in a Data Science
Project. [online] Available at: https://www.analyticsvidhya.com/blog/2021/10/guide-to-
deal-with-missing-values/.

2. gocardless.com. (n.d.). How to Calculate a Regression Line. [online] Available at:

https://gocardless.com/guides/posts/how-to-calculate-a-regression-line/.

3. Stedman, C. (2022). What is data collection? - Definition from WhatIs.com. [online]

SearchCIO. Available at: https://www.techtarget.com/searchcio/definition/data-collection.

4. WallStreetMojo. (2019). Regression Formula | Step by Step Calculation (with Examples).

[online] Available at: https://www.wallstreetmojo.com/regression-formula/.

5. www.datacamp.com. (n.d.). The Machine Learning Life Cycle Explained. [online] Available at:
https://www.datacamp.com/blog/machine-learning-lifecycle-explained.

6. www.linkedin.com. (n.d.). Machine Learning Pipeline & Challenges. [online] Available at:
https://www.linkedin.com/pulse/machine-learning-pipeline-challenges-tahir-riaz [Accessed
10 Sep. 2023].

Page | 9
401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

Page | 10

ML Lab Record
No ratings yet
ML Lab Record
38 pages
ML Mid 1 Solution
No ratings yet
ML Mid 1 Solution
36 pages
Stat 1116-BHS20100 - M Assignment
No ratings yet
Stat 1116-BHS20100 - M Assignment
7 pages
Assignment 2 - LP1
No ratings yet
Assignment 2 - LP1
7 pages
2022 CS244 End Sem Soln
No ratings yet
2022 CS244 End Sem Soln
6 pages
Stat 1116-BHS20100 - M Assignment
No ratings yet
Stat 1116-BHS20100 - M Assignment
7 pages
Amazon ML Summer School Previous Year Questions
100% (1)
Amazon ML Summer School Previous Year Questions
12 pages
Choosing Among Linear, Quadratic, and Exponential Models: Practice and Problem Solving: C
No ratings yet
Choosing Among Linear, Quadratic, and Exponential Models: Practice and Problem Solving: C
1 page
(1122) AI Assignment2
No ratings yet
(1122) AI Assignment2
2 pages
Lesson 3-Multiple Linear Regression
No ratings yet
Lesson 3-Multiple Linear Regression
24 pages
MLA Manual
No ratings yet
MLA Manual
25 pages
Statistics 2 For Chemical Engineering: Department of Mathematics and Computer Science
No ratings yet
Statistics 2 For Chemical Engineering: Department of Mathematics and Computer Science
37 pages
AI & Stats Lab Exercises
No ratings yet
AI & Stats Lab Exercises
13 pages
Summer School 2023
No ratings yet
Summer School 2023
6 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
CS3352 Foundations of Data Science Apr May 2024 Question Paper Download
No ratings yet
CS3352 Foundations of Data Science Apr May 2024 Question Paper Download
19 pages
Fluid Mechanics Research Workshop
No ratings yet
Fluid Mechanics Research Workshop
93 pages
Linear Algebra Projects (1) - Trang-2
No ratings yet
Linear Algebra Projects (1) - Trang-2
9 pages
Ass-3 Ds
No ratings yet
Ass-3 Ds
7 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
DATA SCIENCE iNTERVIEW QUESTION
No ratings yet
DATA SCIENCE iNTERVIEW QUESTION
42 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Assignment#4: A) Draw The Scatter Plot of The Data, If You Can Plot Via R Would Also Be Acceptable
No ratings yet
Assignment#4: A) Draw The Scatter Plot of The Data, If You Can Plot Via R Would Also Be Acceptable
10 pages
ML (DL 2)
No ratings yet
ML (DL 2)
13 pages
Report 4
No ratings yet
Report 4
50 pages
Practical (Data Science)
No ratings yet
Practical (Data Science)
13 pages
DMA Question Bank
No ratings yet
DMA Question Bank
4 pages
Comprehensive Guide to Data Analysis Techniques
No ratings yet
Comprehensive Guide to Data Analysis Techniques
12 pages
Ankit Python
No ratings yet
Ankit Python
26 pages
SML-SET 1-Batch 1-Answer Key
No ratings yet
SML-SET 1-Batch 1-Answer Key
8 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
50 pages
Practice Set Viii: Sub: (MA 231)
No ratings yet
Practice Set Viii: Sub: (MA 231)
2 pages
Emd6m7a Group2
No ratings yet
Emd6m7a Group2
8 pages
S 11
No ratings yet
S 11
7 pages
AIML Regression Sample QP-1
No ratings yet
AIML Regression Sample QP-1
2 pages
1 - Error 97
No ratings yet
1 - Error 97
45 pages
CS3352 Foundations of Data Science Apr May 2024 Question Paper Download 2
No ratings yet
CS3352 Foundations of Data Science Apr May 2024 Question Paper Download 2
7 pages
Statistical Machine Learning Assignment
No ratings yet
Statistical Machine Learning Assignment
5 pages
FINAL - CC01 - Group7
No ratings yet
FINAL - CC01 - Group7
23 pages
Final Cc01 Group7
No ratings yet
Final Cc01 Group7
23 pages
Q1S 1
No ratings yet
Q1S 1
2 pages
Final 1
No ratings yet
Final 1
6 pages
Data-Analytics-Manual Lab G.anill Kumar
No ratings yet
Data-Analytics-Manual Lab G.anill Kumar
23 pages
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
21 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Uct633 Est 23
No ratings yet
Uct633 Est 23
3 pages
ML Lab Manual TE 2021-22
No ratings yet
ML Lab Manual TE 2021-22
43 pages
Question Bank1
No ratings yet
Question Bank1
9 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
Mid Semester Make-Up Data Mining Second Semester 2019-2020
No ratings yet
Mid Semester Make-Up Data Mining Second Semester 2019-2020
3 pages
M.Tech Data Mining Exam Guide
No ratings yet
M.Tech Data Mining Exam Guide
3 pages
Amazon ML Pyq
No ratings yet
Amazon ML Pyq
8 pages
D-H Marara Research Report
No ratings yet
D-H Marara Research Report
103 pages
Mahalanobis Distance
No ratings yet
Mahalanobis Distance
4 pages
Machine Learning 20CSE09
No ratings yet
Machine Learning 20CSE09
3 pages
PDF 2-1
No ratings yet
PDF 2-1
5 pages
NSC416
No ratings yet
NSC416
135 pages
Mechanics Questions
No ratings yet
Mechanics Questions
38 pages
Appliance Multi Fold Mob
No ratings yet
Appliance Multi Fold Mob
32 pages
Principles and Methods of Teaching
No ratings yet
Principles and Methods of Teaching
2 pages
Optimized RF Board Layout For STM32WL Series: Application Note
No ratings yet
Optimized RF Board Layout For STM32WL Series: Application Note
49 pages
Verb To Be Exercise Worksheet Yellow Cute Simple Style
No ratings yet
Verb To Be Exercise Worksheet Yellow Cute Simple Style
1 page
The Gell - Coombs Classification of Hypersensitivity Reactions: A Re-Interpretation
No ratings yet
The Gell - Coombs Classification of Hypersensitivity Reactions: A Re-Interpretation
4 pages
US v. Timothy Graham
No ratings yet
US v. Timothy Graham
9 pages
Project On Music
No ratings yet
Project On Music
14 pages
MARKARENA 2020: The Brand Challenge: (Condom Industry)
No ratings yet
MARKARENA 2020: The Brand Challenge: (Condom Industry)
8 pages
Sample
No ratings yet
Sample
6 pages
Payroll US Best Practices
No ratings yet
Payroll US Best Practices
7 pages
FO 5406B DM3 Fuller 6 Speed Parts Breakdown Manual
No ratings yet
FO 5406B DM3 Fuller 6 Speed Parts Breakdown Manual
19 pages
Rd20 Operators Manual B 22510
No ratings yet
Rd20 Operators Manual B 22510
52 pages
What's A Screw Pump? Understanding The Unique Characteristics and Operating Principles of 1, 2 and 3 Screw Pumps
100% (1)
What's A Screw Pump? Understanding The Unique Characteristics and Operating Principles of 1, 2 and 3 Screw Pumps
4 pages
Pubali Banking UserManual V1 4
No ratings yet
Pubali Banking UserManual V1 4
77 pages
Blood Pressure by DR Manal M Kamal
No ratings yet
Blood Pressure by DR Manal M Kamal
28 pages
Letter Consent Final Full
100% (1)
Letter Consent Final Full
19 pages
Bamboo: Eco-Friendly Growth & Trade
No ratings yet
Bamboo: Eco-Friendly Growth & Trade
10 pages
Parasound A 21 Amplifier Guide
No ratings yet
Parasound A 21 Amplifier Guide
23 pages
VIMA 2.0 Model Constitution
No ratings yet
VIMA 2.0 Model Constitution
54 pages
Gears
No ratings yet
Gears
3 pages
M&A Notes
No ratings yet
M&A Notes
13 pages
Summative Test No. 3 Math 6 Quarter 3
86% (7)
Summative Test No. 3 Math 6 Quarter 3
2 pages
Laboratory Test No.4
No ratings yet
Laboratory Test No.4
4 pages
Forest Fire Detetion
No ratings yet
Forest Fire Detetion
21 pages
NNPC Crude Oil General Terms & Conditions
100% (2)
NNPC Crude Oil General Terms & Conditions
62 pages
Cloud Lesson Plan
No ratings yet
Cloud Lesson Plan
2 pages
SBLC Yasir Required
No ratings yet
SBLC Yasir Required
10 pages
Motherboard Basics and History
No ratings yet
Motherboard Basics and History
48 pages

Machine Learning Basics for Students

Uploaded by

Machine Learning Basics for Students

Uploaded by

401907116 AZEEM SATHAR MACHINE LEARNING ASSIGNMENT 1

1) Arithmetic Mean: The arithmetic mean provides a balanced representation of the

Missing Values Check:

Data Consistency Check:

Function: Ensure data consistency by identifying and correcting discrepancies or conflicting

(Analytics Vidhya, 2021)

2) Then we can calculate the sample covariance and variance of x and y

1) We then can calculate the slope of the regression line

• b = Sxy / Sxx = 7.41

2) And the y-intercept of the regression line:

3) The equation of the regression line will be y = 7.41x + 52.22

1) Handling Missing Data or Null values:

• Box-plots - Box-plot is a visualization technique used in Python to represent data

4) Removal of Irrelevant Data

2. gocardless.com. (n.d.). How to Calculate a Regression Line. [online] Available at:

3. Stedman, C. (2022). What is data collection? - Definition from WhatIs.com. [online]

4. WallStreetMojo. (2019). Regression Formula | Step by Step Calculation (with Examples).

You might also like