0% found this document useful (0 votes)

10 views57 pages

Meweek 3

The document provides an introduction to data science, focusing on linear regression analysis and exploratory data analysis. It explains the concepts of functional and statistical relationships between variables, the use of scatter plots, and the significance of regression models in estimating relationships. Additionally, it discusses the roles of data analysts, data scientists, and data engineers in the field of data science.

Uploaded by

laibaejaz9797

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views57 pages

Meweek 3

Uploaded by

laibaejaz9797

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Introduction to Data Science

Dr. Irfan Yousuf

Department of Computer Science (New Campus)
UET, Lahore
(Week 3; January 29 – February 02, 2024)
Outline
• Linear Regression Analysis
• Exploratory Data Analysis
Relations Between Variables
• Functional Relation
• Statistical Relation
Functional Relation
• A functional relation between two variables is a perfect
relation where the value of the dependent variable is
uniquely determined by the value of the independent
variable.
• It is expressed by a mathematical formula.
Functional Relation
Statistical Relation
• A statistical relation between two variables is a relation
where the value of the dependent variable is NOT
uniquely determined when the level of the independent
variable is specified.
• If the values of a variable Y increase or decrease when the
values of a variable X change, there is a statistical
relationship between Variable Y and Variable X.
• It is not an exact relation.
• Examples:
• The relation between age and income
• The relation between income and expenditures
Statistical Relation
Scatter Plot
• A scatter plot (aka scatter chart, scatter graph) uses dots to
represent values for two different numeric variables.

• The position of each dot on the horizontal and vertical axis

indicates values for an individual data point.

• Scatter plots are used to observe relationships between

variables.
Scatter Plot
Regression Analysis
• Regression analysis provides a method of estimating an
average relation (often linear) between two or more variables.

• Regressor: The variable that forms the basis of estimation or

prediction (aka predictor variable or independent variable).

• Regressand: The variable whose value depend on the

independent variable is called a regressand (aka response
variable, predictand variable or dependent variable).
Regression Models
• Regression models describe the relationship between
variables by fitting a line to the observed data.

• Linear regression models use a straight line, while logistic

and nonlinear regression models use a curved line.

• Regression allows you to estimate how a dependent variable

changes as the independent variable(s) change.
Simple Linear Regression
Simple Linear Regression
- Best-fit Line
- Least Squares Method or Least Squares Regression
Simple Linear Regression
- Best-fit Line
- Least Squares Method or Least Squares Regression
Simple Linear Regression: Example
x y
14.2 215
16.4 325
11.9 185
15.2 332
18.5 406
22.1 522
19.4 412
25.1 614
23.4 544
18.1 421
22.6 445
17.2 408 b = 30.08
a = -159.4
Simple Linear Regression: Example
Residuals
- The residuals are the differences between the observed and
predicted values.
- It measures the distance from the regression line (predicted
value) and the actual observed value. In other words, it
helps us to measure error, or how well our regression line
“fits” our data.
Outliers
Simple Linear Regression: Example
Income Savings
25 2
32 3
37 5
39 2
46 10
49 7
51 15
55 14
58 8
59 15
66 24
68 10 Income = 85, Saving =?
72 12
Income = 42, Saving=?
75 15
77 13
Simple Linear Regression: Example
Income Savings X Y X*Y X^2
25 2 25 2 50 625
32 3 32 3 96 1024
37 5 37 5 185 1369
39 2 39 2 78 1521
46 10 460 2116
46 10
49 7 343 2401
49 7
51 15 765 2601
51 15
55 14 770 3025
55 14 58 8 464 3364
58 8 59 15 885 3481
a = -4.4502
59 15 66 24 1584 4356 b = 0.2741
66 24 68 10 680 4624
68 10 72 12 864 5184
72 12 75 15 1125 5625
75 15 77 13 1001 5929
77 13 809 155 9350 47245
Simple Linear Regression: Example
Income Savings
25 2
32 3
37 5
39 2
46 10
49 7
51 15
55 14
58 8
59 15
66 24
68 10
72 12
75 15
77 13
Simple Linear Regression: Example
Income Savings
25 2 Income = 85, Saving =?
32 3 Income = 42, Saving=?
37 5
39 2 a = -4.4502
46 10 b = 0.2741
49 7
51 15 Y = -4.4502 + 0.2741(X)
55 14
58 8
59 15 Income = 85, Saving = 18.84
66 24 Income = 42, Saving= 7.062
68 10
72 12
75 15
77 13
Simple Linear Regression: Example
Income Savings X Y Predicted Values
25 2 25 2 2.4023
32 3 32 3 4.321
37 5 37 5 5.6915
39 2 39 2 6.2397
46 10 46 10 8.1584
49 7 49 7 8.9807
51 15 51 15 9.5289
55 14 55 14 10.6253
58 8 58 8 11.4476
59 15 59 15 11.7217
66 24 66 24 13.6404
68 10 68 10 14.1886
72 12 72 12 15.285
75 15 75 15 16.1073
77 13 77 13 16.6555
Simple Linear Regression: Example
X Y Predicted Values
25 2 2.4023
32 3 4.321
37 5 5.6915
39 2 6.2397
46 10 8.1584
49 7 8.9807
51 15 9.5289
55 14 10.6253
58 8 11.4476
59 15 11.7217
66 24 13.6404
68 10 14.1886
72 12 15.285
75 15 16.1073
77 13 16.6555
R-squared (Goodness of Fit)
- After fitting a linear regression model, you need to
determine how well the model fits the data.
- R-squared is a goodness-of-fit measure for linear
regression models.
- This statistic indicates the percentage of the variance in
the dependent variable that the independent variable
can explain (or that is predictable from the independent
variable).
- R-squared measures the strength of the relationship
between the two variables on a convenient 0 – 100%
scale.
- It is also called Coefficient of Determination.
R-squared (Goodness of Fit)
yi = Actual Value
y-hat = Predicted Value
y-bar = Actual Mean

x y Predicted (Y)
14.2 215 267.78
16.4 325 333.97
11.9 185 198.58
15.2 332 297.87
18.5 406 397.16
22.1 522 505.47
19.4 412 424.24 Y= 30.08X – 159.4
25.1 614 595.74
23.4 544 544.59
18.1 421 385.12 R2 = 0.9168
22.6 445 520.52
17.2 408 358.04
R-squared (Goodness of Fit)
yi = Actual Value
y-hat = Predicted Value
y-bar = Actual Mean
X Y Yhat (Y - Yhat)^2 (Y -Yavg)^2
25 2 2.4023 0.1618453 69.3889
32 3 4.321 1.745041 53.7289
37 5 5.6915 0.4781723 28.4089
39 2 6.2397 17.975056 69.3889
46 10 8.1584 3.3914906 0.1089
49 7 8.9807 3.9231725 11.0889
51 15 9.5289 29.932935 21.8089
55 14 10.6253 11.3886 13.4689
58 8 11.4476 11.885946 5.4289
59 15 11.7217 10.747251 21.8089 SSRES = 241.87
66 24 13.6404 107.32131 186.869 SSTOT = 513.33
68 10 14.1886 17.54437 0.1089
72 12 15.285 10.791225 2.7889
75 15 16.1073 1.2261133 21.8089
77 13 16.6555 13.36268 7.1289
R2 = 0.5288
241.87521 513.334
Correlation

- The direction and strength of pairwise relationships

between two or more numeric variables.
- -1: Perfect negative correlation. The variables tend to
move in opposite directions (i.e., when one variable
increases, the other variable decreases).
- 0: No correlation. The variables do not have a relationship
with each other.
- 1: Perfect positive correlation. The variables tend to
move in the same direction (i.e., when one variable
increases, the other variable also increases).
Correlation
Correlation

r = 0.9575
How to Interpret Correlation?
- Strength: The greater the absolute value of the correlation
coefficient, the stronger the relationship.

- Direction: The sign of the correlation coefficient

represents the direction of the relationship.
How to Interpret Correlation?
- The correlation, denoted by r, measures the amount of
linear association between two variables.
- r is always between -1 and 1 inclusive.

- The R-squared value, denoted by R2, is the square of the

correlation. It measures the proportion of variation in
the dependent variable that can be attributed to the
independent variable.
- The R-squared value is always between 0 and 1 inclusive.
Correlation vs. Regression
- Correlation is a statistical measure that quantifies the
direction and strength of the relationship between two
numeric variables.

- Regression is a statistical technique that predicts the value

of the dependent variable Y based on the known value of
the independent variable X through an equation.
Drawbacks of Linear Regression
- Linear regression only looks at linear relationships between
dependent and independent variables.
- Linear regression looks at a relationship between the mean
of the dependent variable and the independent variables.
- Linear regression is sensitive to outliers
Exploratory Data Analysis
Data
• Data is a collection of facts, such as numbers, words,
measurements, observations or just descriptions of things.
• Qualitative data is descriptive information (it describes
something).
• Quantitative data is numerical information (numbers).
• Discrete data can only take certain values (like whole
numbers)
• Continuous data can take any value (within a range)
• Discrete data is counted, Continuous data is measured
Data Analysis
• Data analysis is the process of collecting, cleaning,
analyzing, interpreting, and visualizing data to discover
valuable insights or useful information.
Data Science
• Data science is the domain of study that deals with large
volumes of data using modern tools and techniques to find
unseen patterns, derive meaningful information, and make
business decisions.

• Data science uses complex (e.g., machine learning)

algorithms to build predictive models.
Data Analyst
• A data analyst makes sense out of existing data.
• Data analysts typically work with structured data to solve
business problems using tools like SQL, R or Python
programming languages, data visualization software, and
statistical analysis.
• Collaborating with organizational leaders to identify
informational needs
• Acquiring data from primary and secondary sources
• Cleaning and reorganizing data for analysis
• Analyzing data sets to spot trends and patterns that can be
translated into actionable insights
• Presenting findings in an easy-to-understand way to inform data-
driven decisions
Data Scientist
• A data scientist works on new ways of capturing and
analyzing data to be used by the analysts. A data scientist is
someone who creates programming code and combines it
with statistical knowledge to create insights from data.

• Gathering, cleaning, and processing raw data

• Designing predictive models and machine learning algorithms to
mine big data sets
• Developing tools and processes to monitor and analyze data
accuracy
• Building data visualization tools, dashboards, and reports
• Writing programs to automate data collection and processing
Data Engineer
• A data engineer specializes in preparing data for analytical
usage. Data Engineering involves the development of
platforms and architectures for data processing.
• Data Engineer is responsible for designing the format for
data scientists and analysts to work on.
• Build, test, and maintain dataset pipeline architectures
• Create new data validation methods and data analysis tools
• Combine raw information from different sources
• Explore ways to enhance data quality and reliability
Data Analysis
Types of Data Analysis
• Descriptive analysis looks at past data and tells what
happened. This is often used when tracking Key Performance
Indicators (KPIs), revenue, sales leads, and more.
• Exploratory analysis is an approach of analyzing data to
summarize their main characteristics, often using statistical
graphics and other data visualization methods.
• Predictive analysis predicts what is likely to happen in the
future. In this type of research, trends are derived from past
data which are then used to form predictions about the future.
• Prescriptive analysis is the most advanced form of analysis,
as it combines all your data and analytics, then outputs a
model prescription: What action to take.
Exploratory Data Analysis
• Exploratory Data Analysis (EDA) is an approach for data
analysis that employs a variety of techniques (mostly
graphical) to
• maximize insight into a data set
• uncover underlying structure
• extract important variables
• detect outliers and anomalies
• test underlying assumptions
Graphics and Exploratory Data Analysis
• Quantitative (summary)
• Graphical (plots)
Univariate Data
• Univariate data are the ones which consist of only one
variable.
• Univariate data analysis are straightforward as we are dealing
with only one variable.
• The analysis that we do in case of univariate data analysis
doesn't have to do anything with the relationships between
variables.
• The main purpose is to describe the data and find patterns
that are present.
Univariate Data
• Bar Chart
• Pie Chart
• Histogram / Frequency Distribution
Univariate Data
• Bar Chart
• Pie Chart
• Histogram / Frequency Distribution
Bivariate Data
• Bivariate data involves two different variables where we are
concerned about investigating the causes and relationship
between those 2 variables.
Bivariate Data
• Scatter Plot
• Line Plot
• Stacked Bar chart
Bivariate Data
• Scatter Plot
• Line Plot
• Stacked Bar chart
Multivariate data
• Data which involves 3 or more variables are termed as
Multivariate data. These are similar to bivariate but contains
more than one dependent variable.
• “Curse of dimension” is a trouble issue in information
visualization.
• The effectiveness of retinal visual elements (e.g. color, shape,
size) deteriorates when the number of variables increases
Multivariate data
• Enhanced Basic Plots
EDA using Python
• Matplotlib
• Seaborn
Summary
• Linear Regression
• Exploratory Data Analysis

Unit III
No ratings yet
Unit III
13 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
Presentation4 - Bivariate Analysis and Simple Linear Regression
No ratings yet
Presentation4 - Bivariate Analysis and Simple Linear Regression
31 pages
Regression Coeffient
No ratings yet
Regression Coeffient
52 pages
d90840b8 1721727178674
No ratings yet
d90840b8 1721727178674
43 pages
Chapter 6
No ratings yet
Chapter 6
58 pages
Forecasting Models & Regression Analysis
No ratings yet
Forecasting Models & Regression Analysis
13 pages
Business Analytics Regression Guide
No ratings yet
Business Analytics Regression Guide
91 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
Slides
No ratings yet
Slides
39 pages
Model Development
No ratings yet
Model Development
80 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Corr - Regression Analysis
No ratings yet
Corr - Regression Analysis
19 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Module 8 Regression Analysis
No ratings yet
Module 8 Regression Analysis
15 pages
Data Science Q&A - Latest Ed (2020) - 3 - 1
No ratings yet
Data Science Q&A - Latest Ed (2020) - 3 - 1
2 pages
Lecture 6 - Regression Analysis
No ratings yet
Lecture 6 - Regression Analysis
34 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Regression Analysis Basics
No ratings yet
Regression Analysis Basics
12 pages
06 Simple Linear Regression Part1
No ratings yet
06 Simple Linear Regression Part1
8 pages
Data Analytics Lesson 11 Notes
No ratings yet
Data Analytics Lesson 11 Notes
8 pages
Corelation and Regression
No ratings yet
Corelation and Regression
137 pages
Summary: Correlation and Regression
No ratings yet
Summary: Correlation and Regression
6 pages
Predictive Analytics-Mid Sem Exam Question Bank
No ratings yet
Predictive Analytics-Mid Sem Exam Question Bank
28 pages
Forecasting Models & Linear Regression
No ratings yet
Forecasting Models & Linear Regression
71 pages
Module5 Bigdata Analytics
No ratings yet
Module5 Bigdata Analytics
110 pages
Machine Learning in Python
No ratings yet
Machine Learning in Python
36 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
83 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Lekcija 10 - Korelacija I Regresija
No ratings yet
Lekcija 10 - Korelacija I Regresija
76 pages
Python ML Course Notes
No ratings yet
Python ML Course Notes
36 pages
Day 3
No ratings yet
Day 3
85 pages
CH 5
No ratings yet
CH 5
36 pages
2023 Statistics Fin 10
No ratings yet
2023 Statistics Fin 10
14 pages
Aiml Module 3 Part 3
No ratings yet
Aiml Module 3 Part 3
12 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
Unit 2 - Scatterplots Correlation and Regression Summer 2021
No ratings yet
Unit 2 - Scatterplots Correlation and Regression Summer 2021
43 pages
Simple Regression Analysis Guide
No ratings yet
Simple Regression Analysis Guide
58 pages
Linear Regression. Com
No ratings yet
Linear Regression. Com
13 pages
Business Insights with Regression
No ratings yet
Business Insights with Regression
4 pages
Ida Unit-3
No ratings yet
Ida Unit-3
34 pages
Stat Cor Reg
No ratings yet
Stat Cor Reg
85 pages
Da Module 3
No ratings yet
Da Module 3
54 pages
Regression Analysis
No ratings yet
Regression Analysis
27 pages
Aiml M3 C3
No ratings yet
Aiml M3 C3
37 pages
Module 3
No ratings yet
Module 3
34 pages
Intro to Correlation & Regression
No ratings yet
Intro to Correlation & Regression
71 pages
DrSoomro - 2588 - 20292 - 1 - Lecture 9
No ratings yet
DrSoomro - 2588 - 20292 - 1 - Lecture 9
29 pages
Deck2 BusinessIntelligence M1 ACSA
No ratings yet
Deck2 BusinessIntelligence M1 ACSA
15 pages
Linear Regression
No ratings yet
Linear Regression
42 pages
Regression
No ratings yet
Regression
25 pages
Session 19&20
No ratings yet
Session 19&20
54 pages
DA Notes 3
No ratings yet
DA Notes 3
12 pages
Patchogue-Medford Yearbook '75
No ratings yet
Patchogue-Medford Yearbook '75
86 pages
D6R Series Ii Detailed Specifications
No ratings yet
D6R Series Ii Detailed Specifications
4 pages
Vaccine Development Process Guide
No ratings yet
Vaccine Development Process Guide
9 pages
Web Engineering Lec 08
No ratings yet
Web Engineering Lec 08
20 pages
Spark Streaming - Malay
100% (1)
Spark Streaming - Malay
1 page
Address Update Aadhar
No ratings yet
Address Update Aadhar
1 page
Taekwondo in Horn of Africa
No ratings yet
Taekwondo in Horn of Africa
13 pages
University Character Framework
No ratings yet
University Character Framework
12 pages
1au1 p27 Graded Grammar Inter Ans
No ratings yet
1au1 p27 Graded Grammar Inter Ans
4 pages
текст к презентации (проект)
No ratings yet
текст к презентации (проект)
2 pages
Chance Reading List (1017 Books!)
100% (1)
Chance Reading List (1017 Books!)
15 pages
An Experimental Study On Strength of Concrete Using Tyre Rubber & Replacing Aggregates of Different Percentage With Tyre Rubber
No ratings yet
An Experimental Study On Strength of Concrete Using Tyre Rubber & Replacing Aggregates of Different Percentage With Tyre Rubber
3 pages
0417 - w23 - QP - 11 Ms Paer
No ratings yet
0417 - w23 - QP - 11 Ms Paer
16 pages
ADC and DAC PDF
No ratings yet
ADC and DAC PDF
23 pages
CO2 Refrigerant System Design
100% (1)
CO2 Refrigerant System Design
24 pages
T-Loop Force System With and Without Vertical Step Using Finite Element Analysis
No ratings yet
T-Loop Force System With and Without Vertical Step Using Finite Element Analysis
8 pages
Archipelago of Justice - Law in France's Early Modern Empire
100% (1)
Archipelago of Justice - Law in France's Early Modern Empire
287 pages
Biometric Atm
No ratings yet
Biometric Atm
14 pages
Operational-Risk 02 Pykhova 9781398617148 c01
No ratings yet
Operational-Risk 02 Pykhova 9781398617148 c01
27 pages
The Truth About Binge Watching
No ratings yet
The Truth About Binge Watching
5 pages
OEM Samples
No ratings yet
OEM Samples
5 pages
Shafer-Landau Fundamentals of Ethics - Introduction
No ratings yet
Shafer-Landau Fundamentals of Ethics - Introduction
3 pages
Specification Nta92c
No ratings yet
Specification Nta92c
6 pages
Industrial Compressor Controller
25% (4)
Industrial Compressor Controller
2 pages
Pembahasan PTK-2011
No ratings yet
Pembahasan PTK-2011
19 pages
TAP 118-1: Potential Dividers
No ratings yet
TAP 118-1: Potential Dividers
4 pages
Girls' Last Tour (Manga) v03
No ratings yet
Girls' Last Tour (Manga) v03
151 pages
Conzerv EM6400NG+ Series: Multi-Function Power & Energy Meters
No ratings yet
Conzerv EM6400NG+ Series: Multi-Function Power & Energy Meters
8 pages
Physics Project File For Class 12th.
62% (82)
Physics Project File For Class 12th.
16 pages
MLA703b Maritime Industry Practice Assessment Information
No ratings yet
MLA703b Maritime Industry Practice Assessment Information
23 pages

Meweek 3

Uploaded by

Meweek 3

Uploaded by

Introduction to Data Science

Dr. Irfan Yousuf

• The position of each dot on the horizontal and vertical axis

• Scatter plots are used to observe relationships between

• Regressor: The variable that forms the basis of estimation or

• Regressand: The variable whose value depend on the

• Linear regression models use a straight line, while logistic

• Regression allows you to estimate how a dependent variable

- The direction and strength of pairwise relationships

- Direction: The sign of the correlation coefficient

- The R-squared value, denoted by R2, is the square of the

- Regression is a statistical technique that predicts the value

• Data science uses complex (e.g., machine learning)

• Gathering, cleaning, and processing raw data

You might also like