0% found this document useful (0 votes)
24 views74 pages

Introduction Am

The document discusses fundamentals of data science including what data science is, what it can be used for, comparison between data science, machine learning and artificial intelligence, types of data science tasks and algorithms. Data science utilizes techniques to extract value from data and discover useful patterns.

Uploaded by

mostafasameer858
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views74 pages

Introduction Am

The document discusses fundamentals of data science including what data science is, what it can be used for, comparison between data science, machine learning and artificial intelligence, types of data science tasks and algorithms. Data science utilizes techniques to extract value from data and discover useful patterns.

Uploaded by

mostafasameer858
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Fundamentals of Data

Science

LECTURE 2: INTRODUCTION
Course Book Course Software

Data Science: Concepts and


Practice
Authors : Vijay Kotu & Bala Deshpande
Publisher : Morgan Kaufmann

Free Download

○3
What is Data Science?

❑ Data science is a collection of techniques used to


extract value from data.
❑ It has become an essential tool for any organization
that collects, stores, and processes data as part of its
operations.
❑ Data science techniques rely on finding useful
patterns, connections, and relationships within data.
❑ Data science is also commonly referred to as
knowledge discovery, machine learning, predictive
analytics, and data mining.

●4
What is Data Science?

❑ Data science enables businesses to process huge amounts of


data to detect patterns
❑ This in turn allows companies to increase efficiencies, manage
costs, identify new market opportunities, and boost their market
advantage.
❑ DS is the practice of mining large data sets of raw data, both
structured and unstructured, to identify patterns and extract
actionable insight from them.
❑ It’s an interdisciplinary field, and the foundations of data
science include:
statistics, computer science, predictive analytics, machine learning algorithm
development, and new technologies to gain insights from big data

●5
AI, MACHINE LEARNING, AND DATA SCIENCE

Artificial intelligence, Machine learning, and data


science are all related to each other

❑ Artificial intelligence is about giving machines the capability of


mimicking human behavior, particularly cognitive functions.
Examples would be: facial recognition, automated driving,
sorting mail based on postal code.

❑ There are quite a range of techniques that fall under artificial


intelligence: linguistics, natural language processing, decision
science, bias, vision, robotics, planning, etc.
AI, MACHINE LEARNING, AND DATA SCIENCE

Artificial intelligence, Machine learning, and data


science are all related to each other
❑ Machine learning can either be considered a sub-field or one
of the tools of artificial intelligence, is providing machines with
the capability of learning from experience.
❑ Experience for machines comes in the form of data.
❑ Data that is used to teach machines is called training data.
❑ Machine learning turns the traditional programing model
upside down.
❑ A program, a set of instructions to a computer, transforms input
signals into output signals using predetermined rules and
relationships. Machine learning algorithms
AI, MACHINE LEARNING, AND DATA SCIENCE

Artificial intelligence, Machine learning, and data


science are all related to each other
❑ Data science is the business application of machine learning,
artificial intelligence, and other quantitative fields like
statistics, visualization, and mathematics.
❑ It is an interdisciplinary field that extracts value from data
What is implied by Data Science?
Compare between AI, Ml and DS?
What is implied by Data Science?
WHAT IS DATA SCIENCE?

❑ Data science starts with data, which can range from a simple
array of a few numeric observations to a complex matrix of
millions of observations with thousands of variables.
❑ Data science utilizes certain specialized computational
methods in order to discover meaningful and useful
structures within a dataset.

KEY FEATURES AND MOTIVATIONS OF DATA SCIENCE

1- Extracting Meaningful Patterns


2- Building Representative Models
3- Combination of Statistics, Machine Learning, and Computing
4- Learning Algorithms
5- Associated Fields
What Can Data Science Be Used For?

Used in almost worldwide fields including:


• healthcare,
• marketing,
• banking and finance, and policy work.
• Social Intelligence
• Political planning and economics
Almost in all real life fields with huge amounts of
accumulated data

●13
Data Science vs Data Analytics

❑ Although the work of data scientists and data analysts are


sometimes conflated, these fields are not the same.
❑ The term data science analyst really just means one or the other.
❑ In summary, a data scientist is more likely to look ahead,
predicting or forecasting as they look at data.
❑ A data analyst is more likely to focus on specific questions to
answer digging into existing data sets that have already been
processed for insights.

●14
Data Science vs Data Analytics

A data scientist comes in earlier in the A data analyst sees data at a later stage.
game than a data analyst, exploring a They report on what it tells them, make
massive data set, investigating its prescriptions for better performance
potential, identifying trends and insights, based on their analysis, and optimize
and visualizing them for others any data related tools.

. A data scientist is more likely to tackle The data analyst is likely to be analyzing
larger masses of both structured and a specific dataset of structured or
unstructured data. numerical data using a given question or
questions
Data analytics has more to do with placing
historical data in context and less to do with
predictive modeling and machine learning.
Data analysis isn't an open-minded search for
the right question; it relies upon having the
right questions in place from the start.
Furthermore, unlike data scientists, data
●15
analysts typically do not create statistical


DATA SCIENCE CLASSIFICATION

Data science problems can be broadly categorized into


supervised or unsupervised learning models.
• Supervised or directed data science tries to infer a function or
relationship based on labeled training data and uses this
function to map new unlabeled data.
• Supervised techniques predict the value of the output
variables based on a set of input variables. To do this, a model
is developed from a training dataset where the values of input
and output are previously known.
• Supervised data science needs a sufficient number of labeled
records to learn the model from the data.
DATA SCIENCE CLASSIFICATION

Data science problems can be broadly categorized into


supervised or unsupervised learning models.

• Unsupervised or undirected data science uncovers hidden


patterns in unlabeled data.
• In unsupervised data science, there are no output variables
to predict.
• The objective of this class of data science techniques, is to
find patterns in data based on the relationship between
data points themselves
What are the Data science Tasks?

●18
Tasks Description Algorithms Examples

Classification Predict if a data point belongs to one Decision Trees, Neural Assigning voters into known buckets by
of predefined classes. The prediction networks, Bayesian models, political parties eg: soccer moms. Bucketing
will be based on learning from Induction rules, K nearest new customers into one of known customer
known data set. neighbors groups.

Regression Predict the numeric target label of a Linear regression, Logistic Predicting unemployment rate for next
data point. The prediction will be regression year. Estimating insurance premium.
based on learning from known data
set.

Anomaly detection Predict if a data point is an outlier Distance based, Density Fraud transaction detection in credit cards.
compared to other data points in the based, LOF Network intrusion detection.
data set.

Time series Predict if the value of the target Exponential smoothing, Sales forecasting, production forecasting,
variable for future time frame based ARIMA, regression virtually any growth phenomenon that
on history values. needs to be extrapolated

Clustering Identify natural clusters within the K means, density based Finding customer segments in a company
data set based on inherit properties clustering - DBSCAN based on transaction, web and customer
within the data set. call data.

Association analysis Identify relationships within an FP Growth, Apriori Find cross selling opportunities for a
itemset based on transaction data. retailor based on transaction purchase
history.

●19
Types of Data
• Qualitative and quantitative
• Simple quantitative analysis
• Simple qualitative analysis
Quantitative and qualitative
• Quantitative data – expressed as numbers
• Qualitative data – difficult to measure sensibly as
numbers, e.g. count number of words to measure
dissatisfaction
• Quantitative analysis – numerical methods to ascertain
size, magnitude, amount(have a solid number=fact)
• Qualitative analysis – expresses the nature of elements
and is represented as themes, patterns, stories

• Be careful how you manipulate data and numbers!


Quantitative and Qualitative
Quantitative and qualitative
» Red. Blue, Yellow. Black. Qualitative Nominal
» Very Unhappy. Unhappy. Qualitative Ordinal
» 15. 123. 12. -10. Quantitative Discrete
» 14.2. 11.09. 1000 Quantitative Continuous
» Baby, Child, Teenager, Adult, Old Qualitative Ordinal
» Alexandria, Cairo, Luxor Qualitative Nominal
» East. West. North, South Qualitative Nominal
» Student Age Quantitative Discrete
» Product Weight Quantitative Continuous
» Price Quantitative Discrete
» Gender Qualitative Nominal
» Phone Numbers Qualitative Nominal
» Course Grades: A. B. C.E.E Qualitative Ordinal
» User Feedback Paragraph Qualitative Nominal
» Neutral, Happy, Very Happy Qualitative Ordinal
Quantitative VS Qualitative

» Objectives variables for data » Subjective parameters for data


gathering. gathering
» Values that can be counted » Things that can be described
such as age, weight, volume, using the 5 sensory such as color,
and scale. smell, taste, touch or feeling,
typology, and shapes.
» Researcher aim to increase » Researcher aim to get a variety of
the sample size. The more data values to examine and understand.
points, the more accurate. » It is costly to have large sample
size.
» Dynamic & Negotiable reality
» Definite. Fixed & Measurable
reality
Simple quantitative analysis
• Averages
– Mean: add up values and divide by number of data
points
– Median: middle value of data when ranked
– Mode: figure that appears most often in the data
• Histogram
• Skewness
• Outliers
Skewness
Outliers
Qualitative
Ordinal

• It has a rank or order.


• It establishes a relative
rank.
• It has no standardized
interval scale.
• The Median and Mode
can be analyzed.
• Can't have a Mean.
Lie Factor

Is a value to describe the relation between the size of effect


shown in a graphic and the size of effect shown in the data.

Where

It is acceptable to be between 0.95 to 1.05


Lie Factor
Lie Factor

» Graphic =
(1550 - 600) /
600 = 1.58
» Actual =
(11750-10800)
/ 10800 =
0.088
» Lie Factor = 18
Data Cleansing
Reasons for Data Cleansing
2

The data to be analyzed may be:


◉Incomplete; where the data is missing
◉Noisy; where data may contain errors or outlier values

◉Inconsistent; where data may contain discrepancies in

the values
How can Data be Cleaned?
3

Filling-in Missing Values


Smoothing Noisy Data
Identifying and Removing Outliers
Resolving Inconsistency
Typical Example
4

Sample Table
Typical Example (Incomplete Data)
5

Sample Table
Typical Example (Data Values Errors)
6

Sample Table

Reasons:
• The Zip code consists of five digits and cannot contain any
letters
• Income must be positive number
• Age must be positive number
Typical Example (Outlier Values)
7

Sample Table

Reasons:
• Outliers are data values that deviate from expected values of the rest of
the
data set
• The values 10000000 and -40000 look very divergent from the rest of
values
Typical Example (Ambiguity)
8

Sample Table

Reasons:
• “S” in Marital Status could refer to “Single” or
“Separated”
• So, there is a kind of ambiguity in the data
Reasons for Incomplete Data
9

Relevant data may not be recorded because:


A misunderstanding from the data entry persons
Equipment failure
Relevant data may not be available because it is
unknown or providing it is optional
Dealing with Incomplete Data
10

There are several ways to deal with missing data:


Replace the missing value with some default value
Replace the missing value with the field mean for the
fields that take numerical values or the mode (if exists)
for the fields that take categorical values
Replace the missing values with a value generated at
random from the field distribution observed
Mean, Median, and Mode
11

The mean for a population of size n can be computed


by:

◉ Consider the following list of 9 numbers:


13, 15, 12, 17, 22, 11, 13, 19, 12

Mean = (13 + 15 + 12 + 17 + 22 + 11 + 13 + 19 + 12)/9 = 14.88889


Mean, Median, and Mode
12

The median is the middle value of the ordered list of


numbers.
◉Consider the following list of 9 numbers:

13, 15, 12, 17, 22, 11, 13, 19, 12


To compute the median, you need first to order the numbers:

11, 12, 12, 13, 13, 15, 17, 19, 22

Hence, the median is 13


Mean, Median, and Mode
13

The median is the middle value of the ordered list of


numbers.
◉Consider the following list of 10 numbers:

13, 15, 12, 17, 22, 11, 13, 19, 12, 14


To compute the median, you need first to order the numbers:

11, 12, 12, 13, 13, 14, 15, 17, 19, 22

Hence, the median is (13 + 14)/2 = 13.5


Mean, Median, and Mode
14

The mode of a set of data is the value in the set that


occurs most often.
◉Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 13, 19, 13

Number Occurrence Number


Occurrence
13 3 22
15 1 11
12 1 19
17 1
Mode is 13
Mean, Median, and Mode
15

The mode of a set of data is the value in the set that


occurs most often.
◉Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 13, 19, 12

Number Occurrence Number


Occurrence
13 2 22
15 1 11
12 2 19
17 1
Mode is 13 and12 (Bimodal)
Mean, Median, and Mode
16

The mode of a set of data is the value in the set that


occurs most often.
◉Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 19

Number Occurrence Number


Occurrence
13 1 22
15 1 11
12 1 19
17 1
There is no mode
Handling Missing Values
17

A set of fields with missing values


Handling Missing Values
18

A set of fields with missing values


Handling Missing Values (Using Default
Values)
19

Defaults
• Field1 Default: 0 Field2 Default:
N
• Field3 Default: 240 Field4 Default:
50
Handling Missing Values (Using Means
and Modes)
20

Use the mean for the numeric fields and the mode (if
exists) for the categorical fields
If mode doesn’t exist, you need to rely on either a default
value or to use a random value
Numeric Fields: Field1, Field3, and Field4
Field1 Mean = (21+24+22+12+11+16+16+17+18)/9
= 17.44
Field3 Mean = 334.44
Field4 mean = 81.78
If any field doesn’t accept decimal values, just
approximate the mean value
Handling Missing Values (Using Means
and Modes)
21

Field2 is categorical, hence we need to compute the


mode from the existing values

Category Occurrence
A 3
B 1
W 2
C 2

Hence, the mode is A


Handling Missing Values (Using Means
and Modes)
22

Assumptions
• Assume Field 1 and Field4 don’t accept decimal numbers, Hence we
approximate the mean
• Field3 accepts decimal numbers, hence we don’t approximate the mean
value
Handling Missing Values (Using Random
Values)
23
Handling Outliers
24

Sample Table
Data Set Possible Outlier Values

• Outliers are data values that deviate from expected values of the rest of
the
data set
• Outliers are extreme values that lie near the limits of the data range or
go
against the trend of the remaining data.
• Normally, outliers need more investigation to make sure that they don’t
Handling Outliers Using Inter-quartile
Range
25

Q1 Q3

75% of data items


Sorted Data
Items
25% of data
items

50% of data
items
Q2

• Quartile is any of the three values which divide the sorted data
set
into four equal parts
• First quartile (Q1) cuts off lowest 25% of data
• Second quartile (Q2) cuts data set in half (it is the median of the
data
set)
Computing Q1, Q2, and Q3
26

to compute Q1, Q2, and Q3, use the following


method:
◉Order the given data set in ascending order.
◉Use the median to divide the ordered data set into two
halves. This median is second quartile (Q2). Exclude this
median (if it is one of the data items) from any further
computation.
◉The first quartile (Q1) value is the median of the lower

half of the data.


◉The third quartile (Q3) value is the median of the upper

half of the data.


Example #1 of Computing Q1, Q2, and Q3
27

compute Q1, Q2, and Q3 for the following data set:


6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36

◉ Order the given data set in ascending order:

6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49

◉ Q2 = 40 (median of the data set)


◉ Q1 is the median of the lower half of the data (shown in red).
◉ Q1 = 15
◉ Q3 is the median of the upper half of the data (shown in green).
◉ Q3 = 43
Example #2 of Computing Q1, Q2, and Q3
28

compute Q1, Q2, and Q3 for the following data set:


39, 36, 7, 40, 41, 17
◉ Order the given data set in ascending order:

7, 17, 36, 39, 40, 41


◉ Q2 = (36+39)/2 = 37.5 (median of the data set). Note, the number of
data items is even so the median is the average of the middle two data
items
◉ The median is not a data item, hence we need to use all items in the first
half of the data items to compute Q1 and the rest of the items are used
to compute Q3
◉ Q1 = 17

◉ Q3 is the median of the upper half of the data (shown in green).

◉ Q3 = 40
Detecting Outliers using Inter-quartile Range
29

Compute the Inter-Quartile Range (IQR) as follows:


IQR = Q3 − Q1
A data value is an outlier if:
◉its value is <= (Q1 – 1.5*IQR), or
◉its value is >= (Q3 + 1.5*IQR).
Example of Detecting Outliers using Inter-
quartile Range
30

Sample Table
Data Set that might contains outliers

Data Set
75000, -40000, 10000000, 50000,
99999
Example of Detecting Outliers using Inter-
quartile Range
31

Data Set:
75000, -40000, 10000000, 50000, 99999
Ordered Data Set:
-40000, 50000, 75000, 99999, 10000000
Q2 = 75000
Q1 = (–40000+50000)/2 = 5000
Q3 = (99999+10000000)/2 = 5049999.5
IRQ = Q3 – Q1
= 5049999.5 – 5000
= 5044999.5
Q1 – 1.5*IRQ = 5000 – 1.5*50449999.5 = – 7562499.5
Q3 + 1.5*IRQ = 5049999.5 + 1.5*5044999.5 = 12617498.75
All data in the data set are within range, hence there is no outliers in this
example
Example of Detecting Outliers using Inter-
quartile Range
32

Data Set that might contains outliers

Data Set
75000, 40000, 10000000, 50000, 99999,
75000
Example of Detecting Outliers using Inter-
quartile Range
33

Data Set:
75000, 40000, 10000000, 50000, 99999, 75000
Ordered Data Set:

40000, 50000, 75000, 75000, 99999, 10000000


Q2 = (75000+ 75000)/2 = 75000
Q1 = 50000
Q3 = 99999
IRQ = Q3 – Q1
= 99999 – 50000
= 49999
Q1 – 1.5*IRQ = 50000 – 1.5*49999 = –24998.5
Q3 + 1.5*IRQ = 99999 + 1.5* 49999 = 174997.5
Hence data item 10000000 is an outlier and should be re-investigated for any
data-entry errors
Noisy Data
34

Noisy data are the kind of data that have incorrect


values
Some reasons for noisy data:
◉Data collection instruments may be faulty

◉Human or computer errors may occur during data entry


◉Transmission errors may occur
◉Technology limitations like buffer size, may occur during
data-entry
Noisy Data
35

Examples of Noisy Data


Smoothing Noisy Data
36

By smoothing noisy data we can correct the


errors
Smoothing noisy data is performed by:
◉Validation and correction
◉ Standardization
Validation and Correction of Noisy Data
37

This step examines the data for data-entry errors and


tries to correct them automatically as far as possible
according to the following guidelines:
◉Spell checking based on dictionary lookup is useful for
identifying and correcting misspellings.
Example: Kairo can be spell-checked and corrected into
Cairo
◉Use dictionaries on geographic names and zip codes helps
to correct address data.
Example: Zip code 1243456 can be detected as an error
since there is no Zip code matches this value
Validation and Correction of Noisy Data (Cont.)
38

◉Check validation rules and make sure field values follow the
rules; for example:
■Age is not less than certain amount and age is a positive
number.
Example: if there is a rule governing your data says that
age
must be between 20 and 60, then ages of 18, 15, and 68
are detected as errors
■Each value of the categorical values belong to certain
category.
Example: if all the categories you have are A, B, C, and D,
Then if categories W or N are found they will be declared
as errors
Validation and Correction of Noisy Data (Cont.)
39

◉Check the fields that have ambiguous values and check for any
possible data-entry errors

Example: Using the same category value to refer to


different meaning. “S” in “Marital Status” field could refer to
“Single” or “Separated”
Standardization to Smooth Noisy Data
40

Data values should be consistent and have uniform format.


For example:
◉ Date and time entries should have a specific format
Oct. 19, 2009 10/19/2009
19/10/2009
All dates must be written with the same format
that have been agreed upon (e.g., Day/Month/Year)
◉ Names and other string data should be converted to either upper or
lower case.
MOHAMED AHMED instead of Mohamed

Ahmed prefixes and suffixes from names.
Removing

Mohamed Ahmed instead of Mr. Mohamed Ahmed


Mohamed Ahmed instead of Mohamed Ahmed,
Ph.D.
Standardization to Smooth Noisy Data (Cont.)
41

◉Abbreviations and encoding schemes should consistently be


resolved by consulting special dictionaries or applying
predefined conversion rules.

US is the standard abbreviation of United States


Data Inconsistency
42

Data inconsistency means that different data items contain


discrepancies in their values
It can occur when different data items depend on other data
items and their values don’t match; for example:
◉ Age and Birth-date; age can be computed from the birth-date,
hence the value of Age must match the value computed from the
birth-date
◉ City and Phone-area-code; each city has certain area-code

◉ Total-price and (unit-price and quantity); total-price can be


computed from the unit-price and quantity
These dependencies can be utilized to detect errors and
substitute missing values or correct wrong values
Data Inconsistency Example
43

Example of Inconsistent Data

Data Inconsistency Marked in Red


Incorrect Area-Code
Total_Price doesn’t equal (Quantity*Unit_Price)

You might also like