0% found this document useful (0 votes)

24 views74 pages

Introduction Am

The document discusses fundamentals of data science including what data science is, what it can be used for, comparison between data science, machine learning and artificial intelligence, types of data science tasks and algorithms. Data science utilizes techniques to extract value from data and discover useful patterns.

Uploaded by

mostafasameer858

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views74 pages

Introduction Am

Uploaded by

mostafasameer858

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Fundamentals of Data

Science

LECTURE 2: INTRODUCTION
Course Book Course Software

Data Science: Concepts and

Practice
Authors : Vijay Kotu & Bala Deshpande
Publisher : Morgan Kaufmann

Free Download

○3
What is Data Science?

❑ Data science is a collection of techniques used to

extract value from data.
❑ It has become an essential tool for any organization
that collects, stores, and processes data as part of its
operations.
❑ Data science techniques rely on finding useful
patterns, connections, and relationships within data.
❑ Data science is also commonly referred to as
knowledge discovery, machine learning, predictive
analytics, and data mining.

●4
What is Data Science?

❑ Data science enables businesses to process huge amounts of

data to detect patterns
❑ This in turn allows companies to increase efficiencies, manage
costs, identify new market opportunities, and boost their market
advantage.
❑ DS is the practice of mining large data sets of raw data, both
structured and unstructured, to identify patterns and extract
actionable insight from them.
❑ It’s an interdisciplinary field, and the foundations of data
science include:
statistics, computer science, predictive analytics, machine learning algorithm
development, and new technologies to gain insights from big data

●5
AI, MACHINE LEARNING, AND DATA SCIENCE

Artificial intelligence, Machine learning, and data

science are all related to each other

❑ Artificial intelligence is about giving machines the capability of

mimicking human behavior, particularly cognitive functions.
Examples would be: facial recognition, automated driving,
sorting mail based on postal code.

❑ There are quite a range of techniques that fall under artificial

intelligence: linguistics, natural language processing, decision
science, bias, vision, robotics, planning, etc.
AI, MACHINE LEARNING, AND DATA SCIENCE

Artificial intelligence, Machine learning, and data

science are all related to each other
❑ Machine learning can either be considered a sub-field or one
of the tools of artificial intelligence, is providing machines with
the capability of learning from experience.
❑ Experience for machines comes in the form of data.
❑ Data that is used to teach machines is called training data.
❑ Machine learning turns the traditional programing model
upside down.
❑ A program, a set of instructions to a computer, transforms input
signals into output signals using predetermined rules and
relationships. Machine learning algorithms
AI, MACHINE LEARNING, AND DATA SCIENCE

Artificial intelligence, Machine learning, and data

science are all related to each other
❑ Data science is the business application of machine learning,
artificial intelligence, and other quantitative fields like
statistics, visualization, and mathematics.
❑ It is an interdisciplinary field that extracts value from data
What is implied by Data Science?
Compare between AI, Ml and DS?
What is implied by Data Science?
WHAT IS DATA SCIENCE?

❑ Data science starts with data, which can range from a simple
array of a few numeric observations to a complex matrix of
millions of observations with thousands of variables.
❑ Data science utilizes certain specialized computational
methods in order to discover meaningful and useful
structures within a dataset.
❑
KEY FEATURES AND MOTIVATIONS OF DATA SCIENCE

1- Extracting Meaningful Patterns

2- Building Representative Models
3- Combination of Statistics, Machine Learning, and Computing
4- Learning Algorithms
5- Associated Fields
What Can Data Science Be Used For?

Used in almost worldwide fields including:

• healthcare,
• marketing,
• banking and finance, and policy work.
• Social Intelligence
• Political planning and economics
Almost in all real life fields with huge amounts of
accumulated data

●13
Data Science vs Data Analytics

❑ Although the work of data scientists and data analysts are

sometimes conflated, these fields are not the same.
❑ The term data science analyst really just means one or the other.
❑ In summary, a data scientist is more likely to look ahead,
predicting or forecasting as they look at data.
❑ A data analyst is more likely to focus on specific questions to
answer digging into existing data sets that have already been
processed for insights.

●14
Data Science vs Data Analytics

A data scientist comes in earlier in the A data analyst sees data at a later stage.
game than a data analyst, exploring a They report on what it tells them, make
massive data set, investigating its prescriptions for better performance
potential, identifying trends and insights, based on their analysis, and optimize
and visualizing them for others any data related tools.

. A data scientist is more likely to tackle The data analyst is likely to be analyzing
larger masses of both structured and a specific dataset of structured or
unstructured data. numerical data using a given question or
questions
Data analytics has more to do with placing
historical data in context and less to do with
predictive modeling and machine learning.
Data analysis isn't an open-minded search for
the right question; it relies upon having the
right questions in place from the start.
Furthermore, unlike data scientists, data
●15
analysts typically do not create statistical
‍
‍
DATA SCIENCE CLASSIFICATION

Data science problems can be broadly categorized into

supervised or unsupervised learning models.
• Supervised or directed data science tries to infer a function or
relationship based on labeled training data and uses this
function to map new unlabeled data.
• Supervised techniques predict the value of the output
variables based on a set of input variables. To do this, a model
is developed from a training dataset where the values of input
and output are previously known.
• Supervised data science needs a sufficient number of labeled
records to learn the model from the data.
DATA SCIENCE CLASSIFICATION

Data science problems can be broadly categorized into

supervised or unsupervised learning models.

• Unsupervised or undirected data science uncovers hidden

patterns in unlabeled data.
• In unsupervised data science, there are no output variables
to predict.
• The objective of this class of data science techniques, is to
find patterns in data based on the relationship between
data points themselves
What are the Data science Tasks?

●18
Tasks Description Algorithms Examples

Classification Predict if a data point belongs to one Decision Trees, Neural Assigning voters into known buckets by
of predefined classes. The prediction networks, Bayesian models, political parties eg: soccer moms. Bucketing
will be based on learning from Induction rules, K nearest new customers into one of known customer
known data set. neighbors groups.

Regression Predict the numeric target label of a Linear regression, Logistic Predicting unemployment rate for next
data point. The prediction will be regression year. Estimating insurance premium.
based on learning from known data
set.

Anomaly detection Predict if a data point is an outlier Distance based, Density Fraud transaction detection in credit cards.
compared to other data points in the based, LOF Network intrusion detection.
data set.

Time series Predict if the value of the target Exponential smoothing, Sales forecasting, production forecasting,
variable for future time frame based ARIMA, regression virtually any growth phenomenon that
on history values. needs to be extrapolated

Clustering Identify natural clusters within the K means, density based Finding customer segments in a company
data set based on inherit properties clustering - DBSCAN based on transaction, web and customer
within the data set. call data.

Association analysis Identify relationships within an FP Growth, Apriori Find cross selling opportunities for a
itemset based on transaction data. retailor based on transaction purchase
history.

●19
Types of Data
• Qualitative and quantitative
• Simple quantitative analysis
• Simple qualitative analysis
Quantitative and qualitative
• Quantitative data – expressed as numbers
• Qualitative data – difficult to measure sensibly as
numbers, e.g. count number of words to measure
dissatisfaction
• Quantitative analysis – numerical methods to ascertain
size, magnitude, amount(have a solid number=fact)
• Qualitative analysis – expresses the nature of elements
and is represented as themes, patterns, stories

• Be careful how you manipulate data and numbers!

Quantitative and Qualitative
Quantitative and qualitative
» Red. Blue, Yellow. Black. Qualitative Nominal
» Very Unhappy. Unhappy. Qualitative Ordinal
» 15. 123. 12. -10. Quantitative Discrete
» 14.2. 11.09. 1000 Quantitative Continuous
» Baby, Child, Teenager, Adult, Old Qualitative Ordinal
» Alexandria, Cairo, Luxor Qualitative Nominal
» East. West. North, South Qualitative Nominal
» Student Age Quantitative Discrete
» Product Weight Quantitative Continuous
» Price Quantitative Discrete
» Gender Qualitative Nominal
» Phone Numbers Qualitative Nominal
» Course Grades: A. B. C.E.E Qualitative Ordinal
» User Feedback Paragraph Qualitative Nominal
» Neutral, Happy, Very Happy Qualitative Ordinal
Quantitative VS Qualitative

» Objectives variables for data » Subjective parameters for data

gathering. gathering
» Values that can be counted » Things that can be described
such as age, weight, volume, using the 5 sensory such as color,
and scale. smell, taste, touch or feeling,
typology, and shapes.
» Researcher aim to increase » Researcher aim to get a variety of
the sample size. The more data values to examine and understand.
points, the more accurate. » It is costly to have large sample
size.
» Dynamic & Negotiable reality
» Definite. Fixed & Measurable
reality
Simple quantitative analysis
• Averages
– Mean: add up values and divide by number of data
points
– Median: middle value of data when ranked
– Mode: figure that appears most often in the data
• Histogram
• Skewness
• Outliers
Skewness
Outliers
Qualitative
Ordinal

• It has a rank or order.

• It establishes a relative
rank.
• It has no standardized
interval scale.
• The Median and Mode
can be analyzed.
• Can't have a Mean.
Lie Factor

Is a value to describe the relation between the size of effect

shown in a graphic and the size of effect shown in the data.

Where

It is acceptable to be between 0.95 to 1.05

Lie Factor
Lie Factor

» Graphic =
(1550 - 600) /
600 = 1.58
» Actual =
(11750-10800)
/ 10800 =
0.088
» Lie Factor = 18
Data Cleansing
Reasons for Data Cleansing
2

The data to be analyzed may be:

◉Incomplete; where the data is missing
◉Noisy; where data may contain errors or outlier values

◉Inconsistent; where data may contain discrepancies in

the values
How can Data be Cleaned?
3

Filling-in Missing Values

Smoothing Noisy Data
Identifying and Removing Outliers
Resolving Inconsistency
Typical Example
4

Sample Table
Typical Example (Incomplete Data)
5

Sample Table
Typical Example (Data Values Errors)
6

Sample Table

Reasons:
• The Zip code consists of five digits and cannot contain any
letters
• Income must be positive number
• Age must be positive number
Typical Example (Outlier Values)
7

Sample Table

Reasons:
• Outliers are data values that deviate from expected values of the rest of
the
data set
• The values 10000000 and -40000 look very divergent from the rest of
values
Typical Example (Ambiguity)
8

Sample Table

Reasons:
• “S” in Marital Status could refer to “Single” or
“Separated”
• So, there is a kind of ambiguity in the data
Reasons for Incomplete Data
9

Relevant data may not be recorded because:

A misunderstanding from the data entry persons
Equipment failure
Relevant data may not be available because it is
unknown or providing it is optional
Dealing with Incomplete Data
10

There are several ways to deal with missing data:

Replace the missing value with some default value
Replace the missing value with the field mean for the
fields that take numerical values or the mode (if exists)
for the fields that take categorical values
Replace the missing values with a value generated at
random from the field distribution observed
Mean, Median, and Mode
11

The mean for a population of size n can be computed

by:

◉ Consider the following list of 9 numbers:

13, 15, 12, 17, 22, 11, 13, 19, 12

Mean = (13 + 15 + 12 + 17 + 22 + 11 + 13 + 19 + 12)/9 = 14.88889

Mean, Median, and Mode
12

The median is the middle value of the ordered list of

numbers.
◉Consider the following list of 9 numbers:

13, 15, 12, 17, 22, 11, 13, 19, 12

To compute the median, you need first to order the numbers:

11, 12, 12, 13, 13, 15, 17, 19, 22

Hence, the median is 13

Mean, Median, and Mode
13

The median is the middle value of the ordered list of

numbers.
◉Consider the following list of 10 numbers:

13, 15, 12, 17, 22, 11, 13, 19, 12, 14

To compute the median, you need first to order the numbers:

11, 12, 12, 13, 13, 14, 15, 17, 19, 22

Hence, the median is (13 + 14)/2 = 13.5

Mean, Median, and Mode
14

The mode of a set of data is the value in the set that

occurs most often.
◉Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 13, 19, 13

Number Occurrence Number

Occurrence
13 3 22
15 1 11
12 1 19
17 1
Mode is 13
Mean, Median, and Mode
15

The mode of a set of data is the value in the set that

occurs most often.
◉Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 13, 19, 12

Number Occurrence Number

Occurrence
13 2 22
15 1 11
12 2 19
17 1
Mode is 13 and12 (Bimodal)
Mean, Median, and Mode
16

The mode of a set of data is the value in the set that

occurs most often.
◉Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 19

Number Occurrence Number

Occurrence
13 1 22
15 1 11
12 1 19
17 1
There is no mode
Handling Missing Values
17

A set of fields with missing values

Handling Missing Values
18

A set of fields with missing values

Handling Missing Values (Using Default
Values)
19

Defaults
• Field1 Default: 0 Field2 Default:
N
• Field3 Default: 240 Field4 Default:
50
Handling Missing Values (Using Means
and Modes)
20

Use the mean for the numeric fields and the mode (if
exists) for the categorical fields
If mode doesn’t exist, you need to rely on either a default
value or to use a random value
Numeric Fields: Field1, Field3, and Field4
Field1 Mean = (21+24+22+12+11+16+16+17+18)/9
= 17.44
Field3 Mean = 334.44
Field4 mean = 81.78
If any field doesn’t accept decimal values, just
approximate the mean value
Handling Missing Values (Using Means
and Modes)
21

Field2 is categorical, hence we need to compute the

mode from the existing values

Category Occurrence
A 3
B 1
W 2
C 2

Hence, the mode is A

Handling Missing Values (Using Means
and Modes)
22

Assumptions
• Assume Field 1 and Field4 don’t accept decimal numbers, Hence we
approximate the mean
• Field3 accepts decimal numbers, hence we don’t approximate the mean
value
Handling Missing Values (Using Random
Values)
23
Handling Outliers
24

Sample Table
Data Set Possible Outlier Values

• Outliers are data values that deviate from expected values of the rest of
the
data set
• Outliers are extreme values that lie near the limits of the data range or
go
against the trend of the remaining data.
• Normally, outliers need more investigation to make sure that they don’t
Handling Outliers Using Inter-quartile
Range
25

Q1 Q3

75% of data items

Sorted Data
Items
25% of data
items

50% of data
items
Q2

• Quartile is any of the three values which divide the sorted data
set
into four equal parts
• First quartile (Q1) cuts off lowest 25% of data
• Second quartile (Q2) cuts data set in half (it is the median of the
data
set)
Computing Q1, Q2, and Q3
26

to compute Q1, Q2, and Q3, use the following

method:
◉Order the given data set in ascending order.
◉Use the median to divide the ordered data set into two
halves. This median is second quartile (Q2). Exclude this
median (if it is one of the data items) from any further
computation.
◉The first quartile (Q1) value is the median of the lower

half of the data.

◉The third quartile (Q3) value is the median of the upper

half of the data.

Example #1 of Computing Q1, Q2, and Q3
27

compute Q1, Q2, and Q3 for the following data set:

6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36

◉ Order the given data set in ascending order:

6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49

◉ Q2 = 40 (median of the data set)

◉ Q1 is the median of the lower half of the data (shown in red).
◉ Q1 = 15
◉ Q3 is the median of the upper half of the data (shown in green).
◉ Q3 = 43
Example #2 of Computing Q1, Q2, and Q3
28

compute Q1, Q2, and Q3 for the following data set:

39, 36, 7, 40, 41, 17
◉ Order the given data set in ascending order:

7, 17, 36, 39, 40, 41

◉ Q2 = (36+39)/2 = 37.5 (median of the data set). Note, the number of
data items is even so the median is the average of the middle two data
items
◉ The median is not a data item, hence we need to use all items in the first
half of the data items to compute Q1 and the rest of the items are used
to compute Q3
◉ Q1 = 17

◉ Q3 is the median of the upper half of the data (shown in green).

◉ Q3 = 40
Detecting Outliers using Inter-quartile Range
29

Compute the Inter-Quartile Range (IQR) as follows:

IQR = Q3 − Q1
A data value is an outlier if:
◉its value is <= (Q1 – 1.5*IQR), or
◉its value is >= (Q3 + 1.5*IQR).
Example of Detecting Outliers using Inter-
quartile Range
30

Sample Table
Data Set that might contains outliers

Data Set
75000, -40000, 10000000, 50000,
99999
Example of Detecting Outliers using Inter-
quartile Range
31

Data Set:
75000, -40000, 10000000, 50000, 99999
Ordered Data Set:
-40000, 50000, 75000, 99999, 10000000
Q2 = 75000
Q1 = (–40000+50000)/2 = 5000
Q3 = (99999+10000000)/2 = 5049999.5
IRQ = Q3 – Q1
= 5049999.5 – 5000
= 5044999.5
Q1 – 1.5*IRQ = 5000 – 1.5*50449999.5 = – 7562499.5
Q3 + 1.5*IRQ = 5049999.5 + 1.5*5044999.5 = 12617498.75
All data in the data set are within range, hence there is no outliers in this
example
Example of Detecting Outliers using Inter-
quartile Range
32

Data Set that might contains outliers

Data Set
75000, 40000, 10000000, 50000, 99999,
75000
Example of Detecting Outliers using Inter-
quartile Range
33

Data Set:
75000, 40000, 10000000, 50000, 99999, 75000
Ordered Data Set:

40000, 50000, 75000, 75000, 99999, 10000000

Q2 = (75000+ 75000)/2 = 75000
Q1 = 50000
Q3 = 99999
IRQ = Q3 – Q1
= 99999 – 50000
= 49999
Q1 – 1.5*IRQ = 50000 – 1.5*49999 = –24998.5
Q3 + 1.5*IRQ = 99999 + 1.5* 49999 = 174997.5
Hence data item 10000000 is an outlier and should be re-investigated for any
data-entry errors
Noisy Data
34

Noisy data are the kind of data that have incorrect

values
Some reasons for noisy data:
◉Data collection instruments may be faulty

◉Human or computer errors may occur during data entry

◉Transmission errors may occur
◉Technology limitations like buffer size, may occur during
data-entry
Noisy Data
35

Examples of Noisy Data

Smoothing Noisy Data
36

By smoothing noisy data we can correct the

errors
Smoothing noisy data is performed by:
◉Validation and correction
◉ Standardization
Validation and Correction of Noisy Data
37

This step examines the data for data-entry errors and

tries to correct them automatically as far as possible
according to the following guidelines:
◉Spell checking based on dictionary lookup is useful for
identifying and correcting misspellings.
Example: Kairo can be spell-checked and corrected into
Cairo
◉Use dictionaries on geographic names and zip codes helps
to correct address data.
Example: Zip code 1243456 can be detected as an error
since there is no Zip code matches this value
Validation and Correction of Noisy Data (Cont.)
38

◉Check validation rules and make sure field values follow the
rules; for example:
■Age is not less than certain amount and age is a positive
number.
Example: if there is a rule governing your data says that
age
must be between 20 and 60, then ages of 18, 15, and 68
are detected as errors
■Each value of the categorical values belong to certain
category.
Example: if all the categories you have are A, B, C, and D,
Then if categories W or N are found they will be declared
as errors
Validation and Correction of Noisy Data (Cont.)
39

◉Check the fields that have ambiguous values and check for any
possible data-entry errors

Example: Using the same category value to refer to

different meaning. “S” in “Marital Status” field could refer to
“Single” or “Separated”
Standardization to Smooth Noisy Data
40

Data values should be consistent and have uniform format.

For example:
◉ Date and time entries should have a specific format
Oct. 19, 2009 10/19/2009
19/10/2009
All dates must be written with the same format
that have been agreed upon (e.g., Day/Month/Year)
◉ Names and other string data should be converted to either upper or
lower case.
MOHAMED AHMED instead of Mohamed
◉
Ahmed prefixes and suffixes from names.
Removing

Mohamed Ahmed instead of Mr. Mohamed Ahmed

Mohamed Ahmed instead of Mohamed Ahmed,
Ph.D.
Standardization to Smooth Noisy Data (Cont.)
41

◉Abbreviations and encoding schemes should consistently be

resolved by consulting special dictionaries or applying
predefined conversion rules.

US is the standard abbreviation of United States

Data Inconsistency
42

Data inconsistency means that different data items contain

discrepancies in their values
It can occur when different data items depend on other data
items and their values don’t match; for example:
◉ Age and Birth-date; age can be computed from the birth-date,
hence the value of Age must match the value computed from the
birth-date
◉ City and Phone-area-code; each city has certain area-code

◉ Total-price and (unit-price and quantity); total-price can be

computed from the unit-price and quantity
These dependencies can be utilized to detect errors and
substitute missing values or correct wrong values
Data Inconsistency Example
43

Example of Inconsistent Data

Data Inconsistency Marked in Red

Incorrect Area-Code
Total_Price doesn’t equal (Quantity*Unit_Price)

Introduction
No ratings yet
Introduction
20 pages
02 Introduction - Fall 23-24
No ratings yet
02 Introduction - Fall 23-24
29 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
14 pages
ADS SEM 8 Unit 1
No ratings yet
ADS SEM 8 Unit 1
75 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
? What Is Data Science
No ratings yet
? What Is Data Science
31 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Data Science for Professionals
No ratings yet
Data Science for Professionals
15 pages
M1.1 DS
No ratings yet
M1.1 DS
57 pages
DS Unit 1 - ABM
No ratings yet
DS Unit 1 - ABM
103 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
What Is Data Science - A Beginner's Guide To Data Science - Edureka
No ratings yet
What Is Data Science - A Beginner's Guide To Data Science - Edureka
14 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Ab Assignment 3
No ratings yet
Ab Assignment 3
7 pages
Data Analytics for MBA Students
No ratings yet
Data Analytics for MBA Students
50 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
Introduction To Data Science L1
No ratings yet
Introduction To Data Science L1
28 pages
Datascience
78% (9)
Datascience
28 pages
Module 1
No ratings yet
Module 1
192 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
Ch7-Overview of Data Science-Part 1
No ratings yet
Ch7-Overview of Data Science-Part 1
37 pages
Beginner's Guide to Data Science
No ratings yet
Beginner's Guide to Data Science
26 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
7 pages
UGA Data Science & ML Overview
No ratings yet
UGA Data Science & ML Overview
22 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
29 pages
Data Science 2020
100% (1)
Data Science 2020
123 pages
Class X Data Science
No ratings yet
Class X Data Science
29 pages
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
No ratings yet
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
23 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
13 pages
Class 2 - Lifecycle ML Concepts in Ds
No ratings yet
Class 2 - Lifecycle ML Concepts in Ds
22 pages
TLMweek 1 Intro Ds
No ratings yet
TLMweek 1 Intro Ds
11 pages
Data-Science - Introduction
No ratings yet
Data-Science - Introduction
35 pages
Unit I
No ratings yet
Unit I
52 pages
09 Handout 1
No ratings yet
09 Handout 1
4 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
6220010
No ratings yet
6220010
37 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
Himadev
No ratings yet
Himadev
37 pages
Data Science - PPT
No ratings yet
Data Science - PPT
45 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Data Science Lectures 1
No ratings yet
Data Science Lectures 1
65 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
16 pages
Project Report
No ratings yet
Project Report
29 pages
Chapter 1 Introduction To Datascience
No ratings yet
Chapter 1 Introduction To Datascience
13 pages
Unit 01 Ids
No ratings yet
Unit 01 Ids
39 pages
Data Science and Its Role in Data Analytics
No ratings yet
Data Science and Its Role in Data Analytics
23 pages
Introduction To DS PDF
No ratings yet
Introduction To DS PDF
34 pages
Data Science 1
100% (5)
Data Science 1
133 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Data Science & Analytics Overview
No ratings yet
Data Science & Analytics Overview
9 pages
Lesson Plan 3 5e-Lesson-Plan-Template 2 Final
100% (1)
Lesson Plan 3 5e-Lesson-Plan-Template 2 Final
3 pages
Difficulties in Math
100% (1)
Difficulties in Math
16 pages
Material Development - Expressing Compliment and Sympathy
No ratings yet
Material Development - Expressing Compliment and Sympathy
14 pages
Emma J. Rose, CV - Oct 2019
No ratings yet
Emma J. Rose, CV - Oct 2019
16 pages
Teaching Reading Strategies Guide
No ratings yet
Teaching Reading Strategies Guide
13 pages
Lesson Plan On Disciplines and Ideas in The Applied Social Sciences
No ratings yet
Lesson Plan On Disciplines and Ideas in The Applied Social Sciences
2 pages
Chapter I
No ratings yet
Chapter I
5 pages
Theories and Principles in The Use and Design of Technology-Driven Lessons
No ratings yet
Theories and Principles in The Use and Design of Technology-Driven Lessons
39 pages
Chapter 2 Attitude Self Concept
No ratings yet
Chapter 2 Attitude Self Concept
12 pages
Håndbok I Grammatik Og Språkbruk - Partial Amateur Translation
No ratings yet
Håndbok I Grammatik Og Språkbruk - Partial Amateur Translation
13 pages
MCR Preamble
No ratings yet
MCR Preamble
8 pages
Speech Evaluation Guide
No ratings yet
Speech Evaluation Guide
2 pages
Analisis Gambar Karya Anak Sekolah Dasar (Karakteristik Gambar Anak Usia 7 - 9 Tahun)
No ratings yet
Analisis Gambar Karya Anak Sekolah Dasar (Karakteristik Gambar Anak Usia 7 - 9 Tahun)
25 pages
Week 1
No ratings yet
Week 1
2 pages
Lexicology Seminar 4-5
No ratings yet
Lexicology Seminar 4-5
11 pages
Curriculum Types and Perspectives
No ratings yet
Curriculum Types and Perspectives
71 pages
Purposive Communication
100% (1)
Purposive Communication
3 pages
Mentality Synonyms - Collins English Thesaurus
No ratings yet
Mentality Synonyms - Collins English Thesaurus
7 pages
Adjetivos Posesivos
No ratings yet
Adjetivos Posesivos
4 pages
Time Zones 4 - Unit 5 - Where Is The Iphonemade - Lesson Plan
No ratings yet
Time Zones 4 - Unit 5 - Where Is The Iphonemade - Lesson Plan
11 pages
Interview Success: Attitude, Behavior, Compatibility
No ratings yet
Interview Success: Attitude, Behavior, Compatibility
5 pages
Syllabus - ML Lab
No ratings yet
Syllabus - ML Lab
3 pages
Argumentative/ Persuasive Essay: Mahwish Abid
No ratings yet
Argumentative/ Persuasive Essay: Mahwish Abid
26 pages
English A Literature SL/HL Written Assignment: Assessment Chart
No ratings yet
English A Literature SL/HL Written Assignment: Assessment Chart
2 pages
Fractions Program wk1 2
No ratings yet
Fractions Program wk1 2
6 pages
King and The Dragonflies Unit Plan and Resources
No ratings yet
King and The Dragonflies Unit Plan and Resources
17 pages
Write The Verbs in Present Simple
No ratings yet
Write The Verbs in Present Simple
4 pages
Celban Tips: WRITING - Parallel Structure
100% (1)
Celban Tips: WRITING - Parallel Structure
1 page
Referensi TBL1 PDF
No ratings yet
Referensi TBL1 PDF
6 pages
Benchmark - Model Building
No ratings yet
Benchmark - Model Building
4 pages

Introduction Am

Uploaded by

Introduction Am

Uploaded by

Fundamentals of Data

Data Science: Concepts and

❑ Data science is a collection of techniques used to

❑ Data science enables businesses to process huge amounts of

Artificial intelligence, Machine learning, and data

❑ Artificial intelligence is about giving machines the capability of

❑ There are quite a range of techniques that fall under artificial

Artificial intelligence, Machine learning, and data

Artificial intelligence, Machine learning, and data

1- Extracting Meaningful Patterns

Used in almost worldwide fields including:

❑ Although the work of data scientists and data analysts are

Data science problems can be broadly categorized into

Data science problems can be broadly categorized into

• Unsupervised or undirected data science uncovers hidden

• Be careful how you manipulate data and numbers!

» Objectives variables for data » Subjective parameters for data

• It has a rank or order.

Is a value to describe the relation between the size of effect

It is acceptable to be between 0.95 to 1.05

The data to be analyzed may be:

◉Inconsistent; where data may contain discrepancies in

Filling-in Missing Values

Relevant data may not be recorded because:

There are several ways to deal with missing data:

The mean for a population of size n can be computed

◉ Consider the following list of 9 numbers:

Mean = (13 + 15 + 12 + 17 + 22 + 11 + 13 + 19 + 12)/9 = 14.88889

The median is the middle value of the ordered list of

13, 15, 12, 17, 22, 11, 13, 19, 12

11, 12, 12, 13, 13, 15, 17, 19, 22

Hence, the median is 13

The median is the middle value of the ordered list of

13, 15, 12, 17, 22, 11, 13, 19, 12, 14

11, 12, 12, 13, 13, 14, 15, 17, 19, 22

Hence, the median is (13 + 14)/2 = 13.5

The mode of a set of data is the value in the set that

Number Occurrence Number

The mode of a set of data is the value in the set that

Number Occurrence Number

The mode of a set of data is the value in the set that

Number Occurrence Number

A set of fields with missing values

A set of fields with missing values

Field2 is categorical, hence we need to compute the

Hence, the mode is A

75% of data items

to compute Q1, Q2, and Q3, use the following

half of the data.

half of the data.

compute Q1, Q2, and Q3 for the following data set:

◉ Order the given data set in ascending order:

6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49

◉ Q2 = 40 (median of the data set)

compute Q1, Q2, and Q3 for the following data set:

7, 17, 36, 39, 40, 41

◉ Q3 is the median of the upper half of the data (shown in green).

Compute the Inter-Quartile Range (IQR) as follows:

Data Set that might contains outliers

40000, 50000, 75000, 75000, 99999, 10000000

Noisy data are the kind of data that have incorrect

◉Human or computer errors may occur during data entry

Examples of Noisy Data

By smoothing noisy data we can correct the

This step examines the data for data-entry errors and

Example: Using the same category value to refer to

Data values should be consistent and have uniform format.

Mohamed Ahmed instead of Mr. Mohamed Ahmed

◉Abbreviations and encoding schemes should consistently be

US is the standard abbreviation of United States

Data inconsistency means that different data items contain

◉ Total-price and (unit-price and quantity); total-price can be

Example of Inconsistent Data

Data Inconsistency Marked in Red

You might also like