0% found this document useful (0 votes)

17 views40 pages

Unit - IV Part-1

The document discusses the nature of data and data pre-processing, focusing on structured and unstructured data types, challenges, and various data collection methods. It outlines the importance of data cleaning, integration, transformation, reduction, and discretization in preparing data for analysis. Additionally, it highlights the significance of open data and social media data as valuable sources for research and analysis.

Uploaded by

Manthena Narasimha Raju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views40 pages

Unit - IV Part-1

Uploaded by

Manthena Narasimha Raju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Data Science with R

Unit IV: Nature of Data, Data Pre-processing

M. Narasimha Raju
Asst Professor, Dept. of Computer Science &
Engineering
Topics: Nature of Data & Pre-processing:
Introduction Data Pre-processing.
Data Types Data Cleaning
 Structured Data
Data Integration
 Unstructured Data
Data Transformation
 Challenges with Unstructured
Data Reduction
Data
Data Collections Data Discretization
 Open Data
 Social Media Data

 Multimodal Data

 Data Storage and Presentation

2
Data Types
Structured data
Refers to highly organized information that can be seamlessly
included in a database and readily searched via simple search
operations
Unstructured data is essentially the opposite, devoid of any
underlying structure.

In structured data, different values – whether they are

numbers or something else – are labeled, which is not the
case when it comes to unstructured data
4
Structured data
The data has defined fields or
labels
This data includes
 numbers (age, income,
num.vehicles),
 text (housing.type),
 Boolean type (is.employed), and
 categorical data (sex, marital.stat).

What matters for us is that any

data we see here – whether it is
a number, a category, or a text –
is labeled

5
Unstructured data
Unstructured data is data without labels.

“It was found that a female with a height between 65 inches and 67 inches
had an IQ of 125–130.

we have several data points: 65, 67, 125–130, female. However, they are not
clearly labeled. If we were to do some processing, we would not be able to do
that easily. And certainly, if we were to create a systematic process (an
algorithm, a program) to go through such data or observations, we would be
in trouble because that process would not be able to identify which of these
numbers corresponds to which of the quantities.

Humans have no difficulty understanding a paragraph like this that contains

unstructured data.
6
Unstructured data
The lack of structure makes compilation and organizing
unstructured data a time- and energy-consuming task.
It would be easy to derive insights from unstructured data if
it could be instantly transformed into structured data
Email is unstructured data.
An individual may arrange their inbox in such away that it aligns
with their organizational preferences, but that does not mean
the data is structured.
Spreadsheets, which are arranged in a relational database
format and can be quickly scanned for information, are
considered structured data.
7
Data Collections
Open data is that some data should be freely available in a public
domain that can be used by anyone as they wish, without restrictions
from copyright, patents, or other mechanisms of control.
Public
 Favor of openness to the extent permitted by law and subject to privacy,
confidentiality, security, or other valid restrictions.

Local and federal governments, non-government organizations

(NGOs), and academic communities all lead open data initiatives. For
example, you can visit data repositories produced by the US
Government or the City of Chicago.

8
Data Collections
Accessible.
Open data are made available in convenient, modifiable, and
open formats that can be retrieved, downloaded, indexed, and
searched.
Described.
Open data are described fully so that consumers of the data have
sufficient information to understand their strengths, weaknesses,
analytical limitations, and security requirements, as well as how
to process them documentation of data elements, data
dictionaries etc.

9
Data Collections
Reusable.
Open data are made available under an open license
Complete.
Open data are published in primary forms
Timely.
Preserve the value of the data, frequency of release
Managed Post-Release.
Assist with data use and to respond to complaints about
adherence to these open data requirements.

10
Social Media Data
Social media has become a gold mine for collecting data to
analyze for research or marketing purposes.

This is facilitated by the Application Programming Interface (API)

that social media companies provide to researchers and
developers.

The Facebook Graph API

 Collect and use this data to accomplish a variety of tasks
 New socially impactful applications, Research on human information
behavior, and monitoring the aftermath of natural calamities, etc.
11
Social Media Data
datasets have often been released by the social media
platform itself.
Yelp, a popular crowd-sourced review platform
Local businesses, released datasets that have been used for
research in a wide range of topics –
 Automatic photo classification to natural language processing of review
texts,
 sentiment analysis to graph mining, etc.

12
Multimodal Data
Getting connected to the Internet, creating an emerging trend
of the Internet of Things world where more and more devices
exist
These devices are generating and using much data, but not all
of which are “traditional” types (numbers, text).
When dealing with such contexts, we may need to collect and
explore multimodal (different forms) and multimedia (different
media) data such as images, music and other sounds, gestures,
body posture, and the use of space.
The data can be categorized into two types: structured data and
unstructured data.
13
Multimodal Data
Brain imaging data sequences dataset is kind of application is
a multimodal face dataset, which contains output from
different sensors such as EEG, MEG, and fMRI (medical
imaging techniques)

14
Data Storage and Presentation
CSV
(Comma-Separated Values) format is the most common
import and export format for spreadsheets and databases.
There is no “CSV standard,” so the format is operationally
defined by the many applications that read and write it.
An advantage of the CSV format is that it is more generic and
useful when sharing with almost anyone

15
Depression.csv
 treat,before,after,diff
 No Treatment,13,16,3
 No Treatment,10,18,8
 No Treatment,16,16,0
 Placebo,16,13,-3
 Placebo,14,12,-2
 Placebo,19,12,-7
 Seroxat (Paxil),17,15,-2
 Seroxat (Paxil),14,19,5
 Seroxat (Paxil),20,14,-6
 Effexor,17,19,2
 Effexor,20,12,-8
 Effexor,13,10,-3

 The first row mentions the variable names. The remaining rows each individually represent one data
point.
16
TSV (Tab-Separated Values)
Files are used for raw data and can be imported into and exported from
spreadsheet software.
Tab-separated values files are essentially text files, and the raw data can
be viewed by text editors

 Name<TAB>Age<TAB>Address
 Ryan<TAB>33<TAB>1115 W Franklin
 Paul<TAB>25<TAB>Big Farm Way
 Jim<TAB>45<TAB>W Main St
 Samantha<TAB>32<TAB>28 George St

where <TAB> denotes a TAB character.

17
XML (eXtensible Markup Language)
XML (eXtensible Markup Language) was designed to be both
human- and machinereadable, and can thus be used to store
and transport data

XML data is stored in plain text format, it provides a

software- and hardware-independent way of storing data.

XML databases and converting existing data from relational

and object-based storage to an XML model that can be shared
with business partners.
18
XML is becoming quite important as we deal with multiple
devices, platforms, and services relying on the same data

<?xml version=“1.0” encoding=“UTF-8”?>

<bookstore>
<book category=“information science” cover=“hardcover”>
<title lang=“en”>Social Information Seeking</title>
<author>Chirag Shah</author>
<year>2017</year>
<price>62.58</price>
</book>
</bookstore>
19
RSS (Really Simple Syndication)
RSS (Really Simple Syndication) is a format used to share data between
services, and which was defined in the 1.0 version of XML. It facilitates the
delivery of information from various sources on the Web.

The format of RSS follows XML standard usage but in addition defines the
names of specific tags.

Since RSS data is small and fast loading, it can easily be used with services such
as mobile phones, personal digital assistants (PDAs), and smart watches.

RSS is useful for websites that are updated frequently, such as:
 News sites – Lists news with title, date and descriptions.
 Companies – Lists news and new products.
 Calendars – Lists upcoming events and important days.
20

21
JSON (JavaScript Object Notation)
JSON (JavaScript Object Notation) is a lightweight data-
interchange format.
It is not only easy for humans to read and write, but also easy
for machines to parse and generate.
JSON is built on two structures:
• A collection of name–value pairs. In various languages, this
is realized as an object, record, structure, dictionary, hash
table, keyed list, or associative array.
• An ordered list of values. In most languages, this is realized
as an array, vector, list, or sequence.
22
JSON (JavaScript Object Notation)
When exchanging data between a browser and a server, the data
can be sent only as text.

JSON is text, and we can convert any JavaScript object into JSON,
and send JSON to the server.

We can also convert any JSON received from the server into
JavaScript objects.

This way we can work with the data as JavaScript objects, with no
complicated parsing and translations.
23
24
25
Data Pre-processing
Data in the real world is often dirty; that is, it is in need of being cleaned up
before it can be used for a desired purpose.
Factors that indicate that data is not clean or ready to process

 Incomplete. When some of the attribute values are lacking, certain attributes of
interest are lacking, or attributes contain only aggregate data.

 Noisy. When data contains errors or outliers. For example, some of the data points in
a dataset may contain extreme values that can severely affect the dataset’s range.

 Inconsistent. Data contains discrepancies in codes or names. For example, if the

“Name” column for registration records of employees contains values other
than alphabetical letters, or if records do not start with a capital letter,
discrepancies are present.
26
Data Pre-processing

27
Transformation & Reduction

28
Data Cleaning
Data Cleaning
Data may be “cleaned,” or better organized, or scrubbed of
potentially incorrect, incomplete, or duplicated information.

29
Data Munging Consider the following text recipe.

Data is not in a format that is easy to work

“Add two diced tomatoes, three
with it may be stored or presented in a cloves of garlic, and a pinch of salt in
way that is hard to process.
the mix.”
Thus, we need to convert it to something
more suitable for a computer to This can be turned into below table
understand

The approaches to take are all about

manipulating or wrangling (or munging)
the data to turn it into something that is
more convenient or desirable.

This can be done manually, automatically,

or, in many cases, semi-automatically.
30
Handling Missing Data
Sometimes data may be in the right format, but some of the values are
missing.

Consider a table containing customer data in which some of the home

phone numbers are absent.

This could be due to the fact that some people do not have home phones
– instead they use their mobile phones as their primary or only phone.

Other times data may be missing due to problems with the process of
collecting data, or an equipment malfunction.
31
Handling Missing Data
Limited to a certain city or region, and so the area code for a
phone number was not necessary to collect.

Some data may get lost due to system or human error while
storing or transferring the data.

Strategies to combat missing data include ignoring that record,

using a global constant to fill in all missing values, imputation,
inference-based solutions

32
Smooth Noisy Data
 There are times when the data is not missing, but it is corrupted for some reason.

 Data corruption may be a result of faulty data collection instruments, data entry
problems, or technology limitations

 A digital thermometer measures temperature to one decimal point, but the

storage system ignores the decimal points

 A 99.4°F temperature means you are fine, and 99.8°F means you have a fever,
and if our storage system represents both of them as 99°F, then it fails to
differentiate between healthy and sick persons!

 First, you should identify or remove outliers.

33
Data Integration
To be as efficient and effective for various data analyses as
possible, data from various sources commonly needs to be
integrated.
Combine data from multiple sources into a coherent storage place
(e.g., a single file or a database).
 Engage in schema integration, or the combining of metadata from
different sources
Detect and resolve data value conflicts.
 Reasons for this conflict could be different representations
Address redundant data in data integration.
 The same attribute may have different names in different databases.

34
Data Transformation
Data must be transformed so it is consistent and readable (by a system).
Smoothing: Remove noise from data.
Aggregation: Summarization, data cube construction.
Generalization: Concept hierarchy climbing.
Normalization: Scaled to fall within a small, specified range and
aggregation.
 Min–max normalization.
 Z-score normalization.
 Normalization by decimal scaling.
Attribute or feature construction.
New attributes constructed from the given ones.

35
Data Reduction
Data reduction is a key process in which a reduced representation
of a dataset that produces the same or similar analytical results is
obtained.
Data Cube Aggregation.
The lowest level of a data cube is the aggregated data for an
individual entity of interest.
A data cube could be in two, three, or a higher dimension.
Each dimension typically represents an attribute of interest.
Now, consider that you are trying to make a decision using this
multidimensional data.
We can reduce the data to its more meaningful size and structure
for the task at hand. 36
Data Reduction
Data reduction was with the consideration of the task,
dimensionality reduction method
Works with respect to the nature of the data.
Here, a dimension or a column in your data spreadsheet is
referred to as a “feature,”
The goal of the process is to identify which features to
remove or collapse to a combined feature.
Strategies for reduction include sampling, clustering,
principal component analysis, etc.

37
Data Discretization
Dealing with data that are collected from processes that are
continuous, need to convert these continuous values into
more manageable parts. This mapping is called discretization.

Discretization, essentially reducing data.

Thus, this process of discretization could also be perceived as

a means of data reduction, but it holds particular importance
for numerical data.

38
Data Discretization
There are three types of attributes involved in discretization:
a. Nominal: Values from an unordered set
b. Ordinal: Values from an ordered set
c. Continuous: Real numbers

To achieve discretization, divide the range of continuous

attributes into intervals.
For instance, we could decide to split the range of
temperature values into cold, moderate, and hot

39
Data Science with R
Unit IV: Nature of Data, Data Pre-processing

Thank
You
M. Narasimha Raju
Asst Professor, Dept. of Computer Science &
Engineering

Intro to Big Data & Analytics
No ratings yet
Intro to Big Data & Analytics
34 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
The Excitement of Data Science
No ratings yet
The Excitement of Data Science
137 pages
Ics054 Unit 1
No ratings yet
Ics054 Unit 1
14 pages
UNIT1
100% (1)
UNIT1
37 pages
Facets of Data
No ratings yet
Facets of Data
7 pages
Big Data
No ratings yet
Big Data
24 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
Slide#3 - Understanding Data
No ratings yet
Slide#3 - Understanding Data
44 pages
4.0 Introduction To Data
No ratings yet
4.0 Introduction To Data
16 pages
Data To Be Managed 2021 As PP Slides
No ratings yet
Data To Be Managed 2021 As PP Slides
15 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Data Science & Analytics Overview
No ratings yet
Data Science & Analytics Overview
76 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Fundamentals of Data Science & Big Data"
No ratings yet
Fundamentals of Data Science & Big Data"
14 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Ece 2318 GENERAL DATA AND ITS TYPES
No ratings yet
Ece 2318 GENERAL DATA AND ITS TYPES
34 pages
CSC4404 Chap3
No ratings yet
CSC4404 Chap3
84 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Chapter 2 Introduction To Data Science - For Extension
No ratings yet
Chapter 2 Introduction To Data Science - For Extension
51 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Topic 1 T
No ratings yet
Topic 1 T
20 pages
Data Analytics Lecture 3-1
No ratings yet
Data Analytics Lecture 3-1
23 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Introduction
No ratings yet
Introduction
21 pages
Data Science Unit-1 B.sc. III Sem. MDC
No ratings yet
Data Science Unit-1 B.sc. III Sem. MDC
10 pages
Big Data Basics for Beginners
No ratings yet
Big Data Basics for Beginners
40 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit 1
No ratings yet
Unit 1
76 pages
CS 3353 FDS Unit 1 Notes JPR
No ratings yet
CS 3353 FDS Unit 1 Notes JPR
39 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Data Science
No ratings yet
Data Science
244 pages
Data Analitics 3
No ratings yet
Data Analitics 3
14 pages
Data v2
No ratings yet
Data v2
25 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
Module 1 - Data Science Introduction - Detailed
No ratings yet
Module 1 - Data Science Introduction - Detailed
131 pages
Lecture1 Introductiontobigdata 190301171350
No ratings yet
Lecture1 Introductiontobigdata 190301171350
63 pages
AFDM UNIT 2 Notes
No ratings yet
AFDM UNIT 2 Notes
29 pages
L2 - Data Acquisition
No ratings yet
L2 - Data Acquisition
48 pages
Data Analytics Lecture 2
No ratings yet
Data Analytics Lecture 2
26 pages
Big Data Analytics Basics
No ratings yet
Big Data Analytics Basics
44 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Facets of Data
0% (1)
Facets of Data
22 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Unit - II Part-2
No ratings yet
Unit - II Part-2
20 pages
UNIT - I Part-1
No ratings yet
UNIT - I Part-1
27 pages
UNIT - I Part 2 & 3
No ratings yet
UNIT - I Part 2 & 3
87 pages
Unit - Iii
No ratings yet
Unit - Iii
78 pages
Unit - V
No ratings yet
Unit - V
121 pages
CN Unit-5
No ratings yet
CN Unit-5
37 pages
Unit - 4
No ratings yet
Unit - 4
15 pages
CN Unit-3
No ratings yet
CN Unit-3
37 pages
CN Unit-1
No ratings yet
CN Unit-1
32 pages
CN Unit-2
No ratings yet
CN Unit-2
32 pages
M.Tech JNTUK ADS UNIT-3
No ratings yet
M.Tech JNTUK ADS UNIT-3
13 pages
M.Tech JNTUK ADS UNIT-2
100% (1)
M.Tech JNTUK ADS UNIT-2
20 pages
What Is Patent Rights
No ratings yet
What Is Patent Rights
8 pages
Aos
No ratings yet
Aos
8 pages
Math Terms for Test Takers
No ratings yet
Math Terms for Test Takers
10 pages
HP Computers Customer Satisfaction Study
No ratings yet
HP Computers Customer Satisfaction Study
80 pages
Scorpion Tank Shell Inspection
100% (1)
Scorpion Tank Shell Inspection
4 pages
Chapter 3
No ratings yet
Chapter 3
28 pages
DH Ipc Hfw5541t Ase Datasheet 20190620
No ratings yet
DH Ipc Hfw5541t Ase Datasheet 20190620
3 pages
Brand Style Guide Essentials
No ratings yet
Brand Style Guide Essentials
48 pages
DX420LCA
No ratings yet
DX420LCA
13 pages
4ipnet Man Hsg326
No ratings yet
4ipnet Man Hsg326
216 pages
Timetable Template
No ratings yet
Timetable Template
5 pages
Epson LQ-350
No ratings yet
Epson LQ-350
1 page
G23 Parts Manual
No ratings yet
G23 Parts Manual
43 pages
AWS Whitepaper
No ratings yet
AWS Whitepaper
31 pages
AxiomV Centralized Opening Configuration PDF
No ratings yet
AxiomV Centralized Opening Configuration PDF
21 pages
Cut Plan
No ratings yet
Cut Plan
8 pages
Oracle BI Filtering Guide
100% (1)
Oracle BI Filtering Guide
32 pages
Lecture 3 - Branching and Iteration
No ratings yet
Lecture 3 - Branching and Iteration
26 pages
Mechatronics Syllabus
No ratings yet
Mechatronics Syllabus
1 page
AWS Resume
No ratings yet
AWS Resume
2 pages
E Passport 1
No ratings yet
E Passport 1
15 pages
AEC Q100 Rev G Base Document
100% (1)
AEC Q100 Rev G Base Document
36 pages
m094 h7cc Digital Counter Tachometer Datasheet en
No ratings yet
m094 h7cc Digital Counter Tachometer Datasheet en
66 pages
List of DOS Commands
0% (1)
List of DOS Commands
17 pages
Programming Exercise: Parsing Weather Data Assignment
No ratings yet
Programming Exercise: Parsing Weather Data Assignment
5 pages
Delphi Grade 10 Strings Multiple Choice Questions and Answers
No ratings yet
Delphi Grade 10 Strings Multiple Choice Questions and Answers
14 pages
DS-2CD2121G0-I (W) (S) Datasheet V5.5.3 20190308 PDF
No ratings yet
DS-2CD2121G0-I (W) (S) Datasheet V5.5.3 20190308 PDF
4 pages
2022 Torqeedo Catalog en International
No ratings yet
2022 Torqeedo Catalog en International
56 pages
20240222232110508MathematicsCase Study Questions Grade IX
No ratings yet
20240222232110508MathematicsCase Study Questions Grade IX
6 pages
ICT Quizzes Part 4 For Bear
No ratings yet
ICT Quizzes Part 4 For Bear
1 page
Testing in Python - Unit Test & Script
No ratings yet
Testing in Python - Unit Test & Script
5 pages
Oracle BI Apps 11.1.1.8.1 Pre-Requisites
0% (1)
Oracle BI Apps 11.1.1.8.1 Pre-Requisites
82 pages

Unit - IV Part-1

Uploaded by

Unit - IV Part-1

Uploaded by

Data Science with R

Unit IV: Nature of Data, Data Pre-processing

 Data Storage and Presentation

In structured data, different values – whether they are

What matters for us is that any

Humans have no difficulty understanding a paragraph like this that contains

Local and federal governments, non-government organizations

This is facilitated by the Application Programming Interface (API)

The Facebook Graph API

where <TAB> denotes a TAB character.

XML data is stored in plain text format, it provides a

XML databases and converting existing data from relational

<?xml version=“1.0” encoding=“UTF-8”?>

 Inconsistent. Data contains discrepancies in codes or names. For example, if the

Data is not in a format that is easy to work

The approaches to take are all about

This can be done manually, automatically,

Consider a table containing customer data in which some of the home

Strategies to combat missing data include ignoring that record,

 A digital thermometer measures temperature to one decimal point, but the

 First, you should identify or remove outliers.

Discretization, essentially reducing data.

Thus, this process of discretization could also be perceived as

To achieve discretization, divide the range of continuous

You might also like