Data Science with R
Unit IV: Nature of Data, Data Pre-processing
M. Narasimha Raju
Asst Professor, Dept. of Computer Science &
Engineering
Topics: Nature of Data & Pre-processing:
Introduction Data Pre-processing.
Data Types Data Cleaning
Structured Data
Data Integration
Unstructured Data
Data Transformation
Challenges with Unstructured
Data Reduction
Data
Data Collections Data Discretization
Open Data
Social Media Data
Multimodal Data
Data Storage and Presentation
2
Data Types
Structured data
Refers to highly organized information that can be seamlessly
included in a database and readily searched via simple search
operations
Unstructured data is essentially the opposite, devoid of any
underlying structure.
In structured data, different values – whether they are
numbers or something else – are labeled, which is not the
case when it comes to unstructured data
4
Structured data
The data has defined fields or
labels
This data includes
numbers (age, income,
num.vehicles),
text (housing.type),
Boolean type (is.employed), and
categorical data (sex, marital.stat).
What matters for us is that any
data we see here – whether it is
a number, a category, or a text –
is labeled
5
Unstructured data
Unstructured data is data without labels.
“It was found that a female with a height between 65 inches and 67 inches
had an IQ of 125–130.
we have several data points: 65, 67, 125–130, female. However, they are not
clearly labeled. If we were to do some processing, we would not be able to do
that easily. And certainly, if we were to create a systematic process (an
algorithm, a program) to go through such data or observations, we would be
in trouble because that process would not be able to identify which of these
numbers corresponds to which of the quantities.
Humans have no difficulty understanding a paragraph like this that contains
unstructured data.
6
Unstructured data
The lack of structure makes compilation and organizing
unstructured data a time- and energy-consuming task.
It would be easy to derive insights from unstructured data if
it could be instantly transformed into structured data
Email is unstructured data.
An individual may arrange their inbox in such away that it aligns
with their organizational preferences, but that does not mean
the data is structured.
Spreadsheets, which are arranged in a relational database
format and can be quickly scanned for information, are
considered structured data.
7
Data Collections
Open data is that some data should be freely available in a public
domain that can be used by anyone as they wish, without restrictions
from copyright, patents, or other mechanisms of control.
Public
Favor of openness to the extent permitted by law and subject to privacy,
confidentiality, security, or other valid restrictions.
Local and federal governments, non-government organizations
(NGOs), and academic communities all lead open data initiatives. For
example, you can visit data repositories produced by the US
Government or the City of Chicago.
8
Data Collections
Accessible.
Open data are made available in convenient, modifiable, and
open formats that can be retrieved, downloaded, indexed, and
searched.
Described.
Open data are described fully so that consumers of the data have
sufficient information to understand their strengths, weaknesses,
analytical limitations, and security requirements, as well as how
to process them documentation of data elements, data
dictionaries etc.
9
Data Collections
Reusable.
Open data are made available under an open license
Complete.
Open data are published in primary forms
Timely.
Preserve the value of the data, frequency of release
Managed Post-Release.
Assist with data use and to respond to complaints about
adherence to these open data requirements.
10
Social Media Data
Social media has become a gold mine for collecting data to
analyze for research or marketing purposes.
This is facilitated by the Application Programming Interface (API)
that social media companies provide to researchers and
developers.
The Facebook Graph API
Collect and use this data to accomplish a variety of tasks
New socially impactful applications, Research on human information
behavior, and monitoring the aftermath of natural calamities, etc.
11
Social Media Data
datasets have often been released by the social media
platform itself.
Yelp, a popular crowd-sourced review platform
Local businesses, released datasets that have been used for
research in a wide range of topics –
Automatic photo classification to natural language processing of review
texts,
sentiment analysis to graph mining, etc.
12
Multimodal Data
Getting connected to the Internet, creating an emerging trend
of the Internet of Things world where more and more devices
exist
These devices are generating and using much data, but not all
of which are “traditional” types (numbers, text).
When dealing with such contexts, we may need to collect and
explore multimodal (different forms) and multimedia (different
media) data such as images, music and other sounds, gestures,
body posture, and the use of space.
The data can be categorized into two types: structured data and
unstructured data.
13
Multimodal Data
Brain imaging data sequences dataset is kind of application is
a multimodal face dataset, which contains output from
different sensors such as EEG, MEG, and fMRI (medical
imaging techniques)
14
Data Storage and Presentation
CSV
(Comma-Separated Values) format is the most common
import and export format for spreadsheets and databases.
There is no “CSV standard,” so the format is operationally
defined by the many applications that read and write it.
An advantage of the CSV format is that it is more generic and
useful when sharing with almost anyone
15
Depression.csv
treat,before,after,diff
No Treatment,13,16,3
No Treatment,10,18,8
No Treatment,16,16,0
Placebo,16,13,-3
Placebo,14,12,-2
Placebo,19,12,-7
Seroxat (Paxil),17,15,-2
Seroxat (Paxil),14,19,5
Seroxat (Paxil),20,14,-6
Effexor,17,19,2
Effexor,20,12,-8
Effexor,13,10,-3
The first row mentions the variable names. The remaining rows each individually represent one data
point.
16
TSV (Tab-Separated Values)
Files are used for raw data and can be imported into and exported from
spreadsheet software.
Tab-separated values files are essentially text files, and the raw data can
be viewed by text editors
Name<TAB>Age<TAB>Address
Ryan<TAB>33<TAB>1115 W Franklin
Paul<TAB>25<TAB>Big Farm Way
Jim<TAB>45<TAB>W Main St
Samantha<TAB>32<TAB>28 George St
where <TAB> denotes a TAB character.
17
XML (eXtensible Markup Language)
XML (eXtensible Markup Language) was designed to be both
human- and machinereadable, and can thus be used to store
and transport data
XML data is stored in plain text format, it provides a
software- and hardware-independent way of storing data.
XML databases and converting existing data from relational
and object-based storage to an XML model that can be shared
with business partners.
18
XML is becoming quite important as we deal with multiple
devices, platforms, and services relying on the same data
<?xml version=“1.0” encoding=“UTF-8”?>
<bookstore>
<book category=“information science” cover=“hardcover”>
<title lang=“en”>Social Information Seeking</title>
<author>Chirag Shah</author>
<year>2017</year>
<price>62.58</price>
</book>
</bookstore>
19
RSS (Really Simple Syndication)
RSS (Really Simple Syndication) is a format used to share data between
services, and which was defined in the 1.0 version of XML. It facilitates the
delivery of information from various sources on the Web.
The format of RSS follows XML standard usage but in addition defines the
names of specific tags.
Since RSS data is small and fast loading, it can easily be used with services such
as mobile phones, personal digital assistants (PDAs), and smart watches.
RSS is useful for websites that are updated frequently, such as:
News sites – Lists news with title, date and descriptions.
Companies – Lists news and new products.
Calendars – Lists upcoming events and important days.
20
21
JSON (JavaScript Object Notation)
JSON (JavaScript Object Notation) is a lightweight data-
interchange format.
It is not only easy for humans to read and write, but also easy
for machines to parse and generate.
JSON is built on two structures:
• A collection of name–value pairs. In various languages, this
is realized as an object, record, structure, dictionary, hash
table, keyed list, or associative array.
• An ordered list of values. In most languages, this is realized
as an array, vector, list, or sequence.
22
JSON (JavaScript Object Notation)
When exchanging data between a browser and a server, the data
can be sent only as text.
JSON is text, and we can convert any JavaScript object into JSON,
and send JSON to the server.
We can also convert any JSON received from the server into
JavaScript objects.
This way we can work with the data as JavaScript objects, with no
complicated parsing and translations.
23
24
25
Data Pre-processing
Data in the real world is often dirty; that is, it is in need of being cleaned up
before it can be used for a desired purpose.
Factors that indicate that data is not clean or ready to process
Incomplete. When some of the attribute values are lacking, certain attributes of
interest are lacking, or attributes contain only aggregate data.
Noisy. When data contains errors or outliers. For example, some of the data points in
a dataset may contain extreme values that can severely affect the dataset’s range.
Inconsistent. Data contains discrepancies in codes or names. For example, if the
“Name” column for registration records of employees contains values other
than alphabetical letters, or if records do not start with a capital letter,
discrepancies are present.
26
Data Pre-processing
27
Transformation & Reduction
28
Data Cleaning
Data Cleaning
Data may be “cleaned,” or better organized, or scrubbed of
potentially incorrect, incomplete, or duplicated information.
29
Data Munging Consider the following text recipe.
Data is not in a format that is easy to work
“Add two diced tomatoes, three
with it may be stored or presented in a cloves of garlic, and a pinch of salt in
way that is hard to process.
the mix.”
Thus, we need to convert it to something
more suitable for a computer to This can be turned into below table
understand
The approaches to take are all about
manipulating or wrangling (or munging)
the data to turn it into something that is
more convenient or desirable.
This can be done manually, automatically,
or, in many cases, semi-automatically.
30
Handling Missing Data
Sometimes data may be in the right format, but some of the values are
missing.
Consider a table containing customer data in which some of the home
phone numbers are absent.
This could be due to the fact that some people do not have home phones
– instead they use their mobile phones as their primary or only phone.
Other times data may be missing due to problems with the process of
collecting data, or an equipment malfunction.
31
Handling Missing Data
Limited to a certain city or region, and so the area code for a
phone number was not necessary to collect.
Some data may get lost due to system or human error while
storing or transferring the data.
Strategies to combat missing data include ignoring that record,
using a global constant to fill in all missing values, imputation,
inference-based solutions
32
Smooth Noisy Data
There are times when the data is not missing, but it is corrupted for some reason.
Data corruption may be a result of faulty data collection instruments, data entry
problems, or technology limitations
A digital thermometer measures temperature to one decimal point, but the
storage system ignores the decimal points
A 99.4°F temperature means you are fine, and 99.8°F means you have a fever,
and if our storage system represents both of them as 99°F, then it fails to
differentiate between healthy and sick persons!
First, you should identify or remove outliers.
33
Data Integration
To be as efficient and effective for various data analyses as
possible, data from various sources commonly needs to be
integrated.
Combine data from multiple sources into a coherent storage place
(e.g., a single file or a database).
Engage in schema integration, or the combining of metadata from
different sources
Detect and resolve data value conflicts.
Reasons for this conflict could be different representations
Address redundant data in data integration.
The same attribute may have different names in different databases.
34
Data Transformation
Data must be transformed so it is consistent and readable (by a system).
Smoothing: Remove noise from data.
Aggregation: Summarization, data cube construction.
Generalization: Concept hierarchy climbing.
Normalization: Scaled to fall within a small, specified range and
aggregation.
Min–max normalization.
Z-score normalization.
Normalization by decimal scaling.
Attribute or feature construction.
New attributes constructed from the given ones.
35
Data Reduction
Data reduction is a key process in which a reduced representation
of a dataset that produces the same or similar analytical results is
obtained.
Data Cube Aggregation.
The lowest level of a data cube is the aggregated data for an
individual entity of interest.
A data cube could be in two, three, or a higher dimension.
Each dimension typically represents an attribute of interest.
Now, consider that you are trying to make a decision using this
multidimensional data.
We can reduce the data to its more meaningful size and structure
for the task at hand. 36
Data Reduction
Data reduction was with the consideration of the task,
dimensionality reduction method
Works with respect to the nature of the data.
Here, a dimension or a column in your data spreadsheet is
referred to as a “feature,”
The goal of the process is to identify which features to
remove or collapse to a combined feature.
Strategies for reduction include sampling, clustering,
principal component analysis, etc.
37
Data Discretization
Dealing with data that are collected from processes that are
continuous, need to convert these continuous values into
more manageable parts. This mapping is called discretization.
Discretization, essentially reducing data.
Thus, this process of discretization could also be perceived as
a means of data reduction, but it holds particular importance
for numerical data.
38
Data Discretization
There are three types of attributes involved in discretization:
a. Nominal: Values from an unordered set
b. Ordinal: Values from an ordered set
c. Continuous: Real numbers
To achieve discretization, divide the range of continuous
attributes into intervals.
For instance, we could decide to split the range of
temperature values into cold, moderate, and hot
39
Data Science with R
Unit IV: Nature of Data, Data Pre-processing
Thank
You
M. Narasimha Raju
Asst Professor, Dept. of Computer Science &
Engineering