0% found this document useful (0 votes)
17 views40 pages

Unit - IV Part-1

The document discusses the nature of data and data pre-processing, focusing on structured and unstructured data types, challenges, and various data collection methods. It outlines the importance of data cleaning, integration, transformation, reduction, and discretization in preparing data for analysis. Additionally, it highlights the significance of open data and social media data as valuable sources for research and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views40 pages

Unit - IV Part-1

The document discusses the nature of data and data pre-processing, focusing on structured and unstructured data types, challenges, and various data collection methods. It outlines the importance of data cleaning, integration, transformation, reduction, and discretization in preparing data for analysis. Additionally, it highlights the significance of open data and social media data as valuable sources for research and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Data Science with R

Unit IV: Nature of Data, Data Pre-processing

M. Narasimha Raju
Asst Professor, Dept. of Computer Science &
Engineering
Topics: Nature of Data & Pre-processing:
Introduction Data Pre-processing.
Data Types Data Cleaning
 Structured Data
Data Integration
 Unstructured Data
Data Transformation
 Challenges with Unstructured
Data Reduction
Data
Data Collections Data Discretization
 Open Data
 Social Media Data

 Multimodal Data

 Data Storage and Presentation

2
Data Types
Structured data
Refers to highly organized information that can be seamlessly
included in a database and readily searched via simple search
operations
Unstructured data is essentially the opposite, devoid of any
underlying structure.

In structured data, different values – whether they are


numbers or something else – are labeled, which is not the
case when it comes to unstructured data
4
Structured data
The data has defined fields or
labels
This data includes
 numbers (age, income,
num.vehicles),
 text (housing.type),
 Boolean type (is.employed), and
 categorical data (sex, marital.stat).

What matters for us is that any


data we see here – whether it is
a number, a category, or a text –
is labeled

5
Unstructured data
Unstructured data is data without labels.

“It was found that a female with a height between 65 inches and 67 inches
had an IQ of 125–130.

we have several data points: 65, 67, 125–130, female. However, they are not
clearly labeled. If we were to do some processing, we would not be able to do
that easily. And certainly, if we were to create a systematic process (an
algorithm, a program) to go through such data or observations, we would be
in trouble because that process would not be able to identify which of these
numbers corresponds to which of the quantities.

Humans have no difficulty understanding a paragraph like this that contains


unstructured data.
6
Unstructured data
The lack of structure makes compilation and organizing
unstructured data a time- and energy-consuming task.
It would be easy to derive insights from unstructured data if
it could be instantly transformed into structured data
Email is unstructured data.
An individual may arrange their inbox in such away that it aligns
with their organizational preferences, but that does not mean
the data is structured.
Spreadsheets, which are arranged in a relational database
format and can be quickly scanned for information, are
considered structured data.
7
Data Collections
Open data is that some data should be freely available in a public
domain that can be used by anyone as they wish, without restrictions
from copyright, patents, or other mechanisms of control.
Public
 Favor of openness to the extent permitted by law and subject to privacy,
confidentiality, security, or other valid restrictions.

Local and federal governments, non-government organizations


(NGOs), and academic communities all lead open data initiatives. For
example, you can visit data repositories produced by the US
Government or the City of Chicago.

8
Data Collections
Accessible.
Open data are made available in convenient, modifiable, and
open formats that can be retrieved, downloaded, indexed, and
searched.
Described.
Open data are described fully so that consumers of the data have
sufficient information to understand their strengths, weaknesses,
analytical limitations, and security requirements, as well as how
to process them documentation of data elements, data
dictionaries etc.

9
Data Collections
Reusable.
Open data are made available under an open license
Complete.
Open data are published in primary forms
Timely.
Preserve the value of the data, frequency of release
Managed Post-Release.
Assist with data use and to respond to complaints about
adherence to these open data requirements.

10
Social Media Data
Social media has become a gold mine for collecting data to
analyze for research or marketing purposes.

This is facilitated by the Application Programming Interface (API)


that social media companies provide to researchers and
developers.

The Facebook Graph API


 Collect and use this data to accomplish a variety of tasks
 New socially impactful applications, Research on human information
behavior, and monitoring the aftermath of natural calamities, etc.
11
Social Media Data
datasets have often been released by the social media
platform itself.
Yelp, a popular crowd-sourced review platform
Local businesses, released datasets that have been used for
research in a wide range of topics –
 Automatic photo classification to natural language processing of review
texts,
 sentiment analysis to graph mining, etc.

12
Multimodal Data
Getting connected to the Internet, creating an emerging trend
of the Internet of Things world where more and more devices
exist
These devices are generating and using much data, but not all
of which are “traditional” types (numbers, text).
When dealing with such contexts, we may need to collect and
explore multimodal (different forms) and multimedia (different
media) data such as images, music and other sounds, gestures,
body posture, and the use of space.
The data can be categorized into two types: structured data and
unstructured data.
13
Multimodal Data
Brain imaging data sequences dataset is kind of application is
a multimodal face dataset, which contains output from
different sensors such as EEG, MEG, and fMRI (medical
imaging techniques)

14
Data Storage and Presentation
CSV
(Comma-Separated Values) format is the most common
import and export format for spreadsheets and databases.
There is no “CSV standard,” so the format is operationally
defined by the many applications that read and write it.
An advantage of the CSV format is that it is more generic and
useful when sharing with almost anyone

15
Depression.csv
 treat,before,after,diff
 No Treatment,13,16,3
 No Treatment,10,18,8
 No Treatment,16,16,0
 Placebo,16,13,-3
 Placebo,14,12,-2
 Placebo,19,12,-7
 Seroxat (Paxil),17,15,-2
 Seroxat (Paxil),14,19,5
 Seroxat (Paxil),20,14,-6
 Effexor,17,19,2
 Effexor,20,12,-8
 Effexor,13,10,-3

 The first row mentions the variable names. The remaining rows each individually represent one data
point.
16
TSV (Tab-Separated Values)
Files are used for raw data and can be imported into and exported from
spreadsheet software.
Tab-separated values files are essentially text files, and the raw data can
be viewed by text editors

 Name<TAB>Age<TAB>Address
 Ryan<TAB>33<TAB>1115 W Franklin
 Paul<TAB>25<TAB>Big Farm Way
 Jim<TAB>45<TAB>W Main St
 Samantha<TAB>32<TAB>28 George St

where <TAB> denotes a TAB character.


17
XML (eXtensible Markup Language)
XML (eXtensible Markup Language) was designed to be both
human- and machinereadable, and can thus be used to store
and transport data

XML data is stored in plain text format, it provides a


software- and hardware-independent way of storing data.

XML databases and converting existing data from relational


and object-based storage to an XML model that can be shared
with business partners.
18
XML is becoming quite important as we deal with multiple
devices, platforms, and services relying on the same data

<?xml version=“1.0” encoding=“UTF-8”?>


<bookstore>
<book category=“information science” cover=“hardcover”>
<title lang=“en”>Social Information Seeking</title>
<author>Chirag Shah</author>
<year>2017</year>
<price>62.58</price>
</book>
</bookstore>
19
RSS (Really Simple Syndication)
RSS (Really Simple Syndication) is a format used to share data between
services, and which was defined in the 1.0 version of XML. It facilitates the
delivery of information from various sources on the Web.

The format of RSS follows XML standard usage but in addition defines the
names of specific tags.

Since RSS data is small and fast loading, it can easily be used with services such
as mobile phones, personal digital assistants (PDAs), and smart watches.

RSS is useful for websites that are updated frequently, such as:
 News sites – Lists news with title, date and descriptions.
 Companies – Lists news and new products.
 Calendars – Lists upcoming events and important days.
20

21
JSON (JavaScript Object Notation)
JSON (JavaScript Object Notation) is a lightweight data-
interchange format.
It is not only easy for humans to read and write, but also easy
for machines to parse and generate.
JSON is built on two structures:
• A collection of name–value pairs. In various languages, this
is realized as an object, record, structure, dictionary, hash
table, keyed list, or associative array.
• An ordered list of values. In most languages, this is realized
as an array, vector, list, or sequence.
22
JSON (JavaScript Object Notation)
When exchanging data between a browser and a server, the data
can be sent only as text.

JSON is text, and we can convert any JavaScript object into JSON,
and send JSON to the server.

We can also convert any JSON received from the server into
JavaScript objects.

This way we can work with the data as JavaScript objects, with no
complicated parsing and translations.
23
24
25
Data Pre-processing
Data in the real world is often dirty; that is, it is in need of being cleaned up
before it can be used for a desired purpose.
Factors that indicate that data is not clean or ready to process

 Incomplete. When some of the attribute values are lacking, certain attributes of
interest are lacking, or attributes contain only aggregate data.

 Noisy. When data contains errors or outliers. For example, some of the data points in
a dataset may contain extreme values that can severely affect the dataset’s range.

 Inconsistent. Data contains discrepancies in codes or names. For example, if the


“Name” column for registration records of employees contains values other
than alphabetical letters, or if records do not start with a capital letter,
discrepancies are present.
26
Data Pre-processing

27
Transformation & Reduction

28
Data Cleaning
Data Cleaning
Data may be “cleaned,” or better organized, or scrubbed of
potentially incorrect, incomplete, or duplicated information.

29
Data Munging Consider the following text recipe.

Data is not in a format that is easy to work


“Add two diced tomatoes, three
with it may be stored or presented in a cloves of garlic, and a pinch of salt in
way that is hard to process.
the mix.”
Thus, we need to convert it to something
more suitable for a computer to This can be turned into below table
understand

The approaches to take are all about


manipulating or wrangling (or munging)
the data to turn it into something that is
more convenient or desirable.

This can be done manually, automatically,


or, in many cases, semi-automatically.
30
Handling Missing Data
Sometimes data may be in the right format, but some of the values are
missing.

Consider a table containing customer data in which some of the home


phone numbers are absent.

This could be due to the fact that some people do not have home phones
– instead they use their mobile phones as their primary or only phone.

Other times data may be missing due to problems with the process of
collecting data, or an equipment malfunction.
31
Handling Missing Data
Limited to a certain city or region, and so the area code for a
phone number was not necessary to collect.

Some data may get lost due to system or human error while
storing or transferring the data.

Strategies to combat missing data include ignoring that record,


using a global constant to fill in all missing values, imputation,
inference-based solutions

32
Smooth Noisy Data
 There are times when the data is not missing, but it is corrupted for some reason.

 Data corruption may be a result of faulty data collection instruments, data entry
problems, or technology limitations

 A digital thermometer measures temperature to one decimal point, but the


storage system ignores the decimal points

 A 99.4°F temperature means you are fine, and 99.8°F means you have a fever,
and if our storage system represents both of them as 99°F, then it fails to
differentiate between healthy and sick persons!

 First, you should identify or remove outliers.

33
Data Integration
To be as efficient and effective for various data analyses as
possible, data from various sources commonly needs to be
integrated.
Combine data from multiple sources into a coherent storage place
(e.g., a single file or a database).
 Engage in schema integration, or the combining of metadata from
different sources
Detect and resolve data value conflicts.
 Reasons for this conflict could be different representations
Address redundant data in data integration.
 The same attribute may have different names in different databases.

34
Data Transformation
Data must be transformed so it is consistent and readable (by a system).
Smoothing: Remove noise from data.
Aggregation: Summarization, data cube construction.
Generalization: Concept hierarchy climbing.
Normalization: Scaled to fall within a small, specified range and
aggregation.
 Min–max normalization.
 Z-score normalization.
 Normalization by decimal scaling.
Attribute or feature construction.
New attributes constructed from the given ones.

35
Data Reduction
Data reduction is a key process in which a reduced representation
of a dataset that produces the same or similar analytical results is
obtained.
Data Cube Aggregation.
The lowest level of a data cube is the aggregated data for an
individual entity of interest.
A data cube could be in two, three, or a higher dimension.
Each dimension typically represents an attribute of interest.
Now, consider that you are trying to make a decision using this
multidimensional data.
We can reduce the data to its more meaningful size and structure
for the task at hand. 36
Data Reduction
Data reduction was with the consideration of the task,
dimensionality reduction method
Works with respect to the nature of the data.
Here, a dimension or a column in your data spreadsheet is
referred to as a “feature,”
The goal of the process is to identify which features to
remove or collapse to a combined feature.
Strategies for reduction include sampling, clustering,
principal component analysis, etc.

37
Data Discretization
Dealing with data that are collected from processes that are
continuous, need to convert these continuous values into
more manageable parts. This mapping is called discretization.

Discretization, essentially reducing data.

Thus, this process of discretization could also be perceived as


a means of data reduction, but it holds particular importance
for numerical data.

38
Data Discretization
There are three types of attributes involved in discretization:
a. Nominal: Values from an unordered set
b. Ordinal: Values from an ordered set
c. Continuous: Real numbers

To achieve discretization, divide the range of continuous


attributes into intervals.
For instance, we could decide to split the range of
temperature values into cold, moderate, and hot

39
Data Science with R
Unit IV: Nature of Data, Data Pre-processing

Thank
You
M. Narasimha Raju
Asst Professor, Dept. of Computer Science &
Engineering

You might also like