0% found this document useful (0 votes)

51 views348 pages

MCS 226

The document provides an introduction to data science, covering its definition, types of data, and various methods of data analysis. It emphasizes the importance of data science in decision-making processes and outlines its applications in organizations. Additionally, it discusses common misconceptions and the data science life cycle.

Uploaded by

builddwell9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views348 pages

MCS 226

Uploaded by

builddwell9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 348

Basics of Data Science

UNIT 1 INTRODUCTION TO DATA SCIENCE

1.0 Introduction
1.1 Objective
1.2 Data Science - Definition
1.3 Types of Data
1.3.1Statistical Data Types
1.3.2 Sampling
1.4 Basic Methods of Data Analysis
1.4.1 Descriptive Analysis
1.4.2 Exploratory Analysis
1.4.3 Inferential Analysis
1.4.4 Predictive Analysis
1.5 Common Misconceptions of Data Analysis
1.6 Applications of Data Science
1.7 Data Science Life cycle
1.8 Summary
1.9 Solutions/Answers

1.0 INTRODUCTION
The Internet and communication technology has grown tremendously in the past decade
leading to generation of large amount of unstructured data. This unstructured data
includes data such as, unformatted textual, graphical, video, audio data etc., which is
being generated as a result of people use of social media and mobile technologies. In
addition, as there is a tremendous growth in the digital eco system of organisation, large
amount of semi-structured data, like XML data, is also being generated at a large rate.
All such data is in addition to the large amount of data that results from organisational
databases and data warehouses. This data may be processed in real time to support
decision making process of various organisations. The discipline of data science focuses
on the processes of collection, integration and processing of large amount of data to
produce useful decision making information, which may be useful for informed
decision making.

This unit introduces you to the basic concept of data sciences. This unit provides an
introduction to different types of data used in data science. It also points to different
types of analysis that can be performed using data science. Further, the Unit also
introduces some of the common mistakes of data science.

1.1 OBJECTIVES
At the end of this unit you should be able to:
• Define the term data science in the context of an organization
• explain different types of data
• list and explain different types of analysis that can be performed on data
• explain the common mistakes about data size
• define the concept of data dredging
• List some of the applications of data sites
• Define the life cycle of data science

1
Introduction to Data Science
1.2 DATA SCIENCE-DEFINITION
Data Science is a multi-disciplinary science with an objective to perform data analysis
to generate knowledge that can be used for decision making. This knowledge can be
in the form of similar patterns or predictive planning models, forecasting models etc.
A data science application collects data and information from multiple heterogenous
sources, cleans, integrates, processes and analyses this data using various tools and
presents information and knowledge in various visual forms.

As stated earlier data science is a multi-disciplinary science, as shown in Figure 1.

Programming
Visualization Modelling and
Machine Learning simulation,
Computing Mathematics
Big data
Database System Statistics
Data
Science

Domain
Knowledge

Figure 1: Data Science

What are the advantages of Data science in an organisation? The following are some
of the areas in which data science can be useful.
• It helps in making business decisions such as deciding the health of
companies with whom they plan to collaborate,
• It may help in making better predictions for the future such as making
strategic plans of the company based on present trends etc.
• It may identify similarities among various data patterns leading to
applications like fraud detection, targeted marketing etc.
In general, data science is a way forward for business decision making, especially in
the present day world, where data is being generate at the rate of Zetta bytes.

Data Science can be used in many organisations, some of the possible usage of data
science are as given below:

• It has great potential for finding the best dynamic route from a source to
destination. Such application may constantly monitor the traffic flow and
predict the best route based on collected data.
• It may bring down the logistic costs of an organization by suggesting the best
time and route for transporting foods
• It can minimize marketing expenses by identifying the similar group buying
patterns and performing selective advertising based on the data obtained.
• It can help in making public health policies, especially in the cases of
disasters.
2
Basics of Data Science
• It can be useful in studying the environmental impact of various
developmental activities
• It can be very useful in savings of resources in smart cities

1.3 TYPES OF DATA

Type of data is one of the important aspect, which determines the type of
analysis that has to be performed on data. In data science, the following are the
different types of data, that are required to be processed:

1. Structured Data
2. Semi-Structured Data
3. Unstructured data
4. Data Streams

Structured Data

Since the start of the era of computing, computer has been used as a data
processing device. However, it was not before 1960s, when businesses started
using computer for processing their data. One of the most popular language of
that era was Common Business-Oriented Language (COBOL). COBOL had a
data division, which used to represent the structure of the data being processed.
This was followed by a disruptive seminal design of technology by a E.F.
Codd. This lead to creation of relational database management systems
(RDBMS). RDBMS allows structured storage, retrieval and processing of
integrated data of an organisation that can be securely shared among several
applications. The RDBMS technology also supported secure transaction, thus,
became a major source of data generation. Figure 2 shows the sample structure
of data that may be stored in a relational database system. One of the key
characteristics of structured data is that it can be associated with a schema. In
addition, each schema element may be related to a specific data type.

Customer (custID, custName, custPhone, custAddress, custCategory, custPAN, custAadhar)

Account (AccountNumber,custIDoffirstaccountholder,AccountType, AccountBalance)
JointHolders (AccountNumber, custID)
Transaction(transDate, transType, AccountNumber, Amountoftransaction)

Figure 2: A sample schema of structured data

The relational data is structured data and large amount of this structured data is
being collected by various organisations, as backend to most applications. In
90s, the concept of a data warehouse was introduced. A data warehouse is a
time-invariant, subject-oriented aggregation of data of an organisation that can
be used for decision making. A data in a data warehouse is represented using
dimension tables and fact tables. The dimensional tables classifies the data of
fact tables. You have already studied various schemas in the context of data
warehouse in MCS221. The data of data warehouse is also structured in nature
and can be used for analytical data processing and data mining. In addition,
many different types of database management systems have been developed,
which mostly store structured data.

3
Introduction to Data Science
However, with the growth of communication and mobile technologies many
different applications became very popular leading to generation of very large
amount of semi-structured and unstructured data. These are discussed next.

Semi-structured Data

As the name suggest Semi-structured has some structure in it. The structure of
semi-structured data is due to the use of tags or key/value pairs The common
form of semi-structured data is produced through XML, JSON objects, Server
logs, EDI data, etc. The example of semi-structured data is shown in the Figure
3.

<Book> "Book": {
<title>Data Science and Big Data</title> "Title": "Data
<author>R Raman</author> Science",
<author>C V Shekhar</author>
"Price": 5000,
<yearofpublication>2020</yearofpublicatio
n>
</Book> "Year": 2020
}
Figure 3: Sample semi-structured data

Unstructured Data
The unstructured data does not follow any schema definition. For example, a
written text like content of this Unit is unstructured. You may add certain
headings or meta data for unstructured data. In fact, the growth of internet has
resulted in generation of Zetta bytes of unstructured data. Some of the
unstructured data can be as listed below:
• Large written textual data such as email data, social media data etc..
• Unprocessed audio and video data
• Image data and mobile data
• Unprocessed natural speech data
• Unprocessed geographical data
In general, this data requires huge storage space, newer processing methods
and faster processing capabilities.
Data Streams
A data stream is characterised by a sequence of data over a period of time.
Such data may be structured, semi-structured or unstructured, but it gets
generated repeatedly. For example, IoT devices like weather sensors will
generate data stream of pressure, temperature, wind direction, wind speed,
humidity etc for a particular place where it is installed. Such data is huge for
many applications are required to be processed in real time. In general, not all
the data of streams is required to be stored and such data is required to be
processed for a specific duration of time.

4
1.3.1 Statistical Data Types Basics of Data Science

There are two distinct types of data that can be used in statistical analysis.
These are – Categorical data and Quantitative data

Categorical or qualitative Data:

Categorical data is used to define the category of data, for example, occupation
of a person may take values of the categories “Business”, “Salaried”. “Others”
etc. The categorical data can be of two distinct measurement scales called
Nominal and Ordinal, which are given in Figure 4. If the categories are not
related, then categorical data is of Nominal data type, for example, the
Business category and Salaried categories have no relationship, therefore it is
of Nominal type. However, a categorical variable like age category, defining
age in categories “0 or more but less than 26”, “26 or more but less than 46”,
“46 or more but less than 61”, “More than 61”, has a specific relationship. For
example, the person in age category “More than 61” are elder to person in any
other age category.

Quantitative Data:

Quantitative data is the numeric data, which can be used to define different
scale of data. The qualitative data is also of two basic types –discrete, which
represents distinct numbers like 2, 3, 5,… or continuous, which represent a
continuous values of a given variable, for example, your height can be
measured using continuous scale.

Measurement scale of data

Data are raw facts, for example, student data may include name, Gender, Age,
Height of student, etc. The name typically is a distinguishing data that tries to
distinctly identify two data items, just like primary key in a database.
However, the name data or any other identifying data may not be useful for
performing data analysis in data science. The data such as Gender, Age, Height
may be used to answer queries of the kind: Is there a difference in the height of
boys and girls in the age range 10-15 years? One of the important question is
how do you measure the data so that it is recorded consistently? Stanley
Stevens, a psychologist, defined the following four characteristics that any
scale that can be measured:

• Every representation of the measure should be unique, this is referred

to as identify of a value (IDV).
• The second characteristics is the magnitude (M), which clearly can be
used to compare the values, for example, a weight of 70.5 kg is more
than 70.2 kg.
• Third characteristics is about equality in intervals (EI) used to represent
the data, for example, the difference between 25 and 30 is 5 intervals,
which is same as the difference between 41 to 46, which are also 5
intervals.
• The final characteristics is about a defined minimum or zero
value(MZV), for example, in Kelvin scale, temperature have an

5
Introduction to Data Science
absolute zero value, whereas, the Intelligent quotient cannot be defined
as zero.

Based on these characteristics four basic measurement scales are defined.

Figure 4 defines these measurements, their characteristics and examples.

Measurement Characteristics Example

Scale IDV M EI MZV
Nominal Yes No No No Gender
F - Female
M- Male
Ordinal Yes For rank No No A hypothetical Income
ordering Category:
1 - “0 or more but less than 26”
2 -“26 or more but less than 46”
3 - “46 or more but less than
61”
4 - “More than 61”
Interval Yes Yes Yes No IQ, Temperature in Celsius
Ratio Yes Yes Yes Yes Temperature in K, Age

Figure 4: Measurement Scales of Data

1.3.2 Sampling

In general the size of data that is to be processed today is quite large. This
leads you to the question, whether you would use the entire data or some
representative sample of this data. In several data science techniques sample
data is used to develop an exploratory model also. Thus, even in the data
science sample is one of the ways, which can enhance the speed of exploratory
data analysis. The population in this case may be the entire set of data that you
may be interested. Figure 5 shows the relationships between population and
sample. One of the question, which is asked in this context is what should be
the size of a good sample. You may have to find the answer in the literature.
However, you may please note that a good sample is representative of its
population.

Sample

Population

Figure 5: Population and Sample

6
One of the key objectives of statistics, which uses sample data, is to determine Basics of Data Science
the statistic of the sample and find the probability that the statistic developed
for the sample would determine the parameters of population with a specific
percentage of accuracy. Please note the terms stated above are very important
and explain in the following table:

Term Used for Example

Statistic Statistic is computed for the Sample Sample mean (𝑥̅ ),
Sample Standard deviation
(s),
Sample size (n)
Parameter Parameters are predicted from sample Population mean (µ),
and are about the Population Population Standard
deviation (σ),
Population size (N)

Next, we discuss different kind of analysis that can be performed on data.

Check Your Progress 1:

1. Define the term data science.

2. Differentiate between structured, semi-structured, unstructured and

stream data.

3. What would be the measurement scale for the following? Give reason
in support of your answer.
Age, AgeCategory, Colour of eye, Weight of students of a class, Grade
of students, 5-point Likert scale

1.4 BASIC METHODS OF DATA ANALYSIS

The data for data science is obtained from several data sources. This data is first
cleaned of errors, duplication, aggregated and then presented in a form that can be
analysed by various methods. In this section, we define some of the basic methods
used for analysing data. These are: Descriptive analysis, Exploratory data analysis and
Inferential data analysis.

1.4.1 Descriptive Analysis

Descriptive analysis is used to present basic summaries about data; however, it makes
no attempt to interpret the data. These summaries may include different statistical
values and certain graphs. Different types of data are described using different ways.
The following example illustrates this concept:

Example 1: Consider the data given in the following Figure 6. Show the summary of
categorical data in this Figure.

Enrolment Number Gender Height

S20200001 F 155
S20200002 F 160
S20200003 M 179
S20200004 F 175

7
Introduction to Data Science S20200005 M 173
S20200006 M 160
S20200007 M 180
S20200008 F 178
S20200009 F 167
S20200010 M 173
Figure 6: Sample Height Data

Please note that enrolment number variable need not be used in analysis, so no
summary data for enrolment number is to be found.

Descriptive of Categorical Data:

The Gender is a categorical variable in Figure 6. The summary in this case would
be in terms of frequency table of various categories. For example, for the given
data the frequency distribution would be:
Gender Frequency Proportion Percentage
Female (F) 5 0.5 50%
Male (M) 5 0.5 50%

In addition, you can draw bar chart or pie chart for describing the data of Gender
variable. The pie chart for such data is shown in Figure 7. Details of different
charts are explained in Unit 4. In general, you draw a bar graph, in case the number
of categories is more.

Figure 7: The Pie Chart

Descriptive of Quantitative Data:

The height is a quantitative variable. The descriptive of quantitative data is given
by the following two ways:
1. Describing the central tendencies of the data
2. Describing the spread of the data.
Central tendencies of Quantitative data: Mean and Median are two basic measures
that define the centre of data though using different ways. They are defined below
with the help of an example.

Example 2: Find the mean and median of the following data:

Data Set (n observations) 1 2 3 4 5 6 7 8 9 10 11

x 4 21 25 10 18 9 7 14 11 19 14

The mean can be computed as:

∑
𝑥̅ = 𝑥'𝑛
For the given data 𝑥̅ =
(4 + 21 + 25 + 10 + 18 + 9 + 7 + 14 + 11 + 19 + 14)'
11
Mean 𝑥̅ = 13.82

The median of the data would be the mid value of the sorted data. First data is sorted
and the median is computed using the following formula:
If n is even, then
8
! ! Basics of Data Science
median = [(𝑉𝑎𝑙𝑢𝑒𝑜𝑓( )#$ ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 + 𝑉𝑎𝑙𝑢𝑒𝑜𝑓(( + 1)#$ 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛]/2
" "
If n is odd, then
!%&
median =(𝑉𝑎𝑙𝑢𝑒𝑜𝑓( )#$ ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
"
For this example, the sorted data is as follows:

Data Set (n observations) 1 2 3 4 5 6 7 8 9 10 11

x 4 7 9 10 11 14 14 18 19 21 25

So, the median is:

&&%& #$
median = (𝑉𝑎𝑙𝑢𝑒𝑜𝑓( "
) ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 = 14

You may please note that outliers, which are defined as values highly different from
most other values, can impact the mean but not the median. For example, if one
observation in the data, as shown in example 2 is changed as:

Data Set (n observations) 1 2 3 4 5 6 7 8 9 10 11

x 4 7 9 10 11 14 14 18 19 21 100

Then the median will still remain 14, however, mean will change to 20.64, which is
quite different from the earlier mean. Thus, you should be careful about the presence
of outliers while data analysis.

Interestingly, mean and mode may be useful in determining the nature of data. The
following table describes these conditions:

Relationship Comments about A possible Graph of Data

between mean and observations Distribution
mode
Almost Equal values The distribution of
of mean and median data may be
symmetric in nature

Mean Mode Median

Mean >> Median The distribution
may be left-skewed

Mean Median Mode

9
Introduction to Data Science
Mean << Median The distribution
may be right-
skewed

Mode Median Mean

Figure 8: Mean and Median for possible data

The concept of data distribution is explained in the next Unit.

Mode: Mode is defined as the most frequent value of a set of observation. For
example, in the data of example 2, the value 14, which occurs twice, is the mode. The
mode value need not be a mid-value rather it can be any value of the observations. It
just communicates the most frequently occurring value only. In a frequency graph,
mode is represented by peak of data. For example, in the graphs shown in Figure 8,
the value corresponding to the peaks is the mode.

Spread of Quantitative data: Another important aspect of defining the quantitative

data is its spread or variability of observed data. Some of the measures for spread of
data are given in the Figure 9.

Measure Description Example (Please

refer to Data of
Example 2)
Range Minimum to Maximum Value 4 to 25
Variance Sum of the squares of difference Try both the formula
between the observations and its sample and then match the
mean, which is divided by (n-1), as the answer:
difference of nth value can be determined 40.96
from (n-1) computed difference, as
overall sum of differences has to be
zero. Formula of Variance for sample is:
!
"
1
𝑠 = A(𝑥 − 𝑥̅ )"
(𝑛 − 1)
'(&
However, in case you are determining
the Population variance, then you can
use to following formula:
!
1
σ = A(𝑥 − µ)"
"
𝑁
'(&

Standard Standard deviation is one of the most Try both the formula
Deviation used measure for finding the spread or and then match the
variability of data. It can be computed answer:
as: 6.4
For Sample:

10
Basics of Data Science
!
1
𝑠= E A(𝑥 − 𝑥̅ )"
(𝑛 − 1)
'(&

For Population:
!
1
σ = E A(𝑥 − µ)"
𝑁
'(&

5-Point For creating 5-point summary first, you Use Sorted data of
Summary and need to sort the data. The five point Example 2
Interquartile summary is defined as follows:
Range (IQR) Minimum Value (Min) Min = 4
st
1 Quartile<=25% values (Q1) Q1=(9+10)/2=9.5
2nd Quartile is median (M) M = 14
3rd Quartile is <=75% values (Q3) Q3=(18+19)/2=18.5
Maximum Value (Max) Max=25
IQR is the difference between 3rd and 1st IQR= 18.5-9.5=9
quartiles values.
Figure 9: The Measure of Spread or Variability

The IQR can also be used to identify suspected outliers. In general, a suspected outlier
can exist in the following two ranges:
Observation/values less than Q1 – 1.5´IQR
Observation/values more than Q3+1.5 ´ IQR
For the example 2,
IQR is 9, therefore the outliers may be: Values < (9.5 – 9) or Values < 0.5.
or Values > (18.5 – 9) or Values > 27.5.
Thus, there is no outlier in the initial data of Example 2.

For the qualitative data, you may draw various plots, such as histogram, box plot etc.
These plots are explained in Unit 4 of this block.

Check Your Progress 2

1. Age category of student is a categorical data. What information would you
like to show for its descriptive analysis.

2. Age is a quantitative data; how will you describe its data?

3. How can you find that given data is left skewed?

4. What is IQR? Can it be used to find outliers?

1.4.2 Exploratory Analysis

Exploratory data analysis was suggested by John Turkey of Princeton University in

1960, as a group of methods that can be used to learn possibilities of relationships
amongst data. After you have obtained relevant data for analysis, instead of
performing the final analysis, you may like to explore the data for possible
relationships using exploratory data analysis. In general, graphs are some of the best
ways to perform exploratory analysis. Some of the common methods that you can
perform during exploratory analysis are as follows:

1. As a first step, you may perform the descriptive of various categorical and
qualitative variables of your data. Such information is very useful in
11
Introduction to Data Science determining the suitability of data for the purpose of analysis. This may also
help you in data cleaning, modification and transformation of data.
a. For the qualitative data, you may create frequency tables and bar
charts to know the distribution of data among different categories. A
balanced distribution of data among categories is most desirable.
However, such distribution may not be possible in actual situations.
Several methods has been suggested to deal with such situations.
Some of those will be discussed in the later units.
b. For the quantitative data, you may compute the mean, median,
standard deviation, skewness and kurtosis. The kurtosis value relates
to peaks (determined by mode) in the data. In addition, you may also
draw the charts like histogram to look into frequency distribution.
2. Next, after performing the univariate analysis, you may try to perform some
bi-variate analysis. Some of the basic statistics that you can perform for bi-
variate analysis includes the following:
a. Make two way table between categorical variables and make related
stacked bar charts. You may also use chi-square testing find any
significant relationships.
b. You may draw side-by-side box plots to check if the data of various
categories have differences.
c. You may draw scatterplot and check the correlation coefficient, if that
exists between two variables.
3. Finally, you may like to look into the possibilities of multi-variate
relationships amongst data. You may use dimensionality reduction by using
techniques feature extraction or principle component analysis, you may
perform clustering to identify possible set of classes in the solution space or
you may use graphical tools, like bubble charts, to visualize the data.
It may be noted that exploratory data analysis helps in identifying the possibilities of
relationships amongst data, but does not promises that a causal relationship may exist
amongst variables. The causal relationship has to be ascertained through qualitative
analysis. Let us explain the exploratory data analysis with the help of an example.

Example 3: Consider the sample data of students given in Figure 6 about Gender and
Height. Let us explore this data for an analytical question: Does Height depends on
Gender?

You can perform the exploratory analysis on this data by drawing a side-by-side box
plot for Male and Female students height. This box plot is shown in Figure 10.

Figure 10: Exploratory Data Analysis

Please note that box plot of Figure 10 shows that on an average height of male
students is more than the female student. Does this result applies, in general for the
population? For answering this question, you need to find the probability of
12
occurrence of such sample data. need to determine the probability , therefore, Basics of Data Science
Inferential analysis may need to be performed.

1.4.3 Inferential Analysis

Inferential analysis is performed to answer the question that what is the probability of
that the results obtained from an analysis can be applied to the entire population. A
detailed discussion on various terms used in inferential analysis in the context of
statistical analysis had been done in Unit 2. You can perform many different types of
statistical tests on data. Some of these well-known tools for data analysis are listed in
the Figure 11.
Test Why Performed?
Univariate Analysis: Z-Test or T-test To determine, if the computed value of
mean of a sample can be applicable for
the population and related confidence
interval.
Bivariate Chi-square test To test the relationship between two
categorical variables or groups of data
Two sample T-Test To test the difference between the means
of two variables or groups of data
One way ANOVA To test the difference in mean of more
than two variables or groups of data
F-Test It can be used to determine the equality
of variance of two or more groups of
data
Correlation analysis Determines the strength of relationship
between two variables
Regression analysis Examines the dependence of one
variable over a set of independent
variables
Decision Trees Supervised learning used for
classification
Clustering Non-supervised Learning
Figure 11: Some tools for data analysis

You may have read about many of these tests in Data Warehousing and Data Mining
and Artificial Intelligence and Machine Leaning course. In addition, you may refer to
further readings for these tools. The following example explains the importance of
Inferential analysis.
Example 4:Figure 10 in Example 3 shows the box plot of height of male and female
students. Can you infer from the boxplot and the sample data (Figure 6), if there is
difference in the height of male and female students.
In order to infer, if there is a difference between the hight of two groups (Male and
Female Students), a two-sample t-test was run on the data. The output of this t-test is
shown in Figure 12.
t-Test (two tail): Assuming Unequal Variances
Female Male
Mean 167 173
Variance 94.5 63.5
Observations 5 5
Computed t-value -1.07
p-value 0.32
Critical t-value 2.30

Figure 12: The Output of two sample t-test (two tail)

13
Introduction to Data Science
Figure 12 shows that the mean height of the female students is 167 cm, whereas for
the male students it is 173 cm. The variance of female candidates is 94.5, whereas for
male candidate it is 63.5. Each group is interpreted on the basis of 5 observations. The
computed t-value is -1.07and p-value is 0.32. As the p-value is greater than 0.05,
therefore you can conclude that you cannot conclude that the average male student
height is different from average female student height.

1.4.4 Predictive Analysis

With the availability of large amount of data and advanced algorithms for mining and
analysis of large data have led the way to advanced predictive analysis. The predictive
analysis of today uses tools from Artificial Intelligence, Machine Learning, Data
Mining, Data Stream Processing, data modelling etc. to make prediction for strategic
planning and policies of organisations. Predictive analysis uses large amount of data
to identify potential risks and aid the decision-making process. It can be used in
several data intensive industries like electronic marketing, financial analysis,
healthcare applications, etc. For example, in the healthcare industry, predictive
analysis may be used to determine the support for public health infrastructure
requirements for the future based on the present health data.
Advancements in Artificial intelligence, data modeling, machine learning has also led
to Prescriptive analysis. The prescriptive analysis aims to take predictions one step
forward and suggest solutions to present and future issues.
A detailed discussion on these topics is beyond the scope of this Unit. You may refer
to further readings for more information on these.

1.5 COMMON MISCONCEPTIONS OF DATA

ANALYSIS
In this section, we discuss three misconception that can affect the result of a data
science. These misconceptions are explained with the help of an example, only.

Correlation is not Causation: Correlation analysis establishes relationship between

two variables. For example, consider three variables, namely attendance of student
(attend), marks obtained by student (marks) and weekly hours spent by a student for
the study (study). While analysing data, you found that there is a strong correlation
between the variables attend and marks. However, does it really mean that higher
attendance causes students to obtain better marks? There is another possibility that
both study and marks, as well as study and attend are correlated. A motivated student
may be spending higher number of hours at home, which may lead to better marks.
Similarly, a motivated student who is putting a greater number of hours in his/her
study may be attending to school regularly. Thus, the correlation between study to
marks and study to attend results in anon-existing correlation attend to marks. This
situation is shown in Figure 13.

14
Basics of Data Science

Study

Causes Causes

Attend Mark
Observed, but s
not a Cause
Figure 13: Correlation does not mean causation

Simpsons Paradox: Simson paradox is an interesting situation, which

sometimes leads to wrong interpretations. Consider two Universities, say
University 1 and University 2 and the pass out data of these Universities:

University Student Passed Passed % Student Failed Failed % Total

U1 4850 97% 150 3% 5000
U2 1960 98% 40 2% 2000
Figure 14: The Results of the Universities
As you may observe from the data as above, the University U2 is performing better as
far as passing percentage is concerned. However, during a detailed data inspection, it
was noted that both the Universities were running Basic Adult Literacy Programme,
which in general has a slightly poor result. In addition, the data of the literacy
Programme is to be compiled separately. Therefore, the be data for the University
would be:
General Programmes:
University Student Passed Passed % Student Failed Failed % Total
U1 1480 98.7% 20 3% 1500
U2 1480 98.7% 20 2% 1500
Adult Literacy Programme:
University Student Passed Passed % Student Failed Failed % Total
U1 3370 96.3% 130 3.7% 3500
U2 480 96% 20 2% 500
Figure 15: The result after a new grouping is added
You may observe that due to the additional grouping due to adult literacy programme,
the corrected data shows that U1 is performing better than U2, as the pass out rate for
General programme is same and pass out rate for Adult literacy programme is better
from U1. You must note the changes in the percentages. This is the Simpson’s
paradox.

Data Dredging: Data Dredging, as the name suggest, is extensive analysis of very
large data sets. Such analysis results in generation of large number of data
associations. Many of those associations may not be casual, thus, requires further
exploration through other techniques. Therefore, it is essential that every data
association with large data set should be investigated further before reporting them as
conclusion of the study.

15
Introduction to Data Science
1.6 APPLICATIONS OF DATA SCIENCE
Data Science is useful in analysing large data sets to produce useful information that
can be used for business development and can help in decision making process. This
section highlights some of the applications of data science.

Applications using Similarity analysis

These applications use similarity analysis of data using various algorithms, resulting
into classification or clustering of data into several categories. Some of these
applications can be:
• Spam detection system: This system classifies the emails into spam and non-
spam categories. It analyses the IP addresses of mail origin, word patterns
used in mails, word frequency etc. to classify a mail as spam or not.
• Financial Fraud detection system: This is one of the major applications for
online financial services. Basic principle is once again to classify the
transactions as safe or unsafe transactions based on various parameters of the
transactions.
• Recommendation of Products: Several e-commerce companies have the data
of your buying patterns, information about your searches to their portal and
other information about your account with them. This information can be
clustered into classes of buyers, which can be used to recommend various
products for you.

Applications related to Web Searching

These applications primarily help you in finding content on the web more effectively.
Some of the applications in this section would be the search algorithms used by the
various search engines. These algorithms attempt to find the good websites based on
the search terms. They may use tools related to semantic of your term, indexing of
important website and terms, link analysis etc. In addition, the predictive text use of
browser is also an example of use of

Applications related to Healthcare System

The data science can be extremely useful for healthcare applications. Some of the
applications may involve processing and analysing images for neonatal care or to
detect possibilities of tumors, deformities, problems in organs etc. In addition, there
can be applications to establishing relationships of diseases to certain factors, creating
recommendations for public health based on public health data. Genomic analysis,
creation and testing of new drugs etc. The possibilities of use of streaming data for
monitoring the patients is also a potential area for use of data science in healthcare.

Applications related to Transport sector

These applications may investigate the possibilities of finding best routes – air, road
etc., for example, many e-commerce companies need to plan the most economical
ways of logistic support from their warehouses to the customer. Finding the best
dynamic route from a source to destination with dynamic load on road networks etc.
This application will be required to process the streams of data.

In general, data science can be used for the benefit of society. It should be used
creatively to improve the effective resource utilization, which may lead to sustainable
development. The ultimate goal of data science applications should be to help us
protect our environment and human welfare.

16
Basics of Data Science
1.7 DATA SCIENCE LIFE CYCLE
So far, we have discussed about various aspects of data science in the previous
sections. In this section, we discuss about the life cycle of a data science based
application. In general, a data science application development may involve the
following stages:

Data Science Project Requirements Analysis Phase

The first and foremost step for data science project would be to identify the objectives
of a data science project. This identification of objectives is also coupled with the
study of benefits of the project, resource requirements and cost of the project. In
addition, you need to make a project plan, which includes project deliverables and
associated time frame. In addition, the data that is required to be used for the project is
also decided. This phase is similar as that of requirement study and project planning
and scheduling.

Data collection and Preparation Phase

In this phase, first all the data sources are identified, followed by designing the
process of data collection. It may be noted that data collection may be a continuous
process. Once the data sources are identified then data is checked for duplication of
data, consistency of data, missing data, and availability timeline of data. In addition,
data may be integrated, aggregated or transformed to produce data for a defined set of
attributes, which are identified in the requirements phase.

Descriptive data analysis

Next, the data is analysed using univariate and bivariate analysis techniques. This will
generate descriptive information about the data. This phase can also be used to
establish the suitability and validly of data as per the requirements of data analysis.
This is a good time to review your project requirements vis-à-vis collected data
characteristics.

Data Modelling and Model Testing

Next, a number of data models based on the data are developed. All these data models
are then tested for their validity with test data. The accuracy of various models are
compared contrasted and a final model is proposed for data analysis.

Model deployment and Refinement

The tested best model is used to address the data science problem, however, this
model must be constantly refined, as the decision making environment keeps
changing and new data sets and attributes may change with time. The refinement
process goes through all the previous steps again.

Thus, in general, data science project follows a spiral of development. This is shown
in Figure 16.

17
Introduction to Data Science
Data Science
Project
Requirements
Analysis Phase

Model
deployment Data collection
and and Preparation
Phase
Refinement

Data Modelling Descriptive

and Model data analysis
Testing

Figure 16: A sample Life cycle of Data Science Project

Check Your Progress 3

1. What are the advantage of using boxplot?

2. How is inferential analysis different to exploratory analysis?

3. What is Simpson’s paradox?

1.8 SUMMARY
This Unit introduces basic statistical and analytical concepts of data science. This Unit
first introduces you to the definition of the data science. Data science as a discipline
uses concepts from computing, mathematics and domain knowledge. The types of
data for data science is defined in two different ways. First, it is defined on the basis
of structure and generation rate of data, next it is defined as the measures that can be
used to capture the data. In addition, the concept of sampling has been defined in this
Unit.
This Unit also explains some of the basic methods used for analysis, which includes
descriptive, exploratory, inferential and predictive. Few interesting misconceptions
related to data science has also been explained with the help of example. This unit
also introduces you to some of the applications of data science and data science life
cycle. In the ever-advancing technology, it is suggested to keep reading about newer
data science applications

1.9 SOLUTIONS/ANSWERS

Check Your Progress 1:

1. Data science integrates the principles of computer science and

mathematics and domain knowledge to create mathematical models
18
that shows relationships amongst data attributes. In addition, data Basics of Data Science
science uses data to perform predictive analysis.
2. Structured data has a defined dimensional structure clearly identified
by attributes, for example, tables, data cubes, etc. Semi-structure data
has some structure due to use of tags, however, the structure may be
flexible, for example, XML data. Unstructured data has no structure at
all, like long texts. Data streams on the other hand may be structured,
semi-structured or unstructured data that are being produced
continuously.
3. Age category would be a categorical data, it will be of ordinal scale, as
there are differences among categories, but that difference cannot be
defined quantitatively. Weight of the students of a class is ration scale.
Grade is also a measure of ordinal scale. 5-point Likert scale is also
ordinal data.

Check Your Progress 2:

1. Descriptive of categorical data may include the total number of

observations, frequency table and bar or pie chart.
2. The descriptive of age may include mean, median, mode, skewness,
kurtosis, standard deviation and histogram or box plot.
3. For left skewed data mean is substantially higher than median and
mode.
4. The difference between the Quartile 3 and Quartile 1 is interquartile
range (IQR). In general, suspected outliers are at a distance of 1.5 times
IQR higher than 3rd quartile or 1.5 times IQR lower than 1st quartile.

Check Your Progress 3:

1. Box plots shows 5-point summary of data. A well spread box plot is an
indicator of normally distributed data. Side-by-side box blots can be
used to do a comparison of scale data values of two or more categories.
2. Inferential analysis also computes p-value, which determines if the
result obtained by exploratory analysis are significant enough, such that
results may be applicable for the population.
3. Simpson’s paradox signifies that grouped data sometimes statistics may
produce results that are contrary to when same statistics is applied to
ungrouped data.

19
Basics of Data Science
UNIT 2 PORTABILITY AND STATISTICS FOR
DATA SCIENCE

2.0 Introduction
2.1 Objectives
2.2 Probability
2.2.1 Conditional Probability
2.2.2 Bayes Theorem
2.3 Random Variables and Basic Distributions
2.3.1 Binomial Distribution
2.3.2 Probability Distribution of Continuous Random Variable
2.3.3 The Normal Distribution
2.4 Sampling Distribution and the Central Limit Theorem
2.5 Statistical Hypothesis Testing
2.5.1 Estimation of Parameters of the Population
2.5.2 Significance Testing of Statistical Hypothesis
2.5.3 Example using Correlation and Regression
2.5.4 Types of Errors in Hypothesis Testing
2.6 Summary
2.7 Solution/Answers

2.0 INTRODUCTION

In the previous unit of this Block, you were introduced to the basic concepts of
data science, which include the basic types of data, basic methods of data
analysis and applications and life cycle of data science. This Unit introduces
you to the basic concepts related to probability and statistics related to data
science.
It introduces the concept of conditional probability and Bayes Theorem. It is
followed by discussion on the basic probability distribution, highlighting their
significance and use. These distributions includes Binomial and Normal
distributions, the two most used distributions from discrete and continuous
variables respectively. The Unit also introduces you to the concept of sampling
distribution and central limit theorem. Finally, this unit covers the concepts of
statistical hypothesis testing with the help of an example of correlation. You
may refer to further readings for more details on these topics, if needed.

2.1 OBJECTIVES

After going through thus unit, you should be able to:

• Compute the conditional probability of events;
• Use Bayes theorem in problem solving
• Explain the concept of random variable;
• Explain the characteristics of binomial and normal distribution;
• Describes the sampling distribution and central limit theorem;
• State the statistical hypothesis; and
• Perform significance testing.

1
Portability and Statistics
for Data Science

2.2 PROBABILITY

Probability is a possible measure of occurrence of a specific event amongst a

group of events, if the occurrence of the events is observed for a large number
of trials. For example, possibility of getting 1, while rolling a fair die is 1/6. You
may please note that you need to observe this event by repeatedly rolling a die
for a large number of trials to arrive at this probability or you may determine the
probability subjectively by finding the ratio of this outcome and the total number
of possible outcomes, which may be equally likely. Thus, the probability of an
event (E) can be computed using the following formula:
!"#$%& () ("*+(#%, -. */% ,%* () 011 2(,,-$1% ("*+(#%, */0* &%,"1* -. %3%.* 4
𝑃(𝐸) = !"#$%& ("*+(#%, -. */% ,%* () 011 2(,,-$1% ("*+(#%,
(1)

In the equation above, the set of all possible outcomes is also called sample
space. In addition, it is expected that all the outcomes are equally likely to occur.

Consider that you decided to roll two fair dice together at the same time. Will
the outcome of first die will affect the outcome of the second die? It will not, as
both the outcomes are independent of each other. In other words both the trials
are independent, if the outcome of the first trial does not affect the outcome of
second trail and vice-versa; else the trails are dependent trials.
How to compute the probability for more than one events in a sample space. Let
us explain this with the help of example.

Example 1: A fair die having six equally likely outcomes is to be thrown, then:
(i) What is the sample space: {1, 2, 3, 4, 5, 6}
(ii) An Event Ais die shows 2, then outcome of event A is {2}; and
probability P(A)=1/6
(iii) An Event Bis die shows odd face, then Event B is {1, 3, 5}; and
probability of Event B is P(B) = 3/6= 1/2
(iv) An Event C is die shows even face, then Event C is {2, 4, 6}; and
probability of Event C is P(C) = 3/6= 1/2
(v) Event A and B are disjoint events, as no outcomes between them is
common. So are the Event B and C. But event A and C are not
disjoint.
(vi) Intersection of Events A and B is a null set {}, as they are disjoint
events, therefore, probability that both events A and B both occur,
viz. P(AÇB) = 0. However, intersection of A and C is {2}, therefore,
P(AÇ C) = 1/6.
(vii) The union of the Events A and B would be {1, 2, 3. 5}, therefore, the
probability that event A or event B occurs, viz. P(AÈB)=4/6=2/3.
Whereas, the Union of events B and C is {1, 2, 3, 4, 5, 6}, therefore,
P(BÈC)=6/6=1.

Please note that the following formula can be derived from the above example.
Probability of occurrence of any of the two events A or B (also called union of
events) is:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) (2)

2
Basics of Data Science
For example 1, you may compute the probability of occurrence of event A or C
as:
𝑃(𝐴 ∪ 𝐶) = 𝑃(𝐴) + 𝑃(𝐶) − 𝑃(𝐴 ∩ 𝐶)
= 1/6 + 1/2 – 1/6 = 1/2.
In the case of disjoint events, since 𝑃(𝐴 ∩ 𝐵) is zero, therefore, the equation
(2) will reduce to:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) (3)

Probability of Events in independent trials:

This is explained with the help of the following example.
Example 2: A fair die was thrown twice, what is the probability of getting a 2
in the first throw and 4 or 5 in the second throw.
The probability of getting a 2 in the first throw (say event X) is P(X) = 1/6
The probability of getting {4, 5} in the second throw (say event Y) is P(Y) =
2/6.
Both these events are independent of each other, therefore, you need to use the
formula for intersection of independent events, which is:
𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑋) × 𝑃(𝑌) (4)
! # !
Therefore, the probability 𝑃(𝑋 ∩ 𝑌) = " × " = !$
This rule is applicable even with more than two independent events. However,
this rule will not apply if the events are not independent.

2.2.1 Conditional Probability

Conditional probability is defined for the probability of occurrence of an event,

when another event has occurred. Conditional probability addresses the
following question.

Given two events X and Y with the probability of occurrences P(X) and
P(Y) respectively. What would be the probability of occurrence of X if
the other event Y has actually occurred?
Let us analyse the problem further. Since the event Y has already occurred,
therefore, sample space reduces to the sample space of event Y. In addition, the
possible outcomes for occurrence of X could be the occurrences at the
intersection of X and Y, as that is the only space of X, which is part of sample
space of Y. Figure 1 shows this with the help of a Venn diagram.

Sample Space
if Event Y
X Y occurred

Possible
Outcome of X
Initial Sample Space after Y has
occurred

Figure 1: The conditional Probability of event A given that event B has occurred

You can compute the conditional probability using the following equation.
%(' ∩ *)
𝑃(𝑋/𝑌) = %(*) (5)

3
Portability and Statistics
for Data Science Where 𝑃(𝑋/𝑌) is the conditional probability of occurrence of event X, if event
Y has occurred.
For example, in example 1, what is the probability of occurrence of event A, if
event C has occurred?
You may please note that P(AÇC) = 1/6 and P(C)=1/2, therefore, the conditional
probability 𝑃(𝐴/𝐶) would be:
𝑃(𝐴 ∩ 𝐶) 13
𝑃(𝐴/𝐶) = = 6 = 133
𝑃(𝐶) 13
2
What would be conditional probability of disjoint events? You may find the
answer, by computing the 𝑃(𝐴/𝐵) for the Example 1.

What would be the conditional probability for Independent events?

The equation (5) of conditional probability can be used to derive a very
interesting results, as follows:
You can rewrite equation (5) as,
𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑋/𝑌) × 𝑃(𝑌) (5a)
Similarly, you can rewrite equation (5) for 𝑃(𝑌/𝑋) as,
%(' ∩ *)
𝑃(𝑌/𝑋) = %(') or 𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑌/𝑋) × 𝑃(𝑋) (5b)
Using equation 5(a) and equation 5(b) you can conclude the following:
𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑋/𝑌) × 𝑃(𝑌) = 𝑃(𝑌/𝑋) × 𝑃(𝑋) (6)

Independent events are a special case for the conditional probability. As the two
events are independent of each other, therefore, occurrence of the any one of the
event does not change the probability or occurrence of the second event.
Therefore, for independent events X and Y
𝑃(𝑋/𝑌) = 𝑃(𝑋) 𝑎𝑛𝑑 𝑃(𝑌/𝑋) = 𝑃(𝑌) (7)
In fact, the equation (7) can be used to determine the independent events

2.2.2 Bayes Theorem

Bayes theorem is one of the important theorem, which deals with the
conditional probability. Mathematically, Bayes theorem can be written using
equation (6) as,
𝑃(𝑋/𝑌) × 𝑃(𝑌) = 𝑃(𝑌/𝑋) × 𝑃(𝑋)
%(*/')×%(')
Or 𝑃(𝑋/𝑌) = (8)
%(*)

Example 3:Assume that you have two bags namely Bag A and Bag B. Bag A
contains 5 green and 5 red balls; whereas, Bag B contains 3 green and 7 red
balls. Assume that you have drawn a red ball, what is the probability that this
red ball is drawn from Bag B.

In this example,
Let the event X be “Drawing a Red Ball”. The probability of drawing a red ball
can be computed as follows;
You may select a Bag and then draw a ball. Therefore, the probability
will be computed as:
(Probability of selection of Bag A) ´(Probability of selection of red ball
in Bag A) + (Probability of selection of Bag B) ´ (Probability of
selection of red ball in Bag B)
4
Basics of Data Science
P(Red)= (1/2´5/10 + 1/2 ´ 7/10) = 3/5

Let the event Y be “Selection of Bag B from the two Bags, assuming equally
likely selection of Bags. Therefore, P(BagB)=1/2.
In addition, if Bag B is selected then the probability of drawing Red ball
P(Red/BagB)=7/10, as Bag B has already been selected and it has 3 Green and
7 Red balls.
As per the Bayes Theorem:
%(./0/1231)×%(1231)
𝑃(𝐵𝑎𝑔𝐵/𝑅𝑒𝑑) = %(./0)
! "
× 4
𝑃(𝐵𝑎𝑔𝐵/𝑅𝑒𝑑) = "# $
% = !#
&
Bayes theorem is a powerful tool to revise your estimate provided a given
event has occurred. Thus, you may be able to change your predictions.

Check Your Progress 1

1. Is 𝑃(𝑌/𝑋) = 𝑃(𝑌/𝑋)?
2. How can you use probabilities to find, if two events are independent.
3. The MCA batches of University A and University B consists of 20 and
30 students respectively. University A has 10 students who have
obtained more than 75% marks and University B has 20 such students.
A recruitment agency selects one of these student who has more than
75% marks out of the two Universities. What is the probability that the
selected student is from University A?

2.3 RANDOM VARIABLES AND BASIC

DISTRIBUTIONS

In statistics, in general, you perform random experiments to study particular

characteristics of a problem situation. These random experiments, which are
performed in almost identical experimental setup and environment, determine
the attributes or factors that may be related to the problem situation. The
outcome of these experiments can take different values based on the probability
and are termed as the random variables. This section discusses the concept of
random variables.

Example 4: Consider you want to define the characteristics of random

experiment toss of the coin, say 3 tosses, you selected an outcome “Number of
Heads” as your variable, say X. You may define the possible outcomes of sample
space for the tosses as:

Outcomes HHH HHT HTH HTT THH THT TTH TTT

Number of Heads 3 2 2 1 2 1 1 0
(X)
Figure 2: Mapping of outcomes of sample space to Random variable.

Using the data of Figure 2, you can create the following frequency table, which
can also be converted to probability.
5
Portability and Statistics
for Data Science
X Frequency Probability P(X)
0 1 1/8
1 3 3/8
2 3 3/8
3 1 1/8
Total 8 Sum of all P(X) = 1
Figure 3: The Frequency and Probability of Random Variable X

The Random variables are of two kinds:

• Discrete Random Variable
• Continuous Random Variable
Discrete random variables, as the name suggests, can take discrete values only.
Figure 3 shows a discrete random variable X. A discrete random variable, as a
convention, may be represented using a capital alphabets. The individual values
are represented using lowercase alphabet, e.g., for the discrete variable X of
Figure 3, the discrete values are x0, x1,x2 andx3. Please note that their values are
also 0, 1, 2 and 3 respectively. Similarly, to represent individual probability, you
may use the value names p0, p1,p2 andp3. Please also note that the sum of all these
probabilities is 1, e.g. in Figure 3, p0+p1+p2 +p3 = 1.
Probability Distribution of Discrete Random Variable
For the discrete random variable X, which is defined as the number of head in
three tosses of coin, the pair (xi,pi), for i=0 to 3, defines the probability
distribution of the random variable X. Similarly, you can define the probability
distribution for any discrete random variable. The probability distribution of a
discrete random variable has two basic properties:
• The pi should be greater than or equal to zero, but always less than or
equal to 1.
• The sum of all pi should be 1.
Figure 4 shows the probability distribution of X (the number of heads in three
tosses of coin) in graphical form.

0.4 0.375 0.375

0.35

0.3

0.25
PROBABILITY

0.2

0.15 0.125 0.125

0.1

0.05

0
0 1 2 3
NUMBER OF HEADS (X)

Figure 4: Probability Distribution of Discrete Random Variable X (Number of heads in 3 tosses of

coin)

6
Basics of Data Science
Another important value defined in probability distribution is the mean or
expected value, which is computed using the following equation (9)for random
variable X:
𝜇 = ∑6578 𝑥5 × 𝑝5 (9)
Thus, the mean or expected number of heads in three trials would be:
𝜇 = 𝑥8 × 𝑝8 + 𝑥! × 𝑝! + 𝑥# × 𝑝# + 𝑥9 × 𝑝9
! 9 9 ! !#
𝜇 = 0 × $ + 1 × $ + 2 × $ + 3 × $ = $ = 1.5
Therefore, in a trail of 3 tosses of coins, the mean number of heads is 1.5.

2.3.1 Binomial Distribution

Binomial distribution is a discrete distribution. It shows the probability

distribution of a discrete random variable. The Binomial distribution involves
an experiment involving Bernoulli trails, which has the following
characteristics:
• A number of trails are conducted, say n.
• There can be only two possible outcomes of a trail – Success(say s) or
Failure (say f).
• Each trail is independent of all the other trails.
• The probability of the outcome Success (s), as well as failure (f), is same
in each and every independent trial.
For example, in the experiment of tossing three coins, the outcome success is
getting a head in a trial. One possible outcome for this experiment is THT, which
is one of outcome of the sample space shown in Figure 2.
You may please note that in case of n=3, the for the random variable X, which
represents the number of heads, the success is getting a Heads, while failure is
getting a Tails. Thus, THT is actually Failure, Success, Failure. The probability
for such cases, thus, can be computed as shown earlier. In general, in Binomial
distribution, the probability of r successes is represented as:
𝑃(𝑋 = 𝑟) 𝑜𝑟𝑝: = 6𝐶: × 𝑠 : × 𝑓 6;: (10)
Where s is the probability of success and f is the probability of failure in each
trail. The value of 6𝐶: is computed using the combination formula:
6 6!
𝐶: = :!(6;:)! (11)
For the case of three tosses of the coins, where X is represented as number of
heads in the three tosses of coins n = 3 and both s and f are 1/2, the probability
as per Binomial Distribution would be:
9 8 9;8
3! 1 8 1 9 1
𝑃(𝑋 = 0) 𝑜𝑟 𝑝8 = 𝐶8 × 𝑠 × 𝑓 = ×H I ×H I =
0! (3 − 0)! 2 2 8
! #
3! 1 1 3
𝑃(𝑋 = 1) 𝑜𝑟 𝑝! = 9𝐶! × 𝑠 ! × 𝑓 9;! = ×H I ×H I =
1! (3 − 1)! 2 2 8
# !
3! 1 1 3
𝑃(𝑋 = 2) 𝑜𝑟 𝑝# = 9𝐶# × 𝑠 # × 𝑓 9;# = ×H I ×H I =
2! (3 − 2)! 2 2 8
9 8
3! 1 1 1
𝑃(𝑋 = 3) 𝑜𝑟 𝑝9 = 9𝐶9 × 𝑠 9 × 𝑓 9;9 = ×H I ×H I =
3! (3 − 3)! 2 2 8
Which is same as Figure 2 and Figure 3.
Finally, the mean and standard deviation of Binomial distribution for n trails,
each having a probability of success as s, can be defined using the following
formulas:

7
Portability and Statistics
for Data Science 𝜇 =𝑛×𝑠 (12a)
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) (12b)
Therefore, for the variable X which represents number of heads in three tosses
of coin, the mean and standard deviation are:
!
𝜇 = 𝑛 × 𝑠 = 3 × # = 1.5
! ! √9
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) = M3 × # × (1 − #) = #

Distribution of a discrete random variable, thus, is able to compute the

probability of occurrence of specific number successes, as well as the mean or
expected value of a random probability experiment.

2.3.2 Probability Distribution of Continuous Random Variable

A continuous variable is measured using scale or interval measures. For

example, height of the students of a class can be measured using an interval
measure. You can study the probability distribution of a continuous random
variable also, however, it is quite different from the discrete variable
distribution. Figure 5 shows a sample histogram of the height of 100 students of
a class. You may please notice it is typically a grouped frequency distribution.

Frequency of 'Height'
27

14 14
11 11
8
6
5
3
0 1
(145, 150] (155, 160] (165, 170] (175, 180] (185, 190]
[140, 145] (150, 155] (160, 165] (170, 175] (180, 185] (190, 195]
Figure 5: Histogram of Height of 100 students of a Class

The mean of the height was 166 and the standard distribution was about 10.The
probability for a student height is in between 165 to 170 interval is 0.27.

In general, for large data set continuous random variable distribution is

represented as a smooth curve, which has the following characteristics:
• The probability in each interval would be between 0 and 1. To compute the
probability in an interval you need to compute the area of the curve between
the starting and end points of that interval.
• The total area of the curve would be 1.

2.3.3 The Normal Distribution

An interesting frequency distribution of continuous random variable is the
Normal Distribution, which was first demonstrated by a German Scientist C.F.
8
Basics of Data Science
Gauss. Therefore, it is sometime also called the Gaussian distribution. The
Normal distribution has the following properties:
• The normal distribution can occur in many real life situations, such as height
distribution of people, marks of students, intelligence quotient of people etc.
• The curve looks like a bell shaped curve.
• The curve is symmetric about the mean value (μ).Therefore, about half of
the probability distribution curve would lie towards the left of the mean and
other half would lie towards the right of the mean.
• If the standard deviation of the curve is σ, then about 68% of the data values
would be in the range (μ-σ) to (μ+σ) (Refer to Figure 6)
• About 95% of the data values would be in the range (μ-2σ) to (μ+2σ) (Refer
to Figure 6)
• About 99.7% of the data values would be in the range (μ-3σ) to (μ+3σ) (Refer
to Figure 6).
• Skewness and Kurtosis of normal distribution is closer to zero.
• The probability density of standard normal distribution is represented using
a mathematical equation using parameters μ and σ. You may refer to the
equation in the further readings.

𝜇 − 3𝜎 𝜇 − 2𝜎 𝜇 − 𝜎 𝜇 𝜇+𝜎 𝜇+2𝜎 𝜇+3𝜎

68%
95%
99.7%

Figure 6: Normal Distribution of Data

Computing probability using Normal Distribution:

The Normal distribution can be used to compute the z-score, which computes
the distance of a value x from its mean in terms of its standard deviation.
For a given continuous random variable X and its value x; and normal probability
distribution with parameters μ and σ; the z-score would be computed as:
(>;?)
𝑧= @ (13)
You can find the cumulative probabilities at a particular z-value using Normal
distribution, for example, the shaded portion of the Figure 7 shows the
cumulative probabilities at z= 1.3, the probability of the shaded portion at this
point is 0.9032

9
Portability and Statistics
for Data Science 𝐴𝑟𝑒𝑎 𝑢𝑛𝑑𝑒𝑟 𝑡ℎ𝑒 𝑐𝑢𝑟𝑣𝑒
= 0.9032

µ 𝜇 + 1.3𝜎
𝜇+𝜎
Figure 7: Computing Probability using Normal Distribution

Standard Normal Distribution is a standardized form of normal distribution,

which allows comparison of various different normal curves. A standard normal
curve would have the value of mean (μ) as zero and standard deviation (σ) as 1.
The z-score for standard normal distribution would be:
(>;8)
𝑧= ! =𝑥
Therefore, for standard normal distribution the z-score is same as value of x.
This means that 𝑧 = ±2contains the 95% area under the standard normal curve.

In addition to Normal distribution a large number of probability distributions

have been studied. Some of these distributions are – Poisson distribution,
Uniform Distribution, Chi-square distribution etc. Each of these distribution is
represented by a characteristics equation involving a set of parameters. A
detailed discussion on these distributions is beyond the scope of this Unit. You
may refer to Further Reading for more details on these distributions.

2.4 SAMPLING DISTRIBUTION AND THE

CENTRAL LIMIT THEOREM

With the basic introduction, as above, next we discuss one of the important
aspect of sample and population called sampling distribution. A typical
statistical experiment may be based on a specific sample of data that may be
collected by the researcher. Such data is termed as the primary data. The
question is – Does the statistical results obtained by you using the primary data
can be applied to the population? If yes, what may be the accuracy of such a
collection? To answer this question, you must study the sampling distribution.
Sampling distribution is also a probability distribution, however, this
distribution shows the probability of choosing a specific sample from the
population. In other words, a sampling distribution is the probability distribution
of means of the random samples of the population. The probability in this
distribution defines the likelihood of the occurrence of the specific mean of the
sample collected by the researcher. Sampling distribution determines whether
the statistics of the sample falls closer to population parameters or not. The
following example explains the concept of sampling distribution in the context
of a categorical variable.

10
Basics of Data Science
Example 5: Consider a small population of just 5 person, who vote for a question
“Data Science be made the Core Course in Computer Science? (Yes/No)”. The
following table shows the population:

P1 P2 P3 P4 P5 Population Parameter (proportion) (p)

Yes Yes No No No 0.4

Figure 8: A hypothetical population

Suppose, you take a sample size (n) = 3, and collects random sample. The
following are the possible set of random samples:
Sample Sample Proportion (𝑝̂ )
P1, P2, P3 0.67
P1, P2, P4 0.67
P1, P2, P5 0.67
P1, P3, P4 0.33
P1, P3, P5 0.33
P1, P4, P5 0.33
P2, P3, P4 0.33
P2, P3, P5 0.33
P2, P4, P5 0.33
P3, P4, P5 0.00
Frequency of all the sample proportions is:
𝑝̂ Frequency
0 1
0.33 6
0.67 3

Figure 9: Sampling proportions

The mean of all these sample proportions = (0´1 + 0.33´6+0.67´3)/10

= 0.4 (ignoring round off errors)

8
6
FREQUENCY

6
4
2 3
1
0
0 0.33 0.67
SAMPLE PROPORTION

Figure 10: The Sampling Proportion Distribution

Please notice the nature of the sampling proportions distribution, it looks closer
to Normal distribution curve. In fact, you can find that out by creating an
example with 100 data points and sample size 30.

Given a sample size n and parameter proportion p of a particular category, then

the sampling distribution for the given sample size would fulfil the following:

11
Portability and Statistics
for Data Science 𝑚𝑒𝑎𝑛 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 = 𝑝 (14a)
A×(!;A)
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = M 6
(14b)

Let us extend the sampling distribution to interval variables. Following example

explains different aspects sampling distribution:
Example 6: Consider a small population of age of just 5 person. The following
table shows the population:

P1 P2 P3 P4 P5 Population mean (μ)

20 25 30 35 40 30

Figure 8: A hypothetical population

Suppose, you take a sample size (n) = 3, and collects random sample. The
following are the possible set of random samples:
Sample Sample Mean (𝑥̅ )
P1, P2, P3 25
P1, P2, P4 26.67
P1, P2, P5 28.33
P1, P3, P4 28.33
P1, P3, P5 30
P1, P4, P5 31.67
P2, P3, P4 30
P2, P3, P5 31.67
P2, P4, P5 33.33
P3, P4, P5 35

Figure 11: Mean of Samples

The mean of all these sample means = 30, which is same as population mean μ.
The histogram of the data is shown in Figure 12.

2.5

2
Frequency

1.5

0.5

0
(26.5, 28] (29.5, 31] (32.5, 34]
[25, 26.5] (28, 29.5] (31, 32.5] (34, 35.5]
Mean Value

Figure 12: Frequency distribution of sample means

Given a sample size n and population mean μ, then the sampling distribution for
the given sample size would fulfil the following:
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 = 𝜇 (15a)
@
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛𝑠 = (15b)
√6
12
Basics of Data Science
Therefore, the z-score computation for sampling distribution will be as per the
following equation:
Note: You can obtain this equation from equation (13), as this is a
distribution of means, therefore, x of equation (13) is 𝑥̅ , and standard
deviation of sampling distribution is given by equation (15b).
(>̅ ;?)
𝑧= @ (15c)
C
√6

Please note that the histogram of the mean of samples is close to normal
distribution.

Such exponentiations led to the Central limit Theorem, which proposes the following:
Central Limit Theorem: Assume that a sample of size is drawn from a population that
has the mean μ and standard deviation σ. The central limit theorem states that with the
increase in n, the sampling distribution, i.e. the distribution of mean of the samples,
approaches closer to normal distribution.

However, it may be noted that the central limit theorem is applicable only if you have
collected independent random samples, where the size of sample is sufficiently large,
yet it is less than 10% of the population. Therefore, the Example 5 and Example 6 are
not true representations for the theorem, rather are given to illustrate the concept.
Further, it may be noted that the central limit theorem does not put any constraint on
the distribution of population. Equation 15 is a result of central limit theorem.

Does the Size of sample have an impact on the accuracy of results?

Consider that a population size is100,000 and you have collected a sample of size
n=100, which is sufficiently large to fulfil the requirements of central limit theorem.
Will there be any advantage of taking a higher sample size say n=400? The next section
addresses this issue in detail.

Check Your Progress 2

1. A fair dice is thrown 3 times, compute the probability distribution of the
outcome number of times an even number appears on the dice.

2. What would be the probability of getting different number of heads, if a fail

coin is tossed 4 times.

3. What would be the mean and standard deviation for the random variable of
Question 2.

4. What is the mean and standard deviation for standard normal distribution?

5. A country has the population of 1 billion, out of which 1% are the students of
class 10th. A representation sample of 10000 students of class 10 were asked a
question “Is Mathematics difficult or easy?”. Assuming that the population
proportion of this question was reported to be 0.36, what would be possible
standard deviation of the sampling distribution?

6. Given a quantitative variable, what is the mean and standard deviation of

sampling distribution?

13
Portability and Statistics
for Data Science

2.5 STATISTICAL HYPOTHESIS TESTING

In general, statistical analysis is mainly used in the two situations:

S1. To determine if students of class 12 plays some sport, a sample random
survey collected the data from 1000 students. Of these 405 students,
stated that they play some sport. Using this information, can you infer
that students of class 12 give less importance to sports? Such a decision
would require you to estimate the population parameters.
S2. In order to study the effect of sports on the performance of class 12th
marks, a study was performed. It performed random sampling and
collected the data of 1000 students, which included information of
Percentage of marks obtained by the student and hours spent by the
student in sports per week during class 12th. This kind of decision can be
made through hypothesis testing.

In this section, let us analyse both these situations.

2.5.1 Estimation of Parameters of the Population

One of the simplest ways to estimate the parameter value as a point estimation.
Key characteristics of this estimate should be that it should be unbiased, such as
mean or median that lies towards the centre of the data; and should have small
standard deviation, as far as possible. For example, a point estimate for situation
S1 above would be that 40.5% students play some sports. This point estimate,
however, may not be precise and may have some margin of error. Therefore, a
better estimation would be to define an interval that contains the value of the
parameter of the population. This interval, called confidence interval, includes
the point estimate along with possible margin of error. The probability that the
chosen confidence interval contains the population parameter is normally chosen
as 0.95. This probability is called the confidence level. Thus, you can state with
95% confidence that the confidence interval contains a parameter. Is the value
of confidence level as 0.95 arbitrary? As you know that sampling distribution
for computing proportion is normal if the sample size (n) is large. Therefore, to
answer the question asked above, you may study Figure 13 showing the
probability distribution of sampling distribution.

𝑝 − 2 StDev ^p p 𝑝 + 2 StDev

Figure 13: Confidence Level 95% for a confidence interval (non-shaded area).

Since you have selected a confidence level of 95%, you are expecting that
proportion of the sample (𝑝̂ )can be in the interval–(population proportion (p) -

14
Basics of Data Science
2´(Standard Deviation)) to (population proportion (p) + 2´(Standard
Deviation)), as shown in Figure 13. The probability of occurrence of 𝑝̂ in this
interval is 95% (Please refer to Figure 6). Therefore, the confidence level is 95%.
In addition, note that you do not know the value of p that is what you are
estimating, therefore, you would be computing 𝑝̂ . You may observe in Figure
13, that the value of p will be in the interval (𝑝̂ - 2´(Standard Deviation)) to (𝑝̂
+ 2´(Standard Deviation)). The standard deviation of the sampling distribution
can be computed using equation (14b). However, as you are estimating the value
of p, therefore, you cannot compute the exact value of standard deviation.
Rather, you can compute standard error, which is
computed by estimating the standard deviation using the sample proportion (𝑝̂ ),
by using the following formula:
AD×(!;AD)
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟(𝑆𝑡𝐸𝑟𝑟) = M 6
Therefore, the confidence interval is estimated as (𝑝̂ – 2´StErr) to (𝑝̂ + 2´StErr).
In general, for a specific confidence level, you can specify a specific z-score
instead of 2. Therefore, the confidence interval, for large n, is: (𝑝̂ – z´StErr) to
(𝑝̂ + z´StErr)
In practice, you may use confidence level of 90% or 95% and 99%. The z-score
used for these confidence levels are 1.65, 1.96 (not 2) and 2.58respectively.
Example 7: Consider the statement S1 of this section and estimate the
confidence interval for the given data.
For the sample the probability that class 12th students play some sport is:
𝑝̂ = 405/1000=0.405
The sample size (n) = 1000
AD×(!;AD) 8.F8G×(!;8.F8G)
𝑆𝑡𝐸𝑟𝑟 = M 6
= M !888
= 0.016
Therefore, the Confidence Interval for the confidence level 95% would be:
(0.405 – 1.96 ´0.016) to (0.405 + 1.96 ´0.016)
0.374 to 0.436
Therefore, with a confidence of 95%, you can state that the students of class 12th,
who plays some sport is in the range 37.4% to 43.6%
How can you reduce the size of this interval? You may please observe that
StErris inversely dependent on the square root of the sample size. Therefore,
you may have to increase the sample size to approximately 4 times to reduce the
standard error to approximately half.
Confidence Interval to estimate mean
You can find the confidence interval for estimating mean in a similar manner,
as you have done for the case of proportions. However, in this case you need
estimate the standard error in the estimated mean usingthe variation of equation
15b, as follows:
H
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 =
√6
; where s is the standard deviation of the sample

Example 8: The following table lists the height of a sample of 100 students of
class 12 in centimetres. Estimate the average height of students of class 12.
170 164 168 149 157 148 156 164 168 160
149 171 172 159 152 143 171 163 180 158
167 168 156 170 167 148 169 179 149 171
164 159 169 175 172 173 158 160 176 173

15
Portability and Statistics
for Data Science 159 160 162 169 168 164 165 146 156 170
163 166 150 165 152 166 151 157 163 189
176 185 153 181 163 167 155 151 182 165
189 168 169 180 158 149 164 171 189 192
171 156 163 170 186 187 165 177 175 165
167 185 164 156 143 172 162 161 185 174

Figure 14: Random sample of height of students of class 12 in centimetres

The sample mean and sample standard deviation is computed and is shown
below:
Sample Mean (𝑥̅ ) = 166; Standard Deviation of sample (s) = 11
Therefore, the estimated height confidence interval of the mean height of the
students of class 12thcan be computed as:
Mean height (𝑥̅ ) = 166
The sample size (n) = 100
!!
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 = =1.1
√!88
The Confidence Interval for the confidence level 95% would be:
(166 – 1.96 ´ 1.1) to (166 + 1.96 ´ 1.1)
163.8 to 168.2
Thus, with a confidence of 95%, you can state that average height of class 12th
students is in between 163.8 to 168.2 centimetres.
You may please note that in example 8, we have used t-distribution for means,
as we have used sample’s standard deviation rather than population standard
deviation. The t-distribution of means is slightly more restrictive than z-
distribution. The t-value is computed in the context of sampling distribution by
the following equation:
(>̅ ;?)
𝑡= H (16)
C
√6

2.5.2 Significance Testing of Statistical Hypothesis

In this section, we will discuss about how to test the statement S2, given in
section 2.5. A number of experimental studies are conducted in statistics, with
the objective to infer, if the data support a hypothesis or not. The significance
testing may involve the following phases:
1.Testing Pre-condition on Data:
Prior to preforming the test of significance, you should check the pre-conditions
on the test. Most of the statistical test require random sampling, large size of
data for each possible category being tested and normal distribution of the
population.
2. Making the statistical Hypothesis: You make statistical hypothesis after the
parameters . of the population. There are two basic hypothesis in statistical
testing – the Null Hypothesis and the Alternative Hypothesis.
Null Hypothesis: Null hypothesis either defines a particular value for the
parameter or specifies there is no difference or no change in the specified
parameters. It is represented as H0.
Alternative Hypothesis: Alternative hypothesis specifies the values or difference
in parameter values. It is represented as either H1 or Ha. We use the convention
Ha.
For example, for the statement S2 of Section 2.5, the two hypothesis would be:
16
Basics of Data Science
H0: There is no effect of hours of study on the marks percentage of 12th class.
Ha: The marks of class 12th improves with the hours of study of the student.

Please note that the hypothesis above is one sided, as your assumption is that the
marks would increase with hours of study. The second one sided hypothesis may
relate to decrease in marks with hours of study. However, most of the cases the
hypothesis will be two sided, which just claims that a variable will cause
difference in the second. For example, two sided hypothesis for statement S2
would be hours of study of students makes a difference (it may either increase
or decrease ) the marks of students of class 12th. In general, one sided tests are
called one tailed tests and two sided tests are called two tailed tests.

In general, alternative hypothesis relates to the research hypothesis. Please also

note that the alternative hypothesis given above is one way hypothesis as it only
states the effect in terms of increase of marks. In general, you may have
alternative hypothesis which may be two way (increase or decrease; less or more
etc.).

3. Perform the desired statistical analysis:

Next, you perform the exploratory analysis and produce a number of charts to
explore the nature of the data. This is followed by performing a significance
statistical test like chi-square, independent sample t-test, ANOVA, non-
parametric tests etc., which is decided on the basis of size of the sample, type
and characteristics of data. These tests generate assumes the null hypothesis to
be True. A test may generate parameter values based on sample and the
probability called p-value, which is an evidence against the null hypothesis. This
is shown in Figure 15.

𝑝𝑉𝑎𝑙𝑢𝑒 = 0.025 𝑝𝑉𝑎𝑙𝑢𝑒 = 0.025

Figure 15: p-value of test statistics

4. Analysing the results:

In this step, you should analyse your results. As stated in Unit 1, you must not
just draw your conclusion based on statistics, but support it with analytical
reasoning.

Example 9: We demonstrate the problem of finding a relationship between

study hours and Marks percentage (S2 of section 2.5), however, by using only
sample data of 10 students (it is hypothetical data and just used for the
illustration purpose), which is given as follows:

Weekly Study Hours (wsh) 96 92 63 76 89 80 56 70 61 81

Marks Percentage (mp) 21 19 7 11 16 17 4 9 7 18

17
Portability and Statistics
for Data Science In order to find such a relationship, you may like to perform basic exploratory
analysis. In this case, let us make a scatter plot between the two variables, taking
wsh as an independent variable and mp as a dependent variable. This scatter plot
is shown in Figure 16

120

Marks Percentage (mp)

100
80
60
40
20
0
0 5 10 15 20 25
Weekly Study Hours (wsh)

Figure 16: Scatter plot of Weekly Study Hours vs. Marks Percentage.

The scatter plot of Figure 16 suggests that the two variables may be associated.
But how to determine the strength of this association? In statistics, you use
Correlation, which may be used to determine the strength of linear association
between two quantitative variables. This is explained next.

2.5.3 Example using Correlation and Regression

As stated correlation is used to determine the strength of linear association. But

how the correlation is measured?
Consider two quantitative variables x and y, and a set of n pairs of values of these
variables (for example, the wsh and mp values as shown in example 9), you can
compute a correlation coefficient, denoted by r using the following equation:
(()(*) (-)-*)
∑.
/0"K L×M N
,( ,-
𝑟>I = (6;!)
(16)
The following are the characteristics of the correlation coefficient (r):
• The value of r lies between +1 and -1.
• A positive value of r means that value of y increases with increase in
value of x and the value of y decreases with decrease in value of x.
• A negative value of r means that value of y increases with decrease in
value of x and the value of y decreases with increase in value of x.
• If the value of r is closer to +1 or -1, then it indicates that association is
a strong linear association.
• Simple scaling one of the variable does not change the correlation.
• Correlation does not specify the dependent and independent variables.
• Please remember a correlation does not mean cause. You have to
establish it with reasoning.

The data of Example 9 shows a positive correlation. It can be computed as

follows:
Mean of wsh = 12.9; Standard Deviation of wsh (Sample) = 5.98980616
Mean of mp = 76.4; Standard Deviation of mp (Sample) = 13.7210301
$."!SFF89F
𝑟OHP,RA = (!8;!) = 0.95771559
Therefore, the data shows strong positive correlation.
18
Basics of Data Science
You may also use any statistical tool to find the correlation, we used MS-Excel,
which gave the following output of correlation:

Weekly Study Hours (wsh) Marks Percentage (mp)

Weekly Study Hours (wsh) 1
Marks Percentage (mp) 0.957715593 1
Figure 17: The Correlation coefficient
As the linear correlation between wsh and mp variables is strong, therefore, you
may like to find a line, called linear regression line, that may describe this
association. The accuracy of regression line, in general, is better for higher
correlation between the variables.
Single Linear Regression:
A single linear regression predicts a response variable or dependent variable (say
y) using one explanatory variable or independent variable (say x). The equation
of single linear regression can be defined by using the following equation:
𝑦A:/05TU/0 = 𝑎 + 𝑏𝑥 (17)
Here, ypredicted is the predicted value of response variable (y), x is the explanatory
variable, a is the intercept with respect to y and b is called the slope of the
regression line. In general, when you fit a linear regression line to a set of data,
there will be certain difference between the ypredicted and the observed value of
data (say yobserved). This difference between the observed value and the predicted
value, that is (yobserved - ypredicted), is called the residual. One of the most used
method of finding the regression line is the method of least square, which
minimises the sum of squares of these residuals. The following equations can be
used for computing residual:
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦VWH/:X/0 − 𝑦A:/05TU/0 (18)
The objective of least square method in regression is to minimise the sum of
squares of the residual of all the n observed values. This sum is given in the
following equation:
#
𝑆𝑢𝑚𝑂𝑓𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = ∑657!i𝑦VWH/:X/0 − 𝑦A:/05TU/0 j (19)

Another important issue with regression model is to determine the predictive

power of the model, which is computed using the square of the correlation (r2).
The value of r2 can be computed as follows:
• In case, you are not using regression, then you can predict the value of y
using the mean. In such a case, the difference in predicted value and
observed value would be given by the following equation:
𝐸𝑟𝑟𝑜𝑟𝑈𝑠𝑖𝑛𝑔𝑀𝑒𝑎𝑛 = 𝑦VWH/:X/0 − 𝑦l (20)
• The total of sum of square of this error can be computed using the
following equation:
𝑇𝑜𝑡𝑎𝑙𝑆𝑢𝑚𝑂𝑓𝑆𝑞𝑢𝑎𝑟𝑒 = ∑657!(𝑦VWH/:X/0 − 𝑦l)# (21)
The use of regression line reduces the error in prediction of the value of y.
Equation (19) represents this square error. Thus, use of regression results helps
in reducing the error. The proportion r2 is actually the predictive power of the
regression and is represented using the following equation:
$
∑. Y)$ ;∑.
/0"(I12,34536 ; I /0"ZI12,34536 ; I7436/8936 [
𝑟# = ∑. Y)$
(22)
/0"(I12,34536 ; I

As stated earlier, r2 can also be computed by squaring the value of r.

19
Portability and Statistics
for Data Science On performing regressing analysis on the observed data of Example 9, the
statistics as shown in Figure 18 is generated.

Regression Statistics
Multiple R 0.9577
R Square 0.9172
Adjusted R Square 0.9069
Standard Error 4.1872
Observations 10.0000

ANOVA
df SS F Significance F
Regression 1.0000 1554.1361 88.6407 0.0000
Residual 8.0000 140.2639
Total 9.0000 1694.4000

Coefficients Standard Error t Stat P-value

Intercept 48.0991 3.2847 14.6435 0.0000
Weekly Study Hours (wsh) 2.1939 0.2330 9.4149 0.0000
Figure 18: A Selected Regression output

The regression analysis results, as shown above are discussed below:

• Assumptions for the regression model:
o Data sample is collected using random sampling.
o For every value of x, the value of y in the population
§ is normally distributed
§ has same standard deviation
o The mean value if y in the population follows regression equation
(17)
• Various Null hypothesis related to regression are:
o For the analysis of variance (ANOVA) output in the regression:
§ H0A: All the coefficients of model are zero, therefore, the
model cannot predict the value of y.
o For the Intercept:
§ H0I: Intercept =0.
o For the wsh:
§ H0wsh: wsh =0.
• The Significance F in ANOVA is 0, therefore, you can reject the Null
hypothesis H0A and determine that the this model can predict the value
of y. Please note high F value supports this observation.
• The p-value related to intercept and wsh are almost 0, therefore, you can
reject the Null hypothesis H01 and H0wsh.
• The regression line has the equation:
𝑚𝑝2&%5-+*%5 = 48.0991 + 2.1939 × 𝑤𝑠ℎ
• You can compute the sum of squares (SS) using Equation (19) and
Equation (21).
• The degree of freedom in the context of statistics is the number of data
items required to compute the desired statistics.

20
Basics of Data Science
• The term “Multiple R” in Regression Statistics defines the correlation
between the dependent variable (say y) with the set of independent or
explanatory variables in the regression model. Thus, multiple R is similar
to correlation coefficient (r), expect that it is used when multiple
regression is used. Most of the software express the results in terms of
Multiple R, instead of r, to represent the regression output. Similarly, R
Square is used in multiple regression, instead of r2. The proposed model
has a large r2, therefore, can be considered for deployment.
You can go through further readings for more details on all the terms discussed
above.
Figure 19 shows the regression line for the data of Example 9. You may please
observe that residuals is the vertical difference between the Marks Percentage
and Predicted marks percentage. These residuals are shown in Figure 20.

Weekly Study Hours (wsh) Line Fit Plot

Marks Percentage (mp) y = 2.1939x + 48.099
Predicted Marks Percentage (mp)
Linear (Predicted Marks Percentage (mp))
100

90
Marks Percentage (mp)

40
0 5 10 15 20 25
Weekly Study Hours (wsh)

Figure 19: The Regression Line

Weekly Study Hours (wsh) Residual

Plot
10
5
Residuals

0
-5 0 5 10 15 20 25

-10
Weekly Study Hours (wsh)

Figure 20: The Residual Plot

21
Portability and Statistics
for Data Science
2.5.4 Types of Errors in Hypothesis Testing

In the section 2.5.1 and section 2.5.2, we have discussed about testing the Null
hypothesis. You either Reject the Null hypothesis and accepts alternative
hypothesis based on the computed probability or p-value; or you fail to Reject
the Null hypothesis. The decisions in such hypothesis testing would be:
• You reject Null hypothesis for a confidence interval 95% based on the p-
value, which lies in the shaded portion, that is p-value < 0.05 for two tailed
hypothesis (that is both the shaded portions in Figure 15, each area of
probability 0.025). Please note that in case of one tailed test, you would
consider only one shaded area of Figure 15, therefore, you would be
considering p-value < 0.05 in only one of the two shaded areas.
• You fail to reject the Null hypothesis for confidence interval 95%, when p-
value > 0.05.

The two decisions as stated above could be incorrect, as you are considering a
confidence interval of 95%. The following Figure shows this situation.

Final Decision
The Actual H0 is Rejected, that is, you You fail to reject H0, as you do not
Scenario have accepted the have enough evidence to accept
Alternative hypothesis the Alternative hypothesis
H0 is True This is called a TYPE-I error You have arrived at a correct
decision
H0 is False You have arrived at a correct This is called a TYPE-II error
decision

For example, assume that a medicine is tested for a disease and this medicine is
NOT a cure of the disease. You would make the following hypotheses:
H0: The medicine has no effect for the disease
Ha: The medicine improves the condition of patient.
However, if the data is such that for a confidence interval of 95% the p-value is
computed to be less than 0.05, then you will reject the null hypothesis, which is
Type-I error. The chances of Type-I errors for this confidence interval is 5%.
This error would mean that the medicine will get approval, even though it has
no effect on curing the disease.

However, now assume that a medicine is tested for a disease and this medicine
is a cure of the disease. Hypotheses still remains the same, as above. However,
if the data is such that for a confidence interval of 95% the p-value is computed
to be more than 0.05, then you will not be able to reject the null hypothesis,
which is Type-II error. This error would mean that a medicine which can cure
the disease will not be accepted.

Check Your Progress 3

1. A random sample of 100 students were collected to find their opinion about
whether practical sessions in teaching be increased? About 53 students voted for
increasing the practical sessions. What would be the confidence interval of the
population proportions of the students who would favour increasing the
population percentage. Use confidence levels 90%, 95% and 99%.

22
Basics of Data Science
2. The Weight of 20 students, in Kilograms, is given in the following table
65 75 55 60 50 59 62 70 61 57
62 71 63 69 55 51 56 67 68 60
Find the estimated weight of the student population.

3. A class of 10 students were given a validated test prior and after completing
a training course. The marks of the students in those tests are given as under:
Marks before Training (mbt) 56 78 87 76 56 60 59 70 61 71
Marks after training (mat) 55 79 88 90 87 75 66 75 66 78
With a significance level of 95% can you say that the training course was useful?

2.6 SUMMARY
This Unit introduces you to the basic probability and statistics related to data
science. The unit first introduces the concept of conditional probability, which
defines the probability of an event given a specific event has occurred. This is
followed by discussion on the Bayes theorem, which is very useful in finding
conditional probabilities. Thereafter, the unit explains the concept of discrete
and continuous random variables. In addition, the Binomial distribution and
normal distribution were also explained. Further, the unit explained the
concept of sampling distribution and central limit theorem, which forms the
basis of the statistical analysis. The Unit also explain the use of confidence
level and intervals for estimating the parameters of the population. Further, the
unit explains the process of significance testing by taking an example related to
correlation and regression. Finally, the Unit explains the concept of errors in
hypothesis testing. You may refer to further readings for more details on these
concepts.

2.7 SOLUTION/ANSWERS

☞ Check Your Progress – 1

1. Is 𝑃(𝑌/𝑋) = 𝑃(𝑌/𝑋), No. Please check in Example 3, the probability
P(Red/BagB) is 7/10, whereas, P(BagB/Red)is 7/12.
2. Consider two independent events A and B, first compute P(A) and P(B).
The probability of any one of these events to occur would be computed
by equation (2), which is:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
The probability of occurrence of both the events will be computed using
the equation (4), which is:
𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑋) × 𝑃(𝑌)
3. Let us assume Event X, as “A student is selected from University A”.
Assuming, any of the University can be selected with equal probability,
P(UniA) = 1/2.
Let the Event Y, as “A student who has obtained more that 75% marks
! !8 ! #8 4
is selected”. This probability 𝑃(𝑆𝑡𝐷𝑖𝑠) = # × #8 + # × 98 = !#
!8 !
In addition, 𝑃(𝑆𝑡𝐷𝑖𝑠/𝑈𝑛𝑖𝐴) = #8 = #

23
Portability and Statistics
! !
for Data Science 𝑃(𝑆𝑡𝐷𝑖𝑠/𝑈𝑛𝑖𝐴) × 𝑃(𝑈𝑛𝑖𝐴) #
×# 3
𝑃(𝑈𝑛𝑖𝐴/𝑆𝑡𝐷𝑖𝑠) = = 4 =
𝑃(𝑆𝑡𝐷𝑖𝑠) 7
!#
Check Your Progress 2
1. As the probability of getting the even number (E) or odd number (O) is equal
in each two of dice, the following eight outcomes may be possible:
Outcomes EEE EEO EOE EOO OEE OEO OOE OOO
Number of 3 2 2 1 2 1 1 0
times Even
number appears
(X)
Therefore, the probability distribution would be:
X Frequency Probability P(X)
0 1 1/8
1 3 3/8
2 3 3/8
3 1 1/8
Total 8 Sum of all P(X) = 1

2. This can be determined by using the Binomial distribution with X=0, 1, 2, 3 and
4, as follows (s and f both are 1/2):
4! 1 8 1 F 1
𝑃(𝑋 = 0) 𝑜𝑟 𝑝8 = F𝐶8 × 𝑠 8 × 𝑓 F;8 = ×H I ×H I =
0! (4 − 0)! 2 2 16
!
4! 1 1 9 4
𝑃(𝑋 = 1) 𝑜𝑟 𝑝! = F𝐶! × 𝑠 ! × 𝑓 F;! = ×H I ×H I =
1! (4 − 1)! 2 2 16
#
4! 1 1 # 6
𝑃(𝑋 = 2) 𝑜𝑟 𝑝# = F𝐶# × 𝑠 # × 𝑓 F;# = ×H I ×H I =
2! (4 − 2)! 2 2 16
9
4! 1 1 ! 4
𝑃(𝑋 = 3) 𝑜𝑟 𝑝9 = F𝐶9 × 𝑠 9 × 𝑓 F;9 = ×H I ×H I =
3! (4 − 3)! 2 2 16
4! 1 F 1 8 1
𝑃(𝑋 = 4) 𝑜𝑟 𝑝F = F𝐶F × 𝑠 F × 𝑓 F;F = ×H I ×H I =
4! (4 − 4)! 2 2 16

3. The number of tosses (n) = 4 and s = ½, therefore,

!
𝜇 =𝑛×𝑠 =4×#=2
! !
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) = M4 × # × (1 − #) = 1

4. Mean = 0 and Standard deviation = 1.

5. Standard deviation of sampling distribution =

2×(892) ;.=>×(89;.=>) ;.>×;.?
6 = 6 = = 0.0048
. 8;;;; 8;;
The large size of sample results in high accuracy of results.
6. 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 = 𝜇
@
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛𝑠 =
√6

Check Your Progress 3

1. The value of sample proportion 𝑝̂ = 53/100 = 0.53
AD×(!;AD) 8.G9×(!;8.G9)
Therefore, 𝑆𝑡𝐸𝑟𝑟 = M 6
= M !88
= 0.05
24
Basics of Data Science
The Confidence interval for 90%:
(0.53 ± 1. 65 ´0.05), which is 0.4475 to 0.6125
The Confidence interval for 95%:
(0.53 ± 1. 96 ´0.05), which is 0.432 to 0.628
The Confidence interval for 99%:
(0.53 ± 2.58 ´0.05), which is 0.401 to 0.659
2. Sample Mean (𝑥̅ ) = 61.8; Standard Deviation of sample (s) = 6.787
Sample size (n) = 20
".4$4
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 = =1.52
√#8
The Confidence Interval for the confidence level 95% would be:
(61.8 ± 1.96 ´ 1.52) = 58.8 to 64.8

3. Analysis: This kind of problem would require to find, if there is significant

difference in the mean of the test results before and after the training course. In
addition, the data size of the sample is 10 and the same group of person are tested
twice, therefore, paired sample t-test may be used to find the difference of the
mean. You can follow all the steps for this example of hypothesis testing.
1. Testing Pre-condition on Data:
• The students who were tested through this training course were
randomly selected.
• The population test scores, in general, are normally distributed.
• The sample size is small, therefore, a robust test may be used.
2. The Hypothesis
H0: llllll
𝑚𝑏𝑡 = llllll
𝑚𝑎𝑡
llllll
H1: 𝑚𝑏𝑡 < 𝑚𝑎𝑡 llllll
3. The results of the analysis are given below
(Please note H1 is one sided hypothesis, as you are trying to find if
training was useful for the students)
t-Test: Paired Two Sample for Means
Marks before Training (mbt) Marks after training (mat)
Mean 67.4 75.9
Variance 112.9333333 124.1
Observations 10 10
df 9
t Stat -2.832459252
P(T<=t) one-tail 0.009821702
t Critical one-tail 1.833112933

4. Analysis of results: The one tail p-value suggests that you reject the
null hypothesis. The difference in the means of the two results is
significant enough to determine that the scores of the student have
improved after the training.

25
Basics of Data Science
UNIT 3 DATA PREPARATION FOR ANALYSIS
3.0 Introduction
3.1 Objectives
3.2 Need for Data Preparation
3.3 Data preprocessing
3.3.1 Data Cleaning
3.3.2 Data Integration
3.3.3 Data Reduction
3.3.4 Data Transformation
3.4 Selection and Data Extraction
3.5 Data Curation
3.5.1 Steps of Data Curation
3.5.2 Importance of Data Curation
3.6 Data Integration
3.6.1 Data Integration Techniques
3.6.2 Data Integration Approaches
3.7 Knowledge Discovery
3.8 Summary
3.9 Solutions/Answers
3.10 Further Readings

3.0 INTRODUCTION

In the previous unit of this Block, you were introduced to the basic concepts of
conditional probability, Bayes Theorem and probability distribution including
Binomial and Normal distributions. The Unit also introduces you to the concept
of the sampling distribution, central limit theorem and statistical hypothesis
testing. This Unit introduces you to the process of data preparation for Data
Analysis. Data preparation is one of the most important processes, as it leads to
good quality data, which will result in accurate results of the data analysis. This
unit covers data selection, cleaning, curation, integration, and knowledge
discovery from the stated data. In addition, this unit gives you an overview of
data quality and how its preparation for analysis is done. You may refer to further
readings for more details on these topics.

3.1 OBJECTIVES

After finishing this unit, you will be to be able to:

• Describe the meaning of "data quality"
• Explain basic techniques for data preprocessing
• Use the technique of data selection and extraction
• Define data curation and data integration
• Describe the knowledge discovery.

1
Data Preparation for
Analysis
3.2 NEED FOR DATA PREPARATION

In the present time, data is one of the key resources for a business. Data is
processed to create information; information is integrated to create knowledge.
Since knowledge is power, it has evolved into a modern currency, which is
valued and traded between parties. Everyone wants to discuss the knowledge
and benefits they can gain from data. Data is one of the most significant
resources available to marketers, agencies, publishers, media firms, and others
today for a reason. But only high-quality data is useful. We can determine a data
set's reliability and suitability for decision-making by looking at its quality.
Degrees are frequently used to gauge this quality. The usefulness of the data for
the intended purpose and its completeness, accuracy, timeliness, consistency,
validity, and uniqueness are used to determine the data's quality. In simpler
terms, data quality refers to how accurate and helpful the data are for the task at
hand. Further, data quality also refers to the actions that apply the necessary
quality management procedures and methodologies to make sure the data is
useful and actionable for the data consumers. A wide range of elements,
including accuracy, completeness, consistency, timeliness, uniqueness, and
validity, influence data quality. Figure 1 shows the basic factors of data quality.

COMPLETENESS

UNIQUENESS ACCURACY

DATA
QUALITY
VALIDITY TIMELINESS

CONSISTENCY

Figure 1: Factors of Data Quality

These factors are explained below:

• Accuracy - The data must be true and reflect events that actually take
place in the real world. Accuracy measures determine how closely the
figures agree with the verified right information sources.
• Completeness - The degree to which the data is complete determines
how well it can provide the necessary values.
• Consistency - Data consistency is the homogeneity of the data across
applications, networks, and when it comes from several sources. For
example, identical datasets should not conflict if they are stored in
different locations.

2
Basics of Data Science
• Timeliness - Data that is timely is readily available whenever it is
needed. The timeliness factor also entails keeping the data accurate; to
make sure it is always available and accessible and updated in real-time.
• Uniqueness - Uniqueness is defined as the lack of duplicate or redundant
data across all datasets. The collection should contain zero duplicate
records.
• Validity - Data must be obtained in compliance with the firm's defined
business policies and guidelines. The data should adhere to the
appropriate, recognized formats, and all dataset values should be within
the defined range.

Consider yourself a manager at a company, say XYZ Pvt Ltd, who has been tasked
with researching the sales statistics for a specific organization, say ABC. You
immediately get to work on this project by carefully going through the ABC
company's database and data warehouse for the parameters or dimensions (such
as the product, price, and units sold), which may be used in your study. However,
your enthusiasm suffers a major problem when you see that several of the
attributes for different tuples do not have any recorded values. You want to
incorporate the information in your study on whether each item purchased was
marked down, but you find that this data has not been recorded. According to users
of this database system, the data recorded for some transactions were containing
mistakes, such as strange numbers and anomalies.

The three characteristics of data quality—accuracy, completeness, and

consistency—are highlighted in the paragraph above. Large databases and data
warehouses used in the real world frequently contain inaccurate, incomplete, and
inconsistent data. What may be the causes of such erroneous data in the
databases? There may be problems with the data collection tools, which may
result in mistakes during the data-entering process. Personal biases, for example,
when users do not want to submit personal information, they may purposefully
enter inaccurate data values for required fields (for example, by selecting the
birthdate field's presented default value of "January 1"). Disguised missing data
is what we call this. There may also be data transfer errors. The use of
synchronized data transit and consumption may be constrained by technological
limitations, such as a short buffer capacity. Unreliable data may result from
differences in naming conventions, data codes, or input field types (e.g., date).
In addition, cleansing of data may be needed to remove duplicate tuples.

Incomplete data can be caused by a variety of circumstances. Certain properties,

such as customer information for sales transaction data, might not always be
accessible. It is likely that some data was omitted since it was not thought to be
important at the time of input. A misinterpretation or malfunctioning technology
could prevent the recording of essential data. For example, the data, which did
not match the previously stored data, was eliminated. Furthermore, it is likely
that the data's past alterations or histories were not documented. In particular,
for tuples with missing values for some properties, it could be required to infer
missing data.

3
Data Preparation for
Analysis
3.3 DATA PREPROCESSING

Preprocessing is the process of taking raw data and turning it into information
that may be used. Data cleaning, data integration, data reduction and data
transformation, and data discretization are the main phases of data preprocessing
(see Figure 2).

DATA Pre-
processing

Data Cleaning

Data Integration

Data Transformation

Data Reduction

Figure 2: Data pre-processing

3.3.1 Data Cleaning

Data cleaning is an essential step in data pre-processing. It is also referred to as

scrubbing. It is crucial for the construction of a good model. The step that is
required but frequently overlooked by everyone is data cleaning. Real-world
data typically exhibit incompleteness, noise, and inconsistency. In addition to
addressing discrepancies, this task entails filling in missing numbers, smoothing
out noisy data, and eliminating outliers. Errors are decreased, and data quality is
enhanced via data cleansing. Although it might be a time-consuming and
laborious operation, it is necessary to fix data inaccuracies and delete bad entries.

a. Missing Values
Consider you need to study customer and sales data for ABC
Company. As you pointed out, numerous tuples lack recorded
values for a number of characteristics, including customer
income. The following techniques can be used to add the values
that are lacking for this attribute.
i. Ignore the tuple: Typically, this is carried out in the
absence of a class label (assuming the task involves
4
Basics of Data Science
classification). This method is particularly detrimental
when each attribute has a significantly different percentage
of missing values. By disregarding the remaining
characteristics in the tuple, we avoid using their values.
ii. Manually enter the omitted value: In general, this
strategy is time-consuming and might not be practical for
huge data sets with a substantial number of missing values.
iii. Fill up the blank with a global constant: A single
constant, such as "Unknown" or “−∞”, should be used to
replace all missing attribute values. If missing data are
replaced with, say, "Unknown," the analysis algorithm can
mistakenly think that they collectively comprise valid data.
So, despite being simple, this strategy is not perfect.
iv. To fill in the missing value, use a measure of the
attribute's central tendency (such as the mean or
median): The median should be used for skewed data
distributions, while the mean can be used for normal
(symmetric) data distributions. Assume, for instance, that
the ABC company’s customer income data distribution is
symmetric and that the mean income is INR 50,000/-. Use
this value to fill in the income value that is missing.
v. For all samples that belong to the same class as the
specified tuple, use the mean or median: For instance, if
we were to categorize customers based on their credit risk,
the mean income value of customers who belonged to the
same credit risk category as the given tuple might be used
to fill in the missing value. If the data distribution is skewed
for the relevant class, it is best to utilize the median value.
vi. Fill in the blank with the value that is most likely to be
there: This result can be reached using regression,
inference-based techniques using a Bayesian
formalization, or decision tree induction. As an example,
using the other characteristics of your data's customers, you
may create a decision tree to forecast the income's missing
numbers.
b. Noisy Data
Noise is the variance or random error in a measured variable. It
is possible to recognize outliers, which might be noise,
employing tools for data visualization and basic statistical
description techniques (such as scatter plots and boxplots). How
can the data be "smoothed" out to reduce noise given a numeric
property, like price, for example? The following are some of the
data-smoothing strategies.
i. Binning: Binning techniques smooth-sorted data values
by looking at their "neighbourhood" or nearby values.
The values that have been sorted are divided into various
"buckets" or bins. Binding techniques carry out local
smoothing since they look at the values' surroundings.
When smoothing by bin means, each value in the bin is
changed to the bin's mean value. As an illustration,
suppose a bin contains three numbers 4, 8 and 15. The
average of these three numbers in the bin is 9.

5
Data Preparation for
Analysis Consequently, the value nine replaces each of the bin's
original values.
Similarly, smoothing by bin medians, which substitutes
the bin median for each bin value, can be used. Bin
boundaries often referred to as minimum and maximum
values in a specific bin can also be used in place of bin
values. This type of smoothing is called smoothing by
bin boundaries. In this method, the nearest boundary
value is used to replace each bin value. In general, the
smoothing effect increases with increasing breadth. As an
alternative, bins may have identical widths with constant
interval ranges of values.
ii. Regression: Regression is a method for adjusting the data values to a
function and may also be used to smooth out the data. Finding the "best"
line to fit two traits (or variables) is the goal of linear regression, which
enables one attribute to predict the other. As an extension of linear
regression, multiple linear regression involves more than two features
and fits the data to a multidimensional surface.
iii. Outlier analysis: Clustering, for instance, the grouping of comparable
values into "clusters," can be used to identify outliers. It makes sense to
classify values that are outliers as being outside the set of clusters.
iv. Data discretization, a data transformation and data reduction technique,
is an extensively used data smoothing technique. The number of distinct
values for each property is decreased, for instance, using the binning
approaches previously discussed. This functions as a form of data
reduction for logic-based data analysis methods like decision trees,
which repeatedly carry out value comparisons on sorted data. Concept
hierarchies are a data discretization technique that can also be applied to
smooth out the data. The quantity of data values that the analysis process
must process is decreased by a concept hierarchy. For example, the price
variable, which represents the price value of commodities, may be
discretized into “lowly priced”, “moderately priced”, and “expensive”
categories.

Steps of Data Cleaning

The following are the various steps of the data cleaning process.
1. Remove duplicate or irrelevant observations- Duplicate data may be
produced when data sets from various sources are combined, scraped, or data is
obtained from clients or other departments.
2. Fix structural errors-When measuring or transferring data; you may come
across structural mistakes such as unusual naming practices, typographical
errors, or wrong capitalization. Such inconsistencies may lead to mislabeled
categories or classes. For instance, "N/A" and "Not Applicable", which might
be present on any given document, may create two different classifications.
Rather, they should be studied under the same heading or missing values.
3. Managing Unwanted outliers -Outliers might cause problems in certain
models. Decision tree models, for instance, are more robust to outliers than
linear regression models. In general, we should not eliminate outliers unless
there is a compelling reason to do so. Sometimes removing them can improve
performance, but not always. Therefore, the outlier must be eliminated for a

6
Basics of Data Science
good cause, such as suspicious measurements that are unlikely to be present in
the real data.
4. Handling missing data-Missing data is a deceptively difficult issue in
machine learning. We cannot just ignore or remove the missing observation.
They must be carefully treated since they can indicate a serious problem. Data
gaps resemble puzzle pieces that are missing. Dropping it is equivalent to
denying that the puzzle slot is there. It is like trying to put a piece from another
puzzle into this one. Furthermore, we need to be aware of how we report
missing data. Instead of just filling it in with the mean, you can effectively let
the computer choose the appropriate constant to account for missingness by
using this flagging and filling method.
5. Validate and QA-You should be able to respond to these inquiries as part of
fundamental validation following the data cleansing process, for example:
o Does the data make sense?
o Does the data abide by the regulations that apply to its particular field?
o Does it support or refute your hypothesis? Does it offer any new
information?
o Can you see patterns in the data that will support your analysis?
o Is there a problem with the data quality?

Methods of Data Cleaning

The following are some of the methods of data cleaning.
1. Ignore the tuples: This approach is not particularly practical because it
can only be used when a tuple has multiple characteristics and missing
values.
2. Fill in the missing value: This strategy is neither practical nor very
effective. Additionally, the process could take a long time. The missing
value must be entered into the approach. The most common method for
doing this is by hand, but other options include attribute mean or using
the value with the highest probability.
3. Binning method: This strategy is fairly easy to comprehend. The values
nearby are used to smooth the sorted data. The information is
subsequently split into a number of equal-sized parts. The various
techniques are then used to finish the task.
4. Regression: With the use of the regression function, the data is
smoothed out. Regression may be multivariate or linear. Multiple
regressions have more independent variables than linear regressions,
which only have one.
5. Clustering: The group is the primary target of this approach. Data are
clustered together in a cluster. Then, with the aid of clustering, the
outliers are found. After that, the comparable values are grouped into a
"group" or "cluster".

3.3.2 Data Integration

Data from many sources, such as files, data cubes, databases (both relational and
non-relational), etc., must be combined during this procedure. Both
homogeneous and heterogeneous data sources are possible. Structured,
7
Data Preparation for
Analysis unstructured, or semi-structured data can be found in the sources. Redundancies
and inconsistencies can be reduced and avoided with careful integration.

a. Entity Identification Problem: Data integration, which gathers data from

several sources into coherent data storage, like data warehousing, will likely
be required for your data analysis project. Several examples of these sources
include various databases, data cubes, and flat files.
During data integration, there are a number of things to consider. Integration
of schemas and object matching might be challenging. How is it possible to
match comparable real-world things across different data sources? The entity
identification problem describes this. How can a computer or data analyst be
sure that a client's ID in one database and their customer number in another
database refer to the same attribute? Examples of metadata for each attribute
include the name, definition, data type, permissible range of values, and null
rules for handling empty, zero, or null values. Such metadata can be used to
avoid errors during the integration of the schema. The data may also be
transformed with the aid of metadata. For example, in two different instances
of data of an organization, the code for pay data might be "H" for high
income and "S" for small income in one instance of the database. The same
pay code in another instance of a database maybe 1 and 2.
When comparing attributes from one database to another during integration,
the data structure must be properly considered. This is done to make sure
that any referential constraints and functional dependencies on attributes
present in the source system are also present in the target system. For
instance, a discount might be applied to the entire order by one system, but
only to certain items by another. If this is not found before integration, things
in the target system can be incorrectly dismissed.
b. Redundancy and Correlation Analysis: Another crucial problem in data
integration is redundancy. If an attribute (like annual income, for example)
can be "derived" from another attribute or group of data, it may be redundant.
Inconsistent attributes or dimension names can also bring redundancies in
the final data set.
Correlation analysis can identify some redundancies. Based on the available
data, such analysis can quantify the strength of relationships between two
attributes. We employ the chi-square test (χ2) for finding relationships between
nominal data. Numeric attributes can be analyzed using the correlation
coefficient and covariance, which look at how one attribute's values differ from
those of another.
c. Tuple Duplication: Duplication should be identified at the tuple level in
addition to being caught between attributes (e.g., when, for a particular
unique data entry case, there are two or more identical tuples). Additional
data redundant sources include – the use of denormalized tables, which are
frequently used to increase performance by reducing joins; faulty data entry
or updating some (not all) redundant data occurrences, etc. Inconsistencies
frequently appear between different duplicates. For example, there may be
inconsistency as the same purchaser's name may appear with multiple
addresses within the purchase order database. This might happen if a
database for purchase orders has attributes for the buyer's name and address
rather than a foreign key to this data.

d. Data Value Conflict Detection and Resolution: Data value conflicts must
be found and resolved as part of data integration. As an illustration,
attribute values from many sources may vary for the same real-world thing.
8
Basics of Data Science
Variations in representation, scale, or encoding may be the cause of this. In
one system, a weight attribute might be maintained in British imperial
units, while in another, metric units. For a hotel chain, the cost of rooms in
several cities could include various currencies, services (such as a
complimentary breakfast) and taxes. Similarly, every university may have
its own curriculum and grading system. When sharing information among
them, one university might use the quarter system, provide three database
systems courses, and grade students from A+ to F, while another would use
the semester system, provide two database systems courses, and grade
students from 1 to 10. Information interchange between two such
universities is challenging because it is challenging to establish accurate
course-to-grade transformation procedures between the two universities.
An attribute in one system might be recorded at a lower abstraction level
than the "identical" attribute in another since the abstraction level of
attributes might also differ. As an illustration, an attribute with the same
name in one database may relate to the total sales of one branch of a
company, however, the same result in another database can refer to the
company's overall regional shop sales.

3.3.3 Data Reduction

In this phase, data is trimmed. The number of records, attributes, or dimensions

can be reduced. When reducing data, one should keep in mind that the outcomes
from the reduced data should be identical to those from the original data.
Consider that you have chosen some data for analysis from ABC Company’s data
warehouse. The data set will probably be enormous! Large-scale complex data
analysis and mining can be time-consuming, rendering such a study impractical
or unfeasible. Techniques for data reduction can be applied to create a condensed
version of the data set that is considerably smaller while meticulously retaining
the integrity of the original data. In other words, mining the smaller data set
should yield more useful results while effectively yielding the same analytical
outcomes. This section begins with an overview of data reduction tactics and
then delves deeper into specific procedures. Data compression, dimensionality
reduction, and numerosity reduction are all methods of data reduction.
a. Dimensionality reduction refers to the process of lowering the number of
random variables or qualities. Principal components analysis and wavelet
transformations are techniques used to reduce data dimensions by
transforming or rescaling the original data. By identifying and eliminating
duplicated, weakly relevant, or irrelevant features or dimensions, attribute
subset selection is a technique for dimensionality reduction.
b. Numerosity reduction strategies substitute different, more compact forms
of data representation for the original data volume. Both parametric and non-
parametric approaches are available. In parametric techniques, a model is
employed to estimate the data, which frequently necessitates the
maintenance of only the data parameters rather than the actual data. (Outliers
may also be stored.) Examples include log-linear models and regression.
Nonparametric methods include the use of histograms, clustering, sampling,
and data cube aggregation to store condensed versions of the data.
c. Transformations are used in data compression to create a condensed or
"compressed" version of the original data. Lossless data compression is used
when the original data can be recovered from the compressed data without
9
Data Preparation for
Analysis any information being lost. Alternatively, lossy data reduction is employed
when we can only precisely retrieve a fraction of the original data. There are
a number of lossless string compression algorithms; however, they typically
permit only a small amount of data manipulation. Techniques for reducing
numerosity and dimensions can also be categorized as data compression
methods.
There are other additional structures for coordinating data reduction techniques.
The time saved by analysis on a smaller data set should not be "erased" or
outweighed by the computational effort required for data reduction.
Data Discretization: - It is regarded as a component of data reduction. The
notional qualities take the place of the numerical ones. By converting values to
interval or concept labels, data discretization alters numerical data. These
techniques enable data analysis at various levels of granularity by automatically
generating concept hierarchies for the data. Binding, histogram analysis, decision
tree analysis, cluster analysis, and correlation analysis are examples of
discretization techniques. Concept hierarchies for nominal data may be produced
based on the definitions of the schema and the distinct attribute values for every
attribute.
3.3.4 Data Transformation

This procedure is used to change the data into formats that are suited for the
analytical process. Data transformation involves transforming or consolidating
the data into analysis-ready formats. The following are some data transformation
strategies:
a. Smoothing, which attempts to reduce data noise. Binning, regression, and
grouping are some of the methods.
b. Attribute construction (or feature construction), wherein, in order to aid
the analysis process, additional attributes are constructed and added from the
set of attributes provided.
c. Aggregation, where data is subjected to aggregation or summary procedures
to calculate monthly and yearly totals; for instance, the daily sales data may
be combined to produce monthly or yearly sales. This process is often used
to build a data cube for data analysis at different levels of abstraction.
d. Normalization, where the attribute data is resized to fit a narrower range:
−1.0 to 1.0; or 0.0 to 1.0.
e. Discretization, where interval labels replace the raw values of a numeric
attribute (e.g., age) (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth,
adult, senior). A concept hierarchy for the number attribute can then be
created by recursively organizing the labels into higher-level concepts. To
meet the demands of different users, more than one concept hierarchy might
be built for the same characteristic.
f. Concept hierarchy creation using nominal data allows for the
extrapolation of higher-level concepts like a street to concepts like a city or
country. At the schema definition level, numerous hierarchies for nominal
qualities can be automatically created and are implicit in the database
structure.

Check Your Progress 1:

1. What is meant by data preprocessing?

10
Basics of Data Science
2. Why is preprocessing important?

3. What are the 5 characteristics of data processing?

4. What are the 5 major steps of data preprocessing?

5. What is data cleaning?

6. What is the importance of data cleaning?

7. What are the main steps of Data Cleaning?

3.4 SELECTION AND DATA EXTRACTION

The process of choosing the best data source, data type, and collection tools is
known as data selection. Prior to starting the actual data collection procedure,
data selection is conducted. This concept makes a distinction between selective
data reporting (excluding data that is not supportive of a study premise) and
active/interactive data selection (using obtained data for monitoring
activities/events or conducting secondary data analysis). Data integrity may be
impacted by how acceptable data are selected for a research project.

The main goal of data selection is to choose the proper data type, source, and
tool that enables researchers to effectively solve research issues. This decision
typically depends on the discipline and is influenced by the research that has
already been done, the body of that research, and the availability of the data
sources.

Integrity issues may arise, when decisions about which "appropriate" data to
collect, are primarily centred on cost and convenience considerations rather than
the data's ability to successfully address research concerns. Cost and
convenience are unquestionably important variables to consider while making a
decision. However, researchers should consider how much these factors can
skew the results of their study.

Data Selection Issues

When choosing data, researchers should be conscious of a few things, including:
• Researchers can appropriately respond to the stated research questions when
the right type and appropriate data sources are used.
• Appropriate methods for obtaining a representative sample are used.
• The appropriate tools for gathering data are used. It is difficult to separate
the choice of data type and source from the tools used to get the data. The
type/source of data and the procedures used to collect it should be
compatible.

Types and Sources of Data: Different data sources and types can be displayed in
a variety of ways. There are two main categories of data:
• Quantitative data are expressed as numerical measurements at the interval
11
Data Preparation for
Analysis and ratio levels.
• Qualitative data can take the form of text, images, music, and video.

Although preferences within scientific disciplines differ as to which type of data

is preferred, some researchers employ information from quantitative and
qualitative sources to comprehend a certain event more thoroughly.
Researchers get data from people that may be qualitative (such as by studying
child-rearing techniques) or quantitative (biochemical recording markers and
anthropometric measurements). Field notes, journals, laboratory notes,
specimens, and firsthand observations of people, animals, and plants can all be
used as data sources. Data type and source interactions happen frequently.
Choosing the right data is discipline-specific and primarily influenced by the
investigation's purpose, the body of prior research, and the availability of data
sources. The following list of questions will help you choose the right data type
and sources:
• What is the research question?
• What is the investigation's field of study? (This establishes the guidelines for
any investigation. Selected data should not go beyond what is necessary for
the investigation.
• According to the literature (prior research), what kind of information should
be gathered?
• Which form of data—qualitative, quantitative, or a combination of both—
should be considered?
Data extraction is the process of gathering or obtaining many forms of data
from numerous sources, many of which may be erratically organized or wholly
unstructured. Consolidating, processing and refining data enable information to
be kept in a centralized area so that it can be altered. These venues could be
local, online, or a combination of the two.
Data Extraction and ETL
To put the importance of data extraction into perspective, it is helpful to quickly
assess the ETL process as a whole.
1. Extraction: One or more sources or systems are used to collect the data.
Relevant data is located, identified, and then prepared for processing or
transformation during the extraction phase. One can finally analyse them for
business knowledge by combining various data types through extraction.
2. Transformation: It can be further refined after the data has been effectively
extracted. Data is cleansed, sorted, and structured as part of the
transformation process. Duplicate entries will be removed, missing values
will be filled in or removed, and audits will be performed, for example, in
order to offer data that is reliable, consistent, and usable.
3. Loading: The high-quality, converted data is subsequently sent to a single,
centralized target location for storage and analysis.

Data extraction tools

The tools listed below can be used for tasks other than a simple extraction. They
can be grouped into the following categories:
1. Scrape storm: One data extraction tool you may consider is the scrape
12
Basics of Data Science
storm. It is software that uses AI to scrape the web or gather data. It is
compatible with Windows, Mac, or Linux operating systems and has a
simple and straightforward visual operation. This program automatically
detects objects like emails, numbers, lists, forms, links, photos, and
prices. When transferring extracted data to Excel, MongoDB, CSV,
HTML, TXT, MySQL, SQL Server, PostgreSQL, Google Sheets, or
WordPress, it can make use of a variety of export strategies.
2. Altair Monarch: Monarch is desktop-based, self-service, and does not
require any programming. It can link to various data sources, including
big data, cloud-based data, and both organized and unstructured data. It
connects to, cleans, and processes data with high speed and no errors. It
employs more than 80 built-in data preparation functions. Less time is
wasted on making data legible so that more time can be spent on creating
higher-level knowledge.
3. Klippa: The processing of contracts, invoices, receipts, and passports can
be done using the cloud with Klippa. For the majority of documents, the
conversion time may be between one and five seconds. The data
classification and manipulation may be done online, round-the-clock, and
supports a variety of file types, including PDF, JPG, and PNG. It can also
convert between JSON, PDF/A, XLSX, CSV, and XML. Additionally,
the software handles file sharing, custom branding, payment processing,
cost management, and invoicing management.
4. NodeXL: For Microsoft Excel 2007, 2010, 2013, and 2016, Basic is a
free, open-source add-on extension. Since the software is an add-on, data
integration is not performed; instead, it focuses on social network
analytics. Advanced network analytics, text and sentiment analysis, and
robust report generating are extra capabilities included with NodeXL Pro.

Check Your Progress 2:

1. What is the data selection process??
2. What is Data Extraction and define the term ETL?
3. What are the challenges of data extraction?

3.5 DATA CURATION

Data curation is creating, organizing and managing data sets so that people
looking for information can access and use them. It comprises compiling,
arranging, indexing, and categorizing data for users inside of a company, a
group, or the general public. To support business decisions, academic needs,
scientific research, and other initiatives, data can be curated. Data curation is a
step in the larger data management process that helps prepare data sets for usage
in business intelligence (BI) and analytics applications. In other cases, the
curation process might be fed with ready-made data for ongoing management
and maintenance. In organizations without particular data curator employment,
data stewards, data engineers, database administrators, data scientists, or
business users may fill that role.

3.5.1 Steps of data curation

13
Data Preparation for
Analysis There are numerous tasks involved in curating data sets, which can be divided
into the following main steps.

• The data that will be required for the proposed analytics applications
should be determined.
• Map the data sets and note the metadata that goes with them.
• Collect the data sets.
• The data should be ingested into a system, a data lake, a data
warehouse, etc.
• Cleanse the data to remove abnormalities, inconsistencies, and
mistakes, including missing values, duplicate records, and spelling
mistakes.
• Model, organize, and transform the data to prepare it for specific
analytics applications.
• To make the data sets accessible to users, create searchable indexes
of them.
• Maintain and manage the data in compliance with the requirements
of continuous analytics and the laws governing data privacy and
security.
3.5.2 Importance of Data Curation
The following are the reasons for performing data curation.
1. Helps to organize pre-existing data for a corporation: Businesses produce
a large amount of data on a regular basis, however, this data can
occasionally be lacking. When a customer clicks on a website, adds
something to their cart, or completes a transaction, an online clothes
retailer might record that information. Data curators assist businesses in
better understanding vast amounts of information by assembling prior
data into data sets.
2. Connects professionals in different departments: When a company
engages in data curation, it often brings together people from several
departments who might not typically collaborate. Data curators might
collaborate with stakeholders, system designers, data scientists, and data
analysts to collect and transfer information.
3. High-quality data typically uses organizational techniques that make it
simple to grasp and have fewer errors. Curators may make sure that a
company's research and information continue to be of the highest caliber
because the data curation process entails cleansing the data. Removing
unnecessary information makes research more concise, which may
facilitate better data set structure.
4. Makes data easy to understand: Data curators make sure there are no
errors and utilize proper formatting. This makes it simpler for specialists
who are not knowledgeable about a research issue to comprehend a data
set.
5. Allows for higher cost and time efficiency: A business may spend more
time and money organizing and distributing data if it does not regularly
employ data curation. Because prior data is already organized and
distributed, businesses that routinely do data curation may be able to save
14
Basics of Data Science
time, effort, and money. Businesses can reduce the time it takes to obtain
and process data by using data curators, who handle the data.

Check Your Progress 3:

1. What is the Importance of Data Curation?

2. Explain Data Curation.

3. What are the goals of data curation?

4. What are the benefits of data curation?

3.6 DATA INTEGRATION

Data integration creates coherent data storage by combining data from several
sources. Smooth data integration is facilitated by the resolution of semantic
heterogeneity, metadata, correlation analysis, tuple duplicate identification, and
data conflict detection. It is a tactic that combines data from several sources so
that consumers may access it in a single, consistent view that displays their
status. Systems can communicate using flat files, data cubes, or numerous
databases. Data integration is crucial because it maintains data accuracy while
providing a consistent view of dispersed data. It helps the analysis tools extract
valuable information, which in turn helps the executive and management make
tactical choices that will benefit the company.

3.6.1 Data Integration Techniques

Manual Integration-When integrating data, this technique avoids employing
automation. The data analyst gathers, purifies, and integrates the data to create
actionable data. A small business with a modest amount of data can use this
approach. Nevertheless, the extensive, complex, and ongoing integration will
take a lot of time. It takes time because every step of the process must be
performed manually.
Middleware Integration-Data from many sources are combined, normalized, and
stored in the final data set using middleware software. This method is employed
when an organization has to integrate data from historical systems into modern
systems. Software called middleware serves as a translator between antiquated
and modern systems. You could bring an adapter that enables the connection of
two systems with various interfaces. It only works with specific systems.
Application-based integration- To extract, transform, and load data from various
sources, it uses software applications. Although this strategy saves time and
effort, it is a little more difficult because creating such an application requires
technical knowledge.

15
Data Preparation for
Analysis Uniform Access Integration- This method integrates information from a wider
range of sources. In this instance, however, the data is left in its initial place and
is not moved. To put it simply, this technique produces a unified view of the
combined data. The integrated data does not need to be saved separately because
the end user only sees the integrated view.

3.6.2 Data Integration Approaches

There are two basic data integration approaches. These are –
Tight Coupling- It combines data from many sources into a single physical
location using ETL (Extraction, Transformation, and Loading) tools.
Loose Coupling- The real source databases are the most efficient place to store
facts with loose coupling. This method offers an interface that receives a user
query, converts it into a format that the source databases can understand, and
immediately transmits the question to the source databases to get the answer.

Check Your Progress 4:

1. What is meant by data integration?
2. What is an example of data integration?
3. What is the purpose of data integration?

3.7 KNOWLEDGE DISCOVERY

Knowledge discovery in databases is the process of obtaining pertinent
knowledge from a body of data (KDD). This well-known knowledge discovery
method includes several processes, such as data preparation and selection, data
cleansing, incorporating prior knowledge about the data sets, and interpreting
precise answers from the observed results.
Marketing, fraud detection, telecommunications, and manufacturing are some of
the key KDD application areas. In the last ten years, the KDD process has
reached its pinnacle. Inductive learning, Bayesian statistics, semantic query
optimization, knowledge acquisition for expert systems, and information theory
are just a few of the numerous discovery-related methodologies it now houses.
Extraction of high-level knowledge from low-level data is the ultimate objective.
Due to the accessibility and quantity of data available today, knowledge
discovery is a challenge of astounding importance and necessity. Given how
swiftly the topic has expanded, it is not surprising that professionals and experts
today have access to a variety of treatments.

Steps of Knowledge Discovery

1. Developing an understanding of the application domain: Knowledge
discovery starts with this preliminary step. It establishes the framework for
selecting the best course of action for a variety of options, such as
transformation, algorithms, representation, etc. The individuals in charge of a
KDD project need to be aware of the end users' goals as well as the environment
in which the knowledge discovery process will take place.

16
Basics of Data Science
2. Selecting and producing the data set that will be used for discovery -Once
the objectives have been specified, the data that will be used for the knowledge
discovery process should be identified. Determining what data is accessible,
obtaining essential information, and then combining all the data for knowledge
discovery into one set are the factors that will be considered for the procedure.
Knowledge discovery is important since it extracts knowledge and insight from
the given data. This provides the framework for building the models.
3. Preprocessing and cleansing – This step helps in increasing the data
reliability. It comprises data cleaning, like handling the missing quantities and
removing noise or outliers. In this situation, it might make use of sophisticated
statistical methods or an analysis algorithm. For instance, the goal of the Data
Mining supervised approach may change if it is determined that a certain
attribute is unreliable or has a sizable amount of missing data. After developing
a prediction model for these features, missing data can be forecasted. A variety
of factors affect how much attention is paid to this level. However, breaking
down the components is important and frequently useful for enterprise data
frameworks.
4. Data Transformation-This phase entails creating and getting ready the
necessary data for knowledge discovery. Here, techniques of attribute
transformation (such as discretization of numerical attributes and functional
transformation) and dimension reduction (such as feature selection, feature
extraction, record sampling etc.) are employed. This step, which is frequently
very project-specific, can be important for the success of the KDD project.
Proper transformation results in proper analysis and proper conclusions.
5. Prediction and description- The decisions to use classification, regression,
clustering, or any other method can now be made. Mostly, this uses the KDD
objectives and the decisions made in the earlier phases. A forecast and a
description are two of the main objectives of knowledge discovery. The
visualization aspects are included in descriptive knowledge discovery. Inductive
learning, which generalizes a sufficient number of prepared models to produce
a model either explicitly or implicitly, is used by the majority of knowledge
discovery techniques. The fundamental premise of the inductive technique is
that the prepared model holds true for the examples that follow.
6. Deciding on knowledge discovery algorithm -We now choose the strategies
after determining the technique. In this step, a specific technique must be chosen
to be applied while looking for patterns with numerous inducers. If precision
and understandability are compared, the former is improved by neural networks,
while decision trees improve the latter. There are numerous ways that each meta-
learning system could be successful. The goal of meta-learning is to explain why
a data analysis algorithm is successful or unsuccessful in solving a particular
problem. As a result, this methodology seeks to comprehend the circumstances
in which a data analysis algorithm is most effective. Every algorithm has
parameters and learning techniques, including tenfold cross-validation or a
different division for training and testing.
7. Utilizing the Data Analysis Algorithm-Finally, the data analysis algorithm
is put into practice. The approach might need to be applied several times before
producing a suitable outcome at this point. For instance, by rotating the
algorithms, you can alter variables like the bare minimum of instances in a single
decision tree leaf.
17
Data Preparation for
Analysis 8. Evaluation-In this stage, the patterns, principles, and dependability of the
results of the knowledge discovery process are assessed and interpreted in light
of the objective outlined in the preceding step. Here, we take into account the
preprocessing steps and how they impact the final results. As an illustration, add
a feature in step 4 and then proceed. The primary considerations in this step are
the understanding and utility of the induced model. In this stage, the identified
knowledge is also documented for later use.
Check Your Progress 5:
1. What is Knowledge Discovery?
2. What are the Steps involved in Knowledge Discovery?
3. What are knowledge discovery tools?
4. Explain the process of KDD.

3.8 SUMMARY

Despite the development of several methods for preparing data, the intricacy of
the issue and the vast amount of inconsistent or unclean data mean that this field
of study is still very active. This unit gives a general overview of data pre-
processing and describes how to turn raw data into usable information. The
preprocessing of the raw data included data integration, data reduction,
transformation, and discretization. In this unit, we have discussed five different
data-cleaning techniques that can make data more reliable and produce high-
quality results. Building, organizing, and maintaining data sets is known as data
curation. A data curator usually determines the necessary data sets and makes
sure they are gathered, cleaned up, and changed as necessary. The curator is also
in charge of providing users with access to the data sets and information related
to them, such as their metadata and lineage documentation. The primary goal of
the data curator is to make sure users have access to the appropriate data for
analysis and decision-making. Data integration is the procedure of fusing
information from diverse sources into a single, coherent data store. The unit also
introduced knowledge discovery techniques and procedures.

3.9 SOLUTIONS/ANSWERS

Check Your Progress 1:

1. As a part of data preparation, data preprocessing refers to any type of
processing done on raw data to get it ready for a data processing
technique. It has long been regarded as a crucial first stage in the data
mining process.

2. It raises the reliability and accuracy of the data. Preprocessing data can
increase the correctness and quality of a dataset, making it more
dependable by removing missing or inconsistent data values brought by
human or computer mistakes. It ensures consistency in data.

18
Basics of Data Science
3. Data quality is characterized by five characteristics: correctness,
completeness, reliability, relevance, and timeliness.

4. The five major steps of data preprocessing are:

• Data quality assessment
• Data cleaning
• Data transformation
• Data reduction

5. The practice of correcting or deleting inaccurate, damaged, improperly

formatted, duplicate, or incomplete data from a dataset is known as data
cleaning. There are numerous ways for data to be duplicated or
incorrectly categorized when merging multiple data sources.

6. Data cleansing, sometimes referred to as data cleaning or scrubbing, is

the process of locating and eliminating mistakes, duplication, and
irrelevant data from a raw dataset. Data cleansing, which is a step in the
preparation of data, ensures that the cleaned data is used to create
accurate, tenable visualizations, models, and business choices.

7. Step 1: Remove irrelevant data; Step 2: Deduplicate your data; Step 3:

Fix structural errors; Step 4: Deal with missing data; Step 5: Filter out
data outliers; Step 6: Validate your data.

Check Your Progress 2:

1. The process of retrieving data from the database that are pertinent to
the analysis activity is known as data selection. Sometimes the data
selection process comes before data transformation and consolidation.
2. Data extraction is the process of gathering or obtaining many forms of
data from numerous sources, many of which may be erratically
organized or wholly unstructured. The process of extracting,
transforming, and loading data is called ETL. Thus, ETL integrates
information from several data sources into a single consistent data
store, which can be a data warehouse or data analytics system.
3. The cost and time involved in extracting data, as well as the accuracy
of the data, are obstacles. The correctness of the data depends on the
quality of the data source, which can be an expensive and time-
consuming procedure.

Check Your Progress 3:

1. It entails gathering, organizing, indexing, and cataloguing information
for users within an organization, a group, or the wider public. Data
curation can help with academic needs, commercial decisions,
scientific research, and other endeavors.
2. The process of producing, arranging, and managing data sets so that
users who are looking for information can access and use them is
known as data curation. Data must be gathered, organized, indexed, and
catalogued for users within an organization, group, or the broader
public.
3. By gathering pertinent data into organized, searchable data assets, data
curation's overarching goal is to speed up the process of extracting
19
Data Preparation for
Analysis insights from raw data.
4. The benefits of data curation are:
o Easily discover and use data.
o Ensure data quality.
o Maintain metadata linked with data.
o Ensure compliance through data lineage and classification.

Check Your Progress 4:

1. Data integration is used to bring together data from several sources to
give people a single perspective. Making data more readily available,
easier to consume, and easier to use by systems and users is the
foundation of data integration.
2. In the case of customer data integration, information about each
customer is extracted from several business systems, such as sales,
accounts, and marketing, and combined into a single picture of the
client for use in customer service, reporting, and analysis.
3. Data integration combines data collected from various platforms to
increase its value for your company. It enables your staff to collaborate
more effectively and provide more for your clients. You cannot access
the data collected in different systems without data integration.

Check Your Progress 5:

1. Knowledge discovery is the labour-intensive process of extracting
implicit, unknown-before information from databases that may be
beneficial.
2. Steps of Knowledge Discovery:
• Developing an understanding of the application domain
• Selecting and producing the data set that will be used for the discovery
• Preprocessing and cleansing
• Data Transformation
• Prediction and description
• Deciding on a data analysis algorithm
• Utilizing the data analysis algorithm
• Evaluation
3. The process can benefit from a variety of qualitative and quantitative
methods and techniques, such as knowledge surveys, questionnaires,
one-on-one and group interviews, focus groups, network analysis, and
observation. It can be used to locate communities and specialists.
4. Knowledge Discovery from Data, often known as KDD, is another
commonly used phrase that is treated as a synonym for data mining.
Others see data mining as just a crucial stage in the knowledge
discovery process when intelligent techniques are used to extract data
patterns. The steps involved in knowledge discovery from data are as
follows: :
• Data cleaning (to remove noise or irrelevant data).
• Data integration (where multiple data sources may be combined).
• Data selection (where data relevant to the analysis task are retrieved
from the database).
• Data transformation (where data are consolidated into forms
appropriate for mining by performing summary or aggregation
functions, for sample).
20
Basics of Data Science
• Data mining (an important process where intelligent methods are
applied in order to extract data patterns).
• Pattern evaluation (to identify the fascinating patterns representing
knowledge based on some interestingness measures).
• Knowledge presentation (where knowledge representation and
visualization techniques are used to present the mined knowledge to
the user).

3.10 FURTHER READINGS

References

Data Preprocessing in Data Mining - GeeksforGeeks. (2019, March 12). GeeksforGeeks;

GeeksforGeeks. https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/

Data Cleaning in Data Mining - Javatpoint. (n.d.). Www.Javatpoint.Com. Retrieved February 11,
2023, from https://www.javatpoint.com/data-cleaning-in-data-mining

Data Integration in Data Mining - GeeksforGeeks. (2019, June 27). GeeksforGeeks;

GeeksforGeeks. https://www.geeksforgeeks.org/data-integration-in-data-mining/

Dowd, R., Recker, R.R., Heaney, R.P. (2000). Study subjects and ordinary patients. Osteoporos
Int. 11(6): 533-6.

Fourcroy, J.L. (1994). Women and the development of drugs: why can’t a woman be more like a
man? Ann N Y Acad Sci, 736:174-95.

Goehring, C., Perrier, A., Morabia, A. (2004). Spectrum Bias: a quantitative and graphical analysis
of the variability of medical diagnostic test performance. Statistics in Medicine, 23(1):125-35.

Gurwitz,J.H., Col. N.F., Avorn, J. (1992). The exclusion of the elderly and women from clinical
trials in acute myocardial infarction. JAMA, 268(11): 1417-22.

Hartt, J., Waller, G. (2002). Child abuse, dissociation, and core beliefs in bulimic disorders. Child
Abuse Negl. 26(9): 923-38.

Kahn, K.S, Khan, S.F, Nwosu, C.R, Arnott, N, Chien, P.F.(1999). Misleading authors’ inferences
in obstetric diagnostic test literature. American Journal of Obstetrics and Gynaecology., 181(1`),
112-5.

KDD Process in Data Mining - GeeksforGeeks. (2018, June 11). GeeksforGeeks;

GeeksforGeeks. https://www.geeksforgeeks.org/kdd-process-in-data-mining/

Maynard, C., Selker, H.P., Beshansky, J.R.., Griffith, J.L., Schmid, C.H., Califf, R.M., D’Agostino,
R.B., Laks, M.M., Lee, K.L., Wagner, G.S., et al. (1995). The exclusions of women from clinical
trials of thrombolytic therapy: implications for developing the thrombolytic predictive instrument
database. Med Decis Making (Medical Decision making: an international journal of the Society
for Medical Decision Making), 15(1): 38-43.

Pratt, M. K. (2022, January 31). What is Data Curation? - Definition from

SearchBusinessAnalytics. Business Analytics; TechTarget.
https://www.techtarget.com/searchbusinessanalytics/definition/data-curation

Robinson, D., Woerner, M.G., Pollack, S., Lerner, G. (1996). Subject selection bias in clinical:
data from a multicenter schizophrenia treatment center. Journal of Clinical Psychopharmacology,
16(2): 170-6.

21
Data Preparation for
Sharpe, N. (2002). Clinical trials and the real world: selection bias and generalisability of trial
Analysis results. Cardiovascular Drugs and Therapy, 16(1): 75-7.

Walter, S.D., Irwig, L., Glasziou, P.P. (1999). Meta-analysis of diagnostic tests with imperfect
reference standards. J Clin Epidemiol., 52(10): 943-51.

What is Data Extraction? Definition and Examples | Talend. (n.d.). Talend - A Leader in Data
Integration & Data Integrity. Retrieved February 11, 2023, from
https://www.talend.com/resources/data-extraction-defined/

Whitney, C.W., Lind, B.K., Wahl, P.W. (1998). Quality assurance and quality control in
longitudinal studies. Epidemiologic Reviews, 20(1): 71-80.

22
UNIT4: DATA VISUALIZATION AND
INTERPRETATION
Structure Page Nos.
4.0 Introduction
4.1 Objectives
4.2 Different types of plots
4.3 Histograms
4.4 Box plots
4.5 Scatter plots
4.6 Heat map
4.7 Bubble chart
4.8 Bar chart
4.9 Distribution plot
4.10 Pair plot
4.11 Line graph
4.12 Pie chart
4.13 Doughnut chart
4.14 Area chart
4.15 Summary
4.16 Answers
4.17 References

4.0 INTRODUCTION
The previous units of this course covers details on different aspects of data analysis,
including the basics of data science, basic statistical concepts related to data science and
data pre-processing. This unit explains the different types of plots for data visualization
and interpretation. This unit covers the details of the plots for data visualization and
further discusses their constructions and discusses the various use cases associated with
various data visualization plots. This unit will help you to appreciate the real-world
need for a workforce trained in visualization techniques and will help you to design,
develop, and interpret visual representations of data. The unit also defines the best
practices associated with the construction of different types of plots.

4.1 OBJECTIVES
After going through this unit, you will be able to:
• Explain the key characteristics of various types of plots for data visualization;
• Explain how to design and create data visualizations;
• Summarize and present the data in meaningful ways;
• Define appropriate methods for collecting, analysing, and interpreting the
numerical information.

4.2 DIFFERENT TYPES OF PLOTS

As more and more data are available to us today, there are several varieties of charts
and graphs than before. In reality, the amount of data that we produce, acquire, copy,
and use now will be nearly doubled by 2025. Data visualisation is therefore crucial and
serves as a powerful tool for organisations. One can benefit from graphs and charts in
the following ways:
• Encouraging the group to act proactively.
• Showcasing progress toward the goal to the stakeholders
• Displaying core values of a company or an organization to the audience.

Moreover, data visualisation can bring heterogeneous teams together around new
objectives and foster the trust among the team members. Let us discuss about various
graphs and charts that can be utilized in expression of various aspects of businesses.

4.3 HISTOGRAMS
A histogram visualises the distribution of data across distinct groups with continuous
classes. It is represented with set of rectangular bars with widths equal to the class
intervals and areas proportional to frequencies in the respective classes. A histogram
may hence be defined as a graphic of a frequency distribution that is grouped and has
continuous classes. It provides an estimate of the distribution of values, their extremes,
and the presence of any gaps or out-of-the-ordinary numbers. They are useful in
providing a basic understanding of the probability distribution.

Constructing a Histogram: To construct a histogram, the data is grouped into specific

class intervals, or “bins” and plotted along the x-axis. These represent the range of the
data. Then, the rectangles are constructed with their bases along the intervals for each
class. The height of these rectangles is measured along the y-axis representing the
frequency for each class interval. It's important to remember that in these
representations, every rectangle is next to another because the base spans the spaces
between class boundaries.

Use Cases: When it is necessary to illustrate or compare the distribution of specific

numerical data across several ranges of intervals, histograms can be employed. They
can aid in visualising the key meanings and patterns associated with a lot of data. They
may help a business or organization in decision-making process. Some of the use cases
of histograms include-

• Distribution of salaries in an organisation

• Distribution of height in one batch of students of a class, student performance
on an exam,
• Customers by company size, or the frequency of a product problem.
Best Practices

• Analyse various data groups: The best data groupings can be found by
creating a variety of histograms.
• Break down compartments using colour: The same chart can display a
second set of categories by colouring the bars that represent each category.
Types of Histogram
Normal distribution: In a normal distribution, the probability that points will occur on
each side of the mean is the same. This means that points on either side of the mean
could occur.

Example: Consider the following bins shows the frequency of length of wings of
housefly in 1/10 of millimetre.

Bin Frequency Bin Frequency

36-38 2 46-48 19
38-40 4 48-50 15
40-42 10 50-52 10
42-44 15 52-54 4
44-46 19 54-56 2

Bimodal Distribution: This distribution has two peaks. In the case of a bimodal
distribution, the data must be segmented before being analysed as normal distributions
in their own right.
Example:

Variable Frequency
0 2
1 6
2 4
3 2
4 4
5 6
6 4

Bimodal Distribution
8
6
freq

4
2
0
0 1 2 3 4 5 6
variable

Right-skewed distribution: A distribution that is skewed to the right is sometimes

referred to as a positively skewed distribution. A right-skewed distribution is one that
has a greater percentage of data values on the left and a lesser percentage on the right.
Whenever the data have a range boundary on the left side of the histogram, a right-
skewed distribution is frequently the result.
Example:
Left-skewed distribution: A distribution that is skewed to the left is sometimes
referred to as a negatively skewed distribution. A distribution that is left-skewed will
have a greater proportion of data values on the right side of the distribution and a lesser
proportion of data values on the left. When the data have a range limit on the right side
of the histogram, a right-skewed distribution commonly results. An alternative name
for this is a right-tailed distribution.
Example:

A random distribution: A random distribution is characterised by the absence of any

clear pattern and the presence of several peaks. When constructing a histogram using a
random distribution, it is possible that several distinct data attributes will be blended
into one. As a result, the data ought to be partitioned and investigated independently.
Example:

Edge Peak Distribution: When there is an additional peak at the edge of the
distribution that does not belong there, this type of distribution is called an edge peak
distribution. Unless you are very positive that your data set has the expected number of
outliers, this almost always indicates that you have plotted (or collected) your data
incorrectly (i.e. a few extreme views on a survey).
Comb Distribution: Because the distribution seems to resemble a comb, with
alternating high and low peaks, this type of distribution is given the name "comb
distribution." Rounding off an object might result in it having a comb-like form. For
instance, if you are measuring the height of the water to the nearest 10 centimetres but
your class width for the histogram is only 5 centimetres, you may end up with a comb-
like appearance.

Example
Histogram for the population data of a group of 86 people:

Age Group (in years) Population Size

20-25 23
26-30 18
31-35 15
36-40 6
41-45 11
46-50 13
TOTAL 86
Population data of a group of 100 people
Histogram

Population Size (Frequency)

25 23

20 18
15
15 13
11
10
6
5

0
20-25 26-30 31-35 36-40 41-45 46-50
Population Size 23 18 15 6 11 13
Age Group (Bins)

Check Your Progress 1

1. What is the difference between a Bar Graph and a Histogram?
……………………………………………………………………………………

……………………………………………………………………………………

2. Draw a Histogram for the following data:

Class Interval Frequency

0 − 10 35
10 − 20 70
20 − 30 20
30 − 40 40
40 − 50 50

3. Why is histogram used?

……………………………………………………………………………………

……………………………………………………………………………………
4. What do histograms show?
………………………………………………………………………………………
………………………………………………………………………………………

4.4 BOX PLOTS

When displaying data distributions using the five essential summary statistics of
minimum, first quartile, median, third quartile, and maximum, box-and-whisker plots,
also known as boxplots, are widely employed. It is a visual depiction of data that aids
in determining how widely distributed or how much the data values change. These
boxplots make it simple to compare the distributions since it makes the centre, spread,
and overall range understandable. They are utilised for data analysis wherein the
graphical representations are used to determine the following:
1. Shape of Distribution
2. Central Value
3. Variability of Data
Constructing a Boxplot: The two components of the graphic are described by their
names: the box, which shows the median value of data along with the first and third
quartiles (25 percentile and 75 percentile), and the whiskers, which shows the remaining
data. The 3rd quartile's difference from the first quartile of data is called the interquartile
range. The highest and minimum points in the data can also be displayed using the
whiskers. The points beyond 1.5 ´ interquartile range can be identified as suspected
outliers.

Use Cases: A boxplot is frequently used to demonstrate whether a distribution is

skewed and whether the data set contains any potential outliers, or odd observations.
Boxplots are also very useful for comparing or involving big data sets. Examples of box
plots include plotting the:
• Gas efficiency of vehicles
• Time spent reading across readers
Best Practices
• Cover the points within the box: This aids the viewer in concentrating on the
outliers.
• Box plot comparisons between categorical dimensions: Box plots are
excellent for quickly comparing dataset distributions.
Example

Subject Section A Section B Section C

English 59 65 82
Math 96 73 66
Science 78 57 81
Economics 96 79 73
English 65 55 94
Math 78 65 56
Science 68 61 85
Economics 96 98 56
English 85 63 85
Math 93 88 68
Science 94 66 94
Economics 67 59 86
English 82 66 96
Math 64 79 63
Science 55 90 97
Economics 73 89 95
English 89 66 75
Math 57 81 73
Science 67 92 88
Economics 78 65 69
The boxplots clearly shows that Section B has performed poorly in English, whereas
section C has performed poorly in Maths. Section A has mostly balanced performance,
but the marks of the students are most dispersed.
Check Your Progress 2
1. How to correctly interpret a boxplot?
……………………………………………………………………………………

……………………………………………………………………………………
2. What are the most important parts of a box plot?
……………………………………………………………………………………

……………………………………………………………………………………

3. What is the uses of box plot?

………………………………………………………………………………………………

………………………………………………………………………………

4. How do you describe the distribution of a box plot?

……………………………………………………………………………………...
……………………………………………………………………………………...

4.5 SCATTER PLOTS

Scatter plot is the most commonly used chart when observing the relationship between
two quantitative variables. It works particularly well for quickly identifying possible
correlations between different data points. The relationship between multiple variables
can be efficiently studied using scatter plots, which show whether one variable is a good
predictor of another or whether they normally fluctuate independently. Multiple distinct
data points are shown on a single graph in a scatter plot. Following that, the chart can
be enhanced with analytics like trend lines or cluster analysis. It is especially useful for
quickly identifying potential correlations between data points.

Constructing a Scatter Plot: Scatter plots are mathematical diagrams or plots that rely
on Cartesian coordinates. In this type of graph, the categories being compared are
represented by the circles on the graph (shown by the colour of the circles) and the
numerical volume of the data (indicated by the circle size). One colour on the graph
allows you to represent two values for two variables related to a data set, but two colours
can also be used to include a third variable.

Use Cases: Scatter charts are great in scenarios where you want to display both
distribution and the relationship between two variables.
• Display the relationship between time-on-platform (How Much Time Do
People Spend on Social Media) and churn (the number of people who stopped
being customers during a set period of time).
• Display the relationship between salary and years spent at company
Best Practices
• Analyze clusters to find segments: Based on your chosen variables, cluster
analysis divides up the data points into discrete parts.
• Employ highlight actions: You can rapidly identify which points in your
scatter plots share characteristics by adding a highlight action, all the while
keeping an eye on the rest of the dataset.
• mark customization: individual markings Add a simple visual hint to your
graph that makes it easy to distinguish between various point groups.

Example

Temperature (in deg C) Sale of Ice-Cream

17 ₹ 1,750.00
18 ₹ 1,603.00
22 ₹ 1,500.00
29 ₹ 2,718.00
27 ₹ 2,667.00
28 ₹ 3,422.00
31 ₹ 3,681.00
23 ₹ 2,734.00
24 ₹ 2,575.00
25 ₹ 2,869.00
35 ₹ 3,057.00
36 ₹ 3,846.00
38 ₹ 3,500.00
41 ₹ 3,496.00
42 ₹ 3,984.00
29 ₹ 4,109.00
39 ₹ 5,336.00
35 ₹ 5,197.00
42 ₹ 5,426.00
45 ₹ 5,365.00
Relationship between the Temperature and the
Sale of Ice-Cream
₹6,000.00 SCATTER PLOT
₹5,000.00

SALE OF ICE-CREAM
₹4,000.00

₹3,000.00

₹2,000.00

₹1,000.00

₹-
0 10 20 30 40 50
TEMPERATURE IN DEGREE C

Please note that a linear trendline has been fitted to scatter plot, indicating a positive
change in sales of ice-cream with increase in temperature.
Check Your Progress 3

1. What are the characteristics of a scatter plot?

……………………………………………………………………………………

2. What components make up a scatter plot?

………………………………………………………………………………………
………………………………………………………………………………………

3. What is the purpose of a scatter plot?

……………………………………………………………………………………

……………………………………………………………………………………
4. What are the 3 types of corelations that can be inferred from scatter plots?
……………………………………………………………………………………

……………………………………………………………………………………

4.6 HEAT MAP

Heatmaps are two-dimensional graphics that show data trends through colour
shading. They are an example of part to whole chart in which values are represented
using colours. A basic heat map offers a quick visual representation of the data. A
user can comprehend complex data sets with the help of more intricate heat maps.
Heat maps can be presented in a variety of ways, but they all have one thing in
common: they all make use of colour to convey correlations between data
values. Heat maps are more frequently utilised to present a more comprehensive
view of massive amounts of data. It is especially helpful because colours are simpler
to understand and identify than plain numbers.

Heat maps are highly flexible and effective at highlighting trends. Heatmaps are
naturally self-explanatory, in contrast to other data visualisations that require
interpretation. The greater the quantity/volume, the deeper the colour (the higher
the value, the tighter the dispersion, etc.). Heat Maps dramatically improve the
ability of existing data visualisations to quickly convey important data insights.

Use Cases: Heat Maps are primarily used to better show the enormous amounts of
data contained inside a dataset and help guide users to the parts of data
visualisations that matter most.
• Average monthly temperatures across the years
• Departments with the highest amount of attrition over time.
• Traffic across a website or a product page.
• Population density/spread in a geographical location.
Best Practices
• Select the proper colour scheme: This style of chart relies heavily on
colour, therefore it's important to pick a colour scheme that complements
the data.
• Specify a legend: As a related point, a heatmap must typically contain a
legend describing how the colours correspond to numerical values.

Example
Region-wise monthly sale of a SKU (stock-keeping unit)
MONTH
ZONE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
NORTH 75 84 61 95 77 82 74 92 58 90 54 83
SOUTH 50 67 89 61 91 77 80 72 82 78 58 63
EAST 62 50 83 95 83 89 72 96 96 81 86 82
WEST 69 73 59 73 57 61 58 60 97 55 81 92

The distribution of sales is shown in the sample heatmap above, broken down by
zone and spanning a 12-month period. Like in a typical data table, each cell displays
a numeric count, but the count is also accompanied by a colour, with higher counts
denoting deeper hues.

Check Your Progress 4

1. What type of input is needed for a heat map?

……………………………………………………………………………………

……………………………………………………………………………………
2. What kind of information does a heat map display?
……………………………………………………………………………………

……………………………………………………………………………………
3. What can be seen in heatmap?
……………………………………………………………………………………

……………………………………………………………………………………

4.7 BUBBLE CHART

Bubble diagrams are used to show the relationships between different variables. They
are frequently used to represent data points in three dimensions, specifically when the
bubble size, y-axis, and x-axis are all present. Using location and size, bubble charts
demonstrate relationships between data points. However, bubble charts have a restricted
data size capability since too many bubbles can make the chart difficult to read.
Although technically not a separate type of visualisation, bubbles can be used to show
the relationship between three or more measurements in scatter plots or maps by adding
complexity. By altering the size and colour of circles, large amounts of data are
presented concurrently in visually pleasing charts.

Constructing a Bubble Chart: For each observation of a pair of numerical variables

(A, B), a bubble or disc is drawn and placed in a Cartesian coordinate system
horizontally according to the value of variable A and vertically according to the value
of variable B. The area of the bubble serves as a representation for a third numerical
variable (C). Using various colours in various bubbles, you may even add a fourth
dataset (D: numerical or categorical).
By using location and proportions, bubble charts are frequently used to compare and
illustrate the relationships between circles that have been classified. Bubble Charts'
overall image can be utilised to look for patterns and relationships.

Use Cases: Usually, the positioning and ratios of the size of the bubbles/circles on this
chart are used to compare and show correlations between variables. Additionally, it is
utilised to spot trends and patterns in data.
• AdWords’ analysis: CPC vs Conversions vs share of total conversions
• Relationship between life expectancy, GD per capita and population size
Best Practices:
• Add colour: A bubble chart can gain extra depth by using colour.
• Set bubble size in appropriate proportion.
• Overlay bubbles on maps: From bubbles, a viewer can immediately determine
the relative concentration of data. These are used as an overlay to provide the
viewer with context for geographically-related data.
Example

Item Code Units Sold Sales (in Rs.) Profit %

PC001 325 ₹ 14,687.00 22%
PC002 1130 ₹ 16,019.00 18%
PC003 645 ₹ 16,100.00 25%
PC004 832 ₹ 12,356.00 9%
PC005 1200 ₹ 21,500.00 32%
PC006 925 ₹ 16,669.00 21%
PC007 528 ₹ 13,493.00 13%
PC008 750 ₹ 18,534.00 14%
PC009 432 ₹ 13,768.00 6%
PC0010 903 ₹ 22,043.00 11%

The three variables in this example are sales, profits, and the number of units sold.
Therefore, all three variables and their relationship can be displayed simultaneously
using a bubble chart.
Sales and Profit versus the Quantity sold
BUBBLE CHART
₹30,000.00
₹25,000.00
Sales (in INR)

₹20,000.00
₹15,000.00
₹10,000.00
₹5,000.00
₹-
0 200 400 600 800 1000 1200 1400
Number of units sold

Check Your Progress 5

1. What is bubble chart?

……………………………………………………………………………………

……………………………………………………………………………………
2. What is a bubble chart used for?
……………………………………………………………………………………

……………………………………………………………………………………
3. What is the difference between scatter plot and bubble chart?
……………………………………………………………………………………

……………………………………………………………………………………
4. What is bubble size in bubble chart?
……………………………………………………………………………….

……………………………………………………………………………….

4.8 BAR CHART

A bar chart is a graphical depiction of numerical data that uses rectangles (or
bars) with equal widths and varied heights. In the field of statistics, bar charts
are one of the methods for handling data.

Constructing a Bar Chart: The x-axis corresponds to the horizontal line, and
the y-axis corresponds to the vertical line. The y-axis represents frequency in
this graph. Write the names of the data items whose values are to be noted along
the x-axis that is horizontal.
Along the horizontal axis, choose the uniform width of bars and the uniform
gap between the bars. Pick an appropriate scale to go along the y-axis that runs
vertically so that you can figure out how high the bars should be based on the
values that are presented. Determine the heights of the bars using the scale you
selected, then draw the bars using that information.

Types of Bar chart: Bar Charts are mainly classified into two types:
Horizontal Bar Charts: Horizontal bar charts are the type of graph that are
used when the data being analysed is to be depicted on paper in the form of
horizontal bars with their respective measures. When using a chart of this type,
the categories of the data are indicated on the y-axis.

Example:

Vertical Bar Charts: A vertical bar chart displays vertical bars on graph (chart)
paper. These rectangular bars in a vertical orientation represent the
measurement of the data. The quantities of the variables that are written along
the x-axis are represented by these rectangular bars.

Example:
We can further divide bar charts into two basic categories:

Grouped Bar Charts: The grouped bar graph is also referred to as the clustered
bar graph (graph). It is valuable for at least two separate types of data. The
horizontal (or vertical) bars in this are categorised according to their position.
If, for instance, the bar chart is used to show three groups, each of which has
numerous variables (such as one group having four data values), then different
colours will be used to indicate each value. When there is a close relationship
between two sets of data, each group's colour coding will be the same.

Example:

Stacked Bar Charts: The composite bar chart is also referred to as the stacked
bar chart. It illustrates how the overall bar chart has been broken down into its
component pieces. We utilise bars of varying colours and clear labelling to
determine which category each item belongs to. As a result, in a chart with
stacked bars, each parameter is represented by a single rectangular bar. Multiple
segments, each of a different colour, are displayed within the same bar. The
various components of each separate label are represented by the various
segments of the bar. It is possible to draw it in either the vertical or horizontal
plane.

Example:

Use cases: Bar charts are typically employed to display quantitative data. The
following is a list of some of the applications of the bar chart-
• In order to clearly illustrate the relationships between various variables,
bar charts are typically utilised. When presented in a pictorial format,
the parameters can be more quickly and easily envisioned by the user.
• Bar charts are the quickest and easiest way to display extensive
amounts of data while saving time. It is utilised for studying trends over
extended amounts of time.

Best Practices:
• Use a common zero valued baseline
• Maintain rectangular forms for your bars
• Consider the ordering of category level and use colour wisely.

Example:

Region Sales

East 6,123

West 2,053
South 4,181

North 3,316

Sales By Region
North 3,316
East
South 4,181
West
West 2,053 South
East 6,123 North

- 2,000 4,000 6,000 8,000

Check your progress 6:

1. When should we use bar chart?

……………………………………………………………………………
……………………………………………………………………………
2. What are the different types of bar chart?
……………………………………………………………………………
……………………………………………………………………………
3. Draw a vertical bar chart.
……………………………………………………………………………
……………………………………………………………………………
4. Draw a horizontal bar chart.

Use the following data to answer the question 3 and 4:

4.9 DISTRIBUTION PLOT

Visually assessing the distribution of sample data, distribution charts do this by

contrasting the actual distribution of the data with the theoretical values expected from
a certain distribution. In addition to more traditional hypothesis tests, distribution plots
can be used to establish whether the data from the sample follows a particular
distribution. The distribution plot is useful for analysing the relationship between the
range of a set of numerical data and its distribution. The values of the data are
represented as points along an axis.

Constructing a Distribution Plot: You must utilise one or two dimensions, together
with one measure, in a distribution plot. You will get a single line visualisation if you
only use one dimension. If you use two dimensions, each value of the outer, second
dimension will produce a separate line.

Use Cases: Distribution of a data set shows the frequency of occurrence of each
possible outcome of a repeatable event observed many times. For instance:
• Height of a population.
• Income distribution in an economy
• Test scores listed by percentile.

Best Practices:
• It is advisable to have equal class widths.
• The class intervals should be mutually exclusive and non-overlapping.
• Open-ended classes at the lower and upper limits (e.g., <10, >100) should be
avoided.
Example

Sales Amount No. of Clients

1-1000 23
1001-2000 19
2001-3000 22
3001-4000 19
4001-5000 27
5001-6000 25
6001-7000 17
7001-8000 26
8001-9000 23
9001-10000 12
Grand Total 213

Sales Amount Distribution

30
25
20
15
10
5
0
0

0
00

00
00

00
10

00
-2

-3

-4

-5

-6

-7

-8

-9
1-

-1
01

01
10

Check your progress 7:

Q.1 What is the distribution plot?

…………………………………………………………………………………………
…………………………………………………………………………………………

Q.2 When should we use distribution plot?

Q.3 What do distribution graphs show?

…………………………………………………………………………………………
…………………………………………………………………………………………..

4.10 PAIR PLOT

The pairs plot is an extension of the histogram and the scatter plot, which are both
fundamental figures. The scatter plots on the upper and lower triangles show the
relationship (or lack thereof) between two variables. The histogram along the diagonal
gives us the ability to see the distribution of a single variable, while the scatter plots on
the upper and lower triangles show the relationship (or lack thereof) between two
variables.

A pair plot can be utilised to gain an understanding of the optimum collection of

characteristics to describe a relationship between two variables or to create clusters
that are the most distinct from one another. Additionally, it is helpful to construct
some straightforward classification models by drawing some straightforward lines or
making linear separations in our data set.
Constructing a Pair Plot: If you have m attributes in your dataset, it creates a figure
with m x m subplots. Each attribute's univariate histograms (distributions) make up the
main-diagonal subplots. For a non-diagonal subplot, assume a position (i, j). The
dataset's samples are all plotted using a coordinate system with the characteristics i and
j as the axes. In other words, it projects the dataset on these two attributes only. This is
particularly interesting to visually inspect how the samples are spread with respect to
these two attributes ONLY. The "shape" of the spread can give you valuable insight on
the relation between the two attributes.

Use Cases: A pairs plot allows us to see both distribution of single variables and
relationships between two variables. It helps to identify the most distinct clusters or the
optimum combination of attributes to describe the relationship between two variables.
• By creating some straightforward linear separations or basic lines in our data
set, it also helps to create some straightforward classification models.
• Analysing socio-economic data of a population.

Best Practices:
• Use a different colour palette.
• For each colour level, use a different marker.
Example:
calories protein fat sodium fiber rating
70 4 1 130 10 68.40297
120 3 5 15 2 33.98368
70 4 1 260 9 59.42551
50 4 0 140 14 93.70491
110 2 2 180 1.5 29.50954
110 2 0 125 1 33.17409
130 3 2 210 2 37.03856
90 2 1 200 4 49.12025
90 3 0 210 5 53.31381
120 1 2 220 0 18.04285
110 6 2 290 2 50.765
120 1 3 210 0 19.82357
110 3 2 140 2 40.40021
110 1 1 180 0 22.73645
110 2 0 280 0 41.44502
100 2 0 290 1 45.86332
110 1 0 90 1 35.78279
110 1 1 180 0 22.39651
110 3 3 140 4 40.44877
110 2 0 220 1 46.89564
100 2 1 140 2 36.1762
100 2 0 190 1 44.33086
110 2 1 125 1 32.20758
110 1 0 200 1 31.43597
100 3 0 0 3 58.34514
120 3 2 160 5 40.91705
120 3 0 240 5 41.01549
110 1 1 135 0 28.02577
100 2 0 45 0 35.25244
110 1 1 280 0 23.80404
100 3 1 140 3 52.0769
110 3 0 170 3 53.37101
120 3 3 75 3 45.81172
120 1 2 220 1 21.87129
110 3 1 250 1.5 31.07222
110 1 0 180 0 28.74241
110 2 1 170 1 36.52368
140 3 1 170 2 36.47151
110 2 1 260 0 39.24111
100 4 2 150 2 45.32807
110 2 1 180 0 26.73452
100 4 1 0 0 54.85092
150 4 3 95 3 37.13686
150 4 3 150 3 34.13977
160 3 2 150 3 30.31335
100 2 1 220 2 40.10597
120 2 1 190 0 29.92429
140 3 2 220 3 40.69232
90 3 0 170 3 59.64284
130 3 2 170 1.5 30.45084
120 3 1 200 6 37.84059
100 3 0 320 1 41.50354
50 1 0 0 0 60.75611
50 2 0 0 1 63.00565
100 4 1 135 2 49.51187
100 5 2 0 2.7 50.82839
120 3 1 210 5 39.2592
100 3 2 140 2.5 39.7034
90 2 0 0 2 55.33314
110 1 0 240 0 41.99893
110 2 0 290 0 40.56016
80 2 0 0 3 68.23589
90 3 0 0 4 74.47295
90 3 0 0 3 72.80179
110 2 1 70 1 31.23005
110 6 0 230 1 53.13132
90 2 0 15 3 59.36399
110 2 1 200 0 38.83975
140 3 1 190 4 28.59279
100 3 1 200 3 46.65884
110 2 1 250 0 39.10617
110 1 1 140 0 27.7533
100 3 1 230 3 49.78745
100 3 1 200 3 51.59219
110 2 1 200 1 36.18756
The pair plot can be interpreted as follows:

Along the boxes of the diagonal, the variable names are displayed.
A scatterplot of the correlation between each pairwise combination of factors is shown
in each of the remaining boxes. For instance, a scatterplot of the values for rating and
sodium can be seen in the matrix's box in the top right corner. A scatterplot of values
for rating, that is positively connected with rating, and so forth may be seen in the box
in the upper left corner. We can see the association between each pair of variables in
our dataset from this single visualisation. For instance, calories and rating appear to
have a negative link but protein and fat appear to be unrelated.

Check your progress 8:

1. Why pair plot is used?

……………………………………………………………………………
……………………………………………………………………………

2. How do you read a pairs plot?

……………………………………………………………………………
……………………………………………………………………………

3. What does a pair plot show?

……………………………………………………………………………
……………………………………………………………………………

4.11 LINE GRAPH

A graph that depicts change over time by means of points and lines is known as a line
graph, line chart, or line plot. It is a graph that shows a line connecting a lot of points
or a line that shows how the points relate to one another. The graph is represented by
the line or curve that connects successive data points to show quantitative data between
two variables that are changing. The values of these two variables are compared along
a vertical axis and a horizontal axis in linear graphs.

One of the most significant uses of line graphs is tracking changes over both short and
extended time periods. It is also used to compare the changes that have taken place for
diverse groups over the course of the same time period. It is strongly advised to use a
line graph rather than a bar graph when working with data that only has slight
fluctuations.

As an illustration, the finance department of a company would want to visualise how

its current cash balance has changed over time. If so, they will plot the points over the
horizontal and vertical axis using a line graph. It typically refers to the time period that
the data span.

Following are the types of line graphs:

1. Simple Line Graph: Only a single line is plotted on the graph.

Example:

Time (hr) Distance (km)

0.5 180
1 360
1.5 540
2 720
2.5 900
3 1080

2. Multiple Line Graph: The same set of axes is used to plot several lines. An
excellent way to compare similar objects over the same time period is via a
multiple line graph.

Example:

Time(hr) Rahul dist.(km) Mahesh dist. (km)

0.5 180 200
1 360 400
1.5 540 600
2 720 800
2.5 900 1000
3 1080 1200
3. Compound Line Graph: Whenever one piece of information may be broken
down into two or more distinct pieces of data. A compound line graph is the
name given to this particular kind of line graph. To illustrate each component
that makes up the whole, lines are drawn. The line at the top displays the total,
while the line below displays a portion of the total. The size of each component
can be determined by the distance that separates every pair of lines.

Example:

Time Cars Buses Bikes

1-2pm 37 45 42
2-3pm 44 34 26
3-4pm 23 39 27
4-5pm 29 41 48

Constructing a line graph: When we have finished creating the data tables, we will
then use those tables to build the linear graphs. These graphs are constructed by plotting
a succession of points, which are then connected together with straight lines to offer a
straightforward method for analysing data gathered over a period of time. It provides a
very good visual format of the outcome data that was gathered over the course of time.

Use cases: Tracking changes over both short and long time periods is an important
application of line graphs. Additionally, it is utilised to compare changes over the same
time period for various groups. Anytime there are little changes, using a line graph
rather than a bar graph is always preferable.

• Straight line graphs can be used to explain potential future contract

markets and business prospects.
• To determine the precise strength of medications, a straight-line graph
is employed in both medicine and pharmacy.
• The government uses straight line graphs for both research and
budgetary planning.

• Chemistry and biology both use linear graphs.

• To determine whether our body weight is acceptable for our height,

straight line graphs are employed.
Best Practices

• Only connecting adjacent values along an interval scale should be done with
lines.
• In order to provide correct insights, intervals should be of comparable size.
• Select a baseline that makes sense for your set of data; a zero baseline might
not adequately capture changes in the data.
• Line graphs are only helpful for comparing data sets if the axes have the same
scales.

Example:

Sales 2011 2012 2013 2014 2015 2016 2017 2018

North 12000 13000 12500 14500 17300 16000 18200 22000
South 9000 9000 9000 9500 9500 9500 10000 9000
West 28000 27500 24000 25000 24500 24750 28000 29000
East 18000 8000 7000 22000 13000 14500 16500 17000

Check your progress 9:

Q.1 What is the line graph?

Q.2 Where can we use line graph?

Q.3 Draw a line chart from the following information:

4.12 PIE CHART

A pie chart, often referred to as a circle chart, is a style of graph that can be used to
summarise a collection of nominal data or to show the many values of a single variable
(e.g. percentage distribution). Such a chart resembles a circle that has been divided into
a number of equal halves. Each segment corresponds to a specific category. The overall
size of the circle is divided among the segments in the same proportion as the category's
share of the whole data set.

A pie chart often depicts the individual components that make up the whole. In order to
bring attention to a particular piece of information that is significant, the illustration
may, on occasion, show a portion of the pie chart that is cut away from the rest of the
diagram. This type of chart is known as an exploded pie chart.
Types of a Pie chart: There are mainly two types of pie charts one is 2D pie chart and
another is 3D pie chart. This can be further classified into flowing categories:

1. Simple Pie Chart: The most fundamental kind of pie chart is referred to simply as
a pie chart and is known as a simple pie chart. It is an illustration that depicts a pie
chart in its most basic form.

Example:

Pets Owners (%)

Cats 38
Dogs 41
Birds 16
Reptiles 3
Small Mammals 2

Owners(%)

Cats Dogs Birds Reptiles Small Mammals

2. Exploded Pie Chart: To create an exploding pie chart, you must first separate the
pie from the chart itself, as opposed to merging the two elements together. It is common
practise to do this in order to draw attention to a certain section or slice of a pie chart.

Example:

Pets Owners (%)

Cats 38
Dogs 41
Birds 16
Reptiles 3
Small Mammals 2
Owners(%)

Cats Dogs Birds Reptiles Small Mammals

3.Pie of Pie: The pie of pie method is a straightforward approach that enables more
categories to be represented on a pie chart without producing an overcrowded and
difficult-to-read graph. A pie chart that is generated from an already existing pie chart
is referred to as a "pie of pie".

Example:

Pets Owners (%)

Cats 38
Dogs 41
Birds 16
Reptiles 3
Small Mammals 2

3. Bar of Pie: A bar of pie is an additional straightforward method for showing

additional categories on a pie chart while minimising space consumption on the pie
chart itself. The expansion that was developed from the already existing pie chart
was a bar graph rather than a pie of pie, despite the fact that both serve comparable
objectives.

Example:

Pets Owners (%)

Cats 38
Dogs 41
Birds 16
Reptiles 3
Small Mammals 2
Constructing a Pie chart: “The total value of the pie is always 100%”
To work out with the percentage for a pie chart, follow the steps given below:

• Categorize the data

• Calculate the total
• Divide the categories
• Convert into percentages
• Finally, calculate the degrees

Therefore, the pie chart formula is given as (Given Data/Total value of Data) × 360°

Use cases: If you want your audience to get a general idea of the part-to-whole
relationship in your data, and comparing the exact sizes of the slices is not as critical to
you, then you should use pie charts. And indicate that a certain portion of the whole is
disproportionately small or large.
• Voting preference by age group
• Market share of cloud providers

Best Practices

• Fewer pie wedges are preferred: The observer may struggle to interpret the chart's
significance if there are too many proportions to compare. Similar to this, keep the
overall number of pie charts on dashboards to a minimum.
Overlay pies on maps: Pie charts can be used to further deconstruct geographic
tendencies in your data and produce an engaging display.
Example

COMPANY MARKET SHARE

Company A 24%
Company B 13%
Company C 8%
Company D 33%
Company E 22%

MARKET SHARE

Company A
22% 24% Company B
Company C
13%
Company D
33%
8% Company E

Check your progress 10:

Q1. What is the pie chart?

Q2. What are the different type of pie charts?

Q.3 Draw a pie chart from the following information:

4.13 DOUGHNUT CHART

Pie charts have been superseded by a more user-friendly alternative called a doughnut
chart, which makes reading pie charts much simpler. It is recognised that these charts
express the relationship of 'part-to-whole,' which is when all of the parts represent one
hundred percent when collected together. It presents survey questions or data with a
limited number of categories for making comparisons.

In comparison to pie charts, they provide for more condensed and straightforward
representations. In addition, the canter hole can be used to assist in the display of
relevant information. You might use them in segments, where each arc would indicate
a proportional value associated with a different piece of data.

Constructing a Doughnut chart: A doughnut chart, like a pie chart, illustrates the
relationship of individual components to the whole, but unlike a pie chart, it can display
more than one data series at the same time. A ring is added to a doughnut chart for each
data series that is plotted within the chart itself. The beginning of the first data series
can be seen near the middle of the chart. A specific kind of pie chart called a doughnut
chart is used to show the percentages of categorical data. The amount of data that falls
into each category is indicated by the size of that segment of the donut. The creation of
a donut chart involves the use of a string field and a number, count of features, or
rate/ratio field.

There are two types of doughnut chart one is normal doughnut chart and another is
exploded doughnut chart. Exploding doughnut charts, much like exploded pie charts,
highlight the contribution of each value to a total while emphasising individual values.
However, unlike exploded pie charts, exploded doughnut charts can include more than
one data series.

Use cases: Doughnut charts are good to use when comparing sets of data. By using the
size of each component to reflect the percentage of each category, they are used to
display the proportions of categorical data. A string field and a count of features,
number, rate/ratio, or field are used to make a doughnut chart.
• Android OS market share
• Monthly sales by channel

Best Practices

• Stick to five slices or less because thinner and long-tail slices become unreadable
and uncomparable.
• Use this chart to display one point in time with the help of the filter legend.
• Well-formatted and informative labels are essential because the information
conveyed by circular shapes alone is not enough and is imprecise.
• It is a good practice to sort the slices to make it more clear for comparison.
Example:

Project Status
Completed 30%
Work in progress 25%
Incomplete 45%

Check your progress 11:

Q1. What is the doughnut chart?

…………………………………………………………………………………
…………………………………………………………………………………

Q.2 What distinguishes a doughnut chart from a pie chart?

Q.3 Draw a doughnut chart from the following information:

Product 2020 2021

x 40 50
y 30 60
z 60 70

4.14 AREA CHART

An area chart, a hybrid of a line and bar chart, shows the relationship between the
numerical values of one or more groups and the development of a second variable, most
often the passage of time. The inclusion of shade between the lines and a baseline,
similar to a bar chart's baseline, distinguishes a line chart from an area chart. An area
chart has this as its defining feature.

Types of Area Chart:

Overlapping area chart: An overlapping area chart results if we wish to look at how
the values of the various groups compare to one another. The conventional line chart
serves as the foundation for an overlapping area chart. One point is plotted for each
group at each of the horizontal values, and the height of the point indicates the group's
value on the vertical axis variable.
All of the points for a group are connected from left to right by a line. A zero baseline
is supplemented by shading that is added by the area chart between each line. Because
the shading for different groups will typically overlap to some degree, the shading itself
incorporates a degree of transparency to ensure that the lines delineating each group
may be seen clearly at all times.

The shading brings attention to group that has the highest value by highlighting group's
pure hue. Take care that one series is not always higher than the other, as this could
cause the plot to become confused with the stacked area chart, which is the other form
of area chart. In circumstances like these, the most prudent course of action will consist
of sticking to the traditional line chart.

Months (2016) Web Android IOS

June 0 -
July 70k -
Aug 55k 80k
Sep 60k 165k 80k
Oct 70k 165k 295k
Nov 80k 200k 290k
Dec 40k 125k 155k

Stacked area chart: The stacked area chart is what is often meant to be conveyed when
the phrase "area chart" is used in general conversation. When creating the chart of
overlapping areas, each line was tinted based on its vertical value all the way down to
a shared baseline. Plotting lines one at a time creates the stacked area chart, which uses
the height of the most recent group of lines as a moving baseline. Therefore, the total
that is obtained by adding up all of the groups' values will correspond to the height of
the line that is entirely piled on top.

When you need to keep track of both the total value and the breakdown of that total by
groups, you should make use of a stacked area chart. This type of chart will allow you
to do both at the same time. By contrasting the heights of the individual curve segments,
we are able to obtain a sense of how the contributions made by the various subgroups
stack up against one another and the overall sum.

Example:
A B C D
Printers Projectors White Boards
2017 32 45 28
2018 47 43 40
2019 40 39 43
2020 37 40 41
2021 39 49 39
Stacked Area chart
150

100

0
1 2 3 4 5

Printers Projectors White Boards

Use Cases: In most cases, many lines are drawn on an area chart in order to create a
comparison between different groups (also known as series) or to illustrate how a whole
is broken down into its component pieces. This results in two distinct forms of area
charts, one for each possible application of the chart.
• Magnitude of a single quantitative variable's trend - An increase in a public
company's revenue reserves, programme enrollment from a qualified subgroup by
year, and trends in mortality rates over time by primary causes of death are just a
few examples.
• Comparison of the contributions made by different category members (or
group)- the variation in staff sizes among departments, or support tickets opened
for various problems.
• Birth and death rates over time for a region, the magnitudes of cost vs. revenue for
a business, the magnitudes of export vs. import over time for a country

Best Practices:

• To appropriately portray the proportionate difference in the data, start the y-axis at
0.
• To boost readability, choose translucent, contrasting colours.
• Keep highly variable data at the top of the chart and low variable data at the bottom
during stacking.
• If you need to show how each value over time contributes to a total, use a stacked
area chart.
• However, it is recommended to utilise 100% stacked area charts if you need to
demonstrate a part to whole relationship in a situation where the cumulative total is
unimportant.

Example:
The above Stacked area chart is belonging to tele-service offered by various television
based applications. In this data, there are different type of subscribers who are using
the services provided by tele-applications in different months.

Check your progress 12:

Q1. What is area chart?

Q2. What are types of area charts?

Q3. Draw an area chart from the following information:

Product A Product B Product C

2017 2000 600 75
2018 2200 450 85
2019 2100 500 125
2020 3000 750 123

4.15 SUMMARY
This Unit introduces you to some of the basic charts that are used in data science. The
Unit defines the characteristics of Histograms, which are very popular in univariate
frequency analysis of quantitative variables. It then discusses the importance and
various terms used in the box plots, which are very useful while comparing quantitative
variable over some qualitative characteristic. Scatter plots are used to visualise the
relationships between two quantitative variables. The Unit also discusses about the heat
map, which are excellent visual tools for comparing values. In case three variables are
to be compared then you may use bubble charts. The unit also highlights the importance
of bar charts, distribution plots, pair plots and line graphs. In addition, it highlights the
importance of Pie chart, doughnut charts and area charts for visualising different kinds
of data. In addition, there are many different kinds of charts that are used in different
analytical tool. You may read about them from reafferences.
4.16 ANSWERS

Check Your Progress 1

i. A bar graph is a pictorial representation using vertical and
horizontal bars in a graph. The length of bars are proportional
to the measure of data. It is also called bar chart. A histogram
is also a pictorial representation of data using rectangular bars,
that are adjacent to each other. It is used to represent grouped
frequency distribution with continuous classes.

ii.

iii. It is used to summarise continuous or discrete data that is

measured on an interval scale. It is frequently used to
conveniently depict the key characteristics of the data
distribution.

iv. A histogram is a graphic depiction of data points arranged into

user-specified ranges. The histogram, which resembles a bar
graph in appearance, reduces a data series into an intuitive
visual by collecting numerous data points and organising them
into logical ranges or bins.

Check Your Progress 2

1. Follow these instructions to interpret a boxplot. :
Step 1: Evaluate the major characteristics. Look at the distribution's
centre and spread. Examine the potential impact of the sample size on
the boxplot's visual appeal.
Step 2: Search for signs of anomalous or out-of-the-ordinary data.
Skewed data suggest that data may not be normal. Other situations in
your data may be indicated by outliers.
Step 3: Evaluate and compare groups. Evaluate and compare the centre
and spread of groups if your boxplot contains them.
2. A boxplot is a common method of showing data distribution based on a five-number
summary ("minimum," first quartile ("Q1"), median ("Q3"), and "maximum"). You
can learn more about your outliers' values from it.
3. Box plots are generally used for 3 purposes -
• Finding outliers in the data
• Finding the dispersion of data from a median
• Finding the range of data

4. The box plot distribution will reveal the degree to which the data are clustered, how
skewed they are, and also how symmetrical they are.
• Positively Skewed: The box plot is positively skewed if the distance from the me-
dian to the maximum is greater than the distance from the median to the mini-
mum.
• Negatively skewed: Box plots are said to be negatively skewed if the distance
from the median to the minimum is higher than the distance from the median to
the maximum.
• Symmetric: When the median of a box plot is equally spaced from both the maxi-
mum and minimum values, the box plot is said to be symmetric.

Check Your Progress 3

• The most practical method for displaying bivariate (2-variable) data is a scatter plot.
• A scatter plot can show the direction of a relationship between two variables when
there is an association or interaction between them (positive or negative).
• The linearity or nonlinearity of an association or relationship can be ascertained
using a scatter plot.
• A scatter plot reveals anomalies, questionably measured data, or incorrectly plotted
data visually.

• The Title- A brief description of what is in your graph is provided in the title.
• The Legend- The meaning of each point is explained in the legend.
• The Source- The source explains how you obtained the data for your graph.
• Y-Axis.
• The Data.
• X-Axis.

3. A scatter plot is composed of a horizontal axis containing the measured values of one
variable (independent variable) and a vertical axis representing the measurements of the
other variable (dependent variable). The purpose of the scatter plot is to display what
happens to one variable when another variable is changed.
4.

• Positive Correlation.
• Negative Correlation.
• No Correlation (None)

Check Your Progress 4

1. Three main types of input exist to plot a heatmap: wide format, correlation
matrix, and long format.
Wide format: The wide format (or the untidy format) is a matrix where
each row is an individual, and each column is an observation. In this case,
the heatmap makes a visual representation of the matrix: each square of the
heatmap represents a cell. The color of the cell changes according to its
value.
Correlation matrix: Suppose you measured several variables for n
individuals. A common task is to check if some variables are correlated.
You can easily calculate the correlation between each pair of variables, and
plot this as a heatmap. This lets you discover which variable is related to
the other.
Long format: In the tidy or long format, each line represents an
observation. You have 3 columns: individual, variable name, and value (x,
y and z). You can plot a heatmap from this kind of data.

2. A heat map is a two-dimensional visualisation of data in which colours

stand in for values. A straightforward heat map offers a quick visual
representation of the data. The user can comprehend complex data sets with
the help of more intricate heat maps.

3. Using one variable on each axis, heatmaps are used to display relationships
between two variables. You can determine if there are any trends in the
values for one or both variables by monitoring how cell colours vary across
each axis.

Check Your Progress 5

1. A bubble chart is a variant of a scatter chart in which the data points

are swapped out for bubbles, with the size of the bubbles serving as a
representation of an additional dimension of the data. A bubble chart
horizontal and vertical axes are both value axes.

2. To identify whether at least three numerical variables are connected or

exhibit a pattern, bubble charts are utilised. They could be applied in
specific situations to compare categorical data or demonstrate trends
across time.

3. In scatter charts, one numeric field is displayed on the x-axis and

another on the y-axis, making it simple to see the correlation between
the two values for each item in the chart. A third numerical field in a
bubble chart regulates the size of the data points.

4. Any bubbles between 0 and 5 pts on this scale will appear at 5 pt, and
all the bubbles on your chart will be between 5 and 20 pts. To construct
a chart that displays many dimensions, combine bubble size with
colour by value.

Check Your Progress 6

Answer 1:
In the process of statistics development, bar charts are typically employed to
display the data. The following is a list of some of the applications of the bar
chart:
To clearly illustrate the relationships between various variables, bar charts are
typically utilised. When presented in a pictorial format, the parameters can be
more quickly and easily envisioned by the user.
Bar charts are the quickest and easiest way to display extensive amounts of data
while also saving time.
The method of data representation that is most commonly utilised. As a result,
it is utilised in a variety of different sectors.
When studying trends over extended amounts of time, it is helpful to have this
information.

Answer 2:
Charts are primarily divided into two categories:

Horizontal Bar Charts:

Vertical Bar Charts

We can further divide into two types:

Grouped Bar Charts

Stacked Bar Charts
Answer 3:

Answer4:

Check Your Progress 7:

1. For visually assessing the distribution of sample data, you can draw
distribution charts. Using these charts, you can contrast the actual
distribution of the data with the theoretical values expected from a
certain distribution.
2. The distribution plot is useful for analysing the relationship
between the range of a set of numerical data and its distribution.
You are only allowed to use one or two dimensions and one
measure when creating a distribution graphic.
3. These graphs show - how the data is distributed; how the data is
composed; how values relate to one another.

Check Your Progress 8:

1. We can visualise pairwise relationships between variables in a dataset
using pair plots. By condensing a lot of data into a single figure, this gives the
data a pleasant visual representation and aids in our understanding of the data.

2. A scatter plot of a and b, one of a and c, and finally one of a and d are
shown in the first line. b and a (symmetric to the first row) are in the second
row, followed by b and c, b and d, and so on. In pairs, no sums, mean squares,
or other calculations are performed. That is in your data frame if you discover
it in your pairings plot.

3. Pair plots are used to determine the most distinct clusters or the best
combination of features to describe a connection between two variables. By
creating some straightforward linear separations or basic lines in our data set,
it also helps to create some straightforward classification models.

Check Your Progress 9:

1. A graph that depicts change over time by means of points and lines is known
as a line graph, line chart, or line plot. It is a chart that depicts a line uniting
numerous points or a line that illustrates the relation between the points. The
line or curve used to depict quantitative data between two changing variables
in the graph combines a sequence of succeeding data points to create a
representation of the graph.

2. Tracking changes over a short as well as a long period of time is one of the
most important applications of line graphs. Additionally, it is utilised to
compare the modifications that have occurred for various groups throughout
the course of the same period of time. When dealing with data that has only
minor variations, using a line graph rather than a bar graph is strongly
recommended. For instance, the finance team at a corporation may wish to
chart the evolution of the cash balance that the company now possesses
throughout the course of time.

Check Your Progress 10:

1. A pie chart, often referred to as a circle chart, is a style of graph that can be used to
summarise a collection of nominal data or to show the many values of a single variable.
(e.g. percentage distribution).

2. There are mainly two types of pie charts one is 2D pie chart and another is 3D pie
chart. This can be further classified into flowing categories:

1. Simple Pie Chart

2. Exploded Pie Chart

3. Pie of Pie
4. Bar of Pie

Check Your Progress 11:

1. Pie charts have been superseded by a more user-friendly alternative called a doughnut
chart, which makes reading pie charts much simpler. It is recognised that these charts
express the relationship of 'part-to-whole,' which is when all of the parts represent one
hundred percent when collected together. In comparison to pie charts, they provide for
more condensed and straightforward representations.
2. A donut chart is similar to a pie chart, with the exception
that the centre is cut off. When you want to display
particular dimensions, you use arc segments rather than
slices. Just like a pie chart, this form of chart can assist you
in comparing certain categories or dimensions to the
greater overall; nevertheless, it has a few advantages over
its pie chart counterpart.
3.

Product Sales

30
40

x y z

Check Your Progress 12:

1. An area chart shows how the numerical values of one or more groups change in
proportion to the development of a second variable, most frequently the passage of time.
It combines the features of a line chart and a bar chart. A line chart can be differentiated
from an area chart by the addition of shading between the lines and a baseline, just like
in a bar chart. This is the defining characteristic of an area chart.

2. Overlapping area chart and Stacked area chart

Area Chart
3500

3000

2500

2000

1500

1000

500

0
2017 2018 2019 2020

Product A Product B Product C

4.17 REFERENCES

• Useful Ways to Visualize Your Data (With Examples). Pdf

• Data Visualization Cheat Sheet. Pdf
• Which chart or graph is right for you? Pdf
• https://www.excel-easy.com/examples/frequency-distribution.html
• https://analyticswithlohr.com/2020/09/15/556/
• https://www.fusioncharts.com/line-charts
• https://evolytics.com/blog/tableau-201-make-stacked-area-chart/
• https://chartio.com/learn/charts/area-chart-complete-guide/
• https://www.lifewire.com/exploding-pie-charts-in-excel-3123549
• https://www.formpl.us/resources/graph-chart/line/
• https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch9/bargraph-
diagrammeabarres/5214818-eng.htm
• https://sixsigmamania.com/?p=475
• https://study.com/academy/lesson/measures-of-dispersion-variability-and-
skewness.html
UNIT 5 BIG ARCHITECTURE

Structure
5.1 Introduction
5.2 Objectives
5.3 Big Data and Characteristics
5.4 Big data Applications
5.5 Structured vs semi-structured and unstructured data
5.6 Big Data vs data warehouse
5.7 Distributed file system
5.8 HDFS and Map Reduce
5.9 Apache Hadoop 1 and 2 (YARN)
5.10 Summary
5.11 Solutions/Answers
5.12 Further Readings

5.1 INTRODUCTION

In this modern era of Information and knowledge, terabytes of data are produced by a wide
variety of sources every day. These sources include social media, the data of companies
including production data, customer data, financials etc.; the interactions of users, the data
created by sensors and data produced by electronic devices such as mobile phones and
automobiles, amongst others. This voluminous increase in data relates to the field of Big
Data. This concept of "Big Data" applies to the rapidly expanding quantity of data that is being
collected, and it is described primarily by having the following characteristics: volume,
velocity, veracity, and variety.

The process of deriving useful information and insights from massive amounts of data is
referred to as "big data analytics". It requires the application of technologically advanced tools
and procedures. "Big data architecture" refers to the pattern or design that describes how this
data is received from many sources, processed for further ingestion, assessed, and eventually
made available to the end-user.

The building blocks of big data analytics are found in the big data architecture. In most cases,
the architecture components of big data analytics consist of four logical levels or layers, as
discussed below:

• Big Data Source Layer: A big data environment may manage both batch processing and
real-time processing of big data sources, like data warehouses, relational database
management systems, SaaS applications, and Internet of Things devices. This layer is
referred to as the big data sources layer.
• The Management and Storage Layer: it is responsible for receiving data from the source,
converting that data into a format that the data analytics tool can understand, and storing
the data in accordance with the format in which it was received.
• Analysis Layer: This layer is where the business intelligence that was acquired from the
big data storage layer is analysed.
• The Consumption Layer: it is responsible for collecting the findings from the Big Data
Analysis Layer and delivering them to the Appropriate Output Layer, which is also
referred to as the Business Intelligence Layer.

The architecture components of big data analytics typically include four logical layers or levels
(discussed above), which perform fundamental processes as given below :
• Connecting to Data Sources: Connectors and adapters can connect to a wide range of
storage systems, protocols, and networks. They can also connect to any type of data
format.
• Data Governance: includes rules for privacy and security that work from the time data is
taken in until it is processed, analysed, stored, and deleted.
• Systems Management: Modern big data architectures are usually built on large-scale,
highly scalable clusters that are spread out over a wide area. These architectures must be
constantly monitored using central management consoles.
• Protecting Quality of Service: The Quality of Service framework helps define data
quality, compliance policies, and how often and how much data should be taken in.

In this Unit, we will discuss Big Data its Characteristics and Applications. We will also discuss
Big Data architecture and related technologies such as Map Reduce, HDFS, Apache Hadoop and
Apache YARN.

5.2 OBJECTIVES

After going through this unit, you will be able to:

• explain Big Data and its Characteristics
• compare the characteristics of Structured Semi-Structured and Unstructured data
• differentiate Big Data and Data warehouse
• describe the Distributed file system
• explain HDFS and Map Reduce
• differentiate between Apache Hadoop 1 and 2 (YARN).
5.3 BIG DATA AND CHARACTERISTICS

To define the concept of Big Data, we first revise the fundamentals of data and thereafter we will
extend our discussion to the advanced concepts of Big Data. What exactly does "Data" mean?
Data can be numbers or characters or symbols or any other form such as an image or signal,
which can be processed by a computer or may be stored in the storage media or sent in the form
of electrical impulses.

Now, we define - What is Big Data? The term "big data" refers to a collection of information that
is not only extremely large in quantity but also growing at an exponential rate over the course of
time. Because of the vast amount and the high level of complexity, none of the normal methods
that are used for managing data can effectively store or deal with such data. Simply speaking,
"big data" refers to information that is kept in exceedingly vast amounts.

It is vital to have a thorough understanding of the characteristics of big data in order to get a
handle on how the concept of big data works and how it can be used. The characteristics of big
data are:

1. Volume: The amount of information that you currently have access to is referred to as its
volume. We measure the quantity of data that we have in our possession using the units
of Gigabytes (GB), Terabytes (TB), Petabytes (PB), Exabytes (EB), Zettabytes (ZB), and
Yottabytes (YB). According to the patterns that have been seen in the industry, it is
anticipated that the quantity of data has increased exponentially.

2. Velocity: The speed at which data is generated and therefore should be processed, is
referred to as the "velocity" of that process. When it comes to the efficiency of any
operation involving vast amounts of data, a high velocity is a fundamental requirement.
This characteristic can be broken down into its component parts, which include the rate
of change, activity bursts, and the linking of incoming data sets.

3. Variety: The term "variety" relates to the numerous forms that big data might take. When
we talk about diversity, we mean the various kinds of big data that are available. This is a
major problem for the big data sector since it has an impact on the efficiency of analysis.
It is critical that you organize your data in order to effectively manage its diversity. You
get a wide range of data from a variety of sources.

4. Veracity: The veracity of data relates to its accuracy. Lack of veracity can cause
significant harm to the precision of your findings; it is one of the Big Data characteristics
that is considered to be of the utmost importance.
5. Value: The benefits that your organization gains from utilizing the data are referred to as
"value." Are the results of big data analysis in line with your company's goals? Is it
helping your organization expand in any way? In big data, this is a vital component.

6. Validity: This characteristic of Big Data relates to how valid and pertinent are the facts
that are going to be used for the purpose that they were designed for?

7. Volatility: Volatility relates to the life span of big data. Some data items may be valid for
a very short duration like the sentiments of people.

8. Variability: The field of big data is continually changing. In some cases, the information
you obtained from one source may not be the same as what you found today. This
phenomenon is known as data variability, and it has an impact on the homogeneity of
your data.
9. Visualization: You may be able to express the insights that were generated by big data by
using visual representations such as charts and graphs. The insights that big data
specialists have to provide are increasingly being presented to audiences who are not
technically oriented.

Some examples of big data are given below:

• Big Data can be illustrated by looking at the Stock Exchange, which creates
approximately one terabyte of new trade data every single day.
• Social Media: According to various reports, more than 500 Gigabytes of fresh data are
added to the databases of the social media website each and every day. The uploading of
photos and videos, sending and receiving of messages, posting of comments, and other
activities are the primary sources of this data.
• In just thirty minutes of flying time, a single jet engine is capable of producing more than
ten Gigabytes of data. Because there are many thousands of flights each day, the amount
of data generated can reach many Peta bytes.

A big data system is primarily an application of big data in an organization for decision support.
The following four components are required for a Big Data system to function properly:

1. Ingestion (collecting and preparing the data)

2. Storage (storing the data)
3. Analysis (analyzing the data)
4. Consumption (presenting and sharing the insights)
You need a component that is responsible for collecting the data and another that is responsible
for storing it. In order for your big data ecosystem to be complete, you require a system that
analyses and reports the results of big data analysis.

Big data may be the future of worldwide governance and businesses. It presents businesses with
a great number of opportunities for improvement. The following are some of the more important
ones:

1. Improved Decision Making

2. Data-driven Customer Service
3. Efficiency Optimization
4. Real-time Decision Making

Based on these opportunities, this field of Big Data has enormous applications in a variety of
fields, and they are discussed in the subsequent section i.e. section 5.4.

☞ Check Your Progress 1

1. What does "Big Data" mean?
………………………………………………………………………………………………………
………………………………………………………………………………………………………
2. List the components required by a Big Data system to function properly
………………………………………………………………………………………………………
………………………………………………………………………………………………………
3. Discuss the essential characteristics of Big Data.
………………………………………………………………………………………………………
………………………………………………………………………………………………………

5.4 BIG DATA APPLICATIONS

It is generally stated that Big Data is one of the most valuable and effective fuel that can power the vast
IT companies of the 21st century. Big data has applications in virtually every industry. Big data enables
businesses to make better use of the vast quantities of data they generate and collect from a variety of
sources. There are many different applications for big data, which is why it is currently one of the skills
that is most in demand. The following are some examples of important applications of big data:

• One of the industries that makes the most use of big data technology is the travel and tourism
sector. It has made it possible for us to anticipate the need for travel facilities in a variety of
locations, thereby improving business through dynamic pricing and a number of other factors.
• The social media sites produce a significant amount of data. Big data allows marketers to make
greater use of the information provided by social media platforms, which results in improved
promotional activities. It enables them to construct accurate client profiles, locate their target
audience, and comprehend the criteria that are important to them.

• Big data technology is utilized widely within the financial and banking sectors. The use of big
data analytics can assist financial institutions in better comprehending the behaviours of their
clients on the basis of the inputs obtained from their investment behaviours, shopping habits,
reasons for investing, and their personal or financial histories.

• The field of healthcare has already witnessed significant changes brought on by Big Data
application implementation. Individual patients can now receive individualised medical care that
is tailored to their specific needs. This is possible due to the application of predictive analytics by
medical professionals and personnel in the healthcare industry.

• The use of big data in recommendation systems is another popular application of big data. Big
data allows businesses to recognize patterns of client behavior in order to provide better and more
individualized services to those customers.

• One of the most important industries that produce and use big data is the telecommunications and
multimedia business. Every day, Zettabytes of new data are created and managed.

• In addition, the Government and the military make heavy use of big data technology. You may
think of the quantity of data that the Government generates on its records; and in the military, a
standard fighter jet plane needs to handle Petabytes of data while it is in the air.

• Companies are able to do predictive analysis because of the capabilities provided by big data
technologies. It lets businesses to make more accurate predictions of the outcomes of processes
and events, which in turn helps them reduce risk.

• Companies are able to develop insights that are more accurate because of big data. The big data is
collected by them from a variety of sources, which allows them the capacity to use relevant data
to generate insights that can be put into action. If a corporation has more accurate information, it
will be able to make decisions that are more profitable and reduce risks.

5.5 STRUCTURED, SEMI-STRUCTURED AND

UNSTRUCTURED DATA

The terms "structured," "unstructured," and "semi-structured" data are frequently brought up whenever
we are having a discussion about data or analytics. These are the three varieties of data that are becoming
increasingly important for many kinds of commercial applications i.e. Structured, Semi-Structured and
Unstructured data. Structured data has been around for quite some time, and even today, conventional
systems and reporting continue to rely on this type of data. In spite of this, there has been a rapid increase
in the production of unstructured and semi-structured data sources over the course of the last several
years. As a consequence of this, an increasing number of companies are aiming to include all three kinds
of data in their business intelligence and analytics systems in order to take these systems to the next level.

In this section of unit, we are going to discuss data classification used for understanding and
implementing the Big Data i.e. Structured, Semi Structured and Unstructured data.

Structured Data: Structured data is a term that refers to information that has been reformatted and
reorganised in accordance with a data model that has been decided in advance. After mapping the raw
data into the predesigned fields, the data can then be extracted and read using SQL in a straightforward
manner. Relational databases, which are characterised by their organisation of data into tables comprising
of rows and columns, and the query language supported by them (SQL) provide the clearest example
possible of structured data.

This relational model of the data format decreases the quantity of information that is duplicated. On the
other side, organized data is more dependent on one another and is less flexible than unstructured data.
Both humans and machines are responsible for the generation of this kind of data.

There are several examples of structured data that are generated by machines, such as data from point-of-
sale terminals (POS), such as quantity, barcodes, and statistics from blogs. In a similar vein, anybody who
works with data has probably used spreadsheets at least once in their life. Spreadsheets are an example of
a classic form of structured data that is generated by humans. Because of the way it is organized,
structured data is simpler to examine compared to semi-structured data as well as unstructured data.
Structured data can be analysed easily because it corresponds to data models that have already been
defined. For example, you can organise structured data such as names of customers in alphabetical order,
organise telephone numbers in the appropriate format, organise social security numbers in the correct
format etc.

Unstructured data: Information that is displayed in its most unprocessed form is referred to as
"unstructured data”. It is really difficult to work with this data because it has a minimal structure, and the
formatting is also very confusing. Unstructured data management can collect data from a variety of
sources, such as postings on social media platforms, conversations, satellite imagery, data from the
Internet of Things (IoT) sensor devices, emails, and presentations, and organise it in a storage system in a
logical and specified manner.

Semi-Structured Data: There is a third kind of data that falls between structured and unstructured data
called semi-structured data or partially structured data. Your data sets might not always be structured or
unstructured. One variety of such data is known as semi-structured data, and it differs from unstructured
data in that it possesses both consistent and definite qualities. It does not restrict itself to a fixed structure
like those that are required for relational databases. Although organizational qualities such as metadata or
semantics tags are utilised with semi-structured data in order to make it more manageable, there is still a
certain amount of unpredictability and inconsistency present in the data.
Use of delimited files is one illustration of a data format that is only semi-structured. It has elements that
are capable of separating the hierarchies of the data into their own distinct structures. In a similar manner,
digital images have certain structural qualities that make them semi-structured, but the image itself does
not have a pre-defined structure. If a picture is shot with a Smartphone, for instance, it will include certain
structured properties like a device ID, a geotag, and a date and time stamp. After they have been saved,
pictures can be organized further by affixing tags to them, such as "pet" or "dog," for example.

Due to the presence of one or more characteristics that allow for classification, unstructured data may on
occasion be categorized as semi-structured data instead. To summarise, organisations need to analyse all
three kinds of data in order to stay ahead of the competition and make the most of the knowledge they
have.

☞ Check Your Progress 2

1. Give Five applications of Big Data.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
2. Compare Structured, Semi-Structured and Unstructured Data.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
3. Discuss the role of Structured, Semi-Structured and Unstructured Data in decision-
making.
………………………………………………………………………………………………………
………………………………………………………………………………………………………

5.6 BIG DATA VS DATA WAREHOUSE

What should you buy for implementing an analytics system for your organization - a big data solution or
a data warehouse? A data warehouse and a big data solution are quite similar in many respects. Both have
the capacity to store a significant amount of information. When reporting, either option is acceptable to
utilize. But the question that needs to be answered is whether they can truly be utilised as a substitute for
each other. In order to understand this, you need to have a conceptual understanding of both i.e., Big data
and Data Warehouse.

The form of big data that can be found in Hadoop, Cloudera, and other similar big data platforms is the
type that is understood by most people. The following are the working definitions of big data solutions
(You may find an accurate functioning definition in the websites of Cloudera or HortonWorks.):

• A type of technology that is able to store extremely vast volumes of data.

• A technology that is capable of storing the data in devices that are not pricey.
• A type of data storage technology in which the data is kept in an unstructured manner.
Data warehouse has had widespread usage in the past two decades, whereas big data applications have
seen significant rise over the course of the past decade. Big data are expected to eventually replace
traditional data warehousing, which is the first thing that comes to the mind of someone who is not
technically very deep into these technologies. The fact that they share similarities is another argument in
favour of this simplistic way of thinking. Following are some of the similarities between the two:

• Both of these have a significant amount of information.

• Both may be utilised in the reporting process.
• Both are controlled by devices that save information electronically.

But despite this, "Big Data" and "Data Warehouse" are not the same thing at all. Why? In order to
understand this, we need to recapitulate the basic concepts of Data Warehouse

A data warehouse is a subject-oriented, non-volatile, integrated, and time-variant collection of data that is
generated for the aim of the management making decisions with such data. A data warehouse provides a
centralized, integrated, granular, and historical point of reference for all of the company's data in one
convenient location.

So why do individuals demand a solution that involves big data? People are looking for a solution for big
data because there is a significant amount of data in many organizations. And in such companies, that
data – if it is accessed properly – can include a great deal of important information that can lead to better
decisions, which in turn can lead to more revenue, more profitability, and more customers. And the vast
majority of businesses want to do this.

What are the reasons that people require a data warehouse? In order for people to make decisions based
on accurate information, a data warehouse is required. You need data that is trustworthy, verifiable, and
easily accessible to everyone if you want to have a thorough understanding of what is occurring within
your company.

So, in short we can understand the data Warehouse involves a process of collecting data from a number of
sources, processing it, and storing it in a repository where it can be analyzed and used for reporting
reasons. The procedure described above results in the establishment of a data repository that is known as
the data warehouse.

Comparison: Big Data Vs Data Warehouse

Therefore, what do we find when we investigate a data warehouse in conjunction with a big data solution?
A big data solution is an example of a form of technology, whereas data warehousing is an example of a
type of architectural framework. These two parts of the same thing could not be more different from one
another. To put it another way, technology can be loosely defined as any mechanism that is capable of
storing and organising huge amounts of data. A facility that stores and organises data with the aim of
guaranteeing a company's credibility and integrity is known as a data warehouse. Data warehouses are
becoming increasingly common. When data is taken from a data warehouse, the person taking the data is
aware that other individuals are using the same data for various purposes. This is because different people
are taking the data. The availability of a data warehouse ensures the availability of a foundation that can
enable the reconcilability of data.
S.No. Big Data Data Warehouse

1. Big data refers to data that is stored in an The term "data warehouse" refers to the
extremely large format that can be utilised by accumulation of historical data from a variety
many technologies. of business processes within an organisation.

2. The term "big data" refers to a type of A data warehouse is a structure that is utilised
technology that can store and manage extensive in the process of data organisation.
amounts of data.

3. As input, it will accept data that is either The only type of data that can be used as input
structured, non-structured, or semi-structured. is structured data.

4. The distributed file system is used for Processing operations in the data warehouse are
processing big data. not carried out using a distributed file system.

5. When it comes to retrieving data from To get data from relational databases, we use
databases, big data may not use SQL queries. structured query language, or SQL, in the data
warehouse.

6. Apache Hadoop is capable of managing It is difficult for a data warehouse to manage an

massive amounts of data when handled extremely large volume of unstructured data.
appropriately.

7. The modifications to the data that occur as a Changes in the data that occur as a result of
result of adding new data are saved in the form adding new data do not have an immediate and
of a file. which is represented by a table. direct effect on the data warehouse.

8. When opposed to data warehouses, big data As a result of the fact that the data is compiled
does not require the same level of effective from a variety of business divisions, the data
management strategies. warehouse calls for management strategies that
are more effective.

Big Data uses Distributed File System. So, we extend our discussion on Distributed file Systems in
section 5.7 of this unit.

☞ Check Your Progress 3

1. Give similarities between Big Data and Data warehouse.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
2. Compare Big Data and Data warehouse.
………………………………………………………………………………………………………
………………………………………………………………………………………………………

5.7 DISTRIBUTED FILE SYSTEM

A Distributed File System (DFS) is a file system that is distributed across a large number of file servers
may be at several locations (It's possible that these servers are located in various parts of the world). It
allows programmers to access or store isolated files the same way they do local files, allowing
programmers to access files from any network or computer.

The primary objective of the Distributed File System (DFS), is to facilitate the sharing of users' data and
resources between physically separate systems through the utilization of a Common File System. A setup
on Distributed File System is defined as a group of workstations and mainframes that are linked together
over a Local Area Network (LAN). Within the framework of the operating system, a DFS is carried out.
A namespace is generated by DFS, and the clients are not exposed to the inner workings of this
generation process.

The DFS is comprised of two distinct components, given below:

• Location transparency: The namespace component is a useful tool for achieving

Location transparency.

• Redundancy: It is established by the utilization of a file replication component.

Together, these components make it more likely that data will be accessible in the event that there is a
breakdown or a severe load. They accomplish this by making it possible for data that is stored in multiple
locations to be logically grouped together and shared under a single folder that is referred to as the "DFS
root."

It is possible to use the namespace component of Distributed File System (DFS) without using the file
replication component, and it is perfectly possible to use the file replication component of Distributed File
System (DFS) between servers without using the namespace component. It is not essential to make use of
both of the DFS components at the same time.

In order to gain more clarity on the Distributed File System (DFS), we need to study the Features of DFS:

1. Transparency:
When discussing distributed systems, the term "transparency" refers to the process of hiding
information about the separation of components from both the user and the application
programmer. This is done in order to protect the integrity of the system. Because of this, it seems
as though the whole system is a single thing, rather than a collection of distinct components
working together. It is classified into the following types:
a) Structure Transparency –
There is no reason for the client to be aware of the number of file servers and storage
devices and their geographical locations. But it is always recommended to provide a
number of different file servers to improve performance, adaptability, and dependability.
b) Access Transparency –
The access process for locally stored files and remotely stored files should be identical. It
should be possible for the file system to automatically locate the file that is being
accessed and send that information to the client's side.
c) Naming transparency –
There should be no indication in the name of the file as to where the file is located in any
way, shape, or form. Once a name has been assigned to the file, it should not have any
changes made to it while it is being moved from one node to another.
d) Replication transparency –
If a file is copied on multiple nodes, the locations of the copies of the file as well as the
copies themselves should be hidden when moving from one node to another.

2. User Mobility: It will automatically transfer the user's home directory to the node where the user
logs in.

3. Performance: The average amount of time it takes to get what the client wants is used to
measure performance. This time includes CPU time, the time it takes to get to secondary storage,
and the time it takes to get to the network. It would be best for the Distributed File System to
work like both DFS and as well as a centralized file system (CFS), based on requirements.

4. Simplicity and ease of use: A file system should have a simple user interface and a small
number of commands in the file.

5. High availability: If a link fails, a node fails, or a storage drive crashes, a Distributed File
System should still be able to keep working. A distributed file system (DFS) that is both reliable
and flexible should have different file servers that control different storage devices that are also
separate.

6. Scalability: Adding new machines to the network or joining two networks together is a common
way to make the network bigger, so the distributed system will always grow over time. So, a good
distributed file system (DFS) should be built so that it can grow quickly as the number of nodes
and users grows. As the number of nodes and users grows, the service should not deteriorate too
much.

7. High reliability: A good distributed file system (DFS) should make it as unlikely as possible that
data will be lost. That is, users should not feel like they have to make backup copies of their files
because the system is not reliable. Instead, a file system should make copies of important files in
case the originals get lost. Stable storage is used by many file systems to make them very reliable.
8. Data integrity: A file system is often used by more than one person at a time. The file system
must make sure that the data saved in a shared file stays the same. That is, a concurrency control
method must be used to keep track of all the different users' requests to access the same file at the
same time. Atomic transactions are a type of high-level concurrency management that a file
system often offers to users to keep their data safe.

9. Security: A distributed file system should be safe so that its users can trust that their data will be
kept private. Security measures must be put in place to protect the information in the file system
from unwanted and unauthorized access.

10. Heterogeneity: Because distributed systems are so big, there is no way to avoid heterogeneity.
Users of heterogeneous distributed systems can choose to use different kinds of computers for
different tasks.

The working of DFS can be put into practice in one of two different ways:

• Standalone DFS namespace is a namespace that only permits DFS roots to be used in it if they
are located on the local computer and do not make use of Active Directory. Only the computers
on which a Standalone DFS was initially created can be used to acquire the file system. There is
no fault liberation offered by it, and it cannot be connected to any other DFS. Standalone
DFS roots are not very common because of their limited advantage.

• Domain-based DFS namespace — This creates the DFS namespace root, which can be accessed
at domainname>dfsroot> and stores the configuration of DFS in Active Directory.

Advantages of Distributed File System (DFS) :

a) DFS permits many active users to access or store data, simultaneously.

b) DFS allows the remote sharing of data

c) DFS facilitates file searching, faster to get to the data, and makes the network to run better.

d) DFS improves the ability to change the size of the data and the ability to exchange data.

e) DFS maintains data transparency and replicability, even if a server or disc fails, Distributed File
System makes data available.

Disadvantages of Distributed File System (DFS) :

a) Because all the nodes and connections in a Distributed File System need to be secured, we can
safely claim that security is at risk.

b) During the process of moving from one node to another in the network, there is a chance that
some of the messages and data will be lost.

c) While using a distributed file system, Difficulty is encountered when attempting to connect to a
database
d) When compared to a system with a single user, a Distributed File System makes database
management more difficult.
e) If every node in the network attempts to transfer data at the same time, there is a possibility that
the network will become overloaded.

After understanding the Distributed File System (DFS), now it is time to understand Hadoop and HDFS
(Hadoop Distributed File System). The discussion for HDFS (Hadoop Distributed File System) and
MapReduce is given in the subsequent section, and the details of MapReduce are explicitly available in
Unit-6.

☞ Check Your Progress 4

1. Describe the term "Distributed File System" in the context of Big Data.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
2. What is the primary objective of Distributed File System?
………………………………………………………………………………………………………
………………………………………………………………………………………………………
3. Explain the components of Distributed File System.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
4. Discuss the features of Distributed File System
………………………………………………………………………………………………………
………………………………………………………………………………………………………
5. Discuss the number of ways through which the working of Distributed File System can
be put into practice
………………………………………………………………………………………………………
………………………………………………………………………………………………………
6. Give advantages and disadvantages of Distributed File Systems.
………………………………………………………………………………………………………
………………………………………………………………………………………………………

5.8 HDFS AND MAPREDUCE

Data is being produced at a lightning speed in the modern period by a wide variety of sources, such as
corporations, scientific data, e-mails, blogs, and other online platforms, amongst other places. It is
necessary to implement data-intensive applications and storage clusters in order to analyse and manage
the massive amount of data that is being generated and to gain information that is of use to the users. This
is a prerequisite for being able to analyse and manage the data. This category of application is required to
fulfil a number of characteristics, such as being able to tolerate errors, performing parallel processing,
distributing data, maintaining load balance, being scalable, and having highly available operation. The
MapReduce programming approach was developed by Google specifically for the purpose of resolving
problems of this kind. Hadoop is another name for the open-source software project known as Apache
Hadoop, which implements MapReduce technology.

Hadoop is a collection of several software services that are all freely accessible to the public and can be
used in conjunction with one another. It offers a software framework for storing a huge amount of data in
a variety of locations and for working with that data by utilising the MapReduce programming style. Both
of these capabilities are accomplished through the use of software. Hadoop's essential components are the
Hadoop Distributed File System (HDFS) and the MapReduce programming paradigm. Hadoop is
operated using a programming style called MapReduce, and the Hadoop Distributed File System (HDFS)
is the file system that is used to store data. The combination of HDFS and MapReduce creates an
architecture that, in addition to being scalable and fault-tolerant, conceals all of the complexity associated
with the analysis of big data.

Hadoop Distributed File System, abbreviated as HDFS, is a self-healing, distributed file system that offers
dependable, scalable, and fault-tolerant data storage on commodity hardware. HDFS was developed by
Hadoop. It collaborates closely with MapReduce by distributing storage and processing across huge
clusters. This is accomplished by combining storage resources that, depending on the requests and queries
being processed, are able to scale up or down. HDFS is not architecture-specific; it may read data in any
format, including text, photos, videos, and so on, and it will automatically optimise it for high bandwidth
streaming. The ability to tolerate errors is the most significant benefit of HDFS. The possibility of a
catastrophic failure can be reduced by ensuring that there is a quick data movement between the nodes
and that Hadoop can continue to offer service even in the event that individual nodes fail.

The information presented in this section can be used for understanding the development of large-scale
distributed applications that can make use of the computing capacity of several nodes in order to finish
tasks that are data and computation intensive.

Let us discuss the Hadoop Architecture which includes the following components:

• the MapReduce programming framework and

• the Hadoop distributed file system (HDFS)

An understanding of the Hadoop architecture requires the introduction of various daemons involved in its
working. So, we need to understand the various daemons involved in Apache Hadoop, which includes the
five daemons, three of which i.e. NameNode, the DataNode, and the Secondary NameNode relates to
HDFS for the purpose of efficiently managing distributed storage, and The JobTracker and TaskTracker
components that are utilized by the MapReduce engine are responsible for both job tracking and job
execution respectively, and each of the mentioned daemons runs on their respective JVM

Firstly, we will discuss the HDFS, a distributed file system that is comprised of three nodes i.e. the
NameNode, the DataNode, and the Secondary NameNode respectively.
1) NameNode : A single NameNode daemon operates on the master node. NameNode is responsible
for storing and managing the metadata that is connected with the file system. This metadata is
stored in a file that is known as fsimage. When a client makes a request to read from or write to a
file, the metadata is held in a cache that is located within the main memory so that the client may
access it more rapidly. The I/O tasks are completed by the slave DataNode daemons, which are
directed in their actions by the NameNode.

The NameNode is responsible for managing and directing how files are divided up into blocks,
selecting which slave node should store these blocks, and monitoring the overall health and
fitness of the distributed file system. In addition, it decides which slave node should store these
blocks. Memory and input/output (I/O) are both put to intensive use in the operations that are
carried out by the NameNode in the network.

2) DataNode : A DataNode daemon is present on each slave node, which is a component of the
Hadoop cluster. DataNodes are the primary storage parts of HDFS. They are responsible for
storing data blocks and catering to requests to read or write files that are stored on HDFS. These
are under the authority of NameNode. Blocks that are kept in DataNodes are replicated in
accordance with the configuration in order to guarantee both high availability and reliability.
These duplicated blocks are dispersed around the cluster so that computation may take place more
quickly.

3) Secondary NameNode : A backup for the NameNode is not provided by the Secondary
NameNode. The job of the Secondary NameNode is to read the file system at regular intervals,
log any changes that have occurred, and then apply those changes to the fsimage file. This assists
in updating NameNode so that it can start up more quickly the next time, as shown in Figure 3.

The HDFS layer is responsible for daemons that store data and information, such as NameNode and
DataNode. The MapReduce layer is in charge of JobTracker and TaskTracker, which are seen in Figure 1
and are responsible for keeping track of and actually executing jobs.

Hadoop

HDFS Map Reduce

NameNode JobTracker

DataNode DataNode TaskTracker TaskTracker

DataNode TaskTracker

Figure-1-Hadoop Daemons
The master/slave architecture is utilised by HDFS, with the NameNode daemon and secondary
NameNode both operating on the master node and the DataNode daemon running on each and every slave
node, as depicted in Figure 2. The HDFS storage layer consists of three different daemons.

Figure 2 –HDFS Architecture

Figure – 3 : Task of Secondary name Node

A Hadoop cluster consists of several slave nodes, in addition to a single master node. NameNode is the
master daemon for the HDFS storage layer, while JobTracker is the master daemon for the MapReduce
processing layer. Both of these daemons are executed by the master node. The remaining machines will
be responsible for running the "slave" daemons, which include DataNode for the HDFS layer and
TaskTracker for the MapReduce layer.

It is possible for a master node to take on the role of a slave at times, as it possesses the potential to flip
roles. As a result, the master node is capable of operating both the master daemons and the slave daemons
in addition to running the master daemons. The daemons that are operating on the master node are
basically responsible for coordinating and administering the daemons that are running as slaves on all of
the other nodes. This responsibility lies with the master node. These slave daemons are responsible for
carrying out the tasks essential for the processing and storing of data.

Now we will discuss the MapReduce concept and the daemons involved with MapReduce i.e. the
JobTracker and TaskTracker components that are utilized by the MapReduce engine.

MapReduce can refer to either a programming methodology or a software framework. Both are utilised in
Apache Hadoop. Hadoop MapReduce is a programming framework that is made available for creating
applications that can process and analyse massive data sets in parallel on large multi-node clusters of
commodity hardware in a manner that is scalable, reliable, and fault tolerant. The processing and analysis
of data consist of two distinct stages known as the Map phase and the Reduce phase.

Figure 4 : Map and Reduce Phase

During a MapReduce task, the input data is often segmented and partitioned into pieces before being
broken up and processed in parallel by the "Map phase" and then by the "Reduce phase." The data that is
generated by the Map phase is arranged and organised by the Hadoop architecture.

The result of the Map phase is sorted by the Hadoop framework, and this information is then sent as input
to the Reduce phase in order to begin parallel reduce jobs (see Figure 4). These input and output files are
saved in the system's file directory. The HDFS file system is the source of input datasets for the
MapReduce framework by default. It is not required that tasks involving Map and Reduce proceed in a
sequential way. This means that reduce jobs can begin as soon as any of the Map tasks finishes the work
that has been given to them. It is also not required that all Map jobs be finished before any reduction tasks
begin their work. MapReduce operates on key-value pairs as its data structure. In theory, a MapReduce
job will accept a data set as an input in the form of a key-value pair, and it will only produce output in the
form of a key-value pair after processing the data set through MapReduce stages. As can be seen in
Figure 5, the output of the Map phase, which is referred to as the intermediate results, is sent on to the
Reduce phase as an input.

KV1 Map KV2 Reduce KV3

Fig 5: MapReduce key-value pairs

On the same lines as HDFS, MapReduce also makes use of a master/slave architecture. As illustrated in
Figure 6, the JobTracker daemon resides on the master node, while the TaskTracker daemon resides on
each of the slave nodes.

The MapReduce processing layer consists of two different daemons i.e. JobTracker and TaskTracker,
their role are discussed below:
Fig 6: JobTracker and TaskTracker

4) JobTracker: The JobTracker service is responsible for monitoring MapReduce tasks that are carried out
on slave nodes and is hosted on the master node. The job is sent to the JobTracker by the user through
their interaction with the Master node. The next thing that happens is that JobTracker queries NameNode
to find out the precise location of the data in HDFS that needs to be processed. JobTracker searches for
TaskTracker on slave nodes and then sends the jobs to be processed to TaskTracker on those nodes. The
TaskTracker will occasionally send a heartbeat message back to the JobTracker to verify that the
TaskTracker on a specific slave node is still functioning and working on the task that has been assigned to
it. If the heartbeat message is not received within the allotted amount of time, the TaskTracker running on
that particular slave node is deemed to be inoperable, and the work that was assigned to it is moved to
another TaskTracker to be scheduled. The combination of JobTracker and TaskTracker is referred to as
the MapReduce engine. If there is a problem with the JobTracker component of the Hadoop MapReduce
service, all active jobs will be halted until the problem is resolved.

5)TaskTracker: On each slave node that makes up a cluster, a TaskTracker daemon is executed. It works
to complete MapReduce tasks after accepting jobs from the JobTracker. The capabilities of a node
determine the total amount of "task slots" that are available in each TaskTracker. Through the use of the
heartbeat protocol, JobTracker is able to determine the number of "task slots" that are accessible in
TaskTracker on a slave node. It is the responsibility of JobTracker to assign suitable work to the relevant
TaskTrackers, and the number of open task slots will determine how many jobs can be assigned. On every
slave node, TaskTracker is the master controller of how each MapReduce action is carried out. Even
though there is only one TaskTracker for each slave node, each TaskTracker has the ability to start
multiple JVMs so that MapReduce tasks can be completed simultaneously. JobTracker, the master node,
receives a "heartbeat" message regularly from each slave node's TaskTracker. This message confirms to
JobTracker that TaskTracker is still functioning.

In short, it can be said that, when it comes to processing massive and unstructured data volumes, the
Hadoop MapReduce computing paradigm and HDFS are becoming increasingly popular choices. While
masking the complexity of deploying, configuring, and executing the software components in the public
or private cloud, Hadoop makes it possible to interface with the MapReduce programming model. Users
are able to establish clusters of commodity servers with the help of Hadoop. MapReduce has been
modelled as an independent platform-as-a-service layer that cloud providers can utilise to meet a variety
of different requirements. Users are also given the ability to comprehend the data processing and analysis
processes.

5.9 APACHE HADOOP 1 AND 2 (YARN)

Apache Hadoop 1.x suffered from a number of architectural flaws, the most notable of which was a
decline in the overall performance of the system. Actually, the cause of the problem was the excessive
strain that was placed on the MapReduce component of Hadoop-1. MapReduce was responsible for both
application management and resource management in Hadoop 1; however, with Hadoop 2, application
management is now handled by a new component known as YARN, which takes over responsibility for
resource management (yet another resource negotiator). As a result, MapReduce is in charge of managing
application management in Hadoop 2, whereas YARN is in charge of managing the resources. With the
release of Hadoop 2, YARN has added two additional daemons. These are-

• Resource Manager
• Node Manager
These two new Hadoop 2 daemons have replaced the JobTracker and TaskTracker in Hadoop 1.

In addition, there is only one NameNode in the Hadoop 1.x cluster, which implies that it serves as the
single point of failure for the entire system. Hadoop 2. x's architecture, on the other hand, incorporates
both Active and Passive NameNodes in several places. In the event that the active NameNode is unable to
complete the tasks assigned to it, the passive NameNode will step in and assume responsibility. This
element is directly responsible for Hadoop 2. x's high availability, which is one of the most important
characteristics of the new version.
The processing of data is problematic in Hadoop 1.x; however, Hadoop 2.x's YARN provides a
centralised resource management that allows sharing of sharable resources. This makes it possible for
multiple applications included within Hadoop to execute concurrently while sharing a single resource.

The comparison between Apache Hadoop-1 and Hadoop-2 is consolidated below:

COMPARISON Apache Hadoop 1 Apache Hadoop 2

PARAMETER
Components Has MapReduce Has YARN(Yet Another Resource
Negotiator) and MapReduce version 2.

Daemons 1) NameNode, 1) NameNode,

2) DataNode, 2) DataNode,
3) Secondary NameNode, 3) Secondary NameNode,
4) JobTracker, 4) Resource Manager,
5) TaskTracker 5) Node Manager
Working Both HDFS and MapReduce are the The Hadoop Distributed File System, or
components of Hadoop 1 HDFS, is utilised once again for storage
in Hadoop 2,
HDFS is responsible for data storage, while YARN, a Resource Management
and system, is layered over HDFS to perform
its duties. In essence, it distributes the
MapReduce, which sits above available resources and ensures that
HDFS, manages resources and also everything continues to function
perform data processing, as a result of normally. YARN actually shares the load
this the system performance is heavily of MapReduce. Hence the system
impacted. performance improves

Limitation The architecture of Hadoop 1 is known Hadoop 2 utilises the same Master-Slave
as a Master-Slave architecture. One architecture as its predecessor. However,
master rules over a large number of this is made up of a number of different
slaves in this arrangement. In the event masters (also known as active namenodes
that the master node experienced a and standby namenodes) and a number of
catastrophic failure, the cluster would different slaves. In the event that this
be wiped out regardless of the quality master node suffers a crash, the standby
of the slave nodes. Again, in order to master node will take control of the
create that cluster, you have to copy network. You are able to create a wide
system files, picture files, and so on variety of different active-standby node
onto another machine, which takes an combinations. As a result, the issue of
excessive amount of time and is having a single point of failure will be
something that enterprises just cannot resolved with Hadoop 2.
accept.
EcoSystem

Oozie is a simplified version of a workflow scheduler. It is responsible for

determining the specific times at which jobs will run based on the dependencies
between them.

Pig, Hive, and Mahout are examples of data processing tools that sit above Hadoop
and execute their job there.

Sqoop is an application that can import and export structured data. Using a SQL
database, it is possible to immediately import and export data into HDFS.

Flume is a tool that is used to import and export streaming data as well as
unstructured data.
Support No Support of Microsoft Windows Support of Microsoft Windows available
available for Hadoop 1.x for Hadoop 2.x
☞ Check Your Progress 5
1. What is Hadoop? Discuss the components of Hadoop.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
2. Discuss the role of HDFS and MapReduce in Hadoop Architecture
………………………………………………………………………………………………………
………………………………………………………………………………………………………
3. Discuss the relevance of various nodes in HDFS architecture
………………………………………………………………………………………………………
………………………………………………………………………………………………………
4. Explain, how master/slave process works in HDFS architecture
………………………………………………………………………………………………………
………………………………………………………………………………………………………
5. What do you understand by the Map phase and reduce phase in MapReduce architecture?
………………………………………………………………………………………………………
………………………………………………………………………………………………………
6. List and explain the various daemons involved in the functioning of MapReduce.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
7. Differentiate between Apache Hadoop-1 and Hadoop-2.
………………………………………………………………………………………………………
………………………………………………………………………………………………………

5.10 SUMMARY

The unit covers the concepts necessary for the understanding of Big Data and its respective
Characteristics, none the less the understanding of the concept of Big data was also covered from the
applications point of view. The unit also covers further details like the comparison between Structured,
Semi-structured and Unstructured data; also, a comparison of Big Data and Data warehouse is presented
in this unit. This Unit also covers the important concept and understanding of Distributed file systems
used in Big Data. Finally, the unit concludes with the comparative presentation of Map Reduce and
HDFS, along with Apache Hadoop 1 and 2 (YARN) .

5.11 SOLUTIONS/ANSWERS

☞ Check Your Progress 1

1. What exactly does "Big Data" mean?
Refer to section 5.3
2. List the components required by Big Data system to function properly
Refer to section 5.3
3. Discuss the essential characteristics of Big Data
Refer to section 5.3
☞ Check Your Progress 2
1. Give Five applications of Big Data
Refer to section 5.4
2. Compare Structured, Semi-Structured and Unstructured Data
Refer to section 5.5
3. Discuss the role of Structured, Semi-Structured and Unstructured Data in decision
making
Refer to section 5.5 Commented [AK1]: Please given short answers

☞ Check Your Progress 3

1. Give similarities between Big Data and Data warehouse
Refer to section 5.6
2. Compare Big Data and Data warehouse
Refer to section 5.6
☞ Check Your Progress 4
1. Describe the term "Distributed File System" in context of Big Data.
Refer to section 5.7
2. What is the primary objective of Distributed File System
Refer to section 5.7
3. Explain the components of Distributed File System
Refer to section 5.7
4. Discuss the features of Distributed File System
Refer to section 5.7
5. Discuss the number of ways through which the working of Distributed File System can
be put into practice
Refer to section 5.7
6. Give advantages and disadvantages of Distributed File System
Refer to section 5.7
☞ Check Your Progress 5
1. What is Hadoop? Discuss the components of Hadoop.
Refer to section 5.8
2. Discuss the role of HDFS and MapReduce in Hadoop Architecture
Refer to section 5.8
3. Discuss the relevance of various nodes in HDFS architecture
Refer to section 5.8
4. Explain, how master/slave process works in HDFS architecture
Refer to section 5.8
5. What do you understand by the Map phase and reduce phase in MapReduce architecture
Refer to section 5.8
6. List and explain the various daemons involved in the functioning of MapReduce
Refer to section 5.8
7. Differentiate between Apache Hadoop-1 and Hadoop-2
Refer to section 5.9

5.12 FURTHER READINGS

• C. Lam, "Introducing Hadoop", in Hadoop in Action, MANNING, 2011.

• D. Borthakur, "The hadoop distributed file system: architecture and design," Hadoop
Project Website [online]. Available: http://hadoop.apache.org/ core/docs/current/hdfs
design.pdf
UNIT 6 PROGRAMMING USING MAPREDUCE

Structure
6.0 Introduction
6.1 Objectives
6.2 Map Reduce Operations
6.3 Loading data into HDFS
6.3.1 Installing Hadoop
6.3.2 Loading Data
6.4 Executing the MapReduce phases
6.4.1 Executing the Map phase
6.4.2 Shuffling and sorting
6.4.3 Reduce phase execution
6.4.4 Node Failure and MapReduce
6.5 Algorithms using MapReduce
6.5.1 Word counting
6.5.2 Matrix-Vector Multiplication
6.6 Summary
6.7 Answers
6.8 References and further readings

6.0 INTRODUCTION

In the Unit 5 of this Block, you have gone through the concepts of HDFS and have been
introduced to the map-reduce programming paradigm. HDFS is a distributed file
system, which can help in reliable storage of large amount of data files across
cluster machines. This unit discusses the map reduce programming concepts
pertaining to how to leverage cluster machines to perform a particular task by
dividing the tasks effectively across them in a reliable and fault tolerant way.
These tasks may include big jobs, such as building the index of a search engine
to crawl the webpages. This Unit illustrates the three stages in MapReduce, viz.
Map Phase, Shuffle Phase and Reduce Phase. It will also discuss map and reduce
operations and how we can load and store the data in HDFS. Lastly this unit also
provides few classical examples such word count in documents, matrix vector
multiplication etc.

6.1 OBJECTIVES

After completing this unit, you will be able to:

• define MapReduce;
• explain the main characteristics of MapReduce;
• discuss the various phases of MapReduce - Mapper, Shuffle and
Reduce;
• illustrate the MapReduce operations and loading of data into HDFS;
• solve various classical algorithms such as word count, matrix-vector
multiplication using MapReduce.
6.2 MAP REDUCE OPERATIONS

Advances in Information Technology in the last two decades have led to

tremendous growth in data. Technologies like Database management Systems,
Data Warehouse, World Wide Web (WWW), Internet of Things (IOT), Cloud
computing etc. resulted in tremendous growth of size of data. This data is
heterogenous in nature and is growing at a brisk rate. Processing of this Big data
for generating useful actionable knowledge, for timely decision making,
required advancement in storage and processing technologies. The two major
initial applications of such Big data that led to development newer processing
paradigms were:

• Computation of Page Rank: This application required to collect huge amount

of information about various web pages and performed multiplication of
very high dimensional matrices.
• Social Networking friend networks, which generates huge graphs of people’s
information.

These application required fault tolerant parallel or distributed computation

environment. The options were either to develop parallel computers or to use a
large cluster of computers. Due to availability of huge computer clusters and
cheap communication devices, newer storage and processing technology
evolved. A distributed and fault tolerant storage technology were developed.
This storage technology was termed as the distributed file system (DFS) with
Hadoop Distributed File system (HDFS) being its first open-source version of
this. Unit 5 has covered most aspects of DFS and HDFS.

Prior to 2004, huge amounts of data was being stored in a single server. Thus, if
any program was running a query which involves data stored on multiple servers
there was no means for logical integration of data for analysis. It would also lead
to massive amount of computation time and efforts. Furthermore, there was a
threat of data loss and backup. This would lead to reduced scalability. Thus to
cater to this, Google introduced Hadoop MapReduce in December 2004 which
led to significant reduction of analysis time. It allows various queries to run
simultaneously on several server machines and logically integrate the search
results, thus facilitating real-time analysis. Other advantages of MapReduce
include i) fault tolerance and ii) scalability.
MapReduce is a programming model that can process as well as analyse huge
data logically across machine clusters. Mapper is responsible for data sorting
while reducer divides it into logical clusters, while pruning the unnecessary or
bad data.
Hadoop MapReduce
A processing unit of Hadoop using which we can process the big data stored in the
HDFS storage. It allows to perform parallel and distributed computing on huge data.
MapReduce is used in Indexing and searching of data, Classification of data,
recommendation of data, and analysis of data.
There are two functions in MapReduce i.e., one is the Map function and the other is
Reduce function. The architecture of MapReduce is shown in Figure 1. You may
observe there are a number of map and reduce functions in Figure 1. All the map
functions are performed in parallel to each other, similarly all reduce functions are also
performed in parallel to each other. These activities are coordinated by the MapReduce
architecture, which also deals with the failure of nodes performing these operations. The
MapReduce architecture as shown in Figure 1, can be summarised as:

• The input to map() functions are a sequence of records or documents etc.,

which are stored on a DFS node.
• A number of map functions create a sequence of <key, value> pairs from the
input data based on the coding of map function (Please refer to Figure 2).
• The output of map functions are collected and divided, as input to Reduce
function by a master controller.
• Finally, the code of reduce function combines the input received by it to
produce final output.

Advantages of Hadoop MapReduce are:

1. Parallel processing: Data is processed in parallel that making
processing fast
2. Data Locality: Map function is generally performed locally on the
DFS node where data is stored. Processing the data locally is very
effective for the cost.
3. The MapReduce system makes sure that the processing in performed
in a fault tolerant manner.

Figure 1 MapReduce architecture

Thus, in MapReduce programming, an entire task can be divided into map task
and reduce task. Map takes input as a key value, and produces output as a list of
<key-value> pair. Reduce takes input as shuffling of key and list value and the
final output is the key value as shown in Figure 2.
Figure 2: Key-value pair in MapReduce

Next, we discuss the data loading into HDFS system, which is the starting point
of working with MapReduce architecture.

6.3 LOADING DATA INTO HDFS

In order to load HDFS, first you need to install the Hadoop MapReduce into your
system. The minimum configuration for HDFS installation is:
1) Intel Core 2 Duo/Quad/hex/Octa or higher end 64 bit processor PC or Laptop
(Minimum operating frequency of 2.5GHz)
2) Hard Disk capacity of 1- 4TB.
3) 64-512 GB RAM
4) 10 Gigabit Ethernet or Bonded Gigabit Ethernet
5) OS: Ubuntu/any Linux distribution/Windows
In the programs explained below, we have used Ubuntu as the operating system.
6.3.1 Installing Hadoop
The step-by-step installation of Hadoop MapReduce is as follows:
Step 1: Install Java version using the following command:
sudo apt install openjdk-8-jdk

After this, the following packages get installed.

You may check whether java version is installed or not by using the command:
cd /usr/lib/jvm

It will show the following output screen:

Step 2: For installing the ssh to securely connect to remote server/system and
transfer the data in encrypted form, you may use the following command:
sudo apt-get install ssh

Step 3: For installing the pdsh for parallel shell tool to run commands across
multiple nodes in clusters, you may use the following command:

Step 4: Download the Hadoop file directly from "hadoop.apache.org"

or,
download in terminal using wget

Step 5: Extract the file and save where you want to save the extracted files by
using the command:

Step 6: open bashrc file using the command:

sudonano ~/.bashrc
It will show the following output:

Step 7: open path of in terminal using the command:

cd ~/Downloads/hadoop-3.3.3/etc/hadoop

You will get the following display of the directory:

Step 8: Now, open and make change of java path in hadoop-env.h

sudonano hadoop-env.sh

and set the path for JAVA_HOME

Step 9: Editing all the xml according to our requirement and paste the given
configuration
sudonano core-site.xml
Now edit hdfs-site.xml
Sudo nano hdfs-site.xml

---------------------------------------------------------------------------

Now edit mapred-site.xml

sudonano mapred-site.xml

---------------------------------------------------------------------------
---------------------------------------------------------------------------

Now edit yarn-site.xml

sudonano yarn-site.xml

---------------------------------------------------------------------------

--------------------------------------------------------------------------
Step 10: Open localhost

Step 11: Format the default file system

~/Downloads/hadoop-3.3.3/bin/hdfsnamenode -format ("The file of
hadoop in Downloads folder")

export PDSH_RCMD_TYPE=ssh

Step 12: Start NameNode daemon and DataNode daemon

~/Downloads/hadoop-3.3.3/sbin/start-dfs.sh

Step 13 : Check whether hadoop is installed or not

::::::::::::::::: SUCCESSFULLY INSTALLED HADOOP ::::::::::::::::::::::::::

6.3.2 Loading data

In order to use MapReduce feature in Hadoop, you need to load the data into
HDFS format. We will show the steps for word count operation for an input
file name as input.txt on my desktop file as shown in Fig. 3.

Figure 3: Example for word count operation "input.txt" file

Step 1: Format the default configured hadoop HDFS server

hadoop namenode -format

Step 2: Start the DFS server

start-dfs.sh

Step 3: List the file in DFS server by using ls

$HADOOP_HOME/bin/hadoopfs -ls

Step 4: Now to insert the date , we have to create an input directory

$HADOOP_HOME/bin/hdfsdfs -mkdir /user/input

Step 5: Now to put the input text file in hdfs server

$HADOOP_HOME/bin/hdfsdfs -put ~/Desktop/input.txt /user/input

Step 6: We can verify the text file

$HADOOP_HOME/bin/hdfsdfs -ls /user/input

Step 7: Read the data stored in dfs server

$HADOOP_HOME/bin/hdfsdfs -cat /user/input/input.txt

Check Your Progress 1:

1. What are the different operations of MapReduce

2. What are the advantages of MapReduce
3. List and perform the steps of installation of installation of MapReduce
on a single node.
4. Load the data “ MapReduce is a Programming paradigm for Big data
analysis. MapReduce should be learned for faster analysis”

6.4 EXECUTING OF MAP REDUCE PHASES

In this section we discuss the process of execution of MapReduce phases with

the help of the example given in Figure 3.

6.4.1 Execution of Map Phase

For the same example as in Figure 3, we will check the feature of MapReduce
in Hadoop.
We will move to hadoop folder in terminal.
cd ~/Downloads/hadoop-3.3.3/

We will move to MapReduce file

cd share/hadoop/mapreduce/

Jar file is a java Archive package of different operations of MapReduce. We

will see different types of jar files available in hadoop folder.

When we open one jar file, we will get different class functions which are
present as follows:

6.4.2: Shuffling and Sorting

For the example in Figure 3, you can see the map, shuffle and reduce
operations in Figure 4.
Figure 4 Map, Shuffle and Reduce operations for "input.txt" file

Processing of Map and Reduce phase is done as parallel processes. In map the
input is split among the mapper nodes where each chunk is identified and
mapped to the key forming a tuple (key-value) pair. These tuples are passed to
nodes where sorting-shuffling of tuples takes place i.e. sorting and grouping
tuples based on keys so that all tuples with the same key are sent to the same
node.

Shuffling, or the method due to which the system will sort the map output and
feeds it as input to the reducer, is the process of moving data from mappers to
reducers. Therefore, without the MapReduce shuffle phase, the reducers would
not have any input from the mapper phase. As shuffling can begin even before
the map phase is complete, it speeds up work completion and saves time.

The MapReduce Framework automatically sorts the keys produced by the

mapper, i.e., all intermediate representation of the key-value pairs created by the
mapper are sorted by key and not by value before the reduction is started. Each
reducer receives values in any order; they are not sorted before passing them on.

Hadoop sorting enables reducers to quickly determine whenever it is required to

begin a new reduce process. For the reducer, this saves time. When a key in the
input data which is in sorted order differs from the one before it, the reducer
initiates a new reduce process. Each reduce operation accepts as an input a key-
value pairs and outputs key-value pairs. When you specify zero reducers (by
using command setNumReduceTasks(0)), Hadoop MapReduce does not
perform any sorting or shuffle operations. Then, the MapReduce process
terminates at the map phase, which is devoid of any sorting, thus making the
map phase even faster. The secondary sorting approach is used to sort the values
supplied to each reducer, allowing us to do so either in ascending or descending
order. As seen in example of word counting in Figure 4, to find the accurate
number of occurrences of word, we need splitting and sorting of data. The
following is a sample map code of map phase using python. Figure 5 gives the
pseudo code for word count problem in MapReduce. Initially, the mapper
produces a key-value pair for every word. Every word works as the key, and the
integer works as the value frequency. Then, the reducer sums up all counts that
are associated with every single word and create the desirable key pair.
Figure 5: Pseudocode for word count problem

6.4.3 Reduce Phase Execution

Hadoop's Reducer takes the intermediate key-value pair output from the Mapper
and processes each one separately to produce the output. The final result, which
is saved in HDFS, is the reducer's output. Typically, aggregation or summation-
type computations are done in the Hadoop Reducer. The Reducer processes the
mapper's output. It creates a new set of output after processing the data. Finally,
this output data is stored in HDFS.

A Reducer function is called on each intermediate key-value pair that the Hadoop
Reducer receives as input from the mapper. This data in the form of (key, value)
can be aggregated, filtered, and combined in a variety of ways for a variety of
processes. Reducer produces the result after processing the intermediate values
for a certain key produced by the map function. There is a one-to-one mapping
between keys and reducers. Since reducers are independent of one another, they
operate in parallel. The quantity of reducers is chosen by the user. The number
of reducers is one by default. The reduction job collects the key-value pairs after
shuffle and sort. The output of the reduction task is written to the File system by
the OutputCollector.collect() method. The user can specify the number of
reducers for the work with the Job.setNumreduceTasks(int). Figure 6 a and
Figure 6b show the code of map function and reduce function using Python. The
code includes comments and easy to follow.
Figure 6 (a): Map function for word count problem Using
Python
Figure 6b: Reduce function for word count problem using
Python

6.4.4: Node Failure and MapReduce

One of the major advantages of MapReduce is that it allows failure of the node
that are performing the distributed computation. The worst scenario is a failure
of the compute node where the Master program is running, as the master node
coordinates the execution of the MapReduce job. This requires restarting the
entire MapReduce job. However, you may please note that only this one node
has the power to shut down the entire job; all other failures will be handled by
the Master node and the MapReduce task will finally finish.
For example, assume a compute node where a Map worker is located
malfunctions. In such a case, the Master node, which frequently pings the
Worker processes, will notice this failure. Once this worker, which was running
the mapper process, is detected, then all the Map tasks that were assigned to this
worker node, even if they were finished, will need to be redone. Rerunning
finished Map tasks is necessary because the output of those activities, which was
intended for the Reduce tasks, is now unavailable to the Reduce tasks because it
is located at that failed compute node. Each of these Map activities will be
rescheduled by the master on another Worker. Additionally, the Master must
inform every Reduce task that the input location from that Map job has changed.
However, all this activity is done without the intervention of any external person.
The system handles the entire process.
It is easier to handle a failure at a Reduce worker node. The Master merely
changes the status of any Reduce jobs that are currently running to idle. These
will eventually be moved to another Reduce worker.

Check Your Progress 2:

1. Explain the features and functions of MapReduce paradigm.
2. How do you execute a map reduce program for wordcount problem?
3. What is the need of splitting and sorting phase

6.5 ALGORITHMS USING MAPREDUCE – WORD

COUNTING, MATRIX-VECTOR
MULTIPLICATION

In this section we discuss two simple problems that can be solved by using
map reduce. First, we discuss about the word count problem solution and then
matric multiplication.

6.5.1 Word Count Problem

As discussed earlier the word counting is an interesting problem to be
addressed by MapReduce programming. This algorithms for map and reduce
functions for this problem are given in Figure 5 and Figure 6. In addition, you
may be able to use MapReduce on Hadoop in the following way to address this
problem.
To apply the MapReduce on the input file named as “input.txt” and apply the
wordcount class to count the number of word in text file as the output. Once the
data is read from the “input.txt” file from the DFS server, apply the following
steps:

Step 1: Running the MapReduce method to get the result

$HADOOP_HOME/share/hadoop/mapreduce

Step 2: Read the output data stored in DFS server

$HADOOP_HOME/bin/hdfsdfs -cat /user/output/

6.5.2 Matrix-Vector Multiplication

Matrix vector and matrix-matrix multiplication of Big matrices can be used for
several advanced computational algorithms including PageRank computation.
In this section, we discuss about how these operations can be be performed using
MapReduce.
Assume a large matrix M is multiplied with a large vector (v). Each of the vector
v and the matrix M is kept in a DFS file. Further, assume that you can find the
row-column coordinates of a matrix element either from its file position or from
explicitly saved coordinates. For example, you may save a matrix elements using
the triplet (i, j, mij ), where i and j is the row and column indices of matrix
respectively and mij is the magnitude of the element. Similarly, you may assume
that element vj's position in the vector v may be determined in a similar manner.
In order to multiply the M with v, you need to compute the multiplications of
the matrix element mij ´ vj, which will be summed for the entire row. Thus, the
ith output element would be the sum of mij ´ vj for all the columns j. In order to
multiply these elements, you need to write the Map and Reduce functions. You
may please note that for the multiplication both M as well as v are needed. Thus,
the Map method can be created to be applied with the elements of M. Since, v
is smaller than M, therefore it can be read into main memory at the compute
node running a Map task. A portion of the matrix M will be used by each Map
job to operate. It generates a key-value pair from each matrix element mij given
as {i, mijvj}. As a result, the same key, ‘i’ will be assigned to each term of the
sum that composes the component of the matrix-vector product. The Reduce
method will add up all the values related to the key ‘i’ that is specified.

However, it may happen that the vector v be so big that it can not possibly fit in
main memory as a whole vector. It is not necessary for v to fit in main memory
at a compute node, but if it does not, moving pieces of the vector into main
memory will result in a significant amount of disc requests as you multiply
components by matrix elements. As an alternative, we may divide the vector into
an equal number of horizontal stripes of the same height and the matrix into
vertical stripes of equal width to perform this multiplication. You may refer to
further readings for details on these aspects of matrix-vector multiplication.

The following is an example of simple matrix multiplication process using

Hadoop platform. Please note the size of matrix in this case is very small. For
this example, Windows Operating systems is used. The input and output format
of the process are as follows:

1. Input the file in format of matrix (i,j value)

2. Output the file in format of matrix(l,m,value)

Step 1: Input the matrix file directory hadoop fs -mkdir /input_matrix

and now load the input matrix

hadoop fs -put G:\Hadoop_Experiments\MatrixMultiply_M.txt
/input_matrix

Check the file is uploaded or not

Step 2: Again input the another matrix for matrix multiplication
hadoop fs -put G:\Hadoop_Experiments\MatrixMultiply_M.txt
/input_matrix And check the file is loaded or not

Then we will run the matrix multiplication jar file to multiply the given
input matrixes
hadoop jar G:\Hadoop_Experiments\matrixMultiply.jar
com.mapreduce.wc/MatrixMultiply /input_matrix/*/output_matrix/

We will get output like this hadoop dfs -cat /output_matrix/*

For the vector multiplication, we will use the same operations as matrix
multiplication. The vectors will be stored in matrix form as an input.
Check Your Progress 3

1. Write the program using MapReduce to find the occurrences of term

“MapReduce” in the data of Check Your Progress 1 Question 4.
2. Write map and reduce functions for a vector and matrix multiplication,
listing various cases.

6.6 SUMMARY

In this unit, we learnt what are the main characteristics and functionalities of
MapReduce. We also discussed different phases of MapReduce in detail.
Starting with the installation, we have covered the main program snippets for
executing different phases with the help of classical examples such as word
count, matrix multiplication etc.

6.7 SOLUTIONS/ANSWERS

Check Your Progress 1

1. There are mainly three operations of MapReduce:

• Map Stage
• Shuffle Stage
• Reduce Stage
In map stage, input data is processed. The input file is passed through
the HDFS, then mapper processed the data and make many chunks of
data. In Reduce stage, the data received from the mapper, is reduced by
this stage. This is the combination of both shuffle and reduce stage.

2. There are many advantages of MapReduce as follows:

i. Scalability: Hadoop's ability to store and distribute big data sets over
numerous servers allows us to do any operation in parallel. Hadoop
is a very scalable platform as a result.
ii. Flexibility: Hadoop gives organisations the ability to process
structured or unstructured data in a variety of ways and to work with
diverse data kinds.
iii. Fast processing: Hadoop's HDFS is a crucial component that
essentially implements a mapping system for finding data inside a
cluster. Hadoop MapReduce expeditiously handles massive
amounts of unstructured or semi-structured data.
iv. Parallel Processing: Hadoop splits the jobs so that they can be
completed in parallel. The software can run more quickly owing
to the parallel processing, which makes it simpler for each
process to handle its individual job.

3. Step 1: Before installing java , first login into root user and set the
machine name as master and set the local host ip at your default ip.
# hostnamectl set-hostname master # nano /etc/hosts
192.168.1.41master.hadoop.lan
Step 2: Now download the java and install in your pc. # sudo apt install
default-jre

Check java is installed or not and its version in your pc.

Step 3: Now install Hadoop Framework in Ubuntu

# useradd -d /opt/hadoop
hadoop
# passwd hadoop
Step 4: Now install the Apache Hadoop
After download the hadoop
file, then extract the

file as follows:
Now move the
extracted file to
/home/hadoop/

# ls -al /home/hadoop/

Step 5: Now login and configure hadoop and java Environment variables on
our system by just editing

.bash_profile file su -hadoop nano .bashrc

Append the following line
## JAVA env variables

Step 6: Now check the status of environment variable

Step 7: Now configure the ssh key based authentication for hadoop
Step 8: Now configure the hadoop core-site.xml
Now Edit the sore-site.xml file

Now edit hdfs-site.xml

Step 9: Because we’ve specified /op/volume/ as our hadoop file system

storage, we need to create those two directories (datanode and
namenode) from root account and grant all permissions to hadoop
account by executing the below commands.

$ su root
Step 10: Now edit mapred-site.xml
$ nano etc/hadoop/mapred-site.xml

Step 11: Now edit yarn-site.xml

Step 12: Now finally set java home variable for hadoop environment by
editing hadoop-env.sh
Step 13: Now format hadoop namenode with the following command:
Step 14: Start the DFS server and test the hadoop cluster
$ start-dfs.sh
$ start-yarn.sh

Step 15: Now make file or folder in HDFS storage

Step 16: To retrieve data from HDFS to our local file system use the below
command:
Step 17: To stop the manage hadoop Service

$ stop-yarn.sh
$ stop-dfs.sh

4. Step 1: Format the default configured hadoop HDFS server

hadoop namenode -format

Step 2: Start the DFS server start-dfs.sh

Step 3: List the file in DFS server by using ls

$HADOOP_HOME/bin/hadoop fs -ls
Step 4: Now to insert the data, we have
to create an input directory

$HADOOP_HOME/bin/hdfs dfs -mkdir

/user/input

Step 5: Now to put the input text file in hdfs server

$HADOOP_HOME/bin/hdfs dfs -put ~/Desktop/input.txt
/user/input

Step 6: We can verify the text file

$HADOOP_HOME/bin/hdfs dfs -ls /user/input

Step 7: Read the data stored in dfs server

$HADOOP_HOME/bin/hdfs dfs -cat /user/input/input.txt

Check Your Progress 2

1. MapReduce is processing paradigm on Hadoop using which we can

process the big data stored in the HDFS storage. It allows to process
huge amount of parallel and distributed data.
MapReduce is used in Indexing and searching of data, Classification
of data, recommendation of data, and Analysis of data
There are two functions in MapReduce i.e., one is the Map
function and the Other is Reduce function
Advantages of Hadoop MapReduce are:
1. Parallel processing: Data is processed parallel that making
processing fast
2. Data Locality: Processing the data locally is very effective for
the cost.

2. Apply the MapReduce on the input file named as “input.txt” and

produced the output and apply the wordcount class to count the
number of word in text file.

Step 1: Running the mapReduce method to get the result

$HADOOP_HOME/share/ha
doop/mapreduce hadoop jar hadoop-
mapreduce-examples-3.3.3.jar
wordcount /input /output

Step 2 : Read the output data stored in DFS server

$HADOOP_HOME/bin/hdfs dfs -cat /user/output/

3.
There are two functions in MapReduce i.e., one is the Map
function and the Other is Reduce function. Processing of
Map and Reduce phase is done as parallel processes,
In map the input is split among the mapper nodes where
each chunk is identified and mapped to the key forming a
tuple (key-value) pair. These tuples are passed to Reducer
nodes where sorting-shuffling of tuples takes place i.e.
sorting and grouping tuples based on keys so that all
tuples with the same key are sent to the same node.

Check Your Progress 3

1. Making dir and loading data in HDFS

hdfs dfs -mkdir -p /user/hadoop/input

This is the input.txt file

hdfs dfs -put input.txt /user/hadoop/input/

cd $HADOOP_HOME

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar

wordcount input output

Showing the result : hdfs dfs -ls /user/hadoop/output

hdfs dfs -cat /user/hadoop/output/part-r-00000

2. Explanation of Matrix Multiplication example is given above in detail.
Please refer to the example.
6.8 REFERENCES AND FURTHER READINGS

[1] https://kontext.tech/article/448/install-hadoop-330-on-linux
[2] https://kontext.tech/article/447/install-hadoop-330-on-windows-10-step-by-
step-guide
[3] https://intl.cloud.tencent.com/document/product/436/10867
[4] https://bigdatapath.wordpress.com/2018/02/13/introduction-to-hadoop/
[5] https://hadoop.apache.org/
[6] https://blog.csdn.net/qq_30242609/category_6519905.html
[7] https://www.tutorialspoint.com/hadoop/index.htm
[8] https://halvadeforspark.readthedocs.io/en/latest/
[9] Tilley, Scott, and KrissadaDechokul. "Testing iOS Apps with HadoopUnit: Rapid
Distributed GUI Testing." Synthesis Lectures on Software Engineering 2.2 (2014): 1-
103.
[10] Dechokul, Krissada. Distributed GUI testing of iOS applications with HadoopUnit.
Diss. 2014.
[11] https://commandstech.com/category/hadoop/
[12] Frampton, Michael. Big Data made easy: A working guide to the complete
Hadoop toolset. Apress, 2014.
[13] http://infolab.stanford.edu/~ullman/mmds/ch2n.pdf

32
UNIT 7 OTHER BIG DATA ARCHITECTURES
AND TOOLS

Structure
7.0 Introduction
7.1 Objectives
7.2 Apache SPARK Framework
7.3 HIVE
7.3.1 Working of HIVE Queries
7.3.2 Installation of HIVE
7.3.3 Writing Queries in HIVE
7.4 HBase
7.4.1 HBase Installation
7.4.2 Working with HBase
7.5 Other Tools
7.6 Summary
7.7 Answers
7.8 References and further readings

7.0 INTRODUCTION

In the Unit 5 and Unit 6 of this Block, you have gone through the concepts of Hadoop
and MapReduce programming. In addition, you have gone through various phases in
the MapReduce program. This unit introduces you to other popular big data
architecture tools and architectures. These architectures are beneficial to both
the ETL developers and analytics professionals.
This Unit introduces you to the basic software stack of SPARK architecture, one
of the most popular architecture of handling large data. In addition, this Unit
introduces two important tools – HIVE, which is a data warehouse system and
HBase, which is an important database system. An open-source NoSQL
database called HBase uses HDFS and Apache Hadoop to function. Unlimited
data can be stored on this extendable storage. Built on HDFS, HIVE is a SQL
engine that uses MapReduce.

7.1 OBJECTIVES

After completing this unit, you will be able to:

• list the fundamentals of big data processing using Spark ecosystem;
• explain the main components of Spark framework;
• define the fundamental components of Hive;
• installation guidelines for these tools: Spark, Hive and HBase
• illustrate the query processing in Hive and HBase;
7.2 APACHE SPARK FRAMEWORK

With SPARK, the scalability and fault tolerance of Hadoop MapReduce were
preserved while being designed for quick iterative processing, such as machine
learning, and interactive data analysis. As we have already discussed in Unit 6,
the MapReduce programming model, which is the foundation of the Hadoop
framework, allows for scalable, adaptable, fault-resilient, and cost friendly
solutions. In this case, it becomes imperative to reduce the turnaround time
between queries and execution. The Apache Software Foundation released
SPARK to quicken the Hadoop computing processes. SPARK has its own
cluster management, hence it is not dependent on Hadoop and is not a revised
form of Hadoop. Hadoop is merely one method of implementing SPARK.
SPARK's in-memory cluster computing, which accelerates application
processing, is its key feature. Numerous tasks, including batch processing,
iterative algorithms, interactive queries, and streaming, can be handled by
SRARK. Along with accommodating each job in its own system, it lessens the
administrative strain of managing various tools.

SPARK is a data processing framework, which can swiftly conduct operations

on huge datasets and distribute operations across several computers, either alone
as well as in conjunction with other distributed computing tools. Above
characteristics are essential to in big data and machine learning computing,
which both call for immense computing resources to be mobilised in order to
process enormous data stored in data warehouses. Spark provides a user-friendly
Application Programming Interface (API), which abstracts away a lot of the
tedious programming activities involved in big data processing and distributed
computing, relieving developers of several coding related functionality
associated with these activities. Spark framework has been constructed over the
Hadoop MapReduce platform and expands the MapReduce framework to be
utilised for other calculations, including computation of dynamic queries and
stream processing, more effectively.

In 2009, UC Berkeley based AMPLab launched Apache Spark as a research

initiative with an emphasis on data-intensive application areas and further got
open sourced in the year 2010 with BSD License. This partnership involved
students, researchers, and professors. The main features of Apache Spark
framework includes:

i) Fast processing: In a Hadoop cluster, Spark makes it possible for applications

to execute almost 100 times quicker in memory and 10 times on storage by
reducing the amount of read/write operations.
ii) Multi-language support: Python, Scala, or Java-based built-in APIs are
available with Spark, and
iii) Advance Analytics support: Spark offers more than just "Map phase" and
"Reduce phase." Additionally, it also enables support for graph methods, data
streaming data, machine learning (ML), and SQL queries.

Figure 1 depicts the Spark data lake and shows how the Apache Spark can work
in conjunction with Hadoop components and information flow happens with
Apache Spark. Hadoop Distributed File Storage (HDFS) allows us to form
cluster of computing machines and utilize the combined capacity to store the
data, thus allowing to store huge data volumes. Further, MapReduce allows to
use combined power of the clusters and process it to store enormous data stored
in HDFS. The advent of HDFS and MapReduce allows horizontal scalability and
low capital cost as compared to data warehouses. Gradually, cloud infrastructure
became more economical and got wider adoption. Large amounts of organised,
semi-structured, and unstructured data can be stored, processed, and secured
using a data lake, a centralised repository. Without regard to size restrictions, it
can process any type of data and store it in its native format. A data lake has four
key capabilities: i) Ingest: allows data collection and ingestion, ii) Store:
responsible for data storage and management, iii) Process: leads to
transformation and data processing, and iv) Consume: ensures data access and
retrieval. In store capability of data lake, it could be an HDFS or cloud store such
as in Amazon S3, Azure Blob, Google cloud storage etc. The cloud storage
allows scalability and high availability access at an extremely low cost in almost
no time to procure. The notion of the data lake recommends to bring data into
data lake in raw format, i.e. ingest data into data lake and preserve an unmodified
immutable copy of data. The ingest block of data lake is about identifying,
implementing and managing the right tools to bring data from the source system
to the data lake. There is no single ingestion tools thus there could be multiple
tools such as HVR, Informatica, Talend etc. The next layer is the process layer
where all computation takes place such as initial data quality check,
transforming and preparing data, correlating, aggregating, and analysing and
applying machine learning models. Processing layer is further broken into two
parts which helps to manage better: i) data processing and ii) orchestration. The
processing is the core development framework allows to design and develop
distributed computing frameworks. Apache Spark is part of data processing. The
orchestration framework is responsible for the formation of the clusters,
managing resources, scaling up or down etc. There are three main competing
orchestration tools such as Hadoop Yarn, Kubernetes and Apache Mesos. The
last and most critical capability of data lake is to consume the data from data
lake for real life usage. The data lake is a repository of raw and processed data.
The consumption requirements could be from data analysts/scientists, or from
some applications or dashboards to take insights from data, or from
JDBC/ODBC connectors, others might be from Rest interface etc.

Figure 1: SPARK data lake

Apache Spark deployment can happen in three modes:

i) Standalone: When Spark is deployed independently, space is set aside
for HDFS (the Hadoop Distributed File System) on top of Spark.
Spark and MapReduce will coexist in this scenario to handle all
Spark jobs on the cluster.
ii) Hadoop Yarn: Hadoop yarn deployment means that spark runs on
Yarn without the need for any prior installation or root access.
Integrating Spark within the Hadoop stack is beneficial. It enables
running of other components on top of stack.
iii) Spark in MapReduce: Along with standalone deployment, Spark is
used in MapReduce to start spark jobs. Without requiring
administrative access, a user can launch Spark and use its shell using
Spark in MapReduce.
The main components of Apache Spark (as shown in Figure 2) are as follows:
i) Spark Core: All other functionality for the Spark platform is built
upon the universal execution engine known as Spark Core. It offers
in-memory computation (i.e. running complex calculations entirely
in computer memory as in RAM) and refers to datasets in external
storage systems. It oversees memory management, fault recovery, job
scheduling, job distribution, and job monitoring, as well as
communication with storage systems. APIs that are created for Java,
Scala, Python, and R make Spark Core accessible. These APIs
leverage the complex operations of distributed processing with
straightforward, high-level operators.
ii) Spark SQL: On top of Spark Core, the Spark SQL component adds a
new data abstraction called SchemaRDD that supports both
structured and semi-structured data. Row objects with a schema
describing the data types of each column in the row are combined to
form SchemaRDDs. A table in a conventional relational database is
analogous to a SchemaRDD. Resilient Distributed Datasets (RDD)
and schema information are combined to generate SchemaRDD. The
logical arrangement of structured data is described by a schema. For
quick access to data during computation and fault tolerance, the
RDDs store data in memory. A distributed query engine called Spark
SQL offers interactive query processing up to 100 times quicker as
compared to MapReduce. It scales to thousands of nodes and features
a less-cost optimizing engine, tabular arrangement, and interactive
query generation. For data queries, business analysts can utilise either
traditional SQL or the Hive Query Language. APIs are accessible in
Scala, Java, Python, and R for developers.
iii) Spark Streaming: It is used for analysis if data streams. This
component makes use of the fast scheduling technique available in
the Spark Core. It processes RDD (Resilient Distributed Datasets)
modifications on the mini-batches of data that it ingests. Spark
Streaming works towards real-time streaming analytics. Blocks of
data that arrive in certain time intervals are taken by Spark Streaming
and packaged as RDDs. A Spark Streaming job can receive data from
numerous external services. These comprise TCP/IP socket
connections and filesystems in addition to other distributed systems
like Kafka, Flume, Twitter, and Amazon Kinesis. Receivers are
capable of establishing a connection to the source, reading the data,
and sending it to Spark Streaming. After splitting the incoming data
into mini-batch RDDs—one mini-batch RDD for each time period—
Spark Streaming allows the Spark application to process the data.
The outcomes of computations can be stored in relational databases,
filesystems, or other distributed systems.
iv) MLib (Machine Learning Library): It is a distributed framework for
machine learning processing. According to the benchmarks, it has
been tested against Alternating Least Squares (ALS)
implementations by the MLlib developers. Spark MLlib is 9 times
faster than Apache Mahout in a Hadoop disk-based variant. In
Hadoop clusters, the Mahout library serves as the primary machine
learning platform. Mahout uses MapReduce to carry out
classification, clustering, and recommendation. Hadoop MapReduce
divides tasks into parallel ones, some of which may be too big for
machine learning algorithms. These Hadoop applications experience
input-output performance difficulties as a result of this approach.
Data scientists can use R or Python to train machine learning models
on any Hadoop data source, save the models and then import the
models into a Java- or Scala-based pipeline for production purposes.
Spark can be used to perform machine learning tasks faster. It may
be noted that Spark is very useful for quick, interactive computing
that operates in memory. Thus, you can perform regression,
classification, collaborative filtering, clustering, and pattern mining
using Spark.
v) GraphX: It is a distributed graph-processing framework. It offers a
graph computing API that uses the Pregel abstraction API to model
user-defined graphs. Since, in graph data structure vertices are
heavily dependent on the neighbour properties they exhibit recursion.
As a result, many significant graph algorithms repeatedly compute
each vertex's characteristics until a fixed-point condition is met.
These iterative algorithms have been expressed using a variety of
graph-parallel abstractions. Additionally, it offers a runtime that is
optimised for this abstraction. In order to let the users build
interactively and alter the structure of the graphs, GraphX offers
ETL, exploratory analysis, and iterative graph computation. It comes
with a variety of distributed Graph algorithms and a very versatile
API.

Figure 2: Apache Spark Ecosystem

The main data structure is Resilient Distributed Datasets (RDD) in Spark

framework. It is a collection of immutable objects. Each data-set in an RDD is
split into logical partitions that can be computed on several cluster nodes. An
RDD is formally a partitioned, collection of records in read only format. RDDs
can be created due to fixed operations on either the data on stable storage (disk)
or other RDDs. RDD is a set of fault-tolerant components that can be used
concurrently. Normally text files, SQL databases, NoSQL stores, Amazon S3
buckets, and many other sources can all be used to construct RDDs. This RDD
idea serves as the foundation for a significant portion of the Spark Core API,
enabling not only the standard map and reduce functionality but also built-in
support for merging data sets, and operations such as filter, sample, and
aggregate data. By combining a driver core process, which divides a Spark
application into various jobs and distributes them across numerous executor
processes that carry out the job, Spark works in a distributed manner. These
executor processes can be increased up or down depending on the demands of
the application. Spark uses the RDD paradigm to expedite and optimise
MapReduce operations. RDD Transformations are lazy operations, they are not
started until you invoke an action on the Spark RDD. RDDs are immutable,
therefore any transformations on them produce a new RDD while maintaining
the original's integrity. There are two types of transformations: i) Narrow: These
are result of map() and filter() functions and work on data in single partition i.e.
there will not be any data movement amongst partitions in narrow
transformations, and ii) Wider: These are results of groupbykey() and
reducebykey() i.e. it will work on data from multiple partitions and allow data
movement amongst them. Let us consider the RDD transformations in word
count example:

The set of RDD transformations in word count example is as follows:

Here, we first create an RDD by reading any input text file namely “test.txt”.
After applying the function, the flatMap() transformation flattens the RDD and
creates a new RDD. It first divides each record in an RDD by space before
flattening it. Each entry in the resulting RDD only contains one word. Any
complex actions, such as the addition of a column or the updating of a column,
are applied using the map() transformation, and the output of these
transformations always has the same amount of records as the input. In this
example, we are creating a new column and assigning a value of 1 to each word.
The RDD produces a PairRDDFunction that has key-value pairs with the keys
being words of type String and the values being 1 of type Int. The records in an
RDD can be filtered using the filter() transformation. In the example, we are
filtering out all terms that begin with "a". reduceByKey() combines the specified
function with the values for each key. Here, it shortens the word string by using
sum function “+” on the value. The output of our RDD includes the number of
unique words. RDD elements are sorted on keys using the sortByKey()
transformation. In this example, we first use the map transformation to change
RDD[(String,Int)] to RDD[(Int,String]] before using sortByKey, which ideally
sorts on an integer value. Finally, the foreach with println statement outputs to
the console all words in RDD and their count (as a key-value).

Installation Steps for Apache Spark framework (on Ubuntu):

Step 1: Check whether JAVA is installed or not, if not you need to install it.

Step 2: Check whether SCALA is installed or not since Spark is written in Scala,
although elementary knowledge in Scala is enough to run Spark. Other
supported languages in Spark are Java, R, Python.

In case it is not installed, you need to download it and

Extract the tar file
We need to move SCALA software files to /usr/ directory as follows:

We need to set the environment path variable:

Last, check if it is installed successfully:

Step 3: Download Apache Spark

Step 4: Install Spark
We need to extract the requisite tar file

We need to move SPARK software files to /usr/ directory as follows:

We need to set the environment path variable:

Last, check if SPARK is installed successfully:

It will be installed successfully.

Check Your Progress 1:

1. What are the main features of Apache Spark Framework?

2. What are the components of Apache Spark Framework?
3. List and perform the steps of installation of Apache Spark.

Next, we discuss a data warehouse system of this architecture named HIVE

system.

7.3 HIVE
A Hadoop utility for processing structured data is called Hive. It sits on top of
Hadoop to summarize big data and simplifies querying and analysis. Initially
created by Facebook, Hive was later taken up and further developed as an open
source project under the name Apache Hive by the Apache Software Foundation.
It is utilised by various businesses. Hive is not a relational database like SQL.
HIVE does not support for “Row-level” modifications of data, as the case in
SQL based database management systems. It also allows real time queries on
data. It puts processed data into HDFS and stores the schema in a database. It
offers a querying language called HiveQL or HQL that is similar to SQL. It is
dependable, quick, expandable, and scalable.
Figure 3: HIVE Architecture

The main components in Hive architecture (as shown in Figure 3) can be

categorized into:
i) Hive Client: Hive client allows to connect to Hive data with the help
of database drivers like ODBC, JDBC, etc. Hive enables applications
built in any of programming languages including C++, Ruby, JAVA,
Python etc. to be used as Hive client. Thus, creating an application
for the hive client in any language of one's preference is quite simple.
These clients can be of three types:
a) Thrift Clients: Since Apache Thrift is the foundation of the Hive
server, it may fulfil requests from thrift clients. Thrift is a
communication protocol used by applications created in many
programming languages. As a result, client programmes send
requests or hive queries to the Thrift server, which is located in the
hive services layer. b) JDBC Clients: Java applications can be
connected with Hive by utilising the JDBC driver. Thrift is used by
the JDBC driver to interact with the Hive Server, and c) ODBC
Clients: ODBC based applications can be connected with Hive using
the Hive ODBC driver. It communicates with the Hive Server using
Thrift, just like the JDBC driver does.

ii) Hive services: Hive provides a range of services for various purposes.
The following are some of the most useful services:
a. Beeline: It is a command shell that HiveServer2 supports,
allowing users to send commands and queries to the system. It is
JDBC Client based on SQLLINE Command Line Interface
(CLI). SQLLINE CLI is a Java Console-based utility for running
SQL queries and connecting to relational databases.
b. Hive Server 2: After the popularity of HiveServer1, next
HiveServer2 was launched which helps the clients to run the
queries. It enables numerous clients to ask Hive questions and get
query responses.
c. Hive Driver: The user submits a Hive query using HiveQL
statements via the command shell, which are then received by the
Hive driver. The query is then sent to the compiler where session
handles for the query that are created.
d. Hive Compiler: The query is parsed by Hive compiler. The
metadata of the database, which is stored in the metastore can be
used to perform semantic analysis and data type validation on the
various parsed queries, and then it provides an execution plan.
The DAG (Directed Acyclic Graph) is the execution plan the
compiler generates, with each stage referred as a map or a reduce
job, HDFS action, and metadata operation.
e. Optimizer: To increase productivity and scalability, the optimizer
separates the work and performs transformation actions on the
execution plan so as to optimise the query execution time.
f. Execution engine: Following the optimization and compilation
processes, the execution engine uses Hadoop to carry out the
execution plan generated by the compiler as per the dependency
order amongst the execution plan.
g. Metastore: The metadata information on the columns and column
types in tables and partitions is kept in a central location called
the metastore. Additionally, it stores data storage information for
HDFS files as well as serializer and deserializer information
needed for read or write operations. Typically, this metastore is a
relational database. A Thrift interface is made available by
Metastore for querying and modifying Hive metadata.
h. HCatalog: It refers to the storage and table management layer of
Hadoop. It is constructed above metastore and makes Hive
metastore's tabular data accessible to other data processing tools.
And
i. WebHCat: HCatalog’s REST API is referred to as WebHCat. A
Hadoop table storage management tool called HCatalog allows
other Hadoop applications to access the tabular data of the Hive
metastore. A web service architecture known as REST API is
used to create online services that communicate via the HTTP
protocol. WebHCat is an HTTP interface for working with Hive
metadata.
iii) Processing and Resource Management: Internally, the de facto
engine for Hive's query execution is the MapReduce framework,
which is a software framework. MapReduce is used to create map
and reduce functions that process enormous amounts of data
concurrently on vast clusters of commodity hardware. Data is divided
into pieces and processed by map-reduce tasks as part of a map and
reduce jobs.
iv) Distributed Storage: Since Hadoop is the foundation of Hive, the
distributed storage is handled by the Hadoop Distributed File System.

7.3.1 Working of HIVE Queries

There are various steps involved in the execution of Hive queries as follows:
Step 1: executeQuery Command: Hive UI either command line or web UI sends
the query which is to be executed to the JDBC/ODBC driver.
Step 2: getPlan Command: After accepting query expression and creating a
handle for the session to execute, the driver instructs the compiler to produce an
execution plan for the query.
Step 3: getMetadata Command: Compiler contacts the metastore with a
metadata request. The hive metadata contains the table’s information such as
schema and location and also the information of partitions.
Step 4: sendMetadata Command: The metastore transmits the metadata to the
compiler. These metadata are used by the compiler to type-check and analyse
the semantics of the query expressions. The execution plan (in the form of a
Directed Acyclic graph) is then produced by the compiler. This plan can use
MapReduce programming. Therefore, the map and reduce jobs would include
the map and the reduce operator trees.
Step 5: sendPlan Command: Next compiler communicates with the driver by
sending the created execution plan.
Step 6: executePlan Command: The driver transmits the plan to the execution
engine for execution after obtaining it from the compiler.
Step 7: submit job to MapReduce: The necessary map and reduce job worker
nodes get these Directed Acyclic Graphs (DAG) stages after being sent by the
execution engine. Each task, whether mapper or reducer, reads the rows from
HDFS files using the deserializer. These are then handed over through the
associated operator tree. As soon as the output is ready, the serializer writes it to
the HDFS temporary file. The last temporary file is then transferred to the
location of the table for Data Manipulation Language (DML) operations.
Step 8-10: sendResult Command: The execution engine now receives the
temporary files contents directly from HDFS as a retrieve request from the
driver. Results are then sent to the Hive UI by the driver.

7.3.2 Installation of HIVE

The step-by-step installation of Hive (on Ubuntu or any other Linux platform)
is as follows:

Step 1: Check whether JAVA is installed or not, if not you need to install it.

Step 2: Check whether HADOOP is installed or not, if not you need to install it.

Step 3: You need to download Hive and verify the download:

Step 4: Extract the Hive archive

Step 5: Copy the files to /usr/ directory

Step 6: Set up environment for Hive by adding the following to ~/.bashrc file:

To execute ~/.bashrc file

Step 7: Configure Hive and edit hive-env.sh file

Hive installation is complete. To configure metastore, we need an external
database server such as Apache Derby.
Step 8: Download Derby:

Verify download by:

After successful download, extract the archive:

Then, copy files to /usr/ directory

Add these lines to ~/.bashrc file:

To execute ~/.bashrc file

Create directory to store data in Metastore:

Step 9: Configure Metastore by editing hive-site.xml:

Add these lines:

Add these lines in file jpox.properties:

Step 10: Hive installation verify:
The /tmp folder and a unique Hive folder must be created in HDFS before
executing Hive. The /user/hive/warehouse folder is used here. For these
recently formed directories, you must specify write permission as indicated
below:

Hive is successfully installed.

7.3.3 Writing Queries in HIVE

In order to write HIVE queries, you must install the HIVE software on your
system. HIVE queries can be written using the query language that can be
supported by Hive. THE different data types supported by Hive are given below.

Hive Data Types: Hive has 4 main data types:

1. Column Types: Integer, String, Union, Timestamp
2. Literals: Floating, Decimal
3. Null Values: All missing values as NULL
4. Complex Types: These complex types includes Arrays, Maps and
struct. Syntax of these is given below:
Arrays

Maps
Struct

Hive Query Commands: Following are some of the basics commands to create
and drop database, creating, dropping and altering tables; creating partitions
etc. This section also lists some of the commands to query these database
including join command
1. Create Database

2. Drop database:

Cascade helps to drop tables before dropping the database.

3. Creating Table:

4. Altering Table:

5. Dropping Table:

6. Add Partition:
7. Operators Used:
Relational Operators: =, !=, <, <=, >, >=, Is NULL, Is Not NULL,
LIKE (to compare strings)
Arithmetic Operators: +, -, *, /, %, & (Bitwise AND), | (Bitwise OR), ^
(Bitwise XOR), ~(Bitwise NOT)
Logical Operators: &&, || , !
Complex Operators: A[n] (nth element of Array A), M[key] (returns
value of the key in map M), S.x (return x field of struct S)
8. Functions: round(), floor(), ceil(), rand(), concat(string M, string N),
upper(), lower(), to_date(), cast()
Aggregate functions: count(), sum(), avg(), min(), max()
9. Views:

10. Select Where Clause:

It is used to retrieve a particular set of data from table. Where clause is
the condition statement and gives finite result if condition is met.

11. Select Order By Clause: To get information from a single column and
sort the result set in either ascending or descending order, use the
ORDER BY clause.

12. Select Group By Clause: A result set can be grouped using a specific
collection column by utilising the GROUP BY clause. It is used to
search a collection of records.
13. Join: The JOIN clause is used to combine specified fields from two
tables using values that are shared by both. It is employed to merge data
from two or more database tables.

Thus, most of the commands, as can be seen are very close to SQL like syntax.
In case, you know SQL, you would be able to write the Hive queries too.

Check Your Progress 2:

1. What are the main components of HIVE architecture?

2. What are the main data types in HiveQL?
3. What are the different join types in HiveQL?

7.4 HBASE
RDBMS has been the answer to issues with data maintenance and storage since
1970s. After big data became prevalent, businesses began to realise the
advantages of processing large data and began choosing solutions like Hadoop.
Hadoop processes huge data using MapReduce and stores it in distributed file
systems. Hadoop excels at processing and storing vast amounts of data in
arbitrary, semi-structured, and even unstructured formats. Hadoop is only
capable of batch processing, and only sequential access is possible to data. That
implies that even for the most straightforward tasks, one must explore the entire
dataset. When a large dataset is handled, it produces another equally large data
collection, which should also be processed in a timely manner. At this stage, a
fresh approach is needed to handle any data in a one go i.e. access at random.
On top of the Hadoop, the distributed column based data store HBase was
created. It is a horizontally scalable open-source project. Similar to Google's
Bigtable, HBase is a data model created to offer speedy random access to
enormous amounts of structured data. It makes use of the Hadoop File System's
fault tolerance (HDFS). It offers real-time read or write operations to access data
randomly from the Hadoop File System. Data can be directly stored in HDFS or
indirectly using HBase. Using HBase, data consumers randomly read from and
access the data stored in HDFS. HBase is built on top of Hadoop file system,
which offers read and write access. In nutshell, as compared to HDFS, HBase
enables faster lookup, random access with low latency due to internal storage in
hash tables. The tables in the column-oriented database HBase are sorted by row
(as depicted in Figure 4). The column families constituted of key-value pairs are
defined by the table structure. A table is a grouping of rows. A row is made up
of different column families. A collection of columns is called a column family.
Each column has a set of key-value pairs.

Figure 4: Database schema of an HBase table

As compared to RDBMS, HBase is schema-less, meant for wide tables, has

denomalized data, while cannot handle transactions. Other features of HBase
includes:
• HBase scales linearly.
• It supports automatic failure.
• It offers reliable reading and writing.
• It is both a source and a destination of Hadoop integration.
• It features a simple Java client API.
• Data replication between clusters is offered.

Figure 5: HBase components having HBase master and several region servers.

Tables in HBase are divided into regions and are handled by region servers.
Regions are vertically organised into "Stores" by column families. In HDFS,
stores are saved as files (as shown in Figure 5). The master server will assign the
regions to region servers with Apache ZooKeeper's assistance. It manages the
servers in the regions' load balancing. It transfers the regions to less busy servers
and unloads the busy servers and negotiates load balancing to maintain the
cluster's state. Tables that have been divided and distributed throughout the
region servers make up regions. The region server deal with client interactions
and data-related tasks. All read and write requests should be handled by the
respective regions. It uses the region size thresholds to determine the region's
size. The store includes HFiles and memory store. Similar to a cache memory
there is memstore. Everything that is entered into the HBase is initially saved
here. The data is then transported, stored as blocks in Hfiles, and the memstore
is then cleared. All modifications to the data in HBase's file-based storage are
tracked by the Write Ahead Log (WAL). The WAL makes sure that the data
changes can be replayed in the event that a Region Server fails or becomes
unavailable before the MemStore is cleared. An open-source project called
Zookeeper offers functions including naming, maintaining configuration data
and provides distributed synchronisation, etc. Different region servers are
represented by ephemeral nodes in Zookeeper. These nodes are used by master
servers to find servers which are unassigned jobs. Server outages and network
splinters are tracked by the nodes. Clients use zookeeper to communicate with
the region servers. HBase will handle zookeeper in pseudo mode and standalone
mode. HBase operates in standalone mode by default. For the purpose of small-
scale testing, standalone mode and pseudo-distributed mode are also offered.
Distributed mode is suited for a production setting. HBase daemon instances
execute in distributed mode across a number of server machines in the cluster.
The HBase architecture in shown in Figure 6.

Figure 6: HBase architecture

7.4.1 HBase Installation

The step-by-step installation of HBase is as follows:

We should configure Linux via ssh before installing Hadoop in the

environment of Linux (Secure Shell).

Step 1: Check whether JAVA is installed or not, if not you need to install it.

Step 2: Check whether HADOOP is installed or not, if not you need to install it.

There are three installation options for HBase: standalone, pseudo-distributed,

and fully distributed.
Standalone mode:
Step 3: We need to download HBase and then extract tar file:
Using super user mode move HBase to /usr/local

Step 4: We need to configure HBase

Replace variable JAVA Home Path with current path:

Edit hbase-site.xml

Next run HBase start script

Distributed Mode:
Step 5: Edit hbase-site.xml

Next, mention the path where HBase is to be run:

Next run HBase start script

Next, check HBase directory in HDFS:

Step 6: We can start and stop the master:

Step 7: We can start and stop the region servers:

Step 8: We then start HBaseShell

Step 9: Start HBase Web interface:

After you have completed the installation of HBase, you can use commands to
run HBase. Next section discusses these commands.

7.4.2 Working with HBase

In order to interact with HBase, first you should use the shell commands as given
below:
HBase shell Commands: One can communicate with HBase by utilising the shell
that is included with it. The Hadoop File System is used by HBase to store its
data. It contains a master server and region servers. Regions will be used for the
data storage (tables). These regions are splited and stored in respective region
servers. These region servers are managed by the master server, and HDFS is
used for all of these functions. The following commands are supported in HBase
shell.
a) Generic Commands:
status: It gives information about HBase's status, such as how many
servers are there.
version: It gives the HBase current version that is in use.
table_help: It contains instructions for table related commands.
whoami: It will provide details of the user.
b) Data definition language:
create: To create table
list: Lists each table in the HBase database.
Disable: Turns a table into disable mode.
is_disabled: to check if table is disabled
enable: to enable the table
is_enabled: to check if table is enabled
describe: Gives a table's description.
Alter: To alter table
exists : To check if table exists
drop: To drop table
drop_all: Drops all table
Java Admin API: Java offers an Admin API for programmers to
implement DDL functionality.
c) Data manipulation language:
Put: Puts a cell value in a specific table at a specific column in a specific
row using the put command.
Get: Retrieves a row's or cell's contents.
Delete: Removes a table's cell value.
deleteall: Removes every cell in a specified row.
Scan: Scans the table and outputs the data.
Count: counts the rows in a table and returns that number.
Truncate: A table is disabled, dropped, and then recreated.
Java client API: Prior to all of the aforementioned commands, Java has a
client API under the org.apache.hadoop.hbase.client package that
enables programmers to perform DML functionality

HBase Programming using Shell

1. Create table:

2. List table:

This command when used in HBase prompt, gives us the list of all the
tables in HBase.
3. Disable table

4. Enable table

5. Describe and alter table

The syntax to alter a column family's maximum number of cells is

provided below.

6. Exists table:
7. Drop table:

8. Exit shell:

9. Insert data:

10. Read data:

11. Delete data:

12. Scan table:

13. Count and truncate table:

7.5 OTHER TOOLS

A wide variety of Big Data tools and technologies are available on the market
today. They improve time management and cost effectiveness for tasks involving
data analysis. Some of these include Atlas.ti, HPCC, Apache Storm, Cassandra,
StatsIQ, CouchDB, Pentaho, FLink, Cloudera, OpenRefine, RapidMiner etc.
Atlas.ti allows to access all available platforms from one place. It can be utilised
for mixed techniques and qualitative data analysis varied research. High
Performance Computing Cluster (HPCC) Systems provides services using a
single platform, architecture, and data processing programming language. Storm
is a free large data open source distributed and fault tolerant computing system.
Today, a lot of people utilise the Apache Cassandra database to manage massive
amounts of data effectively. The statistical tool Stats iQ by Qualtrics is simple to
use. JSON documents that can be browsed online or queried using JavaScript
are used by CouchDB to store data. Big data technologies are available from
Pentaho to extract, prepare, and combine data. It provides analytics and
visualisations that transform how any business is run. Apache FLink is open
source data analytics tools for massive data stream processing. The most
efficient, user-friendly, and highly secure big data platform today is Cloudera.
Anyone may access any data from any setting using a single, scalable platform.
OpenRefine is a large data analytics tool that aids in working with unclean data,
cleaning it up, and converting it between different formats. RapidMiner is
utilised for model deployment, machine learning, and data preparation. It
provides a range of products to set up predictive analysis and create new data
mining methods.
Check Your Progress 3

1. What are the main features of HBase.

2. Explain WAL in HBase.

7.8 SUMMARY

In this unit, we have learnt various tools and cutting-edge big data technologies
being used by the data analytics worldwide. Particularly, we studied in detail the
usage, installation, components and working of three main big data tools Spark,
Hive and HBase. Furthermore, we also discussed how to do query processing in
Hive and HBase.

7.9 SOLUTIONS/ANSWERS

Check Your Progress 1

1. The main features of Apache Spark are as follows:

i) Fast processing: In a Hadoop cluster, Spark makes it possible for applications

to execute up to 100 times quicker in memory and 10 times faster on storage by
reducing the amount of read/write operations,

ii) Multi-language support: Python, Scala, or Java-based built-in APIs are

available with Spark. Consequently, you can create applications in a variety of
languages. 80 high-level operators are provided by Spark for interactive
querying.

iii) Advance Analytics support: Spark offers more than just "Map" and
"Reduce." Additionally, it supports graph methods, streaming data, machine
learning (ML), and SQL queries.

2. The main components of Apache Spark framework are Spark core, Spark SQL
for interactive queries, Spark Streaming for real time streaming analytics,
Machine Learning Library, and GraphX for graph processing.

3. Installation Steps for Apache Spark framework:

Step 1: Check whether JAVA is installed or not, if not you need to install it.

Step 2: Check whether SCALA is installed or not.

In case it is not installed, you need to download it and

Extract the tar file

We need to move SCALA software files to /usr/ directory as follows:

We need to set the environment path variable:

Last, check if it is installed successfully:

Step 3: Download Apache Spark

Step 4: Install Spark
We need to extract the requisite tar file

We need to move SPARK software files to /usr/ directory as follows:

We need to set the environment path variable:

Last, check if SPARK is installed successfully:

It will be installed successfully.

Check Your Progress 2

1. The main components in Hive architecture can be categorized into:

i) Hive Client: Using JDBC, ODBC, and Thrift drivers, Hive enables
applications built in any language, including Python, Java, C++,
Ruby, and more. Thus, creating a hive client application in a language
of one's choice is quite simple. Hive clients can be of three types:
a) Thrift Clients, b) JDBC Clients c) ODBC Clients

ii) Hive services: Hive offers a number of services, like the Hive
server2, Beeline, etc., to handle all queries. Hive provides a range of
services, including: a) Beeline, b) Hive Server 2, c) Hive Driver, d)
Hive Compiler, e) Optimizer, f) Execution engine, g) Metastore, h)
HCatalog i) WebHCat.
iii) Processing and Resource Management: Internally, the de facto
engine for Hive's query execution is the MapReduce framework. A
software framework called MapReduce is used to create programmes
that process enormous amounts of data concurrently on vast clusters
of commodity hardware. Data is divided into pieces and processed
by map-reduce tasks as part of a map-reduce job.
iv) Distributed Storage: Since Hadoop is the foundation of Hive, the
distributed storage is handled by the Hadoop Distributed File System.
2. Hive has 4 main data types:
1. Column Types: Integer, String, Union, Timestamp
2. Literals: Floating, Decimal
3. Null Values: All missing values as NULL
4. Complex Types: Arrays, Maps, Struct
3. There are 4 main types:
1. Join: Join is extremely similar to SQL's Outer Join.
2. FULL OUTER JOIN: The records from the left and right outer tables are
combined in a FULL OUTER JOIN.
3. LEFT OUTER JOIN: All rows from the left table are retrieved using the
LEFT OUTER JOIN even if there are no matches in the right table.
4. RIGHT OUTER JOIN: In this case as well, even if there are no matches in
the left table, all the rows from the right table are retrieved.

Check Your Progress 3

1. The main features of HBase includes:
• HBase scales linearly.
• It supports automatic failure.
• It offers reliable reading and writing.
• It is both a source and a destination of Hadoop integration.
• It features a simple Java client API.
• Data replication between clusters is offered.

2. All modifications to the data in HBase's file-based storage are tracked

by the Write Ahead Log (WAL). The WAL makes sure that the data
changes can be replayed in the event that a Region Server fails or
becomes unavailable before the MemStore is cleared.
7.10 REFERENCES AND FURTHER READINGS

[1] https://www.infoworld.com/article/3236869/what-is-apache-spark-the-big-data-
platform-that-crushed-hadoop.html
[2] https://aws.amazon.com/big-data/what-is-spark/
[3] https://spark.apache.org/
[4] https://www.tutorialspoint.com/hive/hive_views_and_indexes.htm
[5] https://hive.apache.org/downloads.html
[6] https://data-flair.training/blogs/apache-hive-architecture/
[7] https://halvadeforspark.readthedocs.io/en/latest/
[8] https://www.tutorialspoint.com/hbase/index.htm
[9] Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive:
Data warehouse and query language for Hadoop. " O'Reilly Media, Inc.", 2012.
[10] Du, Dayong. Apache Hive Essentials. Packt Publishing Ltd, 2015.
[11] Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark:
lightning-fast big data analysis. " O'Reilly Media, Inc.".
[12] George, Lars. HBase: the definitive guide: random access to your planet-size
data. " O'Reilly Media, Inc.", 2011.

26
UNIT 8 NoSQL DATABASE
Structure Page Nos.

8.0 Introduction 88
8.1 Objectives 46
8.2 Introduction to NoSQL 46
8.2.1 What is NoSQL 56
8.2.2 Brief history of NoSQL Databases 89
8.2.3 NoSQL database features 90
8.2.4 Differentiate between RDBMS and NoSQL 98
8.3 Types of NoSQL Databases 66
8.3.1 Column based 77
8.3.2 Graph based 77
8.3.3 Key-value pair based 55
8.3.4 Document based 76
8.4 Summary 65
8.5 Solutions/Answers 65
8.6 Further Readings 66

8.0 INTORDUCTION

In the previous Units of this Block, you have gone through various large data
architectural frameworks, such as Hadoop, SPARK and other similar technologies.
However, these technologies are not a replacement for large-scale database systems.
NoSQL databases arose because databases at the time were not able to support the rapid
development of scalable web-based applications.

NoSQL databases have changed the manner in which data is stored and used, despite
the fact that relational databases are still commonly employed. Most applications come
with features like Google-style search, for instance. The growth of data, online surfing,
mobile use, and analytics have drastically altered the requirements of contemporary
databases. These additional requirements have spurred the expansion of NoSQL
databases, which now include a range of types such as key-value, document, column,
and graph.

In this Unit, we will discuss the many kinds of NoSQL databases, including those that
are built on columns, graphs, key-value pairs, and documents respectively.

8.1 OBJECTIVES

After going through this unit, you should be able to:

• define what is NoSQL;
• differentiate between NoSQL and SQL;
• explain the basic features of Column based NoSQL Database;
• explain the Graph-based NoSQL Database;
• explain the Key-value pair based NoSQL Database and
• explain the Document based NoSQL Database.
8.2 INTRODUCTION TO NoSQL

Databases are a crucial part of many technological and practical systems. The
phrase "NoSQL database" is frequently used to describe any non-relational
database. NoSQL is sometimes referred to as "non SQL," but it is also referred
to as "not only SQL." In either case, the majority of people agree that a NoSQL
database is a type of database that stores data in a format that is different from
relational tables.

Whenever you want to use the data, it must first be saved in a particular structure
and then converted into a usable format. On the other hand, there are some
circumstances in which the data are not always presented in a structured style,
which means that their schemas are not always rigorous. This unit provides an
in-depth look into NoSQL and the features that make it unique.

8.2.1 What is NoSQL?

NoSQL is a way to build databases that can accommodate many different kinds
of information, such as key-value pairs, multimedia files, documents, columnar
data, graphs, external files, and more. In order to facilitate the development of
cutting-edge applications, NoSQL was designed to work with a variety of
different data models and schemas.

The great functionality, ease of development, and performance at scale offered

by NoSQL have helped make it a popular name. NoSQL is sometimes referred
to as a non-relational database due to the numerous data handling features it
offers. Due to the fact that it does not adhere to the guidelines established by
Relational Database Management Systems (RDBMS), you cannot query your
data using conventional SQL commands. We can think of such well-known
examples as MongoDB, Neo4J, HyperGraphDB, etc.

8.2.2 Brief history of NoSQL Databases

In the late 2000s, as the price of storage began to plummet, No-SQL databases
began to gain popularity. No longer is it necessary to develop a sophisticated,
difficult-to-manage data model to prevent data duplication. Because developers'
time was quickly surpassing the cost of data storage, NoSQL databases were
designed with efficiency in mind.

Table 1: History of Databases

Year Database Solutions Company / Database

Technology
1970-2000 Mainly RDBMS related Oracle, IBM DB2, SQL Server,
MySQL
2000-2005 DotCom boom – new scale Google, Facebook, IBM,
solutions, start of NoSQL dev, amazon
whitepapers
2005-2010 New open source & Cassandra, Riak, Apache
mainstream databases Hbase, neo4j, MongoDB,
CouchDB, Redis

47
2010 Adoption of Cloud DBaaS (Database as a Service)
onwards

As storage costs reduced significantly, the quantity of data that applications were
required to store and query grew. This data came in all forms— structured, semi-
structured, and unstructured — and sizes making it practically difficult to define
the schema in advance. NoSQL databases give programmers a great deal of
freedom by enabling them to store enormous amounts of unstructured data.

In addition, the Agile Manifesto was gaining momentum, and software

developers were reconsidering their approach to software development. They
were beginning to understand the need of being able to quickly adjust to ever-
evolving requirements. They required the flexibility to make rapid iterations and
adjustments to all parts of their software stack, including the underlying
database. They were able to achieve this flexibility because of NoSQL
databases.

The use of the public cloud as a platform for storing and serving up data and
applications was another trend that arose, as cloud computing gained popularity.
To make their applications more robust, to expand out rather than up, and to
strategically position their data across geographies, they needed the option to
store data across various servers and locations. These features are offered by
some NoSQL databases like MongoDB.

Figure 1: Cost per MB of Data over Time (Log Scale)

(Adapted from https://www.mongodb.com)

8.2.3 NoSQL database features

Every NoSQL database comes with its own set of one-of-a-kind capabilities.
The following are general characteristics shared by several NoSQL databases:
• Schema flexibility
• Horizontal scaling
• Quick responses to queries as a result of the data model
• Ease of use for software developers

8.2.4 Difference between RDBMS and NoSQL

The differences and similarities between the two DBMSs are as follows:
• For the most part, NoSQL databases fall under the category of non-
relational or distributed databases, while SQL databases are classified as
Relational Database Management Systems (RDBMS).
• Databases that use the Structured Query Language (SQL) are table-
oriented, while NoSQL databases use either document-oriented or key-
value pairs or wide-column stores, or graph databases.
• Unlike NoSQL databases, which have dynamic or flexible schema to
manage unstructured data, SQL databases have a strict or static schema.
• Structured data is stored using SQL, whereas both structured and
unstructured data can be stored using NoSQL.
• SQL databases are thought to be scalable in a vertical direction, whereas
NoSQL databases are thought to be scalable in a horizontal direction.
• Increasing the computing capability of your hardware is the first step in
the scaling process for SQL databases. In contrast, NoSQL databases
scale by distributing the load over multiple servers.
• MySQL, Oracle, PostgreSQL, and Microsoft SQL Server are all
examples of SQL databases. BigTable, MongoDB, Redis, Cassandra,
RavenDb, Hbase, CouchDB, and Neo4j are a few examples of NoSQL
databases.
Vertical scalability is required for SQL databases. This means that an excessive
amount of load must be able to be managed by increasing the amount of CPU,
SSD, RAM, GPU, etc. on your server. When it comes to NoSQL databases, the
ability to scale horizontally is one of their defining characteristics. This means
that the addition of additional servers will make the task of managing demand
more manageable.

☞ Check Your Progress 1

1) What is NoSQL?
……………………………………………………………………………
……………………………………………………………………………
…………….………………………………………………………………
2) What are the features of NoSQL databases?
……………………………………………………………………………
……………………………………………………………………………
………..………………………….…………..…………………………….

49
3) Differentiate between the NoSQL and SQL.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………

8.3 TYPES OF NOSQL DATABASES

In this section, we will discuss the many classifications of NoSQL databases.

There are typically four types of NoSQL databases:
1) Column-based: Instead of accumulating data in rows, this method
organizes it all together into columns, which makes it easier to query
large datasets.
2) Graph-based: These are systems that are utilized for the storage of
information regarding networks, such as social relationships.
3) Key-value pair based: This is the simplest sort of database, in which each
item of your database is saved in the form of an attribute name (also
known as a "key") coupled with the value.
4) Document-based: Made up of sets of key-value pairs that are kept in
documents.

8.3.1 Column based

A column store, in contrast to a relational database, is arranged as a set of
columns, rather than rows. This allows you to read only the columns you need
for analysis, saving memory space that would otherwise be taken up by
irrelevant information. Because columns are frequently of the same kind, they
are able to take advantage of more efficient compression, which makes data
reading even quicker. The value of a specific column can be quickly aggregated
using columnar databases.
Although columnar databases are excellent for analytics, because of the way they
publish data, it is challenging for them to remain consistent because writes to all
the columns need several write events on the disk. However, this problem never
arises with relational databases because row data is continuously written to disk.

How Does a Column Database Work?

A columnar database is a type of database management system (DBMS) that
allows data to be stored in columns rather than rows. It is accountable for
reducing the amount of time needed to return a certain query. Additionally, it
is accountable for the significant enhancement of the disk I/O performance.
Both data analytics and data warehousing benefit from it. Additionally, the
primary goal of a Columnar Database is to read and write data in an efficient
manner. Column-store databases include Casandra, CosmoDB, Bigtable, and
HBase, to name a few.
Columnar Database Vs Row Database:
When processing big data analytics and data warehousing, there are a number
of different techniques that can be used, including columnar databases and row
databases. But they each take a different method.

For instance:
• Row Database: “Customer 1: Name, Address, Location". (The fields for each
new record are stored in a long row).
• Columnar Database: “Customer 1: Name, Address, Location”. (Each field
has its own set of columns). Refer Table 2 for relational database example.

Table 2: Relational database: an example

ID Number First Name Last Name Amount

A01234 Sima Kaur 4000

B03249 Tapan Rao 5000

C02345 Srikant Peter 1000

In a Columnar DBMS, the data will be stored in the following format:

A01234, B03249, C02345; Sima, Tapan, Srikant; Kaur, Rao, Peter; 4000,
5000, 1000.

In a Row-oriented DBMS, the data will be stored in the following format:

A01234, Sima, Kaur, 4000; B03249, Tapan, Rao, 5000; C02345, Srikant,
Peter, 1000.

Columnar databases: advantages

The use of columnar databases has various advantages:

• Column stores are highly effective in compression, making them storage

efficient. This implies that you can conserve disk space while storing
enormous amounts of data in a single column.
• Aggregation queries are fairly quick with column-store databases
because the majority of the data is kept in a column, which is beneficial
for projects that need to execute a lot of queries quickly.
• Load times are also quite good; a table with a billion rows can be loaded
in a matter of seconds. This suggests that you can load and query
practically instantly.
• A great deal of versatility because columns do not have to resemble one
another. The database would not be affected if you add new or different
columns, however, updating all tables is necessary to input whole new
record queries.
• Overall, column-store databases are excellent for analytics and reporting
due to their quick query response times and capacity to store massive
volumes of data without incurring significant costs.

51
Column databases: Disadvantages

While there are many benefits to adopting column-oriented databases, there are
also a few drawbacks to keep in mind.

• It takes a lot of time and effort to create an efficient indexing schema.

• Incremental data loading is undesirable and is to be avoided, if at all
possible, even though this might not be a problem for some users.
• This applies to all forms of NoSQL databases, not just those with
columns. Web applications frequently have security flaws, and the
absence of security features in NoSQL databases does not help. If
security is your top goal, you should either consider using relational
databases or, if it's possible, use a clearly specified schema.
• Due to the way data is stored, Online Transaction Processing (OLTP)
applications are incompatible with columnar databases.

Are columns databases always NoSQL?

Before we conclude, we should note that column-store databases are not always
NoSQL-only. It is frequently argued that column-store belongs firmly in the
NoSQL camp because it differs so much from relational database approaches.
The debate between NoSQL and SQL is generally quite nuanced, therefore this
is not usually the case. They are essentially the same as SQL techniques when it
comes to column-store databases. For instance, keyspaces function as schema,
so schema management is still necessary. A NoSQL data store's keyspace
contains all column families. The concept is comparable to relational database
management systems' schema. There is typically only one keyspace per
program. Another illustration is the fact that the metadata occasionally
resembles a conventional relational DBMS perfectly. Ironically, column-store
databases frequently adhere to ACID and SQL standards. However, NoSQL
databases are often either document-store or key-store, neither of which are
column-store. Therefore, it is difficult to claim that column-store is a pure
NoSQL system.

8.3.2 Graph based

The initial hardware hurdles that made it feasible for SQL to handle vast
quantities of data are no longer there, despite the fact that SQL is an excellent
superb RDBMS and has been used for many years to manage massive amounts
of data. As a result, NoSQL has rapidly emerged as the dominant form of
contemporary database management and many of the largest websites, we rely
on today, are powered by NoSQL, like Twitter's use of FlockDB and Amazon's
DynamoDB.
A database that stores data using graph structures is known as a graph database.
It represents and stores data using nodes, edges, and attributes rather than tables
or documents. Relationships between the nodes are represented by the edges.
This makes data retrieval simpler and, in many circumstances, only requires one
action. Additionally, it works fantastically as a database for fast, threaded data
structures like those used on Twitter

How does a Graph Database Work?

Graphs, which are not relational databases, rely heavily on the idea of multi-
relational data "pathways" for their functionality. However, the structure of
graph databases is typically simple. They are largely made up of two elements:
• The Node: This represents the actual data itself. It may be the number of
people who watched a video on YouTube; it could be the number of
people who read a tweet; or it could even be fundamental information
like people's names, addresses, and other such details.
• The Edge: This clarifies the real connection between the two nodes. It is
interesting to note that edges can also have their own data, such as the
type of connection between two nodes. Similar to edges, mentioned
directions may also describe the direction in which the data is flowing.
Graph databases are mostly utilized for studying relationships. For instance,
businesses might extract client information from social media using a graph
database. For example, some organization might use a graph database to extract
data about relationships between Person, Restaurant, and City, as shown in
Figure 2.

Figure 2. Different Nodes and Edges in Graph Database.

(Adapted from https://www.kdnuggets.com/)

When do we need Graph Database?

1) It resolves issues with many-to-many relationships. For example, many-
to-many relationships include friends of friends.
2) When connections among data pieces are more significant. For example,
there is a profile with some unique information, but the main selling
point is the relationship between these different profiles, which is how
you get connected inside a network.
3) Low latency with big amounts of data. The relational database's data sets
will grow significantly as you add more relationships, and when you
query it, its complexity will increase and it will take longer than usual.
However, graph databases are specifically created for this purpose, and
one can easily query relationships.

Now, let’s look at a more specific illustration to explain a group of people's

complicated relationships. For example, five friends share a social network.
These friends are Binny, Bhawna, Chaitaya, Manish, and Mohit. Their personal
data may be kept in a graph database that resembles this, as shown in Figure 3
and Table 3:

53
Figure 3. Example-Five friends sharing social network.

Table 3: Relational database: an example

Id Firstname Lastname Email Mobile
1001 Biney Dayal binnya@example.com 8645212321
1002 Bhawna Rao bhawanrao@example.com 9645212323
1003 Chaitaya Robert chaitayarob@example.com 7645212356
1004 Manish Kumar mkumar@example.com 9955212320
1005 Mohit Jain mjain@example.com 9945212329
This means we will need yet another table to keep track of user relationships.
Our friendship table (refer Table 4) will resemble the following:

Table 4: Friendship Table

user_id friend_id
1001 1002
1001 1003
1001 1004
1001 1005
1002 1001
1002 1003
1002 1004
1002 1005
1003 1001
1003 1002
1003 1004
1003 1005
1004 1001
1004 1002
1004 1003
1004 1005
1005 1001
1005 1002
1005 1003
1005 1004
We won't go too deeply into the theory of the database's main key and foreign
key. Instead, presume that the friendship table uses both friends' ids. Let's say
that every member on our social network gets access to a feature that lets them
view the personal information of their other users who are friends with them.
This means that if Chaitaya were to ask for information, it would be regarding
Biney, Bhawna, Manish and Mohit. We shall address this issue in a conventional
(relational database) manner. First, we need to locate Chaitaya's user id in the
database's Users table (refer Table 5).

Table 5: Chaitaya’s Record

Id Firstname Lastname Email Mobile
1003 Chaitaya Robert chaitanyarob@example.net 7645212356
We would now search the friendship table (refer Table 6) for all tuples with the
user id of 3. The resulting relationship would look like this:

Table 6: Friendship Table for user id 3

user_id friend_id
1003 1001
1003 1002
1003 1004
1003 1005
Let us now examine the time required for this Relational database strategy. This
will be close to log (N) times, where N is the number of tuples in the friendship
table. In this case, the database continues to keep the entries in sequential order
based on their ids. So, in general, the time complexity for 'M' number of queries
is M*log (N). Only, if we had used a graph database strategy the overall time
complexity have been O (N). For the simple reason that once Chaitaya has been
located in the database, all the rest of her friends may be found with a single
click, as shown in Figure 4.

Figure 4. Accessing others data with a single click.

Graph Database Examples
Although graph databases are not as widely used as other NoSQL databases,
there are a handful that have become de facto standards when discussing
NoSQL:
Neo4j is both an open-source and an interestingly developed on Java graph
database. It is considered to be one of the best graph databases. In addition to
that, it comes with its own language known as Cypher, which is comparable to
the declarative SQL language but is designed to work with graphs. In addition
to Java, it supports a number of other popular programming languages, including
Python, .NET, JavaScript, and a few others. Neo4j excels in applications such as
55
the administration of data centers and the identification of fraudulent activity.
RedisGraph is a graph module that is integrated into Redis, which is a key-
value NoSQL database. RedisGraph was developed to have its data saved in
RAM for the same reason that Redis itself is constructed on in-memory data
structures. As a result, a graph database with excellent speed and quick searching
and indexing is created. RedisGraph also makes use of Cypher, which is ideal if
you're a programmer or data scientist looking for greater database flexibility.
Applications that require blazing-fast performance are the main uses.
OrientDB It is interesting to note that OrientDB supports graph, document store,
key-value store, and object-based data formats. Having stated that, the graph
model, which uses direct links between databases, is used to hold all of the
relationships. Although it does not use Cypher, OrientDB is open-source and
developed in Java, just like Neo4j and the two prior graph databases. OrientDB
is designed to be used in situations when many data models are necessary, and
as a result, it is optimized for data consistency as well as minimizing data
complexity.

8.3.3 Key-value pair based

Key-value stores are perhaps the most widely used of the four major NoSQL
database formats because of their simplicity and quick performance. Let us
examine key-value stores' operation and application in more detail. With some
of the most well-known platforms and services depending on them to deliver
material to users with lightning speed, NoSQL has grown in significance in our
daily lives. Of course, NoSQL includes a range of database types, but key-value
store is unquestionably the most used.
Because of its extreme simplicity, this kind of data model is built to execute
incredibly quickly when compared to relational databases. Furthermore, because
key-value stores adhere to the scalable NoSQL design philosophy, they are
flexible and simple to set up.
How Does a Key-Value Work?
In reality, key-value storage is quite simple. A value is saved with a key that
specifies its location, and a value can be pretty much any piece of data or
information. In reality, this design idea may be found in almost every
programming language as an array or map object, refer Figure 5. The fact that it
is persistently kept in a database management system makes a difference in this
case.

Figure 5. Example Key-Value database.

Popularity of key-value stores is due to the fact that information is stored as a
single large piece of data instead of as discrete data. As a result, indexing the
database is not really necessary to improve its performance. Instead, because of
the way it is set up, it operates more quickly on its own. Similar to that, it mostly
uses the get, put, and delete commands rather than having a language of its own.
Of course, this has the drawback that the data you receive in response to a request
is not screened. Under certain conditions, this lack of data management may be
problematic, but generally speaking, the trade-off is worthwhile. Because key-
value stores are both quick and reliable, the vast majority of programmers find
ways to get around any filtering or control problems that may arise.
Benefits of Key-Value
Key-value data models, one of the more well-liked types of NoSQL data models,
provide many advantages when it comes to creating a database:
Scalability: Key-value stores, like NoSQL in general, are infinitely scalable in
a horizontal fashion, which is one of its main advantages over relational
databases. This can be a huge advantage for sophisticated and larger databases
compared to relational databases, where expansion is vertical and finite, as
shown in Figure 6.

Figure 6. Horizontal and Vertical Scalability.

More specifically, partitioning and replication are used to manage this.
Additionally, by avoiding things like low-overhead server calls, it decreases the
ACID guarantees.
No/Simpler Querying: With key-value stores, querying is really not possible
except in very particular circumstances when it comes to querying keys, and
even then, it is not always practicable. Because there is just one request to read
and one request to write, key-value makes it easier to manage situations like
sessions, user profiles, shopping carts, and so on (due to the blob-like nature of
how the data is stored). Similar to this, concurrency problems are simpler to
manage because only one key needs to be resolved.
Mobility: Because key-value stores lack a query language, it is simple to move
them from one system to another without modifying the architecture or the code.
Thus, switching operating systems is less disruptive than switching relational
databases.
When to Use Key-Value
Key-value stores excel in this area because traditional relational databases are
not actually designed to manage a large number of read/write operations. Key-
value can readily scale to thousands of users per second due to its scalability.
Additionally, it can easily withstand lost storage or data because of the built-in
redundancy.
As a result, key-value excels in the following instances:
• Profiles and user preferences
• Large-scale user session management
• Product suggestions (such as in eCommerce platforms)

57
• Delivery of personalized ads to users based on their data profiles
• Cache data for infrequently updated data
There are numerous other circumstances where key-value works nicely. For
instance, because of its scalability, it frequently finds usage in big data research.
Similar to how it works for web applications, key-value is effective for
organizing player sessions in MMOG (massively multiplayer online game) and
other online games.

Key-Value Database Examples

Some key-value database models, for instance, save information to a solid-state
drive (SSD), while others use random-access memory (RAM). We depend on
key-value stores on a daily basis in our lives since they are some of the most
popular and frequently used databases. The fact is that some of the most popular
and commonly used databases are key-value stores.
Amazon DynamoDB is most likely the database that is used the most often for
key-value storage. In point of fact, study into Amazon DynamoDB was the
impetus for the rise in popularity of NoSQL.
Aerospike is a free and open-source database that was designed specifically for
use with in-memory data storage.
Berkeley DB: Another free and open-source database, Berkeley DB is a high-
performance framework for storing databases, despite the fact that it has a very
simple interface.
Couchbase: Text searches and querying in a SQL-like format are both possible
with Couchbase, which is an interesting feature.
Memcached not only saves cached data in RAM, which helps websites load
more quickly, but it is also free and open source.
Riak was designed specifically for use in the app development process, and it
plays well with other databases and app platforms.
Redis: A database that serves as both a memory cache and a message broker.

8.3.4 Document based

A non-relational database that stores data as structured documents is known as a
document database (also known as a NoSQL document store). Instead of using
standard rows and columns, JSON format is a more recent technique to store
data. An XML or JSON file, or a PDF, are all examples of documents. NoSQL
is everywhere nowadays; just look at Twitter and its use of FlockDB or Amazon
and their use of DynamoDB. Figure 7 shows the difference between the
Relational and Document Store model.
Figure 7: Relational Vs Document Store Model.
In spite of the fact that there are a great deal of data models, each of which
contains hundreds of databases, the one we are going to investigate today is
called Document-store. One of the most common database models now in use,
document-store functions in a manner that is somewhat similar to that of the key-
value model in the sense that documents are saved together with particular keys
that access the information. Figure 8 (a) shows the document that holds
information about a book. This file is a JSON representation of a book's
metadata, which includes the book's BookID, Title, Author, and Year and Figure
8 (b) shows the same metadata for Key value database.

A Document Key Value

{ BookID 978-1449396091
“BookID”: “978-1449396091”,
“Title”: “DBMS”, Title DBMS
“Author”: “Raghu Ramakrishnan”,
“Year”: “2022”, Author Raghu Ramakrishnan
} Year 2022
(a) (b)
Figure 8: Example of Document and Key value database
When to use a document database?
• When your application requires data that is not structured in a table
format.
• When your application requires a large number of modest continuous
reads and writes and all you require is quick in-memory access.
• When your application requires CRUD (Create, Read, Update, Delete)
functionality.
• These are often adaptable and perform well when your application has to
run across a broad range of access patterns and data kinds.

59
How does a Document Database Work?
It appears that document databases work under the assumption that any kind of
information can be stored in a document. This suggests that you shouldn't have
to worry about the database being unable to interpret any combination of data
types. Naturally, in practice, most document databases continue to use some sort
of schema with a predetermined structure and file format.
Document stores do not have the same foibles and limitations as SQL databases,
which are both tubular and relational. This implies that using the information at
hand is significantly simpler and running queries may also be much simpler.
Ironically, you can execute the same types of operations in a document storage
that you can in a SQL database, including removing, adding, and querying.
Each document requires a key of some kind, as was previously mentioned, and
this key is given to it through a unique ID. This unique ID processed the
document directly instead of being obtained column by column.
Document databases often have a lower level of security than SQL databases.
As a result, you really need to think about database security, and utilizing Static
Application Security Testing (SAST) is one approach to do so. SAST, examines
the source code directly to hunt for flaws. Another option is to use DAST, a
dynamic version that can aid in preventing NoSQL injections.

Document database advantages

One major benefit of document-store is that all of the data is stored in a single
location, rather than being spread out over many interconnected databases. As a
result, if you do not employ relational processes, you perform better than a SQL
database.
• Schema-less: Because there are no constraints on the format and
structure of data storage, they are particularly effective at keeping huge
quantities of existing data.
• Faster creation of document and maintenance: The creation of a
document is a fairly straightforward process, and apart from that, the
upkeep requirements are virtually nonexistent.
• Open formats: It offers a relatively easy construction process that makes
use of XML, JSON, and other formats.
• Built-in versioning: Because it contains built-in versioning, it means
that when the documents expand in size, there is a possibility that they
will also expand in complexity. Versioning makes conflicts less likely.
More precisely, document stores are excellent for the following applications
because schema can be changed without any downtime or because you could not
know future user needs:
• eCommerce giants (Like Amazon)
• Blogging platforms (such as Blogger, Tumblr)
• CMS (Content management systems) (Like WordPress, windows
registry)
• Analytical platforms (such as Tableau, Oracle server)
Document databases' drawbacks
• Weak Atomicity: Multi-document ACID transactions are not supported.
We will need to perform two different queries, one for each collection,
in order to handle a change in the document data model involving two
collections. This is where the atomicity criteria are violated.
• Consistency Check Limitations: A database performance issue may
arise from searching for documents and collections that aren't linked to
an author collection.
• Security: In today's world, many online apps do not have enough
security, which in turn leads to the disclosure of critical data. Thus, web
app vulnerabilities become a cause for concern.
Document databases examples
• One of the best NoSQL database engines is MongoDB, which is not only
well-known but also uses JSON like format. It has its own query
language.
• A search engine built on the document-store data architecture is
Elasticsearch. Database searching and indexing may be accomplished
using this straightforward and easy-to-learn tool.
• CouchDB: In addition to Ubuntu, it also works with the social
networking site Facebook. It utilizes Javascript and is developed in the
Erlang programming language.
• BaseX is a simple, open-source, XML-based DBM that makes use of
Java.

☞ Check Your Progress 2

1) How Does a Column Database Work? Discuss.
……………………………………………………………………………
……………………………………………………………………………
…………….………………………………………………………………
2) What are the different Graph Database Examples?
……………………………………………………………………………
……………………………………………………………………………
………..………………………….…………..…………………………….

3) Explain document based NoSQL database.

……………………………………………………………………………
……………………………………………………………………………
………..………………………….…………..…………………………….

8.4 SUMMARY

This unit covered the fundamentals of NoSQL as well as the many kinds of
NoSQL databases, such as those based on columns, graphs, key-value pairs, and
61
documents. Numerous businesses now use NoSQL. It is difficult to pick the best
database platform. NoSQL databases are used by many businesses because of
their ability to handle mission-critical applications while decreasing risk, data
spread, and total cost of ownership.

Despite their incredible capability, column-store databases do have their own set
of problems. Due to the fact that columns require numerous writes to the disk,
for instance, the way the data is written results in a certain lack of consistency.
Graph databases can be used to offer content in high-performance scenarios
while producing threads that are simple to comprehend for the typical user,
beyond merely expressive information in a graphical and effective way (such as
in the case of Twitter). The simplicity of a key-value store is what makes it so
brilliant. Although this has potential drawbacks, particularly when dealing with
more complicated issues like financial transactions, it was designed specifically
to fill in relational databases' inadequacies. We may create a pipeline that is even
more effective by combining relational and non-relational technologies, whether
we are working with users or data analysis. Document-store data models are
quite popular and regularly used due to their versatility. It helps analytics by
making it easy for firms to store multiple sorts of data for later use.

8.5 SOLUTIONS/ANSWERS

Check Your Progress 1

1) NoSQL is a way to build databases that can accommodate many different
kinds of information, such as key-value pairs, multimedia files, documents,
columnar data, graphs, external files, and more. In order to facilitate the
development of cutting-edge applications, NoSQL was designed to work
with a variety of different data models and schemas.
2)
• Schema flexibility
• Horizontal scaling
• Quick responses to queries as a result of the data model
• Ease of use for software developers
3) It is different in the following ways:
• For the most part, NoSQL databases fall under the category of non-
relational or distributed databases, while SQL databases are classified
as Relational Database Management Systems (RDBMS).
• Databases that use the Structured Query Language (SQL) are table-
oriented, while NoSQL databases use either document-oriented or key-
value pairs or wide-column stores, or graph databases.
• Unlike NoSQL databases, which have dynamic or flexible schema to
manage unstructured data, SQL databases have a strict, preset or static
schema.
• Structured data is stored using SQL, whereas both structured and
unstructured data can be stored using NoSQL.
• SQL databases are thought to be scalable in a vertical direction,
whereas NoSQL databases are thought to be scalable in a horizontal
direction.
• Increasing the computing capability of your hardware is the first step in
the scaling process for SQL databases. In contrast, NoSQL databases
scale by distributing the load over multiple servers.
• MySQL, SQLite, Oracle SQL, PostgreSQL, and Microsoft SQL Server
are all examples of SQL databases. BigTable, MongoDB, Redis,
Cassandra, RavenDb, Hbase, CouchDB, and Neo4j are a few examples
of NoSQL databases.

Check Your Progress 2

1) A columnar database is a type of database management system (DBMS)
that allows data to be stored in columns rather than rows. It is
accountable for reducing the amount of time needed to return a certain
query. Additionally, it is accountable for the significant enhancement of
the disk I/O performance. Both data analytics and data warehousing
benefit from it. Additionally, the primary goal of a Columnar Database
is to read and write data in an efficient manner. Column-store databases
include Casandra, CosmoDB, Bigtable, and HBase, to name a few. Also,
refer 8.3.1.

2) Graph Database Examples:

• Neo4j is both an open-source and an interestingly developed on Java

graph database. It is considered to be one of the best graph databases in
the world. In addition to that, it comes with its own language known as
Cypher, which is comparable to the declarative SQL language but is
designed to work with graphs. In addition to Java, it supports a number
of other popular programming languages, including Python, .NET,
JavaScript, and a few others. Neo4j excels in applications such as the
administration of data centres and the identification of fraudulent
activity.
• RedisGraph is a graph module that is integrated into Redis, which is a
key-value NoSQL database. RedisGraph was developed to have its data
saved in RAM for the same reason that Redis itself is constructed on in-
memory data structures. As a result, a graph database with excellent
speed and quick searching and indexing is created. RedisGraph also
makes use of Cypher, which is ideal if you're a programmer or data
scientist looking for greater database flexibility. Applications that
require blazing-fast performance are the main uses.
• OrientDB: It's interesting to note that OrientDB supports graph,
document store, key-value store, and object-based data formats. Having
stated that, the graph model, which uses direct links between databases,
is used to hold all of the relationships.

3) It is generally agreed that document stores, which are a sort of NoSQL

database, are the most advanced of the available options. They use JSON
as their data storage format, which is different from the more traditional
rows and columns layout. Most of the day-to-day activities that we carry
out on the internet are supported by NoSQL databases. NoSQL is
everywhere nowadays; just look at Twitter and its use of FlockDB or
Amazon and their use of DynamoDB. Also, refer 8.3.4.

63
8.6 FURTHER READINGS

1) Next Generation Databases: NoSQL and Big Data 1st ed. Edition, G. Harrison,
Apress, December 26, 2015.
2) Shashank Tiwari, Professional NoSQL, 1st Edition, Wrox, September 2011.
3) https://www.kdnuggets.com/
Big Data Analysis
UNIT 9 MINING BIG DATA

Structure Page No.

9.0 Introduction
9.1 Objectives
9.2 Finding Similar Items
9.3 Finding Similar Sets
9.3.1 Jaccard Similarity of Sets
9.3.2 Documents Similarity
9.3.3 Collaborative Filtering and Set Similarity
9.4 Finding Similar Documents
9.4.1 Shingles
9.4.2 Minhashing
9.4.3 Locality Sensitive Hashing
9.5 Distance Measures
9.5.1 Euclidean Distance
9.5.2 Jaccard Distance
9.5.3 Cosine Distance
9.5.4 Edit Distance
9.5.5 Hamming Distance
9.6 Introduction to Other Techniques
9.7 Summary
9.8 Solutions/Answers
9.9 References/Further Readings

9.0 INTRODUCTION

In the previous Block, you have gone through the concepts of big data and Big data
handling frameworks. These concepts include the distributed file system, MapReduce
and other similar architectures. This Block focuses on some of the techniques that
can be used to find useful information from big data.
This Unit focuses on the issue of finding similar item sets in big data. The Unit also
defines the measures for finding the distances between two data objects. Some of
these techniques include Jaccard distance, Hamming’s distance etc. In addition, the
Unit also discusses finding similarities among the documents using shingles. Finally,
this unit introduces you to some of the other techniques that can be used for analysing
Big data.

9.1 OBJECTIVES
After going through this Unit, you will be able to:
• Explain different techniques for finding similar items
• Explain the process of the collaborative filtering process
• Use shingling to find similar documents
• Explain various techniques to measure the distance between two objects
• Define supervised and unsupervised learning

9.2 FINDING SIMILAR ITEMS

Finding similar items for a small set of documents may be a simple problem,
but how do you find a set of similar items when the number of items is
extremely large? This section defines the basic issues of finding similar items
and their applications.
1
Link Analysis

Problem Definition:
Given an extremely large collection of item sets, which may have millions or
billions of sets, how to find the set of similar item sets without comparing all
the possible combinations of item sets, using the notion that similar item sets
may have many common sub-sets.
Purpose of Finding Similar Items:
1. You may like to classify web pages that are using similar words. This
information can be used to classify web pages.
2. You may like to find the purchases and feedback of customers on
similar items to classify them into similar groups, leading to making
recommendations for purchases for these customers.
3. Another interesting use of similar item set is in entity resolution, where
you need to find out if it is the same person across different
applications like e-Commerce web site, social media, searches etc.
4. Two or more web pages amongst millions of web pages may be
identical. These web pages may be plagiarized or a mirror of a simple
website.
Why Finding Similar Items is a problem?
One of the simplest ways to find similar items would be to compare two
documents or web pages and determine if they are identical or not by
comparing sequences of characters/words used in those documents/web
pages. However, considering that you need to find duplicates in 10
documents/web pages, you need to compare 10C2 =45 pairs. In general, for
n documents/webpages, you may need to compare n×(n-1)/2 pairs of
documents/web pages. For about 106 documents/web pages, you need to
make about 5×1011 comparisons. Thus, the question is how to find these
identical documents/web pages amongst billion of documents/web pages
without checking all the combinations.

Next, we first discuss a technique to find the similarity of sets, followed by

techniques that can be used to find similar item sets efficiently.

9.3 FINDING SIMILAR SETS

In this section, we discuss one of the measures of finding similarity among the
sets and then discuss how this similarity measure can be used for finding the
textual similarity of documents; and collaborative filtering, which can be used
for finding a similar group of customers.

9.3.1 Jaccard Similarity of Sets

Jaccard similarity is defined in the context of sets. Consider two sets – set A
and set B, then the Jaccard similarity JSA,B is defined as a ratio of the
cardinality of the set A∩B and cardinality of the set A∪B. The value of JSA,B
can vary from 0 to 1. The following equation represents the Jaccard similarity:
|𝐴 ∩ 𝐵|
JS!,# =
|𝐴 ∪ 𝐵|
For example, consider the sets A={a, b, c, d, e} and set B={c, d, e, f, g}, then
Jaccard similarity of these two sets would be:

2
|{𝑐, 𝑑, 𝑒}| 3 Big Data Analysis
JS!,# = =
|𝑎, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑔| 7

9.3.2 Documents Similarity

Finding document similarity is an interesting domain of use of Jaccard

similarity. It can be used to identify a collection of almost similar documents,
news items, web pages or reports. This is also referred to as the character-
level similarity of documents. Another kind of document similarity also looks
at documents having similar meanings. This kind of similarity requires you to
identify similar words and sometimes the meaning of the words. This is also a
very useful similarity, but it can be solved using different types of techniques.
How can you identify, if two documents are identical? This can be done
simply by character-by-character matching. However, this algorithm may be
of little use as many documents may be mostly similar though not exactly.
The following cases specify this situation:
Plagiarism Checking: The plagiarized text may differ in small
segments, as some portions of the document may have been changed
or even the ordering of some of the sentences may be changed.
Mirror Websites: Even the mirror websites also make changes, as per
the local advertisements and requirements.
The versions of Course Material: Even the versions of older course
material and newer course materials, assignments etc. available on
different websites may be slightly different.
Similar News Reports: The news reports about a news item may
consist of similar text, which may be appended with supplementary
information by each newspaper.

9.3.3 Collaborative Filtering and Set Similarity

In the past decade, e-commerce has gained acceptance by customers. Many of
you, who have purchased and have liked various items on these e-commerce
websites, get online recommendations for the purchase of certain items. You
may have observed that many times those recommendations are very close to
purchases that you would like to do in near future. This is achieved by the
process of collaborative filtering, which tries to group you based on your
purchases and likes with other users making similar purchases and likes and
then recommending you the products that another person in the group has
purchased and liked.
Similarity of the sets is used to address this problem. Two customers can be in
a similar customer group if they purchase and like similar products. This
similarity is determined by the Jaccard set similarity. However, please note
that the number of customers, as well as the products that can be purchased on
an e-commerce website, are very large. For example, two separate customers
may be interested in purchasing books. They may purchase the same books
available on Data Science and may rate them similarly. Please note they may
also purchase many other books, which may be bought by only one of them.
How are the problems of collaborative filtering and document similarity are
different? The prime difference is in the value of Jaccard similarity. For
document similarity, you are looking at JSA,B in the range of 50% or above,
whereas for the case of collaborative filtering, you may be interested if the
value of JSA,B is as low as 10-25%.
3
Link Analysis In general, collaborative filtering uses binary data, such as like/dislike or
seen/not seen or purchased/not purchased etc. This data can easily be
translated into a set such as a set of likes of items; a set of purchased items;
etc. Thus, finding Jaccard's similarity is straightforward. However, in many
situations, a 5-, 7- or 11-point Likert scale may be used to rate the articles. An
interesting way to deal with such data may be to create a bag instead of a set
having as many instances of an item, as the rating. For example, consider that
on a 5-point scale, a customer gave a rating of 2 to product A and a rating of 3
for product B and another customer gave a rating of 1 to product A and a
rating of 4 for product B, then the related data would be represented as:
BagCust1 = {a, a, b, b, b}
BagCust2 = {a, b, b, b, b}
|{𝑎, 𝑏, 𝑏, 𝑏}| 4 2
JS!,# = = =
|𝑎, 𝑎, 𝑏, 𝑏, 𝑏, 𝑏| 6 3
Thus, the set similarity is an important consideration for finding the similarity
between two sets. In the next section, we discuss the method to convert a
document into subsets of very small strings, which can be used to compute the
similarity of two documents.

Check Your Progress 1:

Question 1: What is the major problem in finding similar items?
Question 2: Define Jaccard Similarity with the help of an example.
Question 3: What is collaborative filtering? How can it be addressed using set
similarity?

9.4 FINDING SIMILAR DOCUMENTS

In the last section, we discussed the Jaccard similarity in the context of sets.
We also explained the issues of document similarity. The document similarity
can be checked using the following process:
Step 1: Create Sets of items from the document (Shingles are used)
Step 2: Perform Minhashing and create documents Signatures that can
be tested for similarity checking. This reduces the number of shingles
to be tested for checking the similarity of two documents.
Step 3: Perform Locality Sense Hashing produces pairs of documents
that should be checked for similarity rather than all the possible pairs
In this section, we discuss these three important concepts of document
similarity analysis.

9.4.1 Shingles

In order to define the term shingle, let us first try to answer the question: How
to represent a document as a set of items so that you can find
lexicographically similar documents?
One way would be to identify words in the document. However, the
identification of words itself is a time-consuming problem and would be more
useful if you are trying to find the semantics of sentences. One of the simplest
and efficient ways would be to divide the document into smaller substrings of
characters, say of size 3 to 7. The advantage of this division is that for almost
common sentences many of these substrings would match despite small
changes in those sentences or changes in the ordering of sentences.

4
A k-shingle is defined as any substring of the document of size k. A document Big Data Analysis
has many shingles, which may occur at least once. For example, consider a
document that consists of the following string:
“bit-by-bit”
Assuming the value of k=3 and even using a blank character as part of
substrings. The following are the possible substrings of this document of size
3:
“bit”, “it-”, “t-b”, “-by”, “by-”, “y-b”, “-bi”, “bit”
Please note that out of these substrings “bit” is occurring twice. Therefore, the
3-shingles for this document would be:
{“bit”, “it-”, “t-b”, “-by”, “by-”, “y-b”, “-bi”}
One of the issues while making shingles is how to deal with white spaces. A
possible solution, which is used commonly, would be to replace all the
continuous white space characters with a single blank space.
So, how do you use shingles to find similar documents?
You may convert the documents into shingles and check if the two documents
share the same shingles.
An interesting issue here would be to determine the size of the shingle. If the
size of the shingle is small then most of the documents would have those
shingles, even if they are not identical. For example, if you keep a single size
of k=1, then almost all the characters would be shingles and these shingles
would be common in most of the documents, as most of them will have
almost all the alphabets. On the other hand, bigger shingles may not be able to
distinguish similar items. The ideal size of a shingle is k=5 to 9 for small to
large documents.
How many different shingles are possible for a character set? Assume that a
typical language has n characters and the size of shingles is k, then the
possible number of shingles is nk. Thus, the number of possible shingles in a
document may be very large.
Consider that you are using 9-shingles for finding similar research articles
amongst very large size articles. The possible character set would require 26
alphabets of the English language and one character for the space character.
Therefore, the maximum possible set of shingles would have (26+1)9 = 279 9-
shingles with each shingle being 9 bytes long (assuming 1 alphabet = 1 byte).
This is a very large set of data.
One of the ways of reducing the data would be to use hash functions instead
of shingles. For example, assume that a hash function maps a 9-byte substring
to an integral hash bucket number. Assuming that the size of the integer is 4
bytes, then the hash function maps these 279 possible 9-shingles to 24*8-1=232-
1 possible buckets. You may now use the bucket number as the shingle itself,
thus reducing the size of the shingle from 9 bytes to 4 bytes.
For certain applications, such as finding similar news articles, shingles are
created based on the words. These shingles are found to be more effective for
finding the similarity of news articles.

9.4.2 Minhashing

As you can observe, the set of shingles of a document, in general, is large.

Even if we reduce this set using minhashing of shingles to a smaller size, as
discussed in the previous section, the size is substantial. In fact, in general, the
total size of the set of shingles in Bytes is larger than the size of the document.
Thus, it will not be possible to accommodate the shingles in the memory of a
computer, so as to find the similarity of the documents. Is it possible to reduce
5
Link Analysis this stored size of shingles, yet not compromise too much on the estimates of
Jaccard similarity?
The method used to do so is called minhashing. The idea here is to convert the
set of shingles to a set of signatures of documents using a large number of
permutations of rows of the matrix. In order to explain this process, let us use
a visual representation of shingles using a matrix. Please note the word visual
representation, as it is being used only for explanation. The actual
representation would be closer to the representation of a sparse matrix.
Consider a universal document that has 5 elements or shingles. The four
document sets include the following shingles
Document set 1 = {2, 3, 5}
Document set 2 = {3, 4}
Document set 3 = {1, 4}
Document set 4 = {1, 2, 4}
These documents can be represented using the following matrix of Shingles
and the documents:
Shingle/Documents Set 1 Set 2 Set 3 Set 4
1 0 0 1 1
2 1 0 0 1
3 1 1 0 0
4 0 1 1 1
5 1 0 0 0
Figure 1: Matrix Representation of Documents and associated shingles
You can compute the Jaccard similarity of these documents or sets as follows:
JS !"#$,!"#&
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑜𝑤𝑠 ℎ𝑎𝑣𝑖𝑛𝑔 1′𝑠 𝑖𝑛 𝑐𝑜𝑙𝑢𝑚𝑛𝑠 𝑜𝑓 𝐵𝑂𝑇𝐻 𝑡ℎ𝑒 𝑠𝑒𝑡𝑠
=
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑜𝑤𝑠 ℎ𝑎𝑣𝑖𝑛𝑔 1' sin 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑤𝑜 𝑐𝑜𝑙𝑢𝑚𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑡𝑠
Please note that in the equation given above the columns of the sets having 0s
in both columns are not counted, as they show that that shingle is NOT
present in both columns.
Thus, for computing the similarity of set 1 and set 2 (see Figure 1), you may
observe that row 1 has values 0 in the columns of set 1 and set 2 both,
therefore, would not be counted. You may also observe that only row 3 has a
value of 1 in columns of set 1 and set 2 both. In all the other rows only one of
the two columns is 1. This indicates that only shingle 3 is present in both
documents. Thus, the similarity would be:
' + '
JS $%&',$%&( = ); JS $%&',$%&* = , ; JS $%&',$%&) = ,

' '
JS $%&(,$%&* = * ; JS $%&(,$%&) = )
2
JS $%&*,$%&) =
3
You can verify the Jaccard similarity from the set definitions also. Thus, using
the matrix representation, you may be able to compute the similarity of two
documents or sets.
After going through the representation and its Jaccard similarity, next, let us
define the term minhashing. You can create a large number of orderings of the
rows of the shingle/document matrix given in Figure 1, to generate a new
sequence of rows or shingles (for our example). The first non-zero shingle of
each set is called the minhashed value of that set. For example, consider the
following three ordering of the matrix:

6
ORDER Shingle/Documents Set 1 Set 2 Set 3 Set 4 Big Data Analysis
st
1 2 1 0 0 1
2nd 4 0 1 1 1
rd
3 3 1 1 0 0
4th 5 1 0 0 0
th
5 1 0 0 1 1
MH1(Set n) - 1st 2nd 2nd 1st

ORDER Shingle/Documents Set 1 Set 2 Set 3 Set 4

1st 5 1 0 0 0
nd
2 1 0 0 1 1
3rd 3 1 1 0 0
th
4 2 1 0 0 1
5th 4 0 1 1 1
MH2(Set n) - 1st 3rd 2nd 2nd

ORDER Shingle/Documents Set 1 Set 2 Set 3 Set 4

st
1 1 0 0 1 1
2nd 2 1 0 0 1
rd
3 4 0 1 1 1
4th 3 1 1 0 0
th
5 5 1 0 0 0
MH1(Set n) - 2nd 3rd 1st 1st
Figure 2: Computation of Minhash Signatures
You may collect these minhashed values, called signatures, into a signature
matrix. Thus, a typical signature matrix of the minhasing as above is given in
Figure 3.

Set 1 Set 2 Set 3 Set 4

1st 2nd 2nd 1st
1st 3rd 2nd 2nd
2nd 3rd 1st 1st
Figure 3: Minhash Signature Matrix
Likewise, you can create a large number of minhashed values. But how can
you compute the Jaccard similarity using these minhashed values?
In order to answer this question, you may think about the probability of
finding the same minhash signatures in two columns. This would happen
when both the columns have value 1 in the row, which has been selected as
the first non-zero row for both columns. Further, the minhash signatures will
be different if you have selected a row, which has only one column having a
value 1 and the other having a value 0. Thus, going by this logic, if you select
a very large number of orderings of the row, the probability of a similar
minhash value of two columns would be the same as the Jaccard similarity.
Thus, you can reduce the problem of finding the similarity of documents to
finding the similarity of minhash signatures. Thus, reducing the complexity of
the problem.
For example, you can compute the approximate similarity using Figure 3, as
follows:
+ + '
𝐴JS $%&',$%&( = *; AJS $%&',$%&* = * ; AJS $%&',$%&) = *

7
Link Analysis
' +
AJS $%&(,$%&* = * ; AJS $%&(,$%&) = *
2
AJS $%&*,$%&) =
3

Thus, you may observe that the Jaccard Similarity can be computed
approximately using the signature matrix, which can be used to determine the
similarity between two documents.
Now, consider the case when you are computing the signatures for 1 million
rows of data. The ordering of these itself is time-consuming and representing
an ordering will require large storage space. Thus, for real data, the presented
algorithm may not be practical. Therefore, you would like to simulate the
ordering using simple hash functions rather than actual ordering. You can
decide the number of buckets, say 100, and hash the matrix (see Figure 1) into
these buckets. You must select different hash functions such that a column of
a row (if it has a value 1) is mapped to a different bucket by a particular hash
function. This will ensure that hash functions create different ordering in the
buckets. Please remember that hashing may result in collisions, but by using
many hash functions, the effect of collision can be minimized. You must
select the smallest bucket number to which a column is mapped as its minhash
signature value. The following example explains the process with the help of
two hash functions. Consider the documents and shingles given in Figure 1,
and assume two hash functions given below:
ℎ' (𝑥) = (𝑥 + 1) 𝑚𝑜𝑑 5
ℎ( (𝑥) = (2𝑥 + 3) 𝑚𝑜𝑑 5
We assume 5 buckets, numbered 0 to 4. The following signatures can be
created by using these hashed functions:
Initial Values:
Set 1 Set 2 Set 3 Set 4
h1 ∞ ∞ ∞ ∞
h2 ∞ ∞ ∞ ∞
Figure 4: Initial hashed values
Now, apply the hash function on the first row of Figure 1. Since Set 1 and Set
2 have values 0, therefore, the value in the above table will not change.
However, Set 3 and Set 4 would be mapped as follows:
For Set 3 and Set 4:
ℎ' (1) = (1 + 1) 𝑚𝑜𝑑 5 = 2
ℎ( (1) = (2 ∗ 1 + 3) 𝑚𝑜𝑑 5 = 0

Figure 4 will change now to:

Set 1 Set 2 Set 3 Set 4
h1 ∞ ∞ 2 2
h2 ∞ ∞ 0 0
Figure 5: Hashed signature after hashing the first row

Applying the hash function on the second row of Figure 1:

For Set 1 and Set 4:
ℎ' (2) = (2 + 1) 𝑚𝑜𝑑 5 = 3
ℎ( (2) = (2 ∗ 2 + 3) 𝑚𝑜𝑑 5 = 2

8
Since Set 4 already has lower values than these so no change will take place in Big Data Analysis
set 4. Figure 5 will be modified as:

Set 1 Set 2 Set 3 Set 4

h1 3 ∞ 2 2
h2 2 ∞ 0 0
Figure 6: Hashed signature after hashing the second row

Applying the hash function on the third row of Figure 1:

For Set 1 and Set 2:
ℎ' (3) = (3 + 1) 𝑚𝑜𝑑 5 = 4
ℎ( (3) = (2 ∗ 3 + 3) 𝑚𝑜𝑑 5 = 4
Since Set 1 already has lower values than these so no change will take place in
set 1. Figure 6 will be modified as:

Set 1 Set 2 Set 3 Set 4

h1 3 4 2 2
h2 2 4 0 0
Figure 7: Hashed signature after hashing the third row

Applying the hash function on the fourth row of Figure 1:

For Set 2, Set 3 and Set 4:
ℎ' (4) = (4 + 1) 𝑚𝑜𝑑 5 = 0
ℎ( (4) = (2 ∗ 4 + 3) 𝑚𝑜𝑑 5 = 1
This will result in changes in the h1 values of Set 2, Set 3 and Set 4 and h2
value of Set 2 only, as h2 values of Set 3 and Set 4 are already lower. Figure 7
will be modified as:

Set 1 Set 2 Set 3 Set 4

h1 3 0 0 0
h2 2 1 0 0
Figure 8: Hashed signature after hashing the fourth row

Applying the hash function on the fifth row of Figure 1:

For Set 1 only:
ℎ' (5) = (5 + 1) 𝑚𝑜𝑑 5 = 1
ℎ( (5) = (2 ∗ 5 + 3) 𝑚𝑜𝑑 5 = 3
This will result in changes in the h1 values of Set 1 only. Figure 8 will be
modified as:

Set 1 Set 2 Set 3 Set 4

h1 1 0 0 0
h2 2 1 0 0
Figure 9: Hashed signature after hashing the fifth row

The signature so computed will give almost the same similarity measure as
earlier. Thus, we are able to simplify the process of finding the similarity
between two documents. However, still one of the problems remains, i.e.,
there are a very large number of documents between which the similarity is to
be checked. The next section explains the process of minimizing the pairs that
should be checked for similarities.

9
Link Analysis 9.4.3 Locality Sensitive Hashing

The signature matrix as shown in Figure 9 can also be very large, as there may
be many millions of documents or sets, which are to be checked for similarity.
In addition, the number of minhash functions that can be used for these
documents may be in the hundreds to thousands. Therefore, you would like to
compare only those pairs of documents that have some chance of similarity.
Other pairs of documents will be considered non-similar, though there may be
small false negatives in this group. Locality-sensitive hashing is a technique to
find possible pairs of documents that should be checked for similarity. It takes
the signature matrix, as shown in Figure 9, as input and produces the list of
possible pairs, which should be checked for similarity. The locality-sensitive
hashing uses hash functions to do so.

The basic principle of the Locality-sensitive hashing technique is as follows:

Step 1: Divide the overall set of the signature matrix into horizontal
portions of sets, let us call them horizontal bands. This will
reduce the effective size of data that is to be processed at a
time.
Step 2: Use a separate hash function to hash the column of a horizontal
band into a hash bucket. It may be noted that if two sets are
similar, then there is a very high probability that at least one of
their horizontal band will be hashed by some hash function to
the same bucket.
Step 3: Perform the similarity checking on the pairs of sets collected in
each bucket. You may please note that the probability of
hashing dissimilar sets to the same bucket is low.

Thus, finally reducing the pairs that are to be checked for similarity can be
reduced using locality-sensitive hashing.

You may please note that this method is an approximate method, as there
would be a certain number of false positives as well as false negatives.
However, given the size of the data, a small probability of errors are
acceptable in the results.

Check Your Progress 2:

Question 1: What is a shingle? What are its uses? List all the 5-shingle of a
document, which contain the text “This is a test”
Question 2: Consider the following Matrix representation of 2 documents and their
associated shingles. What is the Jaccard Similarity of the documents?

Shingle/Documents Set 1 Set 2

1 0 0
2 1 1
3 1 1
4 1 1
5 1 0
Question 3: Consider any three orderings of the sets to compute the minhash
signature of the documents.
Question 4: Compute the signature matrix from minhash signatures and compute the
similarity using the signature matrix.
Question 5: What is Locality sensitive hashing?
10
Big Data Analysis

9.5 Distance Measures

In the previous sections, we used the Jaccard similarity to find similarity
measures. In this section, we define some of the important distance measures
that are used in data science. This list is not exhaustive, you may find more
such measures from further readings.
The term distance is defined with reference to space, which is defined as a set
of points.
A distance measure, say distance(x, y), is the distance between two points in
space. Some of the basic characteristics of this distance are:
1. Distance is always positive. It can be zero, only if x and y are the same
points.
2. The distance measured between two points is symmetrical i.e.
distance(x, y) is the same as distance(y, x,).
3. The distance between two points would be the distance of the shortest
path between two points. This can be represented with the help of a
third point, say t, by the following equation:
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑥, 𝑦) ≤ 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑥, 𝑡) + 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑡, 𝑦)

Let us discuss some of the basic distance measures in the following sub-
sections.

9.5.1 Euclidean Distance

In Euclidean space, a point can be represented as an n-dimensional vector. For
example, a point x can be represented as an n-dimensional vector as x(x1,x2,
x3,…xn). The distance of two n-dimensional points x and y in Euclidean n-
dimensional space, where x is represented as x(x1,x2, x3,…xn) and y is
represented as y(y1,y2, y3,…yn), can be computed using the following equation:
.
'1
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑥, 𝑦) = LM[{|(𝑦- − 𝑥- )|}0 ] 0Q

-/'

In the case of m = 1, you get the formula:

𝑚𝑎𝑛ℎ𝑎𝑡𝑡𝑒𝑛𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑥, 𝑦) = LM |(𝑦- − 𝑥- )|Q

-/'

This is called the Manhattan distance.

In case of m =2, we get the formula:

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑥, 𝑦) = RSM(𝑦- −𝑥- )( T

-/'

You can verify that these measures satisfy all the properties of the distance
measure.

11
Link Analysis 9.5.2 Jaccard Distance
As discussed earlier, you compute the Jaccard similarity of two sets, namely
set A and set B, using the following formula:
|𝐴 ∩ 𝐵|
JS!,# =
|𝐴 ∪ 𝐵|
The Jaccard distance between these two sets is computed as:

|𝐴 ∩ 𝐵|
𝐽𝑎𝑐𝑐𝑎𝑟𝑑𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐴, 𝐵) = 1 − JS!,# = 1 −
|𝐴 ∪ 𝐵|

Does this distance measure fulfil all the criteria of a distance measure?
The value of Jaccard Similarity varies from 0 to 1, therefore, the value of
Jaccard distance will be from 1 to 0. The value 0 of Jaccard distance means
that the value of 𝐴 ∩ 𝐵 will be the same as 𝐴 ∪ 𝐵, which can occur only if set
A and set B are identical. In addition, as set intersection and union both are
symmetrical operations, therefore the Jaccard distance is also symmetrical.
You can also test the third property using certain values for three sets.

9.5.3 Cosine Distance

The Cosine distance is computed for those spaces where points are
represented as the direction of vector components. In such cases, the cosine
distance is defined as the angle between the two points. You can use the dot
product and magnitude of the vectors to compute the cosine distance between
two vectors. For example, consider the two vectors A[1,0,1] and B[0,1,0], the
cosine distance between the two vectors can be computed as:
The dot product of the two vectors A.B = 1×0+0×1+1×0=0
𝐴. 𝐵 = |𝐴| × |𝐵| 𝑐𝑜𝑠𝜃
0 = √2 × 1 𝑐𝑜𝑠𝜃
𝑐𝑜𝑠𝜃 = 0 𝑜𝑟 𝜃 = 90°

You may check if this measure also fulfils the criteria of the measures. It may
be noted that the range of cosine distance is evaluated between 0 and 180
degrees only. You may go through further readings for more details.

9.5.4 Edit Distance

The edit distance may be used to determine the distance between two strings.
It can be defined as follows:
Consider two strings, string A consisting of n characters a1, a2,…an and a
string B of m characters b1, b2,…bm, then edit distance between the two
strings is the minimum number operations that are required to make strings
identical. These operations are either the addition of a single character or the
deletion of a single character.
For example, the edit distance between the strings: A= “abcdef” and B=
“acdfg”, would be 3, as you can obtain B from A; by deleting b, deleting e and
inserting g at the end in string A.

However, finding these operations may not be easy to code, therefore, edit
distance can be computed using the following method:
Step 1: Consider two strings - string A consisting of n characters a1, a2,…an
and a string B of m characters b1, b2,…bm. Find a sub-sequence, which
is the longest and has the same character sequence in the two strings.

12
Use the deletion of character operation to do so. Assume the size of Big Data Analysis
this sub-sequence is ls
Step 2: Compute the edit distance using the formula:
𝑒𝑑𝑖𝑡𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐴, 𝐵) = 𝑛 + 𝑚 − 2 × 𝑙𝑠

For example, in the strings A= “abcdef” and B= “acdfg”, the longest common
sub-sequence is “acdf”, which is 4 characters long, therefore, the edit distance
between the two strings is:
𝑒𝑑𝑖𝑡𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐴, 𝐵) = 6 + 5 − 2 × 4 = 3

You can verify that edit distance is a proper distance measure.

9.5.5 Hamming Distance

Hamming distance is one of the most used distances in digital design and error
detection in digital systems, e.g. while mapping of min-terms into Karnaugh’s
map or single error correcting code. The Hamming distance is defined as the
difference between the components of two vectors, especially when these
vectors are of type Boolean.

For example, consider the following two Boolean vectors:

Vector A 1 0 1 1 0 0 1 1
Vector B 1 1 0 1 0 0 0 0
Difference N Y Y N N N Y Y

The Hamming distance between the vectors A and B is 4.

You may verify that Hamming distance also satisfies the properties of a
distance measure.

You may use any of these distance measures based on the type of data being
used or the type of problem.

9.6 INTRODUCTION TO OTHER TECHNIQUES

Big data analytics are the processes that are used to analyse Big data, which
include structured and unstructured data, to produce useful information for
organisational decision-making. In this section, we introduce some of the
techniques that are useful for extracting useful information from Big data.
You may refer to the latest research articles from various journals for more
details on such techniques.

Textual Analysis:
Textual data is produced by a large number of sources of Big data, e.g. email,
blogs, websites, etc. Some of the important types of analytics that you may
perform on the textual data are:
1. Structured Information Extraction from the unstructured textual
data: The purpose here is to generate information from the data
that can be stored for a longer duration and can be reprocessed
easily if needed. For example, the Government may find the list of
medicines that are being prescribed by doctors in various cities by
analysing the prescriptions given by the doctors to different
patients. Such analysis would essentially require recognition of
13
Link Analysis entities, such as doctor, patient, disease, medicine etc. and then
relationships among these entities, such as among doctor, disease
and medicine.
2. Meaningful summarization of single or multiple documents: The
basic objective here is to identify and report the key aspects of a
large group of documents, e.g. financial websites may be used to
generate data that can be summarised to produce information for
stock analysis of various companies. In general, the summarization
techniques either extract some important portions of the original
documents based on the frequency of words, location of words
etc.; or semantic abstraction-based summaries, which use Artificial
Intelligence to generate meaningful summaries.
3. Question-Answering: In the present time, these techniques are
being used to create automated query-answering systems. Inputs to
such systems are the questions asked in the natural languages.
These inputs are processed to determine the type of question,
keywords or semantics of the questions and possible focus of the
question. Next, based on the type of the Question-Answering
system, which can be either an Information Retrieval based or a
knowledge-based system, either the related information is retrieved
or information is generated. Finally, the best possible answer is
sent to the person, who has asked the question. Some examples of
Question-answering systems are – Siri, Alexa, Google Assistant,
Cortona, IBM Watson etc.
4. Sentiment Analysis: Such analysis is used to determine the
opinion or perception of feedback about a product or service.
Sentiment analysis at the document level determines if the given
feedback is positive feedback or negative feedback. However,
techniques have been developed to perform sentiment analysis at
the sentence level or aspect level.

Audio-Video Analysis:

The purpose of audio or video analysis is to extract useful information from

speech data or video data. Most of the calls to the customer call centres are
recorded for performance enhancement. One of the common approaches to
dealing with such data is to transcribe the speech data using automated speech
recognition software. Such software converts the speech into text, which is
checked in a dictionary to ascertain if such a word exists or not. Even
phonetics can be used for converting speech to text.
Video analysis is performed on video or closed-circuit television (CCTV)
data. The objective of such data analysis is to check for security breaches,
which may be checked by changes in CCTV data if any. In addition, it can be
used to perform indexing of videos, so as to find relevant information at a
later time.

Social Media Data Analysis:

This is one of the largest chunks of present-day data. Such data can be part of
social networks, micro-blogging, wiki-based content and many other related
web applications. The challenge here is to obtain structured information from
noisy, user-oriented, unstructured data available in a large number of diverse

14
and dispersed web pages to produce information like finding a connected Big Data Analysis
group of people, community detection, social influence, link predictions etc.
leading to applications like recommender systems.

Predictive Analysis:

The purpose of such analysis is to predict some of the patterns of the future
based on the present and past data. In general, predictive analysis uses some
statistical techniques like moving averages, regression and machine learning.
However, in the case of Big data, as the size of the data is very large and it has
low veracity, you may have to develop newer techniques to perform predictive
analysis.

Supervised and Unsupervised Learning:

Some of the Big data problems can also be addressed using machine learning
algorithms, which create models by learning from the enormous data. These
models are then used to make future predictions. In general, these algorithms
can be classified into two main categories – Supervised Learning algorithms
and unsupervised learning algorithms.

Supervised Learning:
Supervised learning uses already available knowledge to generate
models, one of the most common examples of supervised learning is
spam detection in which the word patterns and anomalies of already
classified emails (spam or not-spam) are used to develop a model,
which is used to detect if a newly arrived email is spam or not.
Supervised learning problems can further be categorised into
classification problems and regression problems.

Unsupervised Learning:
Unsupervised learning generates its own set of classes and models, one
of the most common examples of unsupervised learning is creating a
new categorisation of customers based on their feedback on different
types of products. This new categorisation may be used to market
different types of products to these customers. Such types of problems
are also called clustering problems.

You can obtain more details on supervised and unsupervised learning in the
AI and Machine learning course (MCS224).

Check Your Progress 3:

1. What is the purpose of a distance measure? What are the characteristics of
distance measures?
2. What is the Cosine distance measure? How is it different to Hamming’s
distance?
3. Explain the purpose of text analysis.

9.7 SUMMARY
This unit introduces you to basic techniques for finding similar items. The Unit first
explains the methods of finding similarity between two sets. In this context, the
concept of Jaccard similarity, document similarity and collaborative similarity are
15
Link Analysis discussed. This is followed by a detailed discussion on one of the important set
similarity applications – document similarity. For finding document similarity of a
very large number of documents, a document is first converted to a set of shingles,
next the minhashing function is used to compute the signatures of document sets
against these shingles. Finally, locality-sensitive hashing is used to find the sets that
must be compared to find similar documents. The locality-sensitive hashing greatly
reduces the number of documents that should be compared to find similar documents.
The Unit then describes several common distance measures that may be used in
different types of Big data analysis. Some of the distance measures defined are
Euclidean distance, Jaccard distance, Cosine distance, Edit distance and Hamming
distance. Further, the unit introduces you to some of the techniques that are used to
perform analysis of big data.

9.8 SOLUTIONS/ANSWERS

Check Your Progress 1:

1. For finding similar items, e.g. web pages, the first problem is the
representation of web pages into the sets, which can be compared. The
major problem, in finding similar items, is the size of the data. For
example, if you find duplicate web pages, you may have to compare
billions of pairs of web pages.
2. You can define Jaccard's similarity between two sets, viz. Set A and
Set B as:
|𝐴 ∩ 𝐵|
JS!,# =
|𝐴 ∪ 𝐵|
Consider two sets A={a, b, c, d, e, f, g} and B={a, c, e, g, i}, then
Jaccard similarity between the two sets is:
|{𝑎, 𝑐, 𝑒, 𝑔}| 4 1
JS!,# = = =
|{𝑎, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑔, 𝑖}| 8 2
3. Collaborative filtering is the process of grouping objects based on
certain criteria and providing possible suggestions to those objects
based on the activities of other objects in the group. The similarity is
the basis on which these groups are constructed.
Check Your Progress 2:
1. Shingle is a small string of size 3-9 characters, which can be part of a
document. Shingles are used to convert a document into sets, which
can be checked for similarity. The document containing the text “This
is a test document” has the following set of 5-shingles (_ is used to
represent a space character):
{This_, his_i, is_is, s_is_a, _is_a, is_a_, s_a_t, _a_te, a_tes, _test}
2. Jaccard Similarity of the document is:
|{𝑅𝑜𝑤2, 𝑅𝑜𝑤3, 𝑅𝑜𝑤4}| 3
JS $%&',$%&( = =
|{𝑅𝑜𝑤2, 𝑅𝑜𝑤3, 𝑅𝑜𝑤4, 𝑅𝑜𝑤5}| 4

3. The minhash signatures for the following three orderings are shown:

ORDER Shingle/Documents Set 1 Set 2

1st 1 0 0
2nd 4 1 1
3rd 3 1 1
4th 5 1 0

16
5th 2 1 1 Big Data Analysis

MH1(Set n) - 2nd 2nd

ORDER Shingle/Documents Set 1 Set 2

1st 4 1 1
nd
2 1 0 0
3rd 2 1 1
th
4 5 1 0
th
5 3 1 1
MH1(Set n) - 1st 1st

ORDER Shingle/Documents Set 1 Set 2

st
1 5 1 0
2nd 4 1 1
rd
3 3 1 1
th
4 2 1 1
5th 1 0 0
MH1(Set n) - 1st 2nd

4. The minhash signature matrix is:

Set 1 Set 2
2nd 2nd
st
1 1st
1st 2nd
Similarity using signatures = 2/3

5. The locality sensitive hashing first divides the signature matrix into
several horizontal bands and hashes each column using a hash function
to hash buckets. The similarity is checked only for the documents that
hash into the same bucket. It may be noted that the chances of similar
document sets may get to the same hashed bucket by at least one hash
function is high, whereas the documents that are not similar have low
chances of hashing into the same bucket.

Check Your Progress 3:

1. Distance measures are used to find differences between two entities
based on some criteria. They are used in problems that measure
similarity, which is complementary to distance. The distance measure
should be such that it is positive, symmetric and trigonometric (that is
the distance between two points is less than or equal to the sum of the
distance from the first point to a third point and the distance of the
third point to the second point.)
2. Cosine distance measures the angular distance from a specific
reference point. It is in general in the range from 0 to 180 degrees.
Hamming distance is normally used as the binary distance between
two Boolean vectors.
3. Some of the common uses of text analysis are the extraction of
structured information from unstructured text, meaningful
summarization of documents, question-answering systems and
sentiment analysis.

17
Link Analysis
9.9 REFERENCES/FURTHER READINGS

1. Leskovec J., Rajaraman R, Ullman J, Mining of Massive Datasets,

3rd Edition, available on the website http://www.mmds.org/
2. Gandomi A., Haider M., Beyond the hype: Big data concepts,
methods, and analytics, International Journal of Information
Management, Volume 35, Issue 2, 2015, Pages 137-144, ISSN 0268-
4012.

18
UNIT 10 MINING DATA STREAMS

10.0 Introduction
10.1 Objectives
10.2 Data Streams
10.2.1 Model for Data Stream Processing
10.3 Data Stream Management
10.3.1 Queries of Data Stream
10.3.2 Examples of Data Stream and Queries
10.3.3 Issues and Challenges of Data Stream
10.4 Data Sampling in Data Streams
10.4.1 Example of Representation Sample
10.5 Filtering of Data Streams
10.5.1 Bloom Filter
10.6 Algorithm to Count Different Elements in Stream
10.7 Summary
10.8 Answers
10.9 References/Further Readings

10.0 INTRODUCTION

In the previous Unit of this Block, you have gone through the concepts relating to
mining of Big data, where difference distance measures and techniques were discussed
to uncover hidden pattern in Big data. However, there are certain applications, like
satellite communication data, sensor data etc. which produce a continuous stream of
data. This data stream can be regular or irregular, homogenous or heterogeneous, partly
stored or completely stored etc. Such data streams are processed using various
techniques.
This unit explains the characteristics and models of data stream processing. It also
identifies the challenges or stream processing and introduces you to various techniques
for processing of data streams. You may refer to further readings of this Unit for more
details on data streams and data stream processing.

10.1 OBJECTIVES
After going through this Unit, you would be able to:
• Define the characteristics of data streams
• Discuss models of data stream processing
• Explain the uses of data streams
• Illustrate the example of data stream queries
• Explain the role of Bloom filter in data stream
• Explain an algorithm related to data stream processing.

10.2 DATA STREAMS

Usually, the data resides in a database or a distributed file system from where users can
access the same data repeatedly, as it is available to the users whenever they need it.
But there are some applications where the data does not reside in a database, or if it
does the database is so large that the users cannot query it fast enough to answer
questions about it. One such example is, data being received from a weather satellites.
1
Answering queries about this sort of data requires clever approximation techniques and
methods for compressing data in a way that allows us to answer the queries, we need to
answer.
Thus, mining data stream is a process to extract knowledge in real time from a large
amount of volatile data, which comes in an infinite stream. The data is volatile because
it is continuously changing and evolving over time. The system does not store the data
in the database due to limited amount of resources. How would you anlayse this data
stream? The following section presents the models for data stream processing.

10.2.1 Model For Data Stream Processing

As stated in the previous section, an infinite amount of data arrives continuously in a
data stream. Assume D is the data stream which is the sequence of transactions and can
be defined as:
D = (T1, T2,…., Ti, Ti+1, …., Tj)
Where: T1: is the 1st transaction, T2: is the 2nd transaction, Ti: is the ith
transaction and Tj: jth transaction.
There are three different models for data stream processing, namely, Landmark, Sliding
Windows and Damped, as shown in Figure 1 and discussed as follows:

a b c d e f a b c d b c a b c

(a) Landmark
window at t1 window at t2 window at t3

a b c d e f a b c d b c a b c

(b) Sliding window

weight Decreasing

(c) Damped window

Figure1: Model for data stream processing

(a) Landmark model:

This model finds the frequently used items in entire data stream from a specific
time (known as landmark) till present. In other words, the model finds out
frequent items starting from Ti to current time Tt from the window W[i,t], where
i represents the landmark time. However, if i=1, then the model finds out the
frequent items over entire data stream. In this type of model, all time-points are
treated equally after the starting time. However, this model is not suitable to
2
find items in the most recent data streams. The examples of landmark model
include stock monitor system, which observes and reports on global stock
market.

(b) Sliding Windows model:

This model stores recent data in sliding window from a certain range and
discard old data items. The size of the sliding window may vary according to
the type of application used. Suppose the size of the sliding window is w and
current time is t, the model finds the data on the sliding window- W[t-w+1, t].
The window will update its size according to the current time. The model does
not store the data that arrive before the time t-w+1. In other words, the part of
the data stream that is in the range of the sliding window are retrieved at a
particular time point.
However, the data will not be processed if the arrival rate of data is higher than
the processing rate. In this case, most of the data points will be dropped.

(c) Damped model:

This model is also called as Time-Fading model as it assigns more weight to
the recent transactions in data stream and this weight keeps on decreasing with
age. In other words, the older transactions have less weight as compared to the
newer transactions in the data stream. This model is mostly used in those
applications where new data has more impact on mining results in comparison
to the old data and the impact of old data decreases with time.
You may use any model for data stream processioning, but how are data streams
managed? Next section discusses the data stream management system.

10.3 DATA STREAM MANAGEMENT

The fundamental difference between Data Base Management System (DBMS) and Data
Stream Management System (DSMS) is who controls how data enters into the system.
(a) DBMS:
In a DBMS, the staff associated with the management of the database usually
insert data into the system using a bulk loader or even explicit SQL INSERT
commands. The staff can decide how much data to be loaded into the system.
The staff can also decide when and how fast to load the data into the system.
For example, the records of students stored in schools or colleges or universities.

(b) DSMS:
In a DSMS, the management cannot control the rate of input. For example, the
search queries that arrive at Google search engine are generated by random
people around the globe, who search for information at their respective pace.
Google staff literally has no control over the rate of arrival of queries. They
have to design and architect their system in such a way that it can easily deal
with the varying data rate.
Data Stream Management System (DSMS) extracts knowledge from multiple data
streams by eliminating undesirable elements, as shown in Figure 2. DSMS is important
where the input data rate is controlled externally. For example, Google queries.

3
Figure 2: A simple outline of Data Stream Management System

The sources of data streams include Internet traffic, online transactions, satellite data,
sensors data, live event data, real-time surveillance systems, etc.
Figure 3 shows the detailed view of a data stream management system. The components
of this system are described as follows:

Figure 3: A detailed view of Data Stream Management System

• Processor: The processor is a software that executes the queries on the

data stream. There can be multiple processors working together. The
processor may store some standing queries and also allows ad-hoc
queries (refer to section 10.3.1) to be issued by the users.

• Streams Entering: There can be several streams entering in the system.

Conventionally, we will assume that the elements at the right end of the stream
have arrived more recently. The time goes backward to the left i.e. the further
left the earlier the element entered the system.

• Output: The system makes output in response to the standing queries and the
ad-hoc queries (refer to section 10.3.1).

4
• Archival Storage: There is a massive archival storage and we cannot assume
the archival storage is architected like a database system. Further, we can use
appropriate indices or other tools to efficiently answer the queries from that
data. We only know that if we had to reconstruct the history of the streams it
could take a long time.

• Limited Working Storage: It might be a main memory, or a flash storage, or

even magnetic disk. But we assume that it holds important parts of the input
streams in a way that supports fast execution of query.

10.3.1 Queries Of Data Stream

Streams can be carried in two modes:

(a) Ad-hoc queries:

This is similar to the way we query a database system in which we make a query
once and expect an answer to the query based on the current state of the system.
For example, what is the maximum value seen so far in data stream, D, from
its beginning to the exact time the query is asked?
This question can be answered by keeping a single value- the maximum- and
updating it (if necessary), every time a new stream element arrives.

(b) Standing queries:

In this type of query, the users write the query once. However as compared to
ad-hoc queries, the difference is that here, the users expect the system to report
the answer available at all times perhaps outputting a new value each time the
answer changes.
For example, report each new maximum value ever seen in data stream, D.
This question can be answered by keeping one value- the maximum (MAX)-
and each new element is compared with the MAX. If it is larger, then we output
the value and update the MAX to be that value.

10.3.2 Examples Of Data Stream

The following are the examples of data stream sources:

(a) Mining query streams:

The first example is the query stream and a search engine like Google. For
instance, Google Trends wants to find out which search queries are much more
frequent today than yesterday. These queries represent issues of rising public
interest. Answering such a standing query requires looking back for at most two
days in the query stream that is quite a lot, perhaps billions of queries. But it is
little compared with the stream of all Google queries ever issued.

(b) Mining click streams:

Click streams are another source of a very rapid input. A site like Yahoo has
many millions of users each day and the average user probably clicks a dozen
times or more. A question worth answering is which URLs are getting clicked-
on, a lot more, in the past one hour than normal. Interestingly, while some of
these events reflect breaking news stories, many also represent a broken link.
For instance, when people cannot get the page they want, they often click on it
5
several times before giving up. So, sites mine their click streams to detect
broken links.

(c) IP packets can be monitored at a switch:

The Internet Protocol (IP) packets can be monitored at a switch. The elements
of the stream are IP packets typically and the switch can store a lot of
information about the packets including the response speed of different network
links and the points of origin and destination of the packets. This information
could be used to advise the switch and the best routing for a packet to detect a
Denial of Service (DoS) attack.

10.3.3 Issues And Challenges Of Data Stream

The issues and challenges of data stream are discussed as follows:

• Input tuples: The input elements are the tuples of a very simple kind such as
bits or integers. There are one or more input ports at which data arrives. The
arrival rates of input tuples are very high on these input ports.

• Arrival rate: The arrival rate of data is fast enough that it is not feasible for the
system to store all the arriving data and at the same time make it instantaneously
available for any query that we might want to perform on the data.

• Critical calculations: The algorithms for data stream are general methods that
use a limited amount of storage (perhaps only main memory) and still enables
to answer important queries about the content of the stream. However, it
becomes difficult to perform critical calculations about the data stream with
such a limited amount of memory.
Check Your Progress 1:
1. Define the data stream processing. Which model of data stream processing is
useful in finding stock market trends? Justify.

2. Differentiate between DBMS and DSMS. Why all the data of data streams is
not stored?

3. How are Standing queries different to ad-hoc queries?

4. What is use of mining click streams?

10.4 DATA SAMPLING IN DATA STREAMS

If the data on the stream is arriving too rapidly, we may not want to or need to look at
every stream element. Perhaps we can work with just a small sample of the values in
the stream.
While taking samples, you need to consider the following two issues:
• First, the sample should be unbiased
• Second, the sampling process must preserve the answer to the query or queries
that you may like to ask about the data

6
Thus, you require a method that preserves the answer to the queries. The following
example explains the concepts stated above:
Google has a stream of search queries that are generated at all the times. You might
want to know what fraction of search queries received over a period (say last month)
were unique? This query can also be written as: What fraction of search queries have
only one occurrence in the entire month?
One of the sampling techniques that can be used for data streams to answer the queries
may be to randomly select 1/10th of the stream. For example, if you want to know what
fraction of the search queries are a single word queries; you can compute that fraction
using the 1/10th sample of the data stream. The computed fraction would be very close
to the actual fraction of single word queries over the entire data stream. Thus, over the
month, you would be testing 1/10th of the data stream as sample. If those queries are
selected at random, the deviation from the true answer will be extremely small.
However, the query – to find the fraction of unique queries, cannot be answered
correctly from a random sample of the stream.
10.4.1 The Representative Sample
In this section, we discuss about the need of a representative sample with the help of an
example. The query that needs to be answered is – “to find the fraction of unique search
queries”:
• We know that the length of a sample is 10% of the length of the whole stream
(as mentioned above in section 10.4). The problem is that the probability of a
given query appearing to be unique in the sample gets distorted because of the
sampling.

• Suppose a query is unique in the data stream. It has only a 1/10th of chance of
being selected for the sample. It implies that fraction of truly unique queries
that may get selected into the sample is the same as for the entire stream. If we
could only count the truly unique queries in the sample, we would get the right
answer, as we are trying to find proportions.

• However, suppose a search query appears exactly twice in the whole stream.
The chance that the first occurrence will be selected for the sample is 10% and
the chance that the second occurrence will not be selected is 90%. Multiply
those and we have a 9% chance of this query occurrence being unique in the
sample.

• Moreover, the first occurrence could not be selected but the second is selected,
this also has 9% chance. Thus, a query that really occurs twice but may be
selected as unique in the sample with the probability of a total of 18%.

• Similarly, the queries that appears on the stream three times has a 24.3% chance
of being selected as unique queries in the sample.

• In fact, any query no matter how many times it appears in the original stream,
has at least a small chance of being selected as unique query in the sample.

So, when you count the number of unique queries in the sample, it will be an
overestimate of the true fraction of unique queries.
In the example given above, the problem was that we performed sampling on the basis
of the position in the stream, rather than the value of the stream element. In other words,
we assumed that we flipped a 10-sided coin every time a new element arrived in the
stream. The consequences are that when a query occurs at several positions in the stream,
7
we decide independently whether to add or not to add the query into the sample.
However, this is not the sampling which we are interested in, for answering the query
about finding the unique queries.
We want to pick 1/10thof the search queries, not 1/10th of the instances of search queries
in the stream. We can make a random decision when we see a search query for the first
time.
If we kept a table/list of our decision for each search query we have ever seen, then each
time a query appears, we can look it up in the table. If the query is found in the table/list,
then we can either add it to the sample or not to add it to the sample. But if you did not
find the query in the table, then you can flip the ten sided coin to decide what to do with
it and we crawled the query and the outcome from the table.
However, it will be hard to manage and lookup such a table with each stream element.
Fortunately, there is a much simpler way to get the same effect without storing anything
in the list by using a Hash function.
• Select a hash function, which maps the search queries into 10 buckets i.e. 0 to
9.
• Apply the hash function when a search query arrives, if it is mapped to bucket
0, then add it to the sample, otherwise if it maps to any of the other 9 buckets,
then do not add it to the sample.
The advantage of this approach is that all occurrences of the same query would be
mapped to the same bucket, as the same hash function is applied. As a result, you do
not need to know whether the search query that just arrived has been seen before or not.
Therefore, the fraction of unique queries in the sample is the same as for the stream as
a whole. The result of sampling this way is that 1/10th of the queries are selected for the
sample.
If selected, then the query appears in the sample exactly as many times as it does in the
entire data stream. Thus, the fraction of unique queries in the sample should be exactly
as it is in the data stream.
What if the total sample size is limited?
Suppose you want your sample not to be a fixed fraction of the total stream, but a fixed
number of samples from the stream.
In that case, the solution would be to perform hashing to a large number of
buckets, such that the resulting sample just stays within the size limit.
As more stream elements are added, your sample gets too large. In that case, you can
pick one of the buckets that you have included in the sample and delete all the stream
elements from the sample that hash to that bucket. Organizing the sample by bucket can
make this decision process efficient.
With this different way of sampling, the stream unique query problem will be addressed
as:
Ø You still want a 10% sample for the search queries, however, eventually even
the 10% sample will become too large. Hence, you want the ability to throw
out some fraction of sampled elements. You may have to do it repeatedly. If
any-one occurrence of the query is thrown out, then all other occurrences of the
same query are thrown out.

Ø Perform hashing to 100 buckets for our example, but for a real data stream you
may require a million buckets or even more, as long as you want 10% sample.

8
Ø You could choose any 10 buckets out of 100 buckets for the sample. For the
present example, let us choose bucket 0 to bucket 9.

Ø If the sample size gets too big, then you pick one of the buckets to remove the
samples, say bucket 9. It means that you delete those elements from the sample
that hash to bucket 9 while retaining those that hash to bucket 0 to bucket 8.
You just returned bucket 9 to the available space. Now, your sample is 9% of
the stream.

Ø In future, you add only those new stream elements to sample that hash to bucket
0 through bucket 8.

Ø Sooner or later, even the 9% sample will exceed our space bound. So, you
remove those elements from the sample that hash to bucket 8, then to bucket 7,
and so on.
Sampling Key-Value Pairs

The idea, which has been explained with the help of an example given above, is really
an instance of a general idea. A data stream can be any form of key-value pairs. You
can choose sample by picking a random key set of a desired size and take all key value
pairs whose key falls into the accepted set regardless of the associated value.
In our example, the search query itself was the key with no associated value. In general,
we select our sample by hashing keys only, the associated value is not part of the
argument of the hash function.
You can select an appropriate number of buckets for acceptance and add to sample each
key-value pair whose key hashes to one of the accepting buckets.
Example: Salary Ranges
Ø Assume that a data stream elements are tuples with three components - ID for
some employee, department that employee works for and the salary of that
employee.
StreamData=tuples(EmpID, Department, Salary)

Ø For each department, there is a salary range, which is the difference between
the maximum and minimum salaries and is computed from the salaries of all
the employees of that department.
Query: What is the average salary range within a given department?
Assuming that you want to use a 10% sample of those stream tuples to estimate the
average salary range. Picking 10% of the tuples at random would not work for a given
department, as you are likely to be missing one or both of the employees with the
minimum or maximum salary in that department. This will result in computation of a
lower difference between the MAX and MIN salaries in the sample for a department.
Key= Department
Value= (EmpID, Salary)
The right way to sample is to treat only the department component of tuples as the key
and the other two components: employee ID and salary as part of the value. In general,
both the key and value parts can consist of many components.
If you sample this way, you would be sampling a subset of the departments. But for
each department in the sample, you get all its employee salary data, and you can
compute the true salary range for that department.

9
When you compute the average of the ranges, you might be off a little because you are
sampling the ranges for some departments rather than averaging the ranges for all the
departments. But that error is just random noise introduced by the sampling process and
not a bias in one direction or another.
Check Your Progress 2:
1. What are the different ways of sampling data stream?

2. Explain any one example of sampling data.

3. What is the purpose of sampling using <Key, Value> pair? Why did you choose
department as the key in the example?

10.5 FILTERING OF DATA STREAMS

In the previous section, you were answering queries using the recent window
items. What would you do if you want to extract information from the entire data
stream? You may have to use the filtering of data stream for this. In this section,
we discuss about one data steam filter called Bloom filter.
10.5.1 Bloom filter
Bloom filters enable us to select only those items in the stream that are on some list or
a set of items. In general, Bloom filter is used for cases where the number of items on
the list or set is so large, that you cannot do a comparison of each stream element with
element of the list or check the set membership.
Need of Bloom filters:

Let us explain the need of Bloom filter with the help of an example application of a
search engine.
• Web crawler performs many crawling tasks and uses different processors to
crawl pages.

• The crawler maintains a list of all URLs in the database that it has already found.
Its goal is to explore the web pages in each of these URLs to find the additional
URLs that are linked to these web pages.

• It assigns these URLs to any of a number of parallel tasks, these tasks stream
back the URLs they find in the links they discover on a page.

• However, it is not expected to have the same URL to get into the list twice.
Because in that case you will be wasting your time in crawling the page twice.
So, each time a URL comes back to the central controller, it needs to determine
whether it has seen that URL before and discard the second report so that you
could create an index, say a hash table, to make it efficient to look up the URL
and see whether it is already among those web pages that has been in the index.

• But the number of URLs are extremely large and such an index will not fit in
the main memory. It may be required to be stored in the secondary memory
(hard disk), requiring disk access every time a URL is to be checked. This
would be very time consuming process.
Therefore, you need a Bloom filter.
10
• When a URL arrives in a stream, pass it through a Bloom filter. This filter will
determine if the URL has already been visited or not.

• If the filter says it has not been visited earlier, the URL will be added to the list
of URLs that needs to be crawled. And eventually it will be assigned to some
crawling task.

• But the Bloom filter can have false positives, which would result in assigning
some of the URLs as already visited, while in fact that it was not.

• The good news- if a Bloom filter says that the URL has never been seen. Then
that is true i.e., there are no false negatives.
Working of Bloom filter:

• Bloom filter itself is a large array of bits, perhaps several times as many bits as
there are possible elements in the stream.

• The array is manipulated through a collection of hash functions. The number of

hash functions can be one, although several hash functions are better. In some
situations, even a few dozen hash functions may be a good choice.

• Each hash function maps a stream element to one of the positions in the array.

• During initialization all the bits of the array are initialized to 0.

• Now, when a stream element, say x, arrives, we compute the value of hi(x) for
each hash function hi that are to be used for Bloom filtering.

• A hash function maps a stream element to an index value on Bloom filter array.
In case, this index value is 0, then it is changed to 1.
The following example explains the working of Bloom filter in details.
Example of Bloom filters:

For example, the size of an array that is going to be used for Bloom filter is of 11 bits,
i.e., N=11. Also, assume that the stream that is to be filtered consists of only unsigned
integers of 12 bits, i.e., Stream elements=unsigned integers. Further, for the purpose of
this example, let us use only two hash functions h1 and h2, as given below:
Ø The first hash function h1 maps an integer x to a hash value h1(x) as follows:
o Write the binary value of integer x. Select the bits at the odd bit
positions starting from the left most (least significant) bit.
o Extract these odd bits of x to another binary, say xodd
o Take modulo as: (xodd modulo 11) to map x into hash value h1(x).
Ø h2(x) is computed in exactly the same manner except that it collects the even bit
positions of the binary representation to create xeven. The modulo is also
computed using xeven modulo 11.
Next, you will initialize all the array values of the Bloom filter to zero. Assuming that
the set of valid stream elements is {25, 159, 585}, you train the Bloom filter as:
Initial Bloom filter contents=00000000000, which is representation as:
Bloom Filter Array Index 0 1 2 3 4 5 6 7 8 9 10
Bloom Filter 0 0 0 0 0 0 0 0 0 0 0

11
Bit Position 12 11 10 9 8 7 6 5 4 3 2 1
Binary Equ 2048 1024 512 256 128 64 32 16 8 4 2 1
x = 25 0 0 0 0 0 0 0 1 1 0 0 1
xodd 0 0 0 1 0 1
h1(x) 000101 = 5 in decimal; h1(x) = 5 mod 11 = 5
x 0 0 0 0 0 0 0 1 1 0 0 1
xeven 0 0 0 0 1 0
h2(x) 000010 = 2 in decimal; h2(x) = 2 mod 11 = 2

Bloom filter contents after inserting hash values of 25

Bloom Filter Array Index 0 1 2 3 4 5 6 7 8 9 10

Bloom Filter 0 0 1 0 0 1 0 0 0 0 0

Stream element= 159 and Bloom filter contents=00100100000

Bit Position 12 11 10 9 8 7 6 5 4 3 2 1
Binary Equ 2048 1024 512 256 128 64 32 16 8 4 2 1
x = 159 0 0 0 0 1 0 0 1 1 1 1 1
xodd 0 0 0 1 1 1
h1(x) 000111 = 7 in decimal; h1(x) = 7 mod 11 = 7
x 0 0 0 0 1 0 0 1 1 1 1 1
xeven 0 0 1 0 1 1
h2(x) 001011 = 11 in decimal; h2(x) = 11 mod 11 = 0

Bloom filter contents after inserting hash values of 159

Bloom Filter Array Index 0 1 2 3 4 5 6 7 8 9 10

Bloom Filter 1 0 1 0 0 1 0 1 0 0 0

Stream element= 585 and Bloom filter contents=10100101000

Bit Position 12 11 10 9 8 7 6 5 4 3 2 1
Binary Equ 2048 1024 512 256 128 64 32 16 8 4 2 1
x = 585 0 0 1 0 0 1 0 0 1 0 0 1
xodd 0 0 1 0 0 1
h1(x) 001001 = 9 in decimal; h1(x) = 9 mod 11 = 9
x 0 0 1 0 0 1 0 0 1 0 0 1
xeven 0 1 0 0 1 0
h2(x) 010010 = 18 in decimal; h2(x) = 18 mod 11 = 7

Bloom filter contents after inserting hash values of 585

Bloom Filter Array Index 0 1 2 3 4 5 6 7 8 9 10

Bloom Filter 1 0 1 0 0 1 0 1 0 1 0

How to test membership to the set using Bloom filters:

12
Assuming that you are using Bloom filter to test the membership of a 12-bit unsigned
integer to the set {25, 159, 585}. The bloom filter for this set is 10100101010 (as shown
above). Find if the stream element is member of the set or not.
Lookup element y = 118
Bit Position 12 11 10 9 8 7 6 5 4 3 2 1
Binary Equ 2048 1024 512 256 128 64 32 16 8 4 2 1
y = 118 0 0 0 0 0 1 1 1 0 1 1 0
yodd 0 0 1 1 1 0
h1(y) 001110 = 14 in decimal; h1(y) = 14 mod 11 = 3
y 0 0 0 0 0 1 1 1 0 1 1 0
yeven 0 0 0 1 0 1
h2(y) 000101 = 5 in decimal; h2(y) = 5 mod 11 = 5

Bloom Filter Array Index 0 1 2 3 4 5 6 7 8 9 10

Bloom Filter 1 0 1 0 0 1 0 1 0 1 0
Checking for 118 1 1
M 0

Since, there is a mismatch represented by M, therefore, y is not a member of the set and
it can be filtered out.
However, there can be false positives, when you use Bloom filter. For example:
Lookup element y = 115
Bit Position 12 11 10 9 8 7 6 5 4 3 2 1
Binary Equ 2048 1024 512 256 128 64 32 16 8 4 2 1
y= 115 0 0 0 0 0 1 1 1 0 0 1 1
yodd 0 0 1 1 0 1
h1(y) 001101 = 13 in decimal; h1(y) = 13 mod 11 = 2
y 0 0 0 0 0 1 1 1 0 0 1 1
yeven 0 0 0 1 0 1
h2(y) 000101 = 5 in decimal; h2(y) = 5 mod 11 = 5

Bloom Filter Array Index 0 1 2 3 4 5 6 7 8 9 10

Bloom Filter 1 0 1 0 0 1 0 1 0 1 0
Checking for 115 1 1
0 0

Since, there is No mismatch, therefore, y is a member of the set, and it cannot be filtered
out. However, you may notice this is a false positive.

10.6 ALGORITHM TO COUNT DIFFERENT

ELEMENTS IN STREAM

In stream processing sometimes instead of exact solution, you can accept approximate
solutions. One such algorithm, which is used to count different elements in a stream in
a single pass was given by Flajolet-Martin. This algorithm is discussed below:
Steps of the algorithm:
1. Pick a hash function h that maps each of the n elements of the data stream to at
least log2(n) bits.
2. For each stream element a, let r(a) be the number of trailing 0’s in h(a).
13
3. Record R = the maximum r(a) seen.
4. Estimated number of different elements= 2R.
Example:
Given a good uniform distribution of numbers as shown in Table 1. It has eight different
elements in the stream.
Probability that the right-most set bit is at position 0 = ½
At position 1 = 1/2 * 1/2 = 1/4
At position 2 =1/2 * 1/2 *1/2 = 1/8
…
At position n = 1/2n

Table 1: An Example of Flajolet-Martin Algorithm

Number Binary Position of the

Representation rightmost set bit
0 000 -
1 001 0
Uniform 2 010 1
Distribution 3 011 0
4 100 2
5 101 0
6 110 1
7 111 0

(Assuming that the index value of least significant bit is 0)

It implies that the probability of the right-most set bit drops by a factor of 1/2 with every
position from the LSB to the MSB as shown in Figure 4.

Figure 4: Probability in Flajolet-Martin Algorithm [4]

By keeping the record of these positions of the right-most set bit, say ρ, for each element
in the stream. We will expect position of rightmost set bit = 0 to be 0.5, ρ = 1 to be 0.25,
etc. Also consider that m is the number of distinct elements in the stream.
Ø This probability will come to 0 when bit position b is greater than log m
Ø This probability will be non-zero, when b <= log m

14
Therefore, if we find the right-most unset bit position b such that the probability = 0,
we can say that the number of unique elements will approximately be 2b. This forms the
core intuition behind the Flajolet Martin algorithm.
A detailed discussion on this algorithm can be referred from the further reading.
Check Your Progress 3:
1. Explain how Bloom filter can be used to identify emails that are not from a
selected group of addresses.

2. Explain the Flajolet-Martin algorithm.

10.7 SUMMARY

This unit introduces the concept of mining data streams. Data streams can be processed
using different models. Three such models (landmark, sliding windows, and damped
model) for data stream processing are introduced in this unit. Further, data stream
management system (DSMS) is explained. In addition, different types of queries of data
stream namely, ad-hoc and standing queries are discussed followed by examples of
these queries. The issues and challenges of Data streams and data sampling in data
streams with examples of representation sample and sampling Key-Value pairs are
discussed in this unit. This unit also explains bloom filter with its need, working, and
related examples. In the end, this unit shows the algorithm to approximately count
different elements in stream with example.

10.8 ANSWERS

Check Your Progress 1:

1. The data stream is a process to extract knowledge in real time from a large
amount of volatile data, which comes in an infinite stream of data. The data is
volatile because it is continuously changing and evolving over time. The system
does not store the data in the database due to limited amount of resources. The
landmark model of data stream processing is useful in finding stock market
trends because it finds the frequently used items in entire data stream from a
specific time till present and all time-points are treated equally after the starting
time.
2. In a DBMS, the staff associated with the management of the database usually
insert data into the system. In a DSMS, the management cannot control the rate
of input. All data of data streams is not stored due to limited amount of
resources.
3. In standing queries, the users expect the system to report the answer available
at all time perhaps outputting a new value each time the answer changes. In ad-
hoc queries, the users make a query once and expect an answer to the query
based on the current state of the system, which is available only at current time.
4. The use of mining click streams is to know which URLs are getting clicked-on,
a lot more, in the past one hour than normal.
Check Your Progress 2:
1. The representative samples and sampling key-value pairs are different ways of
sampling data stream.
15
2. Suppose that you want a 10% sample for the search queries. You could choose
any 10 buckets out of 100 buckets for the sample. For the present example, let
us choose bucket 0 to bucket 9. If the sample size gets too big, then you pick
one of the buckets to remove the samples, say bucket 9. It means that you delete
those elements from the sample that hash to bucket 9 while retaining those that
hash to bucket 0 to bucket 8. You just returned bucket 9 to the available space.
Now, our sample is 9% of the stream. In future, you add only those new stream
elements to sample that hash to bucket 0 through bucket 8. Sooner or later, even
the 9% sample will exceed our space bound. So, you remove those elements
from the sample that hash to bucket 8, then to bucket 7, and so on.

3. The purpose of sampling using <Key, Value> pair is that you can choose a
sample by picking a random key set of a desired size and take all key value
pairs whose key falls into the accepted set regardless of the associated value.
The department is chosen as a key in the example because department is a
constant used to define the data set.

Check Your Progress 3:

1. When an email arrives in the stream, it will pass through a Bloom filter. Bloom
filter is a large array of bits which is equal to the possible elements in the stream.
The array is manipulated through a collection of hash functions. Each hash
function maps a stream element to one of the positions in the array. During
initialization all the bits of the array are initialized to 0.
First, you will initialize all the arrays to zero. You will then select each email
address from the allowable/selected group of address and apply every hash
function on these email addresses to compute hash values. A hash function
maps a stream element to an index value on Bloom filter array. In case this
index value is 0, then it is changed to 1. Now, your filter is set.
Now, when a email from a mail id say rakesh@gmail.com, arrives, you
compute the value of hi(x) for each hash function hi that are to be used for
Bloom filtering. If all these hash values are 1 in the filter, it means this email id
is one of the selected/allowed email id. Please note there can be false positives
through.

2. Explain the Flajolet-Martin algorithm with the help of an example.

The Flajolet-Martin algorithm is used to count different elements in a stream in
a single pass. The steps of the algorithm are:
A. Pick a hash function h that maps each of the n elements of the data stream
to at least log2(n) bits.
B. For each stream element a, let r(a) be the number of trailing 0’s in h(a).
C. Record R = the maximum r(a) seen.
D. Estimate = 2R.

10.9 REFERENCES/FURTHER READINGS

References
[1] Mansalis, Stratos, et al. "An evaluation of data stream clustering algorithms."
Statistical Analysis and Data Mining: The ASA Data Science Journal 11.4 (2018): 167-
187
16
[2] Albert C. “Introduction to Stream Mining.”
https://towardsdatascience.com/introduction-to-stream-mining-8b79dd64e460
[3] “Mining Data Streams.” http://infolab.stanford.edu/~ullman/mmds/ch4.pdf
[4] Bhyani, A., “Approximate Count-Distinct using Flajolet Martin Algorithm.”
https://arpitbhayani.me/blogs/flajolet-martin

17
Big Data Analysis
UNIT 11 LINK ANALYSIS

Structure Page No.

11.0 Introduction to Link Analysis

11.1 Objectives
11.2 Link Analysis
11.3 Page Ranking
11.4 Different Mechanisms of Finding PageRank
11.4.1 Different Mechanisms of Finding PageRank
11.4.2 Web Structure and Associated Issues
11.5 Use of PageRank in Search Engines
11.5.1 PageRank Computation using MapReduce
11.6 Topic Sensitive PageRank
11.7 Link Spam
11.8 Hubs and Authorities
11.9 Summary
11.10 Solutions/Answers
11.11 References/Further Readings

11.0 INTRODUCTION

In the previous Units of this Block, you have gone through the concept of measuring
the distances and different algorithms of handling data streams. In this section, we will
discuss about the link analysis, which is used for computing the PageRank. The
PageRank algorithms use graphs to represent the web and computes the Rank based on
probability of moving to different links. Since, the size of web is enormous, the size of
computation requires operations using very large size matrices. Therefore, the
PageRank algorithm uses MapReduce programming paradigm over the distributed file
system. This unit discusses several of these algorithms.

11.1 OBJECTIVES
After going through this Unit, you will be able to:
• Define link analysis
• Use graphs to perform link analysis
• Explain computation of PageRank
• Discuss different techniques for computation of PageRank
• Use MapReduce to compute PageRank

11.2 INTRODUCTION TO LINK ANALYSIS

Link analysis is a data analysis technique used in network theory to analyze the links
of the web graph. Graphs consists of a set of nodes and a set of edges or connections
or intersections between nodes. The graphs are used everywhere, for example, in social
media networks such as Facebook, Twitter, etc.

1
Link Analysis Purpose of Link Analysis

The purpose of link analysis is to create connections in a set of data that can actually
be represented as networks or the networks of information. For example, in Internet,
computers or routers communicate with each other which can be represented as a
dynamic network of nodes which represent the computer and routers. The edges of the
network are physical links between these machines. Another example is the web which
can be represented as a graph.

Representation of the World Wide Web (WWW) as a graph

WWW can be represented as a directed graph. So, in the graph, nodes will correspond
to web pages. Therefore, every web page will be a node in this graph and direct links
between these web pages correspond to hyperlinks. These hyperlink relationships can
be used to create a network.

11.3 PAGE RANKING

PageRank is an algorithm that was designed for the initial implementation of the
Google search engine. The PageRank is defined as an approach for computing the
importance of the web pages in a big web graph.
PageRank – Basic Concept: A page or a node in a graph is as important as the
number of links it has. Not all in-links are equal kind of links. The links coming from
important pages are worth more. The importance of a page depends on the importance
of other pages that points to it. Thus, the importance of a given pointed page depends
on the importance of the others, and its importance then gets passed on further through
the graph.
Concept of PageRank scores of a graph
Figure 1 shows a graph with a set of directed edges, where the size of a node is
proportional to its PageRank scores. The PageRank scores are normalized so that they
sum to 100. The page rank scores are summarized as follows:
• The node B has a very high PageRank score because it has lot of other pages
pointing to it.
• The node C also has a relatively high PageRank score, even though it receives
only a single incoming link. The reason behind this is that the very important
node B is pointing to it.
• The very small purple colored link nodes have relatively small PageRank score,
as it is pointed by a web page that has a low PageRank.
• The PageRank score of node D is higher than the PageRank of node A because
D points to A.
• The node E has kind of intermediate PageRank score and E points to B and so
on.

2
Big Data Analysis

C
B 36%
A 40%
3%

D
5% E
10%

2%
2%
2%

Figure 1: PageRank Scores [1]

The PageRank, as shown in Figure 1, are kind of intuitive, and they correspond to our
intuitive notion of how important a node in our graph is. However, how the PageRank
can be computed? Next sections answer this question.

11.4 DIFFERENT MECHANISMS OF FINDING

PAGERANK

In the last section, we discuss about the basic concepts that can be used to
compute the PageRank, without giving details on how actually you can compute
the PageRank. This section provides details on the PageRank computation
process.
11.4.1 Different Mechanisms of Finding PageRank
In this section, we discuss two important techniques that are employed to compute the
PageRank, viz. Simple recursive formulation and the flow model.
The Simple Recursive Formulation technique will be used to find PageRank Scores in
which:
• Each link is considered as a vote and the importance of a given vote is
proportional to the importance of the source web page that is casting this vote
or that is creating a link to the destination.
• Suppose that there is page j with an importance rj.
• The page j has n outgoing links.
• This importance rj of page j basically gets split on all its outgoing links evenly.
3
Link Analysis • Each link gets rj divided by n votes i.e., rj/n votes.
This is how the importance gets divided from j to the pages it points to. And in a similar
way, you can define the importance of a node as the sum of the votes that it receives
on its in-links.
Figure 2 shows a simple graph that shows the votes related to a page labeled as j. You
can calculate the score of node j as a sum of votes received from ri and rk, using the
following formula:

rj= ri /2 + rk/ 3

This is so because page i has 3 out links and page k has 4 out links. The score of page
j further gets propagated outside of j along with the three outgoing links. So, each of
these links get the importance of node j divided by 3.

i k
𝑟! ⁄2 𝑟" ⁄3

j 𝑟# ⁄3

𝑟# ⁄3
𝑟# ⁄3

Figure 2: A Graph to calculate score of pages [2]

This is basically the mechanism of how every node collects the importance of the pages
that points to it and then propagate it through the neighbours.
The Flow Model: This model is basically the vote flow through the network. So that
is why this is called the flow formulation or a flow model of PageRank. To understand
the concept, let us use a web graph of WWW that contains only three web pages named
a, b, and c (Please refer to Figure 3).

• a has a self-link and then it points to b

• b points to a and c
• c points backwards to b only.
The vote of an important page has more weight than the normal page. So a page is
more important if it is pointed to by other important pages.
Formula to compute PageRank:
𝑟"
𝑟! = #
"→! 𝑑"

4
Where di is the out-degree of node i. You may please recall that out degree of a node Big Data Analysis
is number of links that go out of it. Thus, importance score of page j is simply the sum
of all the other pages i that point to it. The importance of a page i is computed by
dividing its importance by the out degree.
This means that for every node in the network, we obtain a separate equation based on
the number of links. For example, the importance of node a in the network is simply
the importance of a divided by 2 plus the importance of b divided by 2. Because a has
2 outgoing links and then similarly node b has 2 outgoing links. Thus, the three
equations, one each for individual node, for the Figure 3, would be:

ra= ra /2 + rb/ 2
rb= ra /2 + rc
rc= rb /2

a/2

b/2
a/2

c
b c
b/2
Figure 3: A very simple web graph [2]

The problem is that these three equations do not have a unique solution. So, there is a
need to add an additional constraint to make the solution unique. Let us add a constraint
- The total of all the PageRank scores should be one, which is given by the following
equation:
ra + rb + rc = 1
You can solve these equations and find the solution to the PageRank scores, which is:
ra= 2/5; rb = 2/5; rc = 1/5

Problem: This approach will work well for small graphs, but it would not work
for a graph, which has a billion web pages, as this would result in development
of a billion of equations or system of billions of equations that we would be
required to solve for computing the PageRank. So, you need a different
formulation.

5
Link Analysis 11.4.2 Web Structure and Associated Issues

In a web structure, the original idea was to manually categorize all the pages of the
WWW into a set of groups. However, this model is not scalable, as web has been
growing far too quickly.
Another way to organize the web and to find things on the web is to search the web.
There is a rich literature, particularly in the field of information retrieval, that covers
the problem of how do you find a document in a large set of documents. You can think
of every web page as a document, the whole web is one giant corpus of documents,
and your goal would be to find relevant document based on a given query from this
huge set.
The traditional information retrieval field was interested in finding these documents in
relatively small collection of trusted documents. For example, finding a newspaper
collection. However, now the web is very different. The web is huge and full of
untrusted documents, random things, spam, unrelated things, and so on.
Issues: The associated issues are:

• Which web pages on the web should you trust?

• Which web pages are legitimate and which are fake or irrelevant?
So, the solution is PageRank in which trustworthy webpages will be linked to each
other.
But the other problem that happens on the web is that sometimes queries can be rather
ambiguous. For example, the users may ask; What is the best answer to a query word
“newspaper”? However, if a user wants to identify all the good newspapers on the web,
then PageRank algorithm need to again look at the structure of the web graph in order
to identify the set of pages or a set of good newspapers that are linking to each other.
And again, get the result out of the structure of the web graph.
The way to address these challenges is to basically realize that the web as a graph has
very rich structure. So, there is a need to rank nodes of this big graph. Basically, we
would like to compute a score or an important notch of every node in this web graph.
The idea is that some nodes will collect lots of links. So they will have high importance
and some other nodes will have a small number of links or links from untrusted sources,
so they will have low importance.
There are several approaches to compute the importance of the nodes in a graph.
Broadly these approaches are called as link analysis because we are analyzing the links
of the web graph to compute an important score of a node in a graph.
Check Your Progress 1:
Question 1: What is the basic principle of flow model in PageRank?
Question 2: How can you represent webpages and their links?
Question 3: What are the issues associated with the web structure?

11.5 USE OF PAGERANK IN SEARCH

ENGINES

So far, you know that the importance of page j in a web graph is the sum of the
importance of page i that point to it divided by the out-degree of the ith node. This can
be represented using the following equation, as given earlier:
6
𝑟" Big Data Analysis
𝑟! = #
"→! 𝑑"

Where rᵢ is the score of the node i, and dᵢ is its out-degree.

This concept has also been used in Google formulation of the PageRank algorithm.
Consider a graph, as given in Figure 4, consisting of 5 nodes. The rank of each
of these nodes can be represented using the equations given below.
ra= rd/2 + re/2 (as node a has in-links from node d and node e both of
which has two out-links.)
rb = ra/3 + re/2 (as node b has in-links from node a, which has three out-
links and node e, which has two out-links.)
rc = ra/3 + rb/2 (as node c has in-links from node a, which has three out-
links and node b, which has two out-links.)
rd = ra/3 + rb/2 + rc (as node d has in-links from node a, node b and
node c node, these nodes have three, two and one out-
links respectively.)
re = rd/2 (as node e has in-links from node d, which has two out-
links.)

d c

b
a

Figure 4: Directed graph with 5 nodes

The graph in Figure 4 can be written with the following matrix, where ith row
is representing the ith node. The matrix represents the equations given above.

7
Link Analysis
0 0 0 1/2 1/2

1/3 0 0 0 1/2
M=
1/3 1/2 0 0 0

1/3 1/2 1 0 0

0 0 0 1/2 0

In order to compute the PageRank, a recursive formulation is used, which is

represented using the following Matrix equation:

𝑟 $%& = 𝑀 𝑟 $ (1)

Where rt is the PageRank at the tth iteration, and Matrix M is the link matrix,
as shown above.
The following example explains this concept.

Example: Consider the graph of Figure 3 and compute the PageRank using
Matric M, assuming the starting value of PageRank as:
𝑟' 1/3
𝑟( = 1/3
𝑟) 1/3
1/2 1/2 0
The matrix M = 1/2 0 1
0 1/2 0
Application of equation (1) will be:
𝑟& = 𝑀 𝑟 *
1/2 1/2 0 1/3
𝑟& = 1/2 0 1 1/3
0 1/2 0 1/3

1/3
&
𝑟 = 1/2
1/6
On applying the equation again, the value would be:
1/2 1/2 0 1/3
+
𝑟 = 1/2 0 1 1/2
0 1/2 0 1/6
5/12
+
𝑟 = 1/3
1/4
8
On repeated application, you will get: Big Data Analysis

3/8
,
𝑟 = 11/24
1/6
20/48
𝑟 - = 17/48
11/48
37/96
.
𝑟 = 42/96
17/96
79/192
/
𝑟 = 71/192
42/192

You may observe that PageRank of a = 0.41; PageRank of b = 0.37 and

PageRank of c = 0.22, which are converging to 0.4, 0.4 and 0.2 respectively.

Next, you will get answers to the following two questions:

Question 1: Does the equation (1) converge?

Question 2: Does the equation (1) converge to what we want?

Let us answer these questions one by one.
So, here is the first question: Does the equation (1) converge?
We multiply the matrix M with r to compute r iteratively. From an initial
approximation of the RageRank that can be initialized randomly, the algorithm
will update it until convergence.
Imagine a very simple graph as shown in Figure 5 where:
• There are two nodes and node a points to node b and node b points back
to node a.
• If we initialize vector r, or r at time 0, to have two values: value 1 on
the node a and value 0 on node b.
0 1
• The value of Matrix M for this graph would be: 𝑀 =
1 0
• Now when we are multiplying n times r, then following is the outcome:

0 1 1 0
𝑟& = =
1 0 0 1

0 1 0 1
𝑟+ = =
1 0 1 0
Likewise
0 1
𝑟, = 𝑎𝑛𝑑 𝑟 - =
1 0

9
Link Analysis • So, in the next time step, the values will flip. And now when we multiply
again, the value will flip again.
So, what we see here is that we will never converge because the score of 1 gets
passed between a and b and, a score of 0 gets passed between b and a. So it
seems that PageRank computation will never converge. This problem is called
the spider trap problem.

a b
Figure 5: Spider Trap Problem [2]

So, here is the second question: Does the equation (1) converge to what we
want?

Let us consider a very simple graph as shown in Figure 6 where:

• There are 2 nodes: node a and node b and one edge.
• Here, a starts with score 1 and b starts with score 0.
0 0
• The matrix M in this case would be: 𝑀 =
1 0
0 0 1 0
𝑟& = =
1 0 0 1

0 0 0 0
𝑟+ = =
1 0 1 0
• The first multiplication with matrix M, the scores get flipped i.e., 1 0
0 1
• But in the second step of multiplication is basically the score of 1 gets lost.
Thus, a and b are not able to pass the score to anyone else. So the score gets
lost.
• And we converge to this vector of zeros, which is a problem.

Figure 6: Dead end Problem [2]

The spider trap and dead end problem in PageRank:

In dead end problem, the dead ends are those web pages that have no outgoing links
as shown in Figure 7. Such pages cause importance to “leak out”. The idea behind is
that, whenever a web page receives its PageRank score, then there is no way for a web
page to pass this PageRank score to anyone else because it has no out-links. Thus,
PageRank score will “leak out” from the system. In the end, the PageRank scores of
all the web pages will be zero. So this is called the Dead End problem.

10
Big Data Analysis
Dead end

Spider trap
Figure 7: Dead end and Spider trap Problem [2]

In the spider trap problem, out-links from webpages can form a small group as
shown in Figure 7. Basically, the random walker will get trapped in a single part of the
web graph and then the random walker will get indefinitely stuck in that part. At the
end, those pages in that part of the graph will get very high weight, and every other
page will get very low weight. So, this is called the problem of spider traps.
Solution to spider trap problem: Random Teleports

A random walk is known as a random process which describes a path that includes a
succession of random steps. Figure 8(a) shows a graph, with node c being spider trap.
So whenever a random walker basically stuck in an infinite loop because there is no
other way.
The way Google has solved the problem to the spider traps, is to say that at each step,
the random walker has two choices. With some probability β occurs, the random
walker will follow the outgoing link.
So, the way we can think of this now is that we have a random walker that whenever a
random walker arrives to a new page, flips a coin and if the coin says yes, the random
walker will pick another link at random and walk that link and, if the coin says no then
the walker will randomly teleport basically jump to some other random page on the
web.
So this means that the random walker will be able to jump out or teleport out from a
spider track within only a few time steps.
After a few time steps, the coin will say yes, let us teleport and the random walker will
be able to jump out of the trap.
Figure 8(a) shows a graph, with node m being spider trap where random walker will
teleport out of spider trap within a few time steps.

11
Link Analysis

a a

b c b c
Figure 8(a): Random teleports approach [2]

Solution to dead end problem: Always Teleport

Figure 8(b) shows that if a node has no outgoing links and when we reach that node,
we will teleport with probability 1. So, this basically means that whenever you reach
node m, you will always jump out of it, i.e., you will teleport out to a random web page.
So in stochastic (a stochastic matrix is a square matrix whose columns are probability
vectors.) matrix M, column one will have values of 1/3 for all its entries. Basically
whenever a random surfer comes to end, it teleports out and with probability 1/3 lands
to any other node in the graph. So this is again the way to use the random jumps or
random teleports to solve the problem of dead ends.

a a

b c b c
a b c a b c
a 1# 1# 0 a 1# 1# 1#
2 2 2 2 3

b 1# 0 0 b 1# 0 1#
2 2 3

c 0 1# 0 c 0 1# 1#
2 2 3

Figure 8(b): Always teleport approach [2]

So, the page rank equation discussed in section 11.5 is re-arranged into a different
equation:
1−𝛽
𝑟! = # 𝛽𝑀. 𝑟 + 7 :
"→! 𝑁 0

Where M is a sparse Matrix (with no-dead ends). In every step, the random surfer
can either follow a link randomly with probability Β or jump to some random
page with probability 1-β. The β is constant and its value generally lies in a
range 0.8 to 0.9.
&12 &12
; < is a vector with all N entries ; <.
0 0 0

So in each iteration of page rank equation, we need to compute the product of matrix
M with old rank vector:
12
rnew = 𝛽𝑀. rold Big Data Analysis

&12
and then add a constant value ; < to each entry in rnew.
0

Therefore, the Complete Algorithm of PageRank is discussed as follows:

Input: Graph G and parameter 𝛽
Directed Graph G with spider traps and dead ends
Output: PageRank vector r
(#) %
Set: 𝑟! = & , t=1

Do:

𝑟(𝑡−1)
∀𝑗: 𝑟!'(() = '𝛽 𝑖
𝑑𝑖
)→!

𝑟!'(() = 0 𝑖𝑓 𝑖𝑛 − 𝑑𝑒𝑔. 𝑜𝑓 𝑗 𝑖𝑠 0

Now re-insert the leaked PageRank:

(() 1−𝑆
∀𝑗: 𝑟! = 𝑟!'(() + 𝑤ℎ𝑒𝑟𝑒: 𝑆 = ' 𝑗 𝑟!'(()
𝑁
t=t+1
(() ((+%)
while ∑ 𝑗 :𝑟! − 𝑟! :>∈

In the above algorithm, a directed graph G is given as an input along with its parameter
β. The graph may have spider traps and dead ends. The algorithm will give a
new Page rank vector r. If the graph does not have any dead-end then the
amount of leaked PageRank is 1-β. On the other hand, if there are dead-ends,
the amount of leaked PageRank may be larger. This initial equation assumes
that matrix M has no dead ends. It can be either preprocessed to remove all dead
ends or it has to explicitly follow random teleport links with probability 1.0
from dead-ends. If M has dead-ends then ∑ 𝑗 𝑟!'(() < 1 and you also have to
renormalize r’ so that it sums to 1. The computation of new page rank is done
repeatedly until the algorithm converges. The convergence can be checked by
measuring the difference between the old and new page rank value. The
algorithm has to explicitly account for dead ends by computing S.
11.5.1 PageRank Computation Using MapReduce

MapReduce approach solves the problem by performing parallel processing using

cluster and grid computing. This approach scaled up to very large link-graphs. Figure
8 shows traditional versus MapReduce approach.

13
Link Analysis

Slave Slave
Process
Data
Slave Slave Slave
Data Process Process
Data Data
Data

Master Master

Process
Data
Slave Slave Slave

1. Moving data to the Processing 2. Moving Processing Unit to the

Unit (Traditional Approach) data (MapReduce Approach)

Figure 9: Traditional vs. MapReduce approach [3]

There are 4 steps in MapReduce approach as shown in Figure 9 and are explained as
follows:
Step 1-Input Split: The input data is raw and is further divided into small chunks
called input splits. Each chunk will be an input of a single map and the data input will
make a key-value pair(key1, Value1).
Step 2- Mapping: A node which is given with a map function takes the input and
produces a set of (key2, value2) pairs, shown as (K2, V2) in Figure 10. One of the nodes
in the cluster is called as a “Master node” which is responsible for assigning the work
to the worker nodes called as “Slave nodes”. The master node will ensure that the slave
nodes perform the work allocated to them. The master node saves the information
(location and size) of all intermediate files produced by each map task.
Step 3- Shuffling: The output of the mapping function is being clustered by keys and
reallocated in a manner that all data with the same key are positioned on the same node.
The output of this step will be (K2, list(V2)).
Step 4- Reducing: The nodes will now process respective group of output data by
aggregating the shuffle phase values (output). The ultimate output will be from the
shape of (list (K3, V3)).

14
Big Data Analysis

Input Splitting Mapping Shuffling Reducing Final Result

Big 1 Big (1) Big 1

Big Data
Data 1

Data (1,1,1) Data 3 Big 1

Big Data
Structured Data Structured Data Structured 1 Data 3
Semi structured Data 1
Semi (1) Semi 1
Data Semi 1
Structured 2
Semi Structured Semi 1 Structured (1,1) Structured 2
Data Structured 1
Data 1

K1, V1 K2, V2 K2,List (V2) List (K3, V3)

Figure 10: Example of MapReduce approach [3]

Figure 11 shows the pseudocode of the MapReduce approach with map and reduce
functions.

Figure 11: Pseudocode of MapReduce approach [3]

The MapReduce Program for PageRank computation is discussed as follows:

import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf =
pyspark.SparkConf().setAppName('appName').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)

15
Link Analysis # Adjacency list
links = sc.textFile('links.txt')
links.collect()

# Key/value pairs
links = links.map(lambda x: (x.split(' ')[0], x.split('
')[1:]))
print(links.collect())

# Find node count

N = links.count()
print(N)

# Create and initialize the ranks

ranks = links.map(lambda node: (node[0],1.0/N))
print(ranks.collect())

ITERATIONS=20
for i in range(ITERATIONS):
# Join graph info with rank info and propagate to all
neighbors rank scores (rank/number of neighbors)
# And add up ranks from all in-coming edges
ranks = links.join(ranks).flatMap(lambda x : [(i,
float(x[1][1])/len(x[1][0])) for i in x[1][0]])\
.reduceByKey(lambda x,y: x+y)
print(ranks.sortByKey().collect())

The program receives document as page input (webpages or xml data). The
page is parsed using a regular expression method to extract the page title and
its outgoing links. While computing the PageRank using MapReduce program,
at first the web graph is split into some partitions and each partition is sorted as
an adjacency matrix file. Each Map task will process one partition and computes
the partial rank score for some pages. The Reduce task will then merge all the
partial scores and produces the global Rank values for all the web page. Initially,
page identifier and its outgoing link are extracted as key-value pair. Then node
count is done and then we initialize the ranks for each page. After that in each
iteration, join the graph information with rank information and propagate the
same to all neighbors rank scores (rank/number of neighbors). Further, it adds
up ranks from all in-coming edges to generation the final score.
Check Your Progress 2:
1 Explain the spider trap and dead-end problem in PageRank. What are the solutions
for the spider trap and dead-end problem?
2. Why MapReduce paradigm is suitable for computation of PageRank?
3. Given the following graph, compute the matrix of PageRank Computation.

16
Big Data Analysis

11.6 TOPIC SENSITIVE PAGERANK

Topic sensitive PageRank is also known as Personalized PageRank.

Earlier, the initial goal was to identify the important pages on the web. Now you do
not necessarily want to find pages that have generically high PageRank score, but you
would like to say: Which are the web pages that are popular within a given topic or
within a given domain?
So, in order to identify this, your goal will be following:
• The web page should not be assessed only on the basis of the overall popularity,
but also how closely they are to a particular set of topics, e.g., “sports” or
“history”.
• This is interesting because if you think of the web search the way PageRank
was initially thought of was that somebody will come and ask a web search
query, you will go and identify all the web pages that are relevant towards that
web search query.
• But now you need to decide how to rank or present all these web pages to the
user, you would basically take pages, simply sort them by their PageRank
score and show the pages that have the highest PageRank score first to the user.
• Now of course, if you would have the personalized PageRank or topic specific
PageRank for measuring the importance of a page, you could basically show
to the user a given ranking depending on what the user wants, however, in
particular, there can be many queries that are ambiguous.
• For example, a query Trojan could have very different relevant pages
depending on what is the topic you are interested in, in a sense that Trojans
could mean a sports team. It can also mean something different, if you are
interested in history, or if it could mean something very different if you are
interested in Internet security.
• In any case, the idea would be that you would like to compute different
importance scores of different web pages based on their relation to a given
topic.
The way to achieve this is by changing the random teleportation part. So, in the original
PageRank formulation, the random walker can land at any page with equal probability.
In the personalized PageRank, the random walker can teleport only to a topic or
specific set of relevant pages.
So, whenever a random walker decides to jump, they do not jump to any page on the
web, but they only jump to a small subset of pages and this subset of pages is called
the teleport set. So, the idea here is, in some sense, that we are biasing the random walk
i.e., when the walker teleports, they can only teleport into a small set of pages.

17
Link Analysis • Consider S as the teleport set.
• The set S contains only pages that are relevant to a given topic.
• This allows us to measure the relevance of all the other web pages on the web
with regard to this given set S.
• So, for every set S, for every teleport set, you will now be able to compute a
different PageRank vector R that is specific to those data points.
To achieve this, you need to change the teleportation part of the PageRank formulation:
1−𝛽
𝛽 𝑀"! + 𝑖𝑓 𝑖 𝜖𝑆
𝐴"! = |𝑆|
𝛽𝑀"! 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Where A is stochastic matrix,

i is the entry,
S is the set,
&12
|:|
is the random jump probability

• If the entry i is not in the teleport set S then basically nothing happens.
• But if our entry i is in the teleport set, then add the teleport edges.
• A is still stochastic.
The idea here is basically that you have lot of freedom in how to set the teleport set S.
For example, when teleport set S is just a single node. This is called a random walk
with restarts.

11.7 LINK SPAM

Link spamming is any deliberate activity to boost the score of the web page position
in the search engine result page. So, link spam in some sense are web pages that are
the result of spamming. Search engine optimization is the process of improving the
quality and quantity of website traffic to a website or a web page from search engines.
Earlier, search engines operated the following way:
• Search engine would crawl the web to collect the web pages.
• It would then index the web pages by the words they contained and then
whenever somebody searched, it will basically look at the given web page, see
how much does it mention the words in the search query, and then it will rank
the search results based on the number of times they mentioned a given search
query term.
• So, the initial ranking was an attempt to order pages by matching a search
query by importance, which means that basically the search engines would
consider the number of times a query words appeared on the web page. In
addition, it may also use the prominence of the word position, i.e., whether
that word was in the title or in the header of a web page.
So now, this allows spammers to nicely spam and exploit this idea. Consider the
following scenario:
• Imagine that a spammer has just started a business and he has made a web site.
He wants that search engine should give preference to his web pages, as it is
in his commercial interest. The spammer wants to exploit the search engines
and make sure that people go and visit his web pages.

18
• So, spammer wants his web pages to appear as high on the web search ranking Big Data Analysis
as possible, regardless of whether that page is really relevant to a given topic
or not, as driving traffic to a website is good.
The spammer did this by using a technique called link spam. The following example
illustrates the link spam:
• There is a T-shirt seller, she/he creates a web page and wants that her/his web
page should appear, when any web searcher searches the word “theatre”, as a
person in theatre may be interested in her/his T-shirts.
• So this online seller can insert the word “theatre” 1000 times in the webpage.
How can this be done? In the early days of the web, you would have a web
page, which on the top of the page had the legitimate text, and at the bottom
of the web page, there the seller inserts a huge list of words “theater”. This list
would use the same text color of these words as that of the color of the
background.
• So, these words would not bother the user who comes to the web page. But the
web search engine would see these words and would think that this particular
webpage is all about theatre.
• When, of course, somebody will run a query for the word theatre, this web
page would appear to be all about theatre and this page get it ranked high. This
and similar techniques to this one are called term spam.
Spam farm were developed to concentrate PageRank on a single page. For example, a
t-shirt seller creates a fake set of web pages that they all link to his own webpage, and
all these web pages in the anchor text says that the target page is about movies. This is
known as Spam farming.
Google came up with a solution to combat link spam. According to Google; Rather
than believing what the page says about itself, let us believe what other people say
about the page. This means to look at the words in the anchor text and its surrounding
text. The anchor text in a web page contains words that appear underlined to the user
to represent the link. The idea is that you are able to find web pages, even for the
queries/words that the webpage itself does not even mention, but other web pages on
the web may mention that word when referring to the target page.
The next section discusses a similar technique that overcomes the problem of link spam.

11.8 HUBS AND AUTHORITIES

The basic idea in hubs and authority model is that every web page is assigned two
scores, not just one, as is the case of PageRank algorithm. One type of score is called
a hub score and the other one will be called as an authority score.
An authority value is computed as the sum of the scaled hub values that point
to a particular web page. A hub value is the sum of the scaled authority values
of the pages it points to.
The basic features of this algorithm are:
• The hubs and authority based algorithm is named HITS algorithm, which is
Hypertext Induced Topic Selection.
• It is used to measure the importance of pages or documents in a similar way to
what we did with PageRank computation.
• For example, we need to find a set of good newspaper webpages. The idea is
that we do not just want to find good newspapers, but in some sense, we want
to find specialists (persons) who link in a coordinated way to good newspapers.

19
Link Analysis So, the idea is similar to PageRank such that we will count links as votes.
• For every page, you will have to compute its hub and its authority score.
• Each page will basically have a quality score of its own as an expert. We call
this as a hub score
• Each page also has a quality score of it as a content provider, so we call it an
authority score.
• So, the idea would be that the hub score is simply the sum of all the votes of
the authorities that are pointed to. The authority score will be the sum of the
experts votes that the page is receiving.
• We will then apply the principle of repeated improvement to compute the
steady state score.
Hence, the way we will think about this is that generally pages on the web fall into two
classes, hubs and authorities as shown in Figure 12.
Class of Authorities: These are the pages which contain useful information or content.
So in our case of newspaper example, these are newspaper webpages.

Figure 12: Hubs and Authorities [4]

Class of Hub: where pages link to good authorities or list of good things on the web.
So in newspaper example, these are my favorite newspapers and that page would link
to them.
The HITS algorithm, thus, minimizes the problem due to link spam and users would
be able to get good results, as high-ranking pages.
Check Your Progress 3:
1. Explain the mechanism of link spam.
2. What is term spam?
3. What is a Hub score and Authority score in HITS algorithm?

20
Big Data Analysis

11.9 SUMMARY

This unit introduces the concept of link analysis along with its purpose to create
connections in a set of data. Further, PageRank algorithm is discussed with the
calculation of page scores of the graph. Different mechanisms of finding PageRank
such as Simple Recursive Formulation and the flow model with the associated
problems are also presented in this Unit. The web structure with its associated issues
are also included, which shows various ways to address these challenges. The use of
PageRank in search engine page is also discussed with spider trap and dead end
problem and their solutions. Further, this Unit shows rank computation using map-
reduce approach and topic sensitive PageRank. The Unit also introduces the link spam
and how Google came up with a solution to combat link spam. Lastly, hubs and
authorities algorithm is explained, which can be used to compute PageRank.

11.10 SOLUTIONS/ANSWERS

Check Your Progress 1:

1. The flow model is the flow formulation of PageRank to show how the
vote (or links)) flow through the network. The weight of a vote is
computed by diving the vote of a node divided by the number of links it
has. A higher vote of a node means that it has a higher probability that
a user may reach at that web page.
2. Graphs are used to represent webpages. A Node represents a webpage
and directed link between the nodes represents hyperlink from a web
page to other web page.
3. The issues associated with the web structure are to find out the web
pages on which users can trust and which of the web pages are
illegitimate or fake.

Check Your Progress 2:

1. In dead end problem, the dead ends are those web pages that have no outgoing links.
In the spider trap problem, out-links from webpages can form a small group where the
random walker will get trapped. The solution to spider trap problem is “Random
Teleports”. The solution to dead end problem is “Always Teleport”.
2. MapReduce paradigm is suitable for computation of PageRank because it solves the
problems by performing parallel processing using cluster and grid computing. This
approach is scaled up to very large link-graphs.
3.

-------------->
21
Link Analysis

Matrix=

Check Your Progress 3:

1. Link spamming is any deliberate activity to boost the score of the web page
position in the search engine result page, by creating links from bogus webpages.
2. Term spam is a technique where somebody will run a query for the word e.g.,
theatre, but some other web page (e.g., T-shirt webpage) will appear with high rank.
3. The hub score is simply the sum of all the votes of the authorities that are pointed
to. The authority score will be the sum of the experts votes that the page is
receiving.

11.11 REFERENCES/FURTHER READINGS

[1] Liu D. “Big Data Technology.” https://davideliu.com/2020/02/27/analysis-of-

parallel-version-of-pagerank-algorithm/
[2] “Link Mining.” https://www.inf.unibz.it/~mkacimi/PageRank.pdf
[3] Taylan K. “MapReduce Programming Model.”
https://taylankabbani96.medium.com/mapreduce-programming-model-a7534aca599b
[4] Leskovec J. “Link analysis: Page Rank and Similar Ideas.”
http://snap.stanford.edu/class/cs246-2012/slides/10-hits.pdf

22
Big Data Analysis
UNIT 12 WEB AND SOCIAL NETWORK
ANALYSIS

Structure Page No.

12.0 Introduction
12.1 Objectives
12.2 Web Analytics
12.3 Advertising on the Web
12.3.1 The Issues
12.3.2 The algorithms
12.4 Recommendation Systems
12.4.1 The Long Tail
12.4.2 The Model
12.4.3 Content-Based Recommendations
12.5 Mining Social Networks
12.5.1 Social Networks as Graphs
12.5.2 Varieties of Social Networks
12.5.3 Distance Measure of Social Network Graphs
12.5.4 Clustering of Social Network Graphs
12.6 Summary
12.7 Solutions/Answers
12.8 References/Further Readings

12.0 INTRODUCTION

In the previous units of this block, you have gone through various methods of
analysing big data. This unit introduces three different types of problems that may be
addressed in data science. The first problem relates to the issue of advertising on the
web. advertisement on the web is most popular with search queries. A typical
advertisement-related problem is: what advertisements are to be shown, as a result of
a search word? This unit defines this problem in detail and shows an example of an
algorithm that can address this problem. Next, this unit discusses the concept of a
recommender system. A recommender system is needed, as there is an abundance of
choices of products and services. Therefore, a customer may need certain
suggestions. A recommender system attempts to give recommendations to customers,
based on their past choices. This unit only introduces content-based
recommendations, though many other types of recommender systems techniques are
available. Finally, this unit introduces you to concepts of the process of finding social
media communities. You may please note that each of these problems is complex and
is a hard problem. Therefore, you are advised to refer to further readings and the
latest research articles to learn more about these systems.

12.1 OBJECTIVES
After going through this Unit, you will be able to:
• Define the issues relating to advertisements on the web
• Explain the process of solving the AdWords problem.
• Define the term long tail in the context of recommender systems.
• Explain the model of the recommendation system and use the utility matrix
• Implement a mode for the content-based recommendations
• Define the use of graphs for social network
• Define clustering in the context of social network graphs.

1
Link Analysis
12.2 WEB ANALYTICS

Web analytics is the process of collecting, analysing and reporting data

collected from a website with the objective of measuring the effectiveness of
various web pages in achieving the website's purpose. A webpage consists of
a large number of sections or divisions or fragments. Each of these fragments
has a certain purpose, for example, a website may consist of an advertisement
section, multiple information sections, a navigation section etc. Web analytics
may also be used to measure the traffic on the website, especially after an
event, such as for an eCommerce website after a sale offer is announced or for
a university after the announcement of admission.
What kind of data is collected for web analytics? In general, for web analytics
the following information may be collected by different websites:
• The user location and identification of the user, such as IP address, if
permitted.
• The time spent by users on the website and the pages or sections
accessed by the user during her/his stay at the website.
• The number of users, who are accessing the websites at different
times. The number of simultaneous users.
• Information about the kind of browser and possibly the machine used
by a user to access the website.
• If the user has clicked on any external link given on the website.
• Is any advertisement on the website clicked?
In addition, based on the objectives of the web application related information
may be collected.

Web analytics is performed to address the basic question: Is the website able
to fulfil the objectives with which it was created? Some of these objectives
may be:
Is the website informative for users? This would require analysis of the
number of users visiting, the time spent by them on information pages,
whether are users coming again to the website etc.
How can cloud services used for hosting the website be optimised? This
question may be performed by doing the traffic analysis and simultaneous
user data etc.
Thus, web analytics may be very useful for enhancing the efficiency and
effectiveness of a website. In the subsequent sections, we discuss some of the
specialized applications of website-related information.

12.3 ADVERTISING ON THE WEB

Advertising is one of the major sources of revenue for many professions like
Television, newspapers etc. The popularity of WWW has also led to the
placement of advertisements on web pages. Interestingly one of the most
popular places to put advertisements is the output of a search engine. This
section describes some of the basic ways of dealing with advertising on the
WWW and some of the algorithms that can be used to find the outcome of
using the advertisements on the WWW.

2
12.3.1 The issues Big Data Analysis

In order to define the issues related to web advertisements, first let us discuss
different ways of placing advertisements on the Web.

Historically, the advertisements were placed on the websites as banners and

were charged for the number of impressions or webpage visits consisting of
that banner. This model was similar to advertisements of magazines that used
to charge based on their circulation. However, on web pages, defining the rate
of advertisements in this way is very inefficient. Most advertisers define the
performance of a webpage advertisement, as the ratio of the number of times
an advertisement is clicked from a webpage to the number of impressions of
the advertisement on that webpage. As the performance of banner-based
advertisements was very poor, the targeted impression of advertisements
based on demography started resulting in a slightly better click ratio.
However, even this kind of advertisement also had poor performance. Thus, a
newer form of advertisement is being practised on the website, which was first
designed by an old company named Overture in 2000 and later redesigned by
Google in 2002. It is a performance-based advertisement, which is called
AdWords and is defined as follows:
1. The advertisements were placed as the result of a search term
2. The advertisers are asked to bid for placing an advertisement on the
result of search terms or keywords.
3. The advertisements are displayed (which advertisement?) when that
search engine displays the result of the specific search term.
4. The advertisers are charged if a displayed advertisement is clicked on.

You may need to design an algorithm to find out, which advertisement will be
displayed as a result of a search query. This is called the AdWords problem.
The AdWords problem can be defined as follows:

Given:
• A sequence or stream of queries, which are arriving at a search engine,
regularly. The typical nature of queries is that only the present query is
known, and which query will come next cannot be predicted. Let us
say, the sequence of the keyword queries is q1, q2, q3, …
• On each type of query, several advertisers have put their bids. Let us
say that m bids are placed on each type of query, say b1, b2, …, bm.
• The probability of clicking on an advertisement shown for a query, say
p1, p2, …
• The Budget stipulated by an advertiser, which may be allocated for
every day, for n advertisers, say B1, B2, B3, …Bn.
• The maximum number of advertisements that are to be displayed for a
given search query, say t, where t < m.

The objective of the Problem:

Select the sub-set of advertisements to be displayed on the arrival of a
query, which maximises the profit of the search engine.

The Output:
• The size of the selected sub-set of advertisement should be equal to t

3
Link Analysis • The advertiser of the selected advertisement has made a bid for that
type of query.
• In case the ith advertisement is selected then the Budget left should be
more than the bid amount, Bi >= bi.

12.3.2 The Algorithms

The category of algorithms to address the AdWords problem is known as

online algorithms. An online algorithm works on a partially available data set
for making committed decisions at a given instance of time. You may
compare such algorithms to offline algorithms, which are the algorithms that
we normally write. Offline algorithms process the complete data sets.
AdWords problem is the typical case for an online algorithm to be used, as on
arrival of a query a decision is to be made to show few advertisements. This
decision then cannot change, as advertisements are shown along with the
display of results.

Greedy Algorithm:
One of the simplest algorithms to solve this problem would be to use the
greedy algorithm, where the important considerations here are to show the
advertisements which have (1) high bid value and (2) high chances of getting
clicked, as the payment will be made only if the advertisement is clicked. This
will be subject to the constraint that the advertiser’s budget has not been over.
For example, consider a query q1 that has the following bids:

Advertiser Bid Budget Probability of Probable Rank of

(bi) in of clicking that Revenue advertisement
INR Bidder advertisement selection, if
(Bi) per (pi) Bi >=bi
day
X 80 160 0.01 0.8 3
Y 60 60 0.02 1.2 1
Z 30 30 0.03 0.9 2
Figure 1: A sample advertiser’s Bid
Thus, assuming that only one advertisement is to be displayed in the results,
the greedy algorithm will display the advertisement of advertiser Y, as it is
ranked 1. However, please notice that these impressions of the advertisement
of Y will be displayed only till the time the advertisement is clicked, as once it
is clicked the Budget for the advertiser will become zero for that day. Please
also note that we are assuming there would be many queries which are of the
same type as query q1.

One of the questions here is how to compute the probability of clicking the
advertisement, this information can be computed only after certain
experimentation and for a new advertisement this value would not be known.
A discussion on this is beyond the scope of this unit, you may refer to further
reading for more details on this issue.

The greedy algorithm, as discussed here may be a useful way of increasing

revenue. However, it may not be able to provide the optimal possible revenue.
In fact, it has been shown that in the worst case, a greedy algorithm would be

4
able to produce ½ of the optimal revenue. This can be explained with the help Big Data Analysis
of the following worst case.

Consider there are two types of queries Q1 and Q2 having the following
bidders:

Query Advertiser Bid Budget Probability of Probable

Type (bi) of clicking that Revenue
in Bidder advertisement
INR (Bi) (pi)
per
day
Q1 X 80 160 0.01 0.8
Q2 X 80 160 0.01 0.8
Y 80 160 0.01 0.8
Figure 2: A sample bid for the worst case

Please note that query Q1 has only one bidder X and query Q2 has two
bidders X and Y. You may also note that all other parameters are the same for
the queries. Now assume the following sequence of queries occurs:
Q2, Q2, Q1, Q1
Now, assume that the advertisement of X is selected randomly for the
impression on query Q2 and it also gets clicked, then the allocation of the
advertisements would be:
X, X, -, -
This occurs because the advertisement budget of X is over after 2nd click and
only X has bid for Query Q1. This is a typical problem of an online algorithm,
where that information about future queries is not known. You may observe
that if this sequence is known before then an optimal selection would have
been:
Y, Y, X, X
Which would optimise the revenue. In the worst case, you may also observe
that the Greedy algorithm has earned ½ of the revenue earned by the Optimal
algorithm.

Is there a better algorithm than this Greedy algorithm? Fortunately, a better

algorithm was developed for this problem. This is called a Balanced
algorithm. In this algorithm, in case of a tie, the advertisement of the
advertiser with a higher budget is displayed. For example, advertisement
selection for the query sequence given above would be:
X, Y, X, -
Please note that once X’s advertisement is displayed and clicked, it has a
lower budget than that of Y, which will result in the display of Y’s
advertisement in the second instance of Q2. You may please observe that the
case of the balanced algorithm earns about ¾ of the optimal revenue. You
may go through the further readings for more details on these algorithms.

Check Your Progress 1:

Question 1: What is the AdWords problem?
Question 2: What is an online algorithm?
Question 3: Differentiate between greedy and balanced algorithms of AdWords
problem.

5
Link Analysis

12.4 RECOMMENDATION SYSTEMS

In the last section, we discussed the AdWords problem. In this section, we

focus on another important domain of problems called recommendation
systems.

Let us first try to answer the question: What is the need for recommender
systems?

Consider you have to buy some products from a very large list of products, as
shown in Figure 3. For example, you have to buy a TV and you are not sure
about the size, type, brand etc and a very large choice of sizes, types and
brands exists. In such situations, you would like to consider impartial advice.
This may be the reason the recommender systems were created.

Which one
to buy???

large set
of items

Figure 3: The need for a Recommendation system

In case you are looking for an item from a very large set of items, say
thousands, and your requirements are not very well specified then rather than
searching for that kind of item, you may like suggestions about the item from
some known person. The recommender system does this job, as it
recommends the product, which may be needed by you, based on the
information known to the system about you.

12.4.1 The Long Tail

Why did recommendation systems become popular? This can be attributed to

the web-based marketing of the products. In direct marketing, the shelf space
of the store is very important. However, as marketing moves to the web, a
very large number of similar products can be sold leading to marketing of
choice. Figure 4 shows an interesting term called the long tail of products. In
this era of the digital market, a large number of products are on sale. Some of
these products sell at a very high rate, but most of the products have a low
demand. These low-demand products are also very large in number and can be
of very good quality and are termed long-tail products (see Figure 4).
6
Big Data Analysis
In the era of the direct market, only those products that were selling the most
were finding a place on the shelf and were being stocked. However, in this
digital market even the products, which have low demand can be sold, as they
need not be stocked at all locations, but can be made available on demand.
Thus, long-tail products are now available to general users. For example,
suppose you want to buy only albums of classical music, then earlier the
stores which used to sell them were very difficult to find. However, now you
may be able to order them easily. In addition, suppose you order some such
system using the digital market, then based on your choices it is possible to
suggest to you some other classic albums that you may find interesting. This is
the advantage of the recommender system.

Common
Products
units / time period
Sales of product

Long tail products

Products (in the Order of Sales)

Figure 4: Product sale and long tail [1]

Some of the most common applications of the recommender systems are for
recommending books, music albums, movies, published articles etc. The
usefulness of the recommender system can be gauged from the fact that there
are a large number of cases that highlights the success of the recommender
system application. For example, a recommender system of books found that
several purchasers of a book, say a book named A, are also purchasing not a
very popular but highly rated book B. The system started showing this book,
as a recommendation to the purchasers of Book A. It turns out that after a
while even book B became one of the best sellers. Thus, recommendation
system applications have great potential, as they can inform purchasers about
the availability of some good items, which were not known to them due to the
very large number of items.

12.4.2 The Model

There are three basic types of recommended systems:

1. Recommendations by the Editors/Critics/Staff: This is the simplest
kind of recommendation system, which are in existence for a long
time. In such systems, recommendations of editors or critics or staff
are recorded and reported. One of the major weaknesses of such
systems is that it does not take input from the actual customers.

7
Link Analysis 2. Aggregates recommendations: Such information is aggregated from
the various user or customer activities, for example, the most popular
video or the most purchased product or the highly rated services etc.
However, these recommendations may not suitable to your
requirements, as these recommendations are generic.
3. User-specific Recommendations: These recommendations are
specifically being made for a specific user based on his/her past web
activities such as online purchases, searches, social media interactions,
viewing of video, listening to audio etc.
In this section, we will discuss the model for user-specific recommendations.
The specific problem of user-specific recommendations can be stated as:

Given:
• A set of users or customers, say S who has given some rating to some
of the products/services.
• A set of products or services, say P, being sold or provided by the web
application.
• A set of ratings for every pair of customers and products, these ratings
may use a star rating scale of 0 to 5 or 0 to 10 or just 0 (dislike),
1(like). It may be noted that the values of 0 to 5 or 0 to 10 are ordered
in terms of liking a product with 5 or 10, as the case may be, being the
highest level of liking.

These ratings may be represented by using a model or a function as:

𝑓: 𝑆 × 𝑃 → 𝑅
This mapping can be presented using a utility matrix, which is a sparse
matrix (see Figure 5).

The objective of the Problem:

To find the expected rating that may be given by a customer s to a
product p, which s/he has not rated, based on her/his other ratings.

The Output:
• Customer-specific recommendations based on the computation of
expected ratings.

P1 P2 P3 P4 P5
C1 3 1 2
C2 4 1
C3 1 5 3

Figure 5: A sample Utility Matrix (3 customers and 5 products on a 0 to 5-star rating)

The Key Problems of Making a Recommender system: In order to make a

recommender system you need to address the following three problems:
1. Gathering the ratings into a utility matrix: One of the simplest ways for
gathering the rating would be to directly ask the customer to rate a
product, however, very few people spend time to rate a product,
therefore, in addition to this direct rating most e-commerce website
implicitly rate their products. But how may this implicit rating work?
Well, consider a case, where a customer returns a purchased item,
which means he rates the item lowly; on the other hand, a purchased
8
item may get a good implicit rating. An item which is purchased again Big Data Analysis
by a customer should be given a high implicit rating. A utility matrix
may contain both direct and implicit ratings given by a customer.
2. Computing the unknown ratings for every customer. The focus here is
to find the high unknown ratings only, as you would like to
recommend a product to a customer only if he is expected to rate it
highly. The key problem in finding the unknown ratings is that the
utility matrix is sparse. In addition, a new product has no ratings when
added to the list. There are three different types of approaches to
computing rating systems – the content-based approach, the
collaborative filtering and the latent factor-based approach. We will
only discuss the content-based approach in the next section. You may
refer to the further readings for the other methods.
3. Evaluating the performance of the recommender system: One way of
computing the performance of a recommender system is based on the
fact that a recommendation has resulted in a successful purchase or
not. You may refer to the further readings for more details on this.

12.4.3 Content-Based Recommendations

The content-based recommender system is designed to recommend

products/services to a customer based on his/her earlier highly rated
recommendations. Some of the commonest examples of a content-based
recommendation system are the recommendation of movies, research articles,
friends etc. For example, if you highly recommend a movie then a content-
based recommendation system is likely to recommend movies of similar
actors or genres etc. Similarly, if you highly rate a research article the content-
based recommendation system will recommend research articles of similar
content or nature. How is such a recommendation made? Figure 6 defines this
process:

Customer

Likes Producs/Services

Make Profile of
Products/ Services
liked

Make Customer
Profile

Recommend
Products/Services
based on Customer
Profile

Figure 6: Content-based Recommender System

The profile of products or services is defined with the help of a set of features.
For example, the features of a movie can be the actors, genre, director, etc.,

9
Link Analysis whereas a research article's features may be authors, title, metadata of research
articles, etc. The accumulation of these features makes the customer profile.
But how do we represent the product profile? One way to represent the
product profile would be to use a set of vectors. For example, to represent a
movie, we may use a feature vector consisting of the names of the actors,
names of the director and types of genres. You may notice that this vector is
going to be sparse. For research articles, you may use the term frequency and
inverse of document frequency, which were defined in an earlier unit. The
following examples show how the utility matrix and feature matrix can be
used to build a customer profile.

Consider you are making a recommendation system of movies and the only
feature you are considering is the genre of the movies. Also, assume that
movies have just two genres – X and Y. A person rates 5 such movies and the
only ratings here are likes or no ratings, then the following may be a portion
of the utility matrix for the customer, let us say Customer Z. Please note that
in Figure 8, 0 means no ratings and 1 means like.

M1 M2 M3 M4 M5 M6 M7
Cz 1 1 1 1 1 0 0
Figure 8: A Sample Utility matrix
Further, assuming that among the movies liked by customer Z, movies M1,
M2 and M3 are of genre X and M4 and M5 are of genre Y, then the article
profile for the movies is given in Figure 9. Please note that in Figure 9, 1
means the movie is of that genre, whereas 0 means that the movie is not of
that genre.

M1 M2 M3 M4 M5
X 1 1 1 0 0
Y 0 0 0 1 1
Figure 9: A sample product feature matrix
This product matrix now can be used to produce the customer Z profile as:
Feature Genre X profile = sum of the row of X/Movies rated
= 3/5
Feature Genre Y profile = 2/5
You may please note that the method followed here is just finding the average
of all the genres. Though for an actual study, you may use a different
aggregation method.

However, in general, the customers rate the movies on a 5-point scale of say,
1 to 5. With 1 and 2 being negative ratings, 3 being neutral rating and 4 and 5
being positive ratings. In such as case the utility matrix may be as shown in
Figure 10.

M1 M2 M3 M4 M5 M6 M7
Cz 1 1 2 4 2 NR NR
Figure 10: Utility matrix on a 5-point rating scale (1-5)
NR in Figure 10 means not rated. Now considering the same product feature
matrix as Figure 9. You may like to compute the profile of customer Z.
However, let us use a slightly different method here.

10
In general, on a 5-point scale, each customer has their way of assessing these Big Data Analysis
ratings. Therefore, it may be a good idea would be to normalize the ratings of
each customer. For that first find the average rating of a customer. This
customer has rated 5 movies with an average rating of 10/5 =2. Next, use this
average rating to mean neutral rating and subtracting it from the other rating
would make the ratings as:

M1 M2 M3 M4 M5
Normalize ratings of Cz -1 -1 0 2 0
Figure 11: Normalized Utility matrix
In this case, you may use the following method to create the customer Z
profile:
Feature Genre X profile
= sum of normalised rating of genre X/Movies rated of genre X
= Normalised rating of (M1+M2+M3)/3
= (-1-1+0)/3 = -2/3
Feature Genre Y profile
= sum of normalised rating of genre Y/Movies rated of genre Y
= Normalised rating of (M4+M5)/2
= (0+2)/2=1
Thus, customer Z has a positive profile for genre Y.

There can be other methods of normalization and aggregation. You may refer
to the latest research on this topic for better algorithms.

You have the product/service profiles, as well as customer profiles available

to you. Now, the next question is: how will you recommend a product or a
service to a customer?
Please notice that both the product/service profile and the customer profiles
are high-dimensional vectors, therefore, you may use the cosine distance to
compute the distance between the two vectors using the cosine distance vector
formula, which allows you to find the angle, say x, between two vectors using
the vector dot product. The cosine similarity is defined as 180-x. You must
recommend those products or services to the customer, which are highly
similar to its customer profile. More details on this can be studied from further
readings.

Check Your Progress 2:

Question 1: What is a recommender system?

Question 2: Define the term long tail.

Question 3: Define the process of content-based recommendations.

12.5 MINING SOCIAL NETWORKS

In the previous section, we discussed the recommender system. Social media
has an immense amount of data about people's interactions among themselves.
This data can be processed to produce useful information about certain
products, services and communities. Social media data is obtained from the
social media networks such as Facebook, Instagram, Twitter, LinkedIn, and
many more. In general, on social media people form connections with one
11
Link Analysis another, for example, friends on Facebook. One of the key questions that may
be asked relating to social media is: To find the communities, a sub-set of
related groups, from amongst a very large number of people, for example, one
social media you may be part of a group of your classmates of the school as
one community, a job-related group as another etc. One of the characteristics
of a community is that people of one community may know each other and
possibly share the same interests, whereas they may not know the people of
other communities. In this section, we discuss the mining of social media
networks in more detail.

12.5.1 Social Networks as Graphs

A social media network consists of a large number of people their connections

and interactions with other detail. How can you represent the people and their
connections?
Social media networks can best be represented with the help of a graph where
a node represents a person and links represent the relationship of the person
with others. In general, these links can be undirected, however, in certain
cases, the links can be one-directional or directed. Figure 12 represents a
typical social media network graph.

College friends

Family

Working with

Figure 12: A Social Media Network Graph

Please observe that in Figure 12, you may be connected to several

communities and there may be overlap in certain communities. This is the
typical nature of social media networks graph.

12.5.2 Varieties of Social Networks

There are many different kinds of social networks. Most of these networks are
used for sharing personal or official information. Some of the basic categories
of these networks are:

12
1. Traditional social networks: Mainly used for sharing information Big Data Analysis
among communities of friends or people who come together for a
specific purpose. These networks are primarily analysed for getting
information about groups. An interesting type of social network in this
category is the collaborative research network. Such a network
includes links among authors who have co-authored a research paper.
In addition, one of the communities of this network is the editors of the
research work. These networks can be used to find researchers who
share common research areas.
2. Social review or discussion or blogging networks: Such networks
create a node for the users and make links if two user
review/discuss/blog the same topic/article. These networks can be used
to identify communities, which have similar thinking and inclinations.
3. Image or video sharing networks: These networks may allow people to
follow a person hosting a video. A follower may be linked to a host in
such networks. The meta tags of videos and images and comments of
the people watching the videos may be used to create communities that
share common interests.

In addition, there can be a large number of information networks that can be

represented as graphs. Next, we discuss some of the issues relating to finding
communities on a social network graph.

12.5.3 Distance Measure of Social Network Graphs

As discussed in the previous section, social media networks can be

represented using graphs. For finding the communities in this social media
graph, one may use clustering algorithms. One of the clustering requirements
is finding the distance between two nodes. Thus, we must define a method to
find the distance between two nodes in a social media network graph. You
may recollect that in a social network graph, a node represents a person or
entity and a link represents a connection between two entities. However, the
problem here is that as the link represents the friend or connection, it does not
have any weight assigned to it. In such a situation, the following distance
measure may be considered for social network graphs:

The distance between two nodes A and B, disA,B is:

disA,B = 1 ; if there is a link from A to B.
disA,B = ∞ ; if there is no direct link from A to B

The distance measure, as suggested above has one basic issue, which is as
follows:

Consider three nodes A, B and C with only two of these node pairs (A, B) and
(B, C) are connected; then:
disA,B = 1 ; disB,C = 1 ; disA,C = ∞
However, this may violate the true distance measure rule:
disA,B + disB,C >= disA,C

Thus, the definition of distance, when there is no direct link may need to be
redefined, if any traditional clustering methods are to be considered.

13
Link Analysis 12.5.4 Clustering of Social Network Graphs

The social media network consists of a very large number of nodes and links.
As stated earlier, one of the interesting problems here is to identify a set of
communities in this graph. What are the characteristics of a community or
cluster in a graph?
A community or cluster in a graph is a subset of a graph having a large
number of links within the subset, but having less number of links to other
clusters. You may observe that in Figure 12 such clusters exist, though the
cluster of “college friends” and “working with” have few common nodes, but,
in general, the previous statement holds. One additional, feature of the social
media network graph is that a cluster can be further broken into sub-clusters.
For example, in Figure 12, the “college friend” cluster may consist of two
sub-clusters: Undergraduate College Friends and Postgraduate College
friends.
You may please note that such clustering is very similar to hierarchical
clustering or k-means clustering used in machine learning, with the difference
that here you are wanting to find clusters in graphs and not in large datasets of
points.

Given:
A social network undirected graph, say G (V, E), where V is the set of
nodes, which represent an entity such as a person, and E is the set of
edges, which represent the connections, such as friends. The links are not
assigned any weight (see Figure 13)

The objective of the Problem:

To find good clusters, which maximise the links within a cluster and
minimise the links between clusters.

The Process:
Find a set of minimal edges, which when removed creates a cluster. For
example, in figure 13 the edges {C, E} and {D, F}. This will divide the
graph given in Figure 13 into two clusters {A, B, C, D} and {E, F, G, H}.

A C E G

D F H
B

Figure 13: Social Network Graph G (V, E)

Several algorithms have been developed for efficient clustering and the
creation of communities in social network graphs. You can refer to the further
readings for a detailed discussion on this topic.

Check Your Progress 3:

1. Why are graphs used to represent social media networks?
2. What are the different types of social media networks?

14
3. Define the objective of clustering in social network graphs. Big Data Analysis

12.6 SUMMARY
This unit introduces you to some of the basic problems that can be addressed by data
science. The Unit first introduces you to the concept of web analytics, which is based
on a collection of information about a website to determine if the objectives of the
website are met. One of the interesting aspects of doing business through the web is
web advertising. This unit explains one of the most important problems of
advertisement through web AdWords problem. It also proposes a greedy solution to
the problem. The unit also proposes a better algorithm, named a balanced algorithm,
than the greedy solution to the AdWords problem. Next, the Unit discussed the
recommender system, which results from long-tail products. The recommender
system aims at providing suggestions to a person based on his/her ranking of past
purchases. These recommendations are, in general, such that recommended products
may be liked by the person to whom the recommendations are made. In the context
of recommender systems, this unit discusses the concept of the long-tail, utility
matrix. In addition, the unit discusses two algorithms that may be used for making
content-based recommendations. Finally, the unit discusses the social media network,
which is represented as graphs. The unit also introduces the process of clustering for
the social media network. You may please go through further readings and research
papers for more detail on these topics.

12.7 SOLUTIONS/ANSWERS

Check Your Progress 1:

1. AdWords problem is basically to select a set of advertisements to be
displayed along with the result of a search query. The decision to
choose advertisements is taken based on the data such as bidding and
budget of the advertiser, the probability of clicking on a given
advertisement by a user; to maximise the profit of the advertiser.
2. An online algorithm is one, which makes a committed decision based on
currently available data. It may be noted that such an algorithm does not have
access to the complete set of data, as is the case of an offline algorithm.
Online algorithms are useful to address the AdWords problem, as the
complete sequence of the query stream is not known at the time of making
decisions about the selection of an advertisement to display.
3. The decision to select an advertisement to display in the Greedy algorithm is
based on the probable revenue of the advertisement. In case, two
advertisements have the same probable revenue, then any one of them is
selected at random. Whereas in the case of the balanced algorithm, the
selection is based on probable revenue and remaining budget.

Check Your Progress 2:

1. In the present scenario of online marketing, a large number of choices
of products are available. A recommender system is a system, which
based on previous ratings or purchases made by a customer, suggests
newer products or items that are expected to be liked by this customer.
2. The term long tail is coined for online marketing, where a large
number of products, which may be of good quality, have not been
purchased as frequently as popular products. A company may have a
very large list of such products, therefore, despite fewer sales, overall
long-tail products can generate substantial revenue.

15
Link Analysis 3. The content-based recommendations collect the customer ratings of
various products and services. It also makes the profile of products
based on various attributes. The customer ratings and product profiles
are used to make a profile of the customer as per the attributes, which
are then used to make specific recommendations of the products,
which are expected to be rated highly by the customer.

Check Your Progress 3:

1. Most social media networks represent a set of entities and their
relationships. An entity can be represented with the help of a node,
which can also store the activities performed by that entity. The
relationships can best be represented with the help of an edge between
two nodes. Social media networks have a large number of complex
relationships. The graphs are the most natural way of representing the
nodes and complex relationships. In addition, this allows the use of
graph-based algorithms for analytics.
2. Different types of social media networks include traditional social
networks like Facebook, Twitter etc; social review of discussion or
blogging-based networks and many image and video sharing networks.
3. One of the analyses of social networks is to find social communities,
which have a large number of connections within but have less number
of connections with other communities. This can best be performed
with the help of finding clusters in graphs.

12.8 REFERENCES/FURTHER READINGS

1. Leskovec J., Rajaraman R, Ullman J, Mining of Massive Datasets,

16
Basic of R Programming
UNIT 13 BASICS OF RPROGRAMMING
Structure Page Nos.
13.0 Introduction
13.1 Objectives
13.2 Environment of R
13.3 Data types, Variables, Operators, Factors
13.4 Decision Making, Loops, Functions
13.5 Data Structures in R
13.5.1 Strings and Vectors
13.5.2 Lists
13.5.3 Matrices, Arrays and Frames
13.6 Summary
13.7 Answers

13.0 INTRODUCTION
This unit covers the fundamental concepts of R programming. The unit
familiarises with the environment of R and covers the details of the global
environment. It further discusses the various types of data that is associated with
every variable to reserve some memory space and store values in R. The unit
discusses the various types of the data objects known as factors and the types of
operators used in R programming. The unit also explains the important elements
of decision making, the general form of a typical decision-making structures and
the loops and functions. R’s basic data structures including vector, strings, lists,
frames, matrices and arrays would also be discussed.

13.1 OBJECTIVES
After going through this Unit, you will be able to:
• explain about the environment of R, the global environment and their
elements;
• explain and distinguish between the data types and assign them to
variables;
• explain about the different types of operators and the factors;
• explain the basics of decision making, the structure and the types of
loops;
• explain about the function- their components and the types;
• explain the data structures including vector, strings, lists, frames,
matrices, and arrays.

13.2 ENVIRONMENT OF R
R Programming language has been designed for statistical analysis of data. It
also has a very good support for graphical representation of data. It has a vast
set of commands. In this Block, we will cover some of the essential component
of R programming, which would be useful for you for the purpose of data
analysis. We will not be covering all aspects of this programming language;
therefore, you may refer to the further readings for more details.
5
Basics of R Programming
The discussion on R programming will be in the context of R-Studio, which is
an open-source software. You may try various commands listed in this unit to
facilitate your learning. The first important concept of R is its environment,
which is discussed next.
Environment can be thought of as a virtual space having collection of objects
(variables, functions etc.) An environment is created when you first hit the R
interpreter.
The top level environment present at R command prompt is the global
environment known as R_GlobalEnv, it can also be referred as .GlobalEnv. You
can use ls() command to know what variables/ functions are defined in the
working environment. You can even check it in the Environment section of R
Studio.

Figure 5.1: Environment with a variable in RStudio

Figure 5.2: Variables in Global Environment

In Figure 5.2, variables a, b and f are in R_GlobalEnv. Notice that x (as an
argument to the function) is not in the global environment. When you define a
function, a new environment is created. In Figure 5.1, a function f created a new
environment inside the Global environment.

13.3 DATA TYPES, VARIABLES, OPERATORS,

FACTORS

Every variable in R has an associated data type, which is known as the reserved
6
Basic of R Programming
memory. This reserved memory is needed for storing the values. Given below is
a list of basic data types available in R programming:
DATA TYPE Allowable Values
Integer Values from the Set of Integers, Z
Numeric Values from the Set of Real Numbers, R
Complex Values from the Set of Complex numbers,
C
Logical Only allowable values are True ; False
Character Possible values are -“x”, “@”, “1”, etc.
Table 1: Basic Data Types

Numeric Datatype:
Decimal values are known to be numeric in R and is default datatype for any
number in R.

Whenever a number is stored in R, it gets converted into decimal type with at

least 2 decimal points or the “double” value. So, if you enter a normal integer
value also, for example 10, then R interpreter will convert it into double i.e.
10.00. You can even confirm this by checking the type of the variable, as given
below:

is.integer() function returning FALSE confirms that the variable z is converted

into double or the decimal type.

Integer Datatype:
R supports integer data type, you can create an integer by suffixing “L” to denote
that particular variable as integer as well as convert a value to an integer by
passing the variable to as.integer() function.

7
Basics of R Programming

Logical Datatype:
R has a logical datatype which returns value as either TRUE or FALSE. It is
usually used while comparing two variables in a condition.

Complex Datatype:
Complex data types are also supported in R. These datatype includes the set of
all complex numbers.

Character Datatype:
R supports character datatype which includes alphabets and special characters.
We need to include the value of the character type inside single or double
inverted commas.

8
Basic of R Programming

VARIABLES:

A variable as discussed in the previous section, allocates a memory space and

stores the values, which can be manipulated. A valid variable name consists of
letters, numbers and dot or underline characters

Variable Name Valid Reason

var_name1. Valid Contains letters, number, dot and
underscore
1var_name Invalid Starting with a number
Var_name@ Invalid Has special character (@). Only dot
and underscore is allowed.
.var_name, var.name Valid Can start with a dot, which is followed
by an alphabet.
_var_name Invalid Should not start with underscore.
.2var_name Invalid Dot is followed by a number and
hence invalid.

Variables Assignment: Variables can be assigned in multiple ways –

• Assignment (=): var1 = “Hello”
• Left (ß): var2 ß “, “
• Right (à): “How are you” à var3

OPERATORS:
As the case with other programming languages, R also supports assignment,
arithmetic, relational and logical operators. The logical operators of R include
element by element operations. In addition, several other operators are supported
by R, as explained in this section.

Arithmetic Operators:
• Addition (+): The value at the corresponding positions in the vectors are
added. Please note the difference with C programming, as you are adding
a complete vector using a single operator.
• Subtraction (-): The value at the corresponding positions are subtracted.
Once again please note that single operator performs the task of subtract-
ing elements of two vectors.
• Multiplication (*): The value at the corresponding positions are multi-
plied.
9
Basics of R Programming
• Division (/): The value at the corresponding positions are divided.
• Power (^): The first vector is raised to the exponent (power) of the sec-
ond.
• Modulo (%%): The remainder after dividing the two will be returned.

Logical Operators:

• Element-wise Logical AND Operator (&): If both the corresponding op-

erands are true, then this operator returns the Boolean value TRUE for
that element. Please note the difference with C programming, in which
it is a bitwise AND operator, whereas in R it is an element wise AND
operator.
• Element-wise Logical OR Operator (|): If either of the corresponding op-
erands are TRUE, then this operator returns the Boolean value TRUE for
that element.
• Not Operator (!): This is a unary operator that is used to negate the oper-
and.
• Logical AND Operator (&&): If the first element of both the operand are
TRUE, then this operator returns the Boolean value TRUE.
Logical OR Operator (||): If either of the first elements of the operands
are true, then this operator returns Boolean value TRUE.

Relational Operators:

The relational operators can take scalar or vector operands. In case of vector
operands comparison is done element by element and a vector of TRUE/FALSE
values is returned.
• Less than (<): If an element of the first operand (scalar or vector) is less
than that the corresponding element of the second operand, then this op-
erator returns Boolean value TRUE.

10
Basic of R Programming
• Less than Equal to (<=): If every element in the first operand or vector
is less than or equal to the corresponding element of the second operand,
then this operator returns Boolean value TRUE.
• Greater than (>): If every element in the first operand or vector is greater
than that the corresponding element of the second operand, then this op-
erator returns Boolean value TRUE.
• Greater than (>=): If every element in the first operand or vector is
greater than or equal to the corresponding element of the second operand,
then this operator returns Boolean value TRUE.
• Not equal to (!=): If every element in the first operand or vector is not
equal to the corresponding element of the second operand, then this op-
erator returns Boolean value TRUE.
• Equal to (==): If every element in the first operand or vector is equal to
the corresponding element of the second operand, then this operator re-
turns Boolean value TRUE.

Assignment Operators:
• Left Assignment (ß or <<-or =): Used for assigning value to a vector.
• Right Assignment (-> or ->>): Used for assigning value to a vector.

11
Basics of R Programming

Miscellaneous Operators:

• %in% operator: It determines whether a data element is contained in a

list and returns a Boolean value TRUE if the element is found to exist.
• Colon(:) Operator: It prints a list of elements from before the colon to
after the colon.
• %*% Operator: It helps in multiplying a matrix with its transpose.

FACTORS:
Factors are the data objects are used for categorizing and further storing the data
as levels. They store both, strings and integer values. Factors are useful in the
columns that have a limited number of unique values also known to be categor-
ical variable. They are useful in data analysis for statistical modelling. For ex-
ample, a categorical variable employment types – (Unemployed, Self-Em-
ployed, Salaried, Others) can be represented using factors. More details on fac-
tors can be obtained from the further readings.
Check Your Progress 1
1. What are various Operators in R?
……………………………………………………………………………

……………………………………………………………………………
12
Basic of R Programming
2. What does %*% operator do?
…………………………………………………………………………….

………………………………………………………………………………
3. Is .5Var a valid variable name? Give reason in support of your answer.
………………………………………………………………………………

………………………………………………………………………………

13.4 DECISION MAKING, LOOPS, FUNCTIONS

Decision making requires the programmer to specify one or more conditions
which will be evaluated or tested by the program, along with the statements to
be executed if the condition is determined to be true, and optional statements
to be executed if the condition is determined to be false.
Given below is the general form of a typical decision making structure found
in most of the programming languages–

If Condition
condition
is true If
condition
is false

Conditional code

The format of if statement in R is as follows:

if (conditional statement, may include relational and logical operator) {
R statements to be executed, if the conditional statement is true
}
else {
R statements to be executed, if the conditional statement is FALSE
}
You may use else if instead of else
LOOPS:
A loop is defined as a situation where we need to execute a block of code several
number of times. In the case of loops, the statements are executed sequentially.
13
Basics of R Programming

Conditional code

If condition
Condition is true

If condition
is false

Loop Type and Description:

• Repeat loop: Executes sequence of statements multiple times.
• While loop: Repeat a given statement while the given condition is true,
executes before executing the loop body.
Syntax:

Example:

• For loop: Like while statement, executes the test condition at the end of
the loop body.
14
Basic of R Programming
Syntax:

Example:

Loop Control Statements:

• Break Statements: Terminates the loop statement and execute the
statements immediately below the loop.

FUNCTIONS:

A function refers to a set of instructions that is required to execute a command

to achieve a task in R. There are several built-in functions available in R. Further,
users may create a function basis their requirements.
Definition:
A function can be defined as:

function_name<- function(arg_1, arg_2, ...) {

Function body
}

Function Components
• Function Name: Actual name of the function.
15
Basics of R Programming
• Arguments: Passed when the function is invoked. They are optional.
• Function Body: statements that define the logic of the function.
• Return value: last expression of the function to be executed.

Built-in function: Built in functions are the functions already written and is ac-
cessible just by calling the function name. Some examples are seq(), mean(),
min(), max(), sqrt(), paste() and many more.

13.5 Data Structures in R

R’s basic data structures include Vector, Strings, Lists, Frames, Matrices and
Arrays.

13.5.1 Strings and Vectors

Vectors:
A vector is a one-dimensional array of data elements that have same data type.
The most basic data structure are the Vectors, which supports logical, integer,
double, complex, character datatypes.
Strings:
Any value written within a pair of single quote or double quotes in R is treated
as a string. Internally R stores every string within double quotes, even when you
create them with single quote.

Rules Applied in String Construction

• The quotes at the beginning and end of a string should be either both
double quotes or both single quote. They cannot be mixed.
• Double quotes can be inserted into a string starting and ending with sin-
gle quote.
• Single quote can be inserted into a string starting and ending with double
quotes.
• Double quotes cannot be inserted into a string starting and ending with
double quotes.
• Single quote cannot be inserted into a string starting and ending with
single quote.
Length of String: The length of strings tells the number of characters in a string.
The inbuilt function nchar() or function str_length() of the stringr package can
be used to get the length of the string.
16
Basic of R Programming
String Manipulations:
• Substring: Accessing the different portions of the strings. The 2 inbuilt
functions present for this is substr() or substring() to extract the sub-
strings.
• Case Conversion: The characters of the string can be converted to upper
or the lower case by using toupper() or tolower().
• Concatenation: The strings in R can be combined by using the paste()
function. It can concatenate any number of strings together. For exam-
ple, paste(..., sep = " ", collapse = NULL)where x is vector having val-
ues, sep: is a separator symbol that is used to separate elements& col-
lapse gives value to collapse.

13.5.2 Lists
Lists are the objects in R that contains different types of objects within itself
like number, string, vectors or even another list, matrix or any function as its
element It is created by calling list() function.

13.5.3 Matrices, Arrays and Frames

Matrices are R objects which are arranged in 2-D layout. They contain element
of same type. The basic syntax of creating a matrix in R is:
matrix(data, nrow, ncol, byrow, dimnames), where data is the name of input
vector, nrow is no of rows, ncol is no of columns, byrow is to specify either row
matrix or column matrix and dimnameis the name assigned to rows and
columns.

17
Basics of R Programming

Accessing the elements of the matrix: Elements of a matrix can be accessed

by specifying the row and column number.

Matrix Manipulations:
Mathematical operations can be performed on the matrix like addition,
subtraction, multiplication and division. You may please note that matrix
division is not defined mathematically, but in R each element of a matrix is
divided by the corresponding element of other matrix.

18
Basic of R Programming

Arrays:
An array is a data object in R that can store multi-dimensional data that have the
same data type. It is used using the array() function and can accept vectors as an
input. An array is created using the values passed in the dim parameter.
For instance, an array is created with dimensions (2,3,5); then R would create 5
rectangular matrices comprising of 2 rows and 3 columns each. However, the
data elements in each of the array will be of the same data type.

19
Basics of R Programming

Accessing Array Elements:

Dataframe:
A data frame represents a table or a structure similar to an array with two
dimensions It can be interpreted as matrices where each column of that matrix
can be of different data types.
The characteristics of a data frame are given as follow
20
Basic of R Programming
• The names of the columns should not be left blank
• The row names should be unique.
• The data frame can contain elements with numeric, factor or
character data type
• Each column should contain same number of data items.

Statistical summary of the dataframe can be fetched using summary() function.

Extracting specific data from data frame by specifying the column name.

21
Basics of R Programming
Expanding the data frame by Adding additional column.

To add more rows permanently to an existing data frame, we need to bring in

the new rows in the same structure as the existing data frame and use
the rbind() function.

Check Your Progress 2

1. Why are Matrices data structure not used that often?
…………………………………………………………………………………
……………………………………………………………………………

2. What are the different data structures in R? Briefly explain about them.
…………………………………………………………………………………
…………………………………………………………………………………

3. What is the function used for adding datasets in R?

…………………………………………………………………………………
…………………………………………………………………………………

22
Basic of R Programming
13.6 SUMMARY

The unit introduces you to the basics of R programming. It explains about the
environment of R, a virtual space having collection of objects and how a new
environment can be created within the global environment. The unit also
explains about the various types of data associated with the variables that
allocates a memory space and stores the values that can be manipulated. It also
gives the details of the five types of operators in R programming. It also explains
about factors that are the data objects used for organizing and storing the data as
levels. The concept of decision making is also been discussed in detail that
requires the programmer to specify one or more conditions to be evaluated or
tested by the program. The concept of loops and their types has also been defined
in this unit. It gives the details of function in R that is a set of instructions that is
required to execute a a command to achieve a task in R. There are several built-
in functions available in R. Further, users may create a function basis their
requirements. The concept of matrices, arrays, dataframes etc have also been
discussed in detail.

13.7 ANSWERS
Check Your Progress 1

1. The various operators in R are Arithmetic, Relational, Logical,

assignment and Miscellaneous Operators. All of the above briefly explained in
section 5.3

2. %*% Operator: It helps in multiplying a matrix with its transpose.

3. .5Var is an Invalid variable name as the dot is followed by a number

Check Your Progress 2

1. Matrices are not used much often as they contains only one data type
and that too usually character or logical values.
2. Various Data Structures in R:

Data Structure Description

Vector A vector is a one-dimensional array

of data elements that have same
data type. These data elements in a
vector are referred to as
components.
List Lists are the R objects which
contain elements of different types
like- numbers, strings, vectors or
another list inside it.

23
Basics of R Programming
Matrix A matrix is a two-dimensional data
structure. Matrices are used to bind
vectors from the same length. All
the elements of a matrix must have
the same data type, i.e. (numeric,
logical, character, complex).
Dataframe A dataframe is more generic than a
matrix, i.e. different columns can
have different data types (numeric,
logical etc). It combines features of
matrices and lists like a rectangular
list.

3. Rbind() is the function used to add datasets in R.

13.8 REFERENCES AND FURTHER READINGS

1. De Vries, A., & Meys, J. (2015). R for Dummies. John Wiley & Sons.
2. Peng, R. D. (2016). R programming for data science (pp. 86-181). Victoria, BC, Canada:
Leanpub.
3. Schmuller, J. (2017). Statistical Analysis with R For Dummies. John Wiley & Sons.
4. Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage publications.
5. Lander, J. P. (2014). R for everyone: Advanced analytics and graphics. Pearson
Education.
6. Lantz, B. (2019). Machine learning with R: expert techniques for predictive modeling.
Packt publishing ltd.
7. Heumann, C., & Schomaker, M. (2016). Introduction to statistics and data analysis.
Springer International Publishing Switzerland.
8. Davies, T. M. (2016). The book of R: a first course in programming and statistics. No
Starch Press.
9. https://www.tutorialspoint.com/r/index.html

24
Data Interfacing & Visualization in R
UNIT 14 DATA INTERFACING AND
VISUALISATION IN R
Structure Page Nos.
14.1 Introduction
14.2 Objectives
14.3 Reading Data From Files
14.3.1 CSV Files
14.3.2 Excel Files
14.3.3 Binary Files
14.3.4 XML Files
14.3.5 JSON Files
14.3.6 Interfacing with Databases
14.3.7 Web Data
14.4 Data Cleaning and Pre-processing
14.5 Visualizations in R
14.5.1 Bar Charts
14.5.2 Box Plots
14.5.3 Histograms
14.5.4 Line Graphs
14.5.5 Scatterplots
14.6 Summary
14.7 Answers
14.8. References and Further Readings

14.1 INTRODUCTION
In the previous unit, you have learnt about basic concepts of R programming.
This unit explains how to read and analyse data in R from various file types
including- CSV, Excel, binary, XML, JSON, etc. It also discusses how to extract
and work on data in R from databases and also web data. The unit also explains
in detail about data cleaning and pre-processing in R. In the later sections, the
unit explores the concept of visualisations in R. Various types of graphs and
charts, including - bar charts, box plots, histograms, line graphs and scatterplots,
are discussed.

14.2 OBJECTIVES
After going through this Unit, you will be able to:
• explain the various file types and their interface that can be processed for
data analysis in R;
• read, write and analyse data in R from different type of files including-
CSV, Excel, binary, XML and JSON;
• extract and use data from databases and web for analysis in R;
• explain the steps involved in data cleaning and pre-processing using R;
• Visualise the data using various types of graphs and charts using R and
explain their usage.

25
Basics of R Programming
14.3 READING DATA FROM FILES
In R, you can read data from files outside of the R environment. One
may also write data to files that the operating system can store and further
access. There is a wide range of file formats, including CSV, Excel,
binary, and XML, etc., R can read and write from.
14.3.1 CSV Files
Input as CSV File:
CSV file is a text file in which column values are separated by commas.
For example, you can create data with name, programme, phone of
students. By copying and pasting this data into Windows Notepad, you
can create the CSV file. Using notepad's Save As option, save the file as
input.csv.
Reading a CSV File:
Function used to read a CSV file: read.csv()

Figure 14.1: Reading data from a CSV file

Analysing the CSV File:

The read.csv() function returns a data frame as its default output. You
can use the following three print functions to:(1) verify if the read data
input from CSV file is in frame format or not; (2) find the number of
columns in the data; and (3) find the number of rows in the data.

Figure 14.2: Checking read data

Writing into a CSV File:

The write.csv() function of R can be used to generate a CSV file from a
data frame. For example, the following function will generate an
output.csv file. Please note that this output.csv will be created in the
present directory in which you are working.

14.3.2 Excel Files

Microsoft Excel is the most extensively used spreadsheet tool and it uses the.xls
or.xlsx file extension to store data. Using various Excel-specific packages, R
can read directly from these files. XLConnect, xlsx, and gdata are a few
examples of such packages. The xlsx package also allows R to write to an Excel
file.

26
Data Interfacing & Visualization in R
Install xlsx Package

• Command to install ‘xlsx’ package: install.packages("xlsx")

• To Load the library into R workspace: library("xlsx")

Reading the Excel File

The read.xlsx() function is used to read the input.xlsx file, as illustrated below.
In the R environment, the result is saved as a data frame.

Figure 14.3: Reading data from an Excel file

Writing the Excel File
For writing to a new Excel file, you use the write function, as shown below:

14.3.3 Binary Files

A binary file is one that solely includes data in the form of bits and bytes.
(0's and 1's). When you try to read a binary file, the sequence of bits is
translated as bytes or characters, which include numerous other non-
printable characters, that are not human readable. Any text editor that
tries to read a binary file will display characters like Ø , ð, printable
characters and many other characters including beeps.
R has two functions writeBin() and readBin() to create and read binary
files.

Syntax:
writeBin(object, con)
readBin(con, what, n )

where,
• The connection object con is used to read or write a binary file.
• The binary file to be written is the object.
• The mode that represents the bytes to be read, such as character,
integer, etc is what.
• The number of bytes to read from the binary file is given by n.

Writing the Binary File(You should read the comments for

explanation on each command.)

27
Basics of R Programming

Figure14.4: An example of Writing data to a Binary file

Reading the Binary File(You should read the comments for

explanation on each command.)

Figure14.5: An example of Reading data to a Binary file

14.3.4 XML Files

XML is an acronym for “extensible markup language”. It is a file format

that allows users to share the file format as well as the data over the
internet, intranet and other places, as standard ASCII text. XML uses
markup tags that describe the meaning of the data stored in the file. This
is similar to the markup tags used in HTML wherein the markup tag
describes the structure of the page instead.
The "XML" package in R can be used to read an xml file. The following
command can be used to install this package:

install.packages("XML")

Reading XML File

R reads the xml file using the function xmlParse(). In R, it is
saved as a list.

28
Data Interfacing & Visualization in R

Figure14.6: An example of reading data from a Binary file

XML to Data Frame

In order to manage the data appropriately in huge files, the data
in the xml file can be read as a data frame. The data frame should
then be processed for data analysis.

Figure14.7: converting the read data to a data frame

14.3.5 JSON Files

The data in a JSON file is stored as text in a human-readable

format. JavaScript Object Notation is abbreviated as JSON. The
rjson package in R can read JSON files.

Install rjson Package

To install the rjson package, type the following command in the
R console: install.packages("rjson")

Read the JSON File

R reads the JSON file using the function fromJSON(). In R, it is
saved as a list.

Figure14.8: An example of reading data from JSON file

Convert JSON to a Data Frame

Using the as.data.frame() function, you can turn the retrieved
data above into a R data frame for further study.

29
Basics of R Programming

Figure14.9: Converting read data to data frame

14.3.6 Databases

Data is stored in a normalised way in relational database systems. As a

result, you will require quite advanced and complex SQL queries to
perform statistical computing. However, R can readily connect to various
relational databases, such as MySQL, Oracle, and SQL Server, and
retrieve records as a data frame. Once the data is in the R environment,
it becomes a standard R data set that can be modified and analysed with
all of R's sophisticated packages and functions.

RMySQL Package
R contains a built-in package called "RMySQL" that allows you to
connect to a MySql database natively. The following command will
install this package in the R environment.

install.packages("RMySQL")

Connecting R to MySQL

Figure14.10: Connecting to MySQL database

Querying the Tables

Using the MySQL function dbSendQuery(), you can query the database
tables . The query is run in MySQL, and the results are returned with the
R fetch() function. Finally, it is saved in R as a data frame.

Figure 14.11: Querying the MySQL table

Updating Rows in the Tables

Figure 14.12: Updating rows in MySQL table

Inserting Data into the Tables

30
Data Interfacing & Visualization in R

Figure 14.13: Inserting data in a MySQL table

Creating Tables in MySQL

The function dbWriteTable() in MySQL can be used to create tables. It takes

a data frame as input and overwrites the table if it already exists.

Figure 14.14: Creating a table in MySQL

Dropping Tables in MySQL

Figure 14.15: Dropping a table in MySQL

14.3.7 Web Data

Many websites make data available for users to consume. The World
Health Organization (WHO), for example, provides reports on health and
medical information in CSV, txt, and XML formats. You can
programmatically extract certain data from such websites using R
applications. "RCurl," "XML," and "stringr" are some R packages that
are used to scrape data from the web. They are used to connect to URLs,
detect required file links, and download the files to the local
environment.

Install R Packages

For processing the URLs and links to the files, the following packages
are necessary.
install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("plyr")

14.4 DATA CLEANING AND PRE-PROCESSING

Data cleaning is the process of identifying, correcting and removing incorrect
raw data. The clean data is then fed to the models to build the logical
conclusions.
If the data is poorly prepped, unreliable results can destroy the assumptions and
insights.
Packages like tidy verse can make complex data manipulations easier.
The following is the checklist of cleaning and preparing data which is mainly
31
Basics of R Programming
considered among the best practices.
• Familiarization with the dataset: to get good domain knowledge so
that one is aware which variable represents what.
• Check for structural errors: You may check for mislabelled variables,
faulty Data types, non-unique (duplicated values) and string
inconsistencies or typing errors.
• Check for data irregularities: You may check for the invalid values and
outliers.
• Decide on how to deal missing values: Either delete the observations if
they are not providing any meaningful insights to our data or imputing
the data with some logical values like mean or median based on the
observations.
Check your Progress 1

1. What is the package used to use JSON Files in R?

………………………………………………………………………….
…………………………………………………………….
2. What are wb and rb mode while dealing with binary files?
………………………………………………………………………….
…………………………………………………………….
3. Mention any 2 checklist points used for cleaning/ preparing data?
………………………………………………………………………….
…………………………………………………………….

14.5 VISUALIZATION IN R
In the previous section, we have discussed about obtaining input from different
types of data. This section explains various types of graphs that can be drawn
using R. It may please be noted that only selected types of graphs have been
presented here.

14.5.1 Bar Charts

A bar chart depicts data as rectangular bars whose length is proportionate

to the variable's value. The function barplot() in R is used to make bar
charts. In a bar chart, R can create both vertical and horizontal bars. Each
of the bars in a bar chart can be coloured differently.

Syntax:
barplot(H,xlab,ylab,main, names.arg,col)

where,
• In a bar chart, H is a vector or matrix containing numeric values.
• The x axis label as xlab.
• The y axis label is ylab.
• The title of the bar chart is main.
• names.arg is a list of names that appear beneath each bar.
• col is used to color the graph's bars.

32
Data Interfacing & Visualization in R

Figure 14.16: Creating a Bar chart

Figure 14.17: A Bar Chart of data of Figure 14.16

Bar Chart Labels, Title and Colors

More parameters can be added to the bar chart to increase its capabilities.
The title is added using the main parameter. Colors are added to the bars
using the col parameter. To express the meaning of each bar, args.name
is a vector with the same number of values as the input vector.

Figure 14.18: Function for plotting Bar chart with labels and
colours

33
Basics of R Programming

Figure 14.19: A Bar Chart of with labels and colors

14.5.2 Box Plots

In order to determine how evenly the data is distributed in a dataset, you
can use boxplot, which is very effective tool. The dataset is split using
quartiles. This graph depicts the data set's minimum, first quartile,
median, third quartile and maximum. Drawing boxplots for each data set
allows you to compare the distribution of data across data sets.
The boxplot() function in R is used to make boxplots.
Syntax:
boxplot(x, data, notch, varwidth, names, main)
The parameters of the functions are as follows:
• Parameter x either can specify a formula or it can specify a vector.
• Parameter data is used to specify the data frame that contains the
data required to be plotted.
• Parameter notch represents a logical value. In case, you want to
draw a notch in the box plot, you may set its value to TRUE.
• Parameter varwidth is also logical. It can be set to TRUE, if you
want to make the box's width proportional to the sample size.
• Parameter names can be used to specify the group labels that will
be printed beneath each boxplot.
• main is used to give the graph a title.

Creating the Boxplot

Figure 14.20: Coding for Box plot

34
Data Interfacing & Visualization in R

Figure 14.21: A Box plot of Figure 14.20

14.5.3 Histograms

The frequency of values of a variable bucketed into ranges is represented

by a histogram. The difference between a histogram and a bar chart is
that a histogram groups the numbers into continuous ranges. The height
of each bar in a histogram represents the number of items present in that
range.

The hist() function in R is used to produce a histogram. This function

accepts a vector as an input and plots histograms using additional
parameters.

Syntax:
hist(v,main,xlab,xlim,ylim,breaks,col,border)

where,
• The parameter v is a vector that contains the numeric values for
which histogram is to be drawn.
• The title of the chart is shown by the main.
• The colour of the bars is controlled by col.
• Each bar's border colour is controlled by the border parameter.
• The xlab command is used to describe the x-axis.
• The x-axis range is specified using the xlim parameter.
• The y-axis range is specified with the ylim parameter.
• The term "breaks" refers to the breadth of each bar.

Figure 14.22: Creating a Histogram

35
Basics of R Programming

Figure 14.23: Histogram of data used in Figure 14.22

14.5.3 Line Graphs

A graph that uses line segments to connect a set of points is known as the
line graph. These points are sorted according to the value of one of their
coordinates (typically the x-coordinate). Line charts are commonly used
to identify data trends.
The line graph was created using R's plot() function.

Syntax:
plot(v,type,col,xlab,ylab)

where,
• The numeric values are stored in v, which is a vector.
• type takes values, "p","l","o". The value "p" is used to draw only
points, "l" is used to draw only lines, and "o" is used to draw both
points and lines.
• xlab specifies the label for the x axis.
• ylab specifies the label for the x axis..
• Main is used to specify the title of chart .
• col is used to specify the color of the points and/or the lines.

Figure 14.24: Function to draw a line chart

36
Data Interfacing & Visualization in R

Figure 14.25: A Line Chart for data of Figure 14.24

Multiple Lines in Line Chart & Axis Details

Figure 14.25: Function for Line Chart with multiple lines

Figure 14.27: A Line Chart with multiple lines for data of Figure
14.26

14.5.4 Scatterplots

Scatterplots are diagrams that display a large number of points shown in

the Cartesian plane. The values of two variables are represented by each
point.
One variable is chosen on the horizontal axis, while another is chosen on
the vertical axis. To create a simple scatterplot, use the plot() method.

Syntax:
37
Basics of R Programming
plot(x, y, main, xlab, ylab, xlim, ylim, axes)

The parameters of the plot functions are as follows:

• parameter x is the data values for x-axis.
• parameter y is the data values for y-axis
• parameter main is used for title of the grpah
• parameters xlab and ylab are used to specify the Labels for x-
axis and y-axis respectively.
• Parameters xlim and ylim used define the limits of values of x
and y respectively.
• axes specifies if the plot should include both axes.

Figure 14.28: Plot function to draw Scatter Plot

Figure 14.29: Scatter plot for the data of Figure 14.28

Scatterplot Matrices
The scatterplot matrix is used when there are more than two variables
and you want to identify the correlation between one variable and the
others. To make scatterplot matrices, we use pairs() function.

Syntax:
pairs(formula, data)
where,
• The formula represents a set of variables that are utilised in pairs.
• The data set from which the variables will be derived is referred
to as data.

38
Data Interfacing & Visualization in R

Figure 14.30: Function for Scatterplot matrix

Figure 14.31: A Scatterplot matrix

Check your Progress 2

1. What is scatter plot?

2. When you will use histogram and when you will use bar chart in R?
3. What type of chart you consider when trying to demonstrate
“relationship“ between variables/parameters?

14.6 Summary
In this unit you have gone though various file types that can be processed for
data analysis in R and further discussed their interfaces. R can read and write a
variety of file types outside the R environment, including CSV, Excel, binary,
XML and JSON. Further, R can readily connect to various relational databases,
such as MySQL, Oracle, and SQL Server, and retrieve records as a data frame
that can be modified and analysed with all of R's sophisticated packages and
functions. The data can also be programmatically extracted from websites using
R applications. "RCurl," "XML," and "stringr" are some R packages that are
used to scrape data from the web. The unit also explains the concept of data
cleaning and pre-processing which is the process of identifying, correcting and
removing incorrect raw data, familiarization with the dataset, checking data for
structural errors and data irregularities and deciding on how to deal with missing
values are the steps involved in cleaning and preparing data which is mainly
considered among the best practices. The unit finally explores the concept of
39
Basics of R Programming
visualisations in R. There are various types of graphs and charts including- bar
charts, box plots, histograms, line graphs and scatterplots that can be used to
visualise the data effectively. The unit explained the usage and syntax for each
of the illustration with graphics.

14.7 Answers
Check your progress 1
1. Install.packages(“rjson”)
library(rjson)
2. rb mode opens the file in the binary format for reading and wb mode
opens the file in the binary format for writing.
3. The checklist points used for cleaning/ preparing data:
Check for data irregularities: You may check for the invalid values and
outliers.
Decide on how to deal missing values: Either delete the observations if
they are not providing any meaningful insights to our data or imputing
the data with some logical values like mean or median based on the
observations.
Check your progress 2
1. A scatter plot is a chart used to plot a correlation between two or more
variables at the same time
2. We use a histogram to plot the distribution of a continuous variable,
while we can use a bar chart to plot the distribution of a categorical
variable.
3. When you are trying to show “relationship” between two variables, you
will use a scatter plot or chart. When you are trying to show
“relationship” between three variables, you will have to use a bubble
chart.

14.8 REFERENCES AND FURTHER READINGS

40
Data Analysis and R
UNIT 15 DATA ANALYSIS AND R
Structure Page Nos.
15.1 Introduction
15.2 Objectives
15.3 Chi Square Test
15.4 Linear Regression
15.5 Multiple Regression
15.6 Logistic Regression
15.7 Time Series Analysis
15.8 Summary
15.9 Answers
15.10 References and Further Readings

15.1 INTRODUCTION
This unit deals with the concept of data analysis and how to leverage it by using
R programming. The unit discusses various tests and techniques to operate on
data in R and how to draw insights from it. The unit covers the Chi-Square Test,
its significance and the application in R with the help of an example. The unit
also familiarises with the concept of Regression Analysis and its types
including- Simple Linear and Multiple Linear Regression and afterwards,
Logistic Regression. It is further substantiated with examples in R that explain
the steps, functions and syntax to use correctly. It also explains how to interpret
the output and visualise the data. Subsequently, the unit explains the concept of
Time Series Analysis and how to run it on R. It also discusses about the
Stationary Time Series, extraction of trend, seasonality, and error and how to
create lags of a time series in R.

15.2 OBJECTIVES
After going through this Unit, you will be able to:-
• Run tests and techniques on data and interpret the results using R;
• explain the correlation between two variables in a dataset by running
Chi-Square Test in R;
• explain the concept of Regression Analysis and distinguish between their
types- simple Linear and Multiple Linear;
• build relationship models in R to plot and interpret the data and further
use it to predict the unknown variable values;
• explain the concept of Logistic Regression and its application on R;
• explain about the Time Series Analysis and the special case of Stationary
Time Series;
• explain about extraction of trend, seasonality, and error and how to create
lags of a time series in R.

15.3 CHI-SQUARE TEST

The Chi-Square test is a statistical tool for determining if two categorical
variables are significantly correlated. Both variables should come from the same

41
Basics of R Programming
population and be categorical in nature, such as – top/bottom, True/False,
Black/White.Syntax of a chi-square test: chisq.test(data)
EXAMPLE:
Let’s consider R’s built in “MASS” library that contains Cars93 dataset that
represents the sales of different models of car.

Figure 15.1: Description of sample data set

As you can see, we have various variables that can be considered as categorical
variable. Let’s consider “Airbags” and “Type” for our model. You want to check,
if there is a correlation in these two categorical variables. Chi-square test is a
good indicator for such information. To perform the chi-square test, you may
perform the following steps:
• First, you need to extract this data from the dataset (see Figure 15.2).
• Next, create the table of the data(See Figure 15.2) and
• Perform chi square test on the table (See Figure 15.2)

Figure 15.2: Chi-square testing

The result shows the p value 0.0002723 which is less than 0.05 which indicates
strong correlation. In addition, the value of chi square is also high. Thus, the
variable type of car is strongly related to number of air bags.

42
Data Analysis and R
Chi-square test is one of the most useful test in finding relationships between
categorical variables.
How can you find the relationships between two scale or numeric variables using
R? One such technique, which helps in establishing a model-based relationship
is regression, which is discussed next.

15.4 LINEAR REGRESSION

Regression analysis is a common statistical technique for establishing a
relationship model between two variables. One of these variables is known as a
predictor variable, and its value is derived via experimentation. The response
variable, whose value is generated from the predictor variable, is the other
variable.
A regression model that employs a straight line to explain the relationship
between variables is known as linear regression. In Linear Regression these
two variables are related through an equation, where exponent (power) of both
these variables is one. It searches for the value of the regression coefficient(s)
that minimises the total error of the model to find the line of best fit through
your data.
The general equation for a linear regression is –
𝑦 = 𝑎+ 𝑏 × 𝑥
In the equation given above:
• y is called response/dependent variable, whereas x is a
independent/predictor variable.
• The a and b values are the coefficients used in the equation,
which are to be predicted.
The objective of the regression model is to determine the values of these two
constants.
There are two main types of linear regression:

• Simple Linear Regression: This kind of regression uses only one

independent variable, as shown in the equation above.
• Multiple Linear Regression: However, if you add more independent
variables like: 𝑦 = 𝑎 + 𝑏 × 𝑥! + 𝑐 × 𝑥"#⋯ , then it is called multiple
regression.

Steps for Establishing a Linear Regression:

A basic example of regression is guessing a person's weight based on his/her

height. To do so, you need to know the correlation between a person's height
and weight.

The steps to establishing a relationship are as follows:

1. Carry out an experiment in which you collect a sample of observed

height and weight values.
2. Create a relationship model using the lm() functions in R.
3. Find the coefficients from the model you constructed and use them to
create a mathematical equation.
43
Basics of R Programming
4. To find out the average error in prediction, get a summary of the
relationship model. Also known as residuals, as shown in Figure 15.3
5. The predict() function in R can be used to predict the weight of
new person. A sample regression line and residual are shown in Figure
15.3

residual

Figure 15.3: An example of regression mode and residual

Input Data
Below is the sample data with the observations between weight and height,
which is experimentally collected and is input in the Figure 15.4

Figure 15.4: Sample data for linear regression

lm() function create the relation model between the variable i.e. predictor and
response.
Syntax: lm(formula, data),where
formula: presenting the relation between x and y.
data: data on which the formula needs to be applied.
Figure 15.5 shows the use of this function.

Figure 15.5: Use of lm function in linear regression

44
Data Analysis and R
Summary of the relationship:

Figure 15.6: Results of regression

The results of regression as presented by R includes the following:
1. Five-point summary (Minimum, First Quartile, Median, Third Quartile,
and Maximum). This shows the spread of the residual. You may
observe that about 50% of residuals are in the range -1.713 to +1.725,
which shows a good model fit.
2. The t value and Pr values for the intercept (that is b in the equation y =
ax +b) and x (that is a in the equation y = ax +b).
3. The F-statistics is very high with a low p-value, indicating statistical
difference between group means.

Predict function:
Function which will be used to predict the weight of the new person.

Syntax: Predict(object, newdata),

object is the formula already formulated using lm() function.
newdata is the vector containing new value for predictor variable.

Figure 15.7: The Predict function

Plot for Visualization: Finally, you may plot these values by setting the plot
title and axis titles (see Figure 15.8). The linear regression line is shown in
Figure 15.3.

45
Basics of R Programming

Figure 15.8: Making a chart of linear regression

Linear regression has one response variable and one predictor variables,
however, in many practical cases there can be more than one predictor
variables. This is the case of multiple regression and is discussed next.

15.5 MULTIPLE REGRESSION

The relationship between two or more independent variables and a single
dependent variable is estimated using multiple linear regression. When you need
to know the following, you can utilize multiple linear regression.
• The degree to which two or more independent variables and one
dependent variable are related (e.g. how baking soda,
baking temperature, and amount of flour added affect the taste of cake).
• The dependent variable's value at a given value of the independent
variables (e.g. the taste of cake for different amount of baking soda,
baking temperature, and flour).
The general equation for multiple linear regression is –
y = a + b1X1 + b2X2 +...bnXn
where,
• y is response variable.

• a, b1, b2...bn are coefficients.

• X1, X2, ...Xn are predictor variables.
The lm() function in R is used to generate the regression model. Using the input
data, the model calculates the coefficient values. Using these coefficients, you
can then predict the value of the response variable for a given collection of
predictor variables.
lm() Function:
The relationship model between the predictor and the response variable is
created using this function.
Syntax: The basic syntax for lm() function in multiple regression is –
lm(y ~ x1+x2+x3...,data),
The relationship between the response variable and the predictor variables is
represented by a formula. The vector on which the formula will be applied is
called data.

INPUT Data

46
Data Analysis and R
Let’s take the R inbuilt data set “mtcars”, which gives comparison between
various car models based on the mileage per gallon (mpg), cylinder
displacement (“disp”), horse power(“hp”), weight of the car(“wt”) & more.
The aim is to establish relationship of mpg (response variable) with predictor
variable (disp, hp, wt). The head function, as used in Figure 15.9, shows the first
5 rows of the dataset.

Figure 15.9: Sample data for Multiple regression

Creating Relationship model & getting the coefficients

Figure 15.10: The Regression model

Please note that the input is the name of a variable, which was created in
Figure 15.9.

Figure 15.11: Display of various output parameters

47
Basics of R Programming

The results of regression as presented by R includes the following:

1. Five-point summary (Minimum, First Quartile, Median, Third Quartile,
and Maximum). This shows the spread of the residual. You may
observe that about 50% of residuals are in the range -1.640 to +1.061,
which shows a good model fit.
2. Low p values mean the model is statistically significant.

Creating Equation for Regression Model: Based on the intercept & coefficient
values one can create the mathematical equation as follows:
𝑌 = 𝑎 + 𝑏 × 𝑥%&'( + 𝑐 × 𝑥)( + 𝑑 × 𝑥*+
or
𝑌 = 37.15 − 0.000937 × 𝑥%&'( − 0.0311 × 𝑥)( − 3.8008 × 𝑥*+

The same equation will be applied in predicting new values.

Check your Progress 1

1. What is linear regression?

…………………………………………………………………………..
…………………………………………………………………………..
2. What does chi-square test answers?
………………………………………………………………………….
………………………………………………………………………….
3. Difference between linear and multiple regression?
………………………………………………………………………….
………………………………………………………………………….

15.6 LOGISTIC REGRESSION

In R Programming, logistic regression is a classification algorithm for

determining the probability of event success and failure. When the dependent
variable is binary (0/1, True/False, Yes/No), logistic regression is utilised. In a
binomial distribution, the logit function is utilised as a link function.
Binomial logistic regression is another name for logistic regression. It is based
on the sigmoid function, with probability as the output and input ranging from -
∞ to +∞.The sigmoid function is given below:
!
𝑔(𝑧) = !# - !" Where 𝑧 = 𝑎 + 𝑏 × 𝑥

The general equation for logistic regression is –

1
𝑔(𝑧) = .(0#1 ×3 #1$ ×3$ #1% ×3% #...)
1+ 𝑒 # #
where, y is called as the response variable, and xi are predictors.
The a and bi are coefficients.
The glm() function is used to construct the regression model.
48
Data Analysis and R
Syntax:
glm (formula, data,family)

• The symbol expressing the relationship between the variables is a

formula.
• The data set containing the values of these variables is known as data.
• family is a R object that specifies the model's details. For logistic
regression, it has a binomial value.
Input Data: Let’s take the R inbuilt data set “mtcars”, which provides details
of various car models & engine specifications. The transmission mode of the
car i.e. whether the car is manual or automatic is described by the column am
having a binary value as 0 or 1. You can create the model between columns
“am” (Outcome/ dependent/ response variable) and three others – hp, wt and
cyl (predictor variables).This model is aimed at determining, if car would have
manual or automatic transmission, given the horse power (hp), weight (wt) and
number of cylinders (cyl) in the car.

Figure 15.12: The sample data set for logistic regression

Figure 15.13: The logistic regression model

The null deviance demonstrates how well a model with an intercept term can
predict the dependent variable, whereas the residual deviance represents how
well a model with n predictor variables can predict the dependent variable.
Deviance is measure of goodness of fit of a model.
In the summary as the p-value is more than 0.05 for the variables "cyl" (0.0152)
and "hp" (0.0276), we will consider them insignificant in contributing to the
value of the variable "am". Only weight (wt) impacts the "am" value in this
regression model.

15.7 TIME SERIES ANALYSIS

49
Basics of R Programming

A Time Series is any metric that is measured at regular intervals. It entails

deriving hidden insights from time-based data (years, days, hours, minutes) in
order to make informed decisions. When you have serially associated data, time
series models are particularly beneficial. Weather data, stock prices, industry
projections, and so on are just a few examples.

A time series is represented as follows:

A data point, say (Yt), at a specific time t (indicated by subscript t) is defined as

the either sum or product of the following three components:
Seasonality (St), Trend (Tt); and Error (et) (also known as, White Noise).

Input: Import the data set and then use ts() function.
The steps to use the function are given below. However, it is pertinent to note
here that the input values used in this case should ideally be a numeric vector
belonging to the “numeric” or “integer” class.
The following functions will generate quarterly data series from 1959:
ts(inputData, frequency =4, start = c(1959,2)) #frequency 4 => QuarterlyData
The following function will generate monthly data series from 1990
ts(1:10, frequency =12, start = 1990) #freq 12 => MonthlyData
The following function will generate yearly data series from 2009 to 2014.
ts(inputData, start=c(2009), end=c(2014), frequency=1) # YearlyData

In case, you want to use Additive Time Series, you use the following:
𝑌+ = 𝑆+ + 𝑇+ + 𝑒+
However, for Multiplicative Time Series, you may use:
𝑌+ = 𝑆+ × 𝑇+ × 𝑒+

The additive time series can be converted from multiplicative time series by
taking using the log function on the time series as represented below:
𝑎𝑑𝑑𝑖𝑡𝑖𝑣𝑒𝑇𝑆 = 𝑙𝑜𝑔(𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑐𝑎𝑡𝑖𝑣𝑒𝑇𝑆)

15.7.1 Stationary Time Series

A time series is considered “stationary” if the following criteria are satisfied:

1. When the mean value of a time series remains constant over a period of time
and hence, the trend component is removed Over time, the variance does not
increase.
2. Seasonality has a minor impact.

This means it has no trend or seasonal characteristics, making it appear to be

random white noise regardless of the time span viewed.

Steps to convert a time series as stationary

Each data point in a time series is differentiated by subtracting it from the one
before it. It is a frequent technique for making a time series immobile. To make
a stationary series out of most time series patterns 1 or 2 differencing is required.

15.7.2 Extraction of trend, seasonality and error

Using decompose() and forecast::stl, the time series is separated into seasonality,
trend, and error components (). You may use the following set of commands to
do so.
50
Data Analysis and R

timeSeriesData = EuStockMarkets[,1]
resultofDecompose = decompose(timeSeriesData, type=”mult”)
plot(resultofDecompose)
resultsofSt1 = stl(timeSeriesData, s.window = “periodic”)

15.7.3 Creating lags of a time-series

A lag of time series is generated when the time basis is shifted by a given
number of periods. Moreover, the state of a time series a few periods ago,
however, may still have an effect on its current state. Hence, in the time series
models, the delays of a time series are typically used as explanatory variables.
lagTimeSeries = lag(timeSeriesData, 3) #Shifting to 3 periods earlier
library(DataCombine)
mydf = as.data.frame(timeSeriesData)
mydf = slide(mydf, “x”, NewVar = “xLag1”, slideBy = -1) #create lag1
variable
mydf = slide(mydf, “x”, NewVar = “xLag1”, slideBy = 1)

Check your Progress 2

1. What is logistic regression?

…………………………………………………………………………..
2. What are the uses of Time-Series analysis?
………………………………………………………………………….
3. Differentiate between linear regression and logistic regression?
………………………………………………………………………….

15.8 SUMMARY

This unit introduces the concept of data analysis and examine its application
using R programming. It explains about the Chi-Square Test that is used to
determine if two categorical variables are significantly correlated and further
study its application on R. The unit explains the Regression Analysis, which is
a common statistical technique for establishing a relationship model between
two variables- a predictor variable and the response variable. It further explains
the various models in Regression Analysis including Linear and Logistics
Regression Analysis. In Linear Regression the two variables are related through
an equation of degree is one and employs a straight line to explain the
relationship between variables. It is categorised into two types- Simple Linear
Regression which uses only one independent variable and Multiple Linear
Regression which uses two or more independent variables. Once familiar with
the Regression, the unit proceeds to explain about the logistic regression, which
is a classification algorithm for determining the probability of event success and
failure. It is also known as Binomial logistic regression and is based on the sigmoid
function, with probability as the output and input ranging from -∞ to +∞ . At the end,
the unit introduces the concept of time series analysis and help understand its
application and usage on R. It also discusses the special case of Stationary Time Series
and how to make a time series stationary. This section further explains how to extract
the trend, seasonality and error in a time series in R and the creating lags of a time
series.

51
Basics of R Programming

15.9 ANSWERS
Check your Progress 1
1. A regression model that employs a straight line to explain the relationship
between variables is known as linear regression. In Linear Regression these
two variables are related through an equation, where exponent (power) of
both these variables is one. It searches for the value of the regression
coefficient(s) that minimises the total error of the model to find the line of
best fit through your data.
2. The Chi-square test of independence determines whether there is a
statistically significant relationship between categorical variables. It’s a
hypothesis test that answers the question—do the values of one categorical
variable depend on the value of other categorical variables?
3. Linear regression considers 2 variables whereas multiple regression consists
of 2 or more variables.

Check your Progress 2

1. Logistic regression is an example of supervised learning. It is used to

calculate or predict the probability of a binary (yes/no) event occurring.
2. Time series analysis is used to identify the fluctuation in economics and
business. It helps in the evaluation of current achievements. Time series is
used in pattern recognition, signal processing, weather forecasting and
earthquake prediction.
3. The problems pertaining to regression are solved using linear regression;
however, the problems pertaining to classification are solved using the
logistic regression. The linear regression yields a continuous result,
whereas logistic regression yields discrete results.

15.10 REFERENCES AND FURTHER READINGS

52
Advance Analysis using R
UNIT 16 ADVANCE ANALYSIS USING R
Structure Page Nos.
16.0 Introduction
16.1 Objectives
16.2 Decision Trees
16.3 Random Forest
16.4 Classification
16.5 Clustering
16.6 Association rules
16.7 Summary
16.7 Answers
16.8 References and Further Readings
16.0 INTRODUCTION
This unit explores the concepts pertaining to advance level of data analysis and
their application in R. The unit explains the theory and the working of the
decision tree model and how to run it on R. It further discusses its various types
that may fall under the 2 categories based on the target variables. The unit also
explores the concept of Random Forest and discusses its application on R. In the
subsequent sections, the unit explains the details of classification algorithm and
its features and types. It further explains the unsupervised learning technique-
Clustering and its application in R programming. It further discusses the 2 types
of clustering in R programming including the concept and algorithm of K-Means
Clustering. The unit concludes by drawing insights on the theory of the
Association Rules and its application in R.

16.1 OBJECTIVES
After going through this Unit, you will be able to:
• explain the concept of Decision Tree- including its types and application in
R;
• explain the working of Decision Tree and the factors to consider when
choosing a tree in R;
• explain the concept of Random Forest and its application on R;
• explain the concept of Classification algorithm and its features including-
classifier, characteristics, binary classification, multi-class classification and
multi-label classification;
• explain the types of classifications including- linear classifier and its types,
support vector machine and decision tree;
• explain the concept of Clustering and its application in R;
• explain the methods of Clustering and their types including the K-Means
clustering and its application in R;
• explain the concept and the theory behind the Association Rule Mining and
its further application in R Language.

53 53
Basic Computer Organisation

16.2 DECISION TREES

A decision tree is a graph that represents decisions and their results in a tree
format. Graph nodes represent events or selections, and graph edges represent
decision rules or conditions. It is primarily used in machine learning and data
mining applications that use R. Examples of use of decision trees include
predicting email as spam or non-spam, predicting cancerous tumours, or
predicting credit based on the credit risk of each of these factors. Models are
typically built using observational data, also known as training data. Then use a
set of validation data to validate and improve the model. R has packages used to
build and visualize decision trees. For a new set of predictors, this model is used
to reach decisions about the categories of data (yes / no, spam / non-spam).
Installing R Package: Package “party” is used for decision tree. It has a
function ctree() which is used to create and analyse decision tree. Figure 16.1
shows the output, when you install the Package using install command.

Figure 16.1: Installation of package party used for decision trees

Syntax: Defining tree formula describes the predictor and response variables
and data is the name of dataset used. The following are the different types of
decision trees that can be created using this package.
• Decision Stump: It is used to generate decision trees with only one split
and is therefore also known as one-level decision tree. In most cases,
known for its low predictive performance due to its simplicity.
• M5: It is known for its exact classification accuracy, and the ability to
work well with small noisy datasets.

• ID3 (Iterative Dichroatiser 3): One of the core and a wide range of
decision structures is the best attribute for classifying the specified
record with the top-down, greedy search approach via the specified
dataset.
54
Advance Analysis using R
• C4.5: This type of decision tree, known as a statistical classifier, is
derived from its parent ID3. This creates a decision based on the
predictor's bundle.
• C5.0: As a successor to C4.5, there are two models, the base tree and the
rule-based model, whose nodes can only predict category targets.
• CHAID: This algorithm is extended as a chi-square automatic
interaction detector and basically examines the merged variables and
justifies the result of the dependent variable by building a predictive
model.
• MARS: Extended as a multivariate adaptive regression spline, this
algorithm builds a set of piecewise linear models used to model
anomalies and interactions between variables. They are known for their
ability to process numerical data more efficiently.
• Conditional inference tree: This is a type of decision tree that
recursively separates response variables using the conditional inference
framework. It is known for its flexibility and strong fundamentals.
• CART: Expanded as a classification and regression tree, the value of the
target variable is predicted if it is continuous. Otherwise, if it is a
category, the required class is identified.
There are many types of decision tree but all of them fall under two main
categories based on the target variable.
• Categorical Variable: It refers to the variables whose target variables
has definite set of values and belong to a group.
• Continuous Variable: It refers to the variables whose target variables
can choose value from the wide range of data types.
Input Data:
Use R's built-in dataset “readingSkills” to build a decision tree. If you know the
age, shoe size, score (raw score on reading test), it represents the person's
reading literacy score and also if the person is native speaker or not. Figure 16.2
shows this data.

Figure 16.2: Sample data for decision tree

55 55
Basic Computer Organisation Let’s use ctree() function on the above data set to create decision tree and its
graph.

Figure 16.3: Making decision tree

Output: The ellipse in the diagram represents a node of decision tree. It shows
the name of the variable and the calculated p-value. The links are marked
with the cut-off values on which the decision is taken. From the decision
tree of Figure 16.4, we can conclude that people whose reading skills is
less than 38.306 and age more than 6 is not a native speaker. The black
rectangles states that they are native speakers and the grey one shows
they aren’t the native speakers. The reading score greater than 38.306
determines that the probability of determining 0.6+ is “yes” (native
speaker) and the remaining probability is “no” (not a native speaker).
People whose age is less than 6 and reading skills greater than 30.766
are native speakers and reading skills less than equal to 30.766 are not
native speakers.

Figure 16.4: The Decision Tree for the example

Working of Decision Tree:

56
Advance Analysis using R
• Partitioning: Refers to the process of partitioning a dataset into subsets.
The decision to make a strategic split has a significant impact on the
accuracy of the tree. Many algorithms are used in the tree to divide a
node into sub-nodes. As a result, the overall clarity of the node with
respect to the target variable is improved. To do this, various algorithms
such as chi-square and Gini coefficient are used and the most efficient
algorithm is selected.
• Pruning: This refers to the process of turning a branch node into a leaf
node and shortening the branches of a tree. The essence behind this idea
is that most complex classification trees fit well into training data, but do
not do the compelling task of classifying new values, thus avoiding
overfitting with simpler trees. That is.
• Tree selection: The main goal of this process is to select the smallest
tree that fits your data for the reasons described in the pruning section.
Important factors to consider when choosing a tree in R
• Entropy: Mainly used to determine the uniformity of a particular
sample. If the sample is perfectly uniform, the entropy is 0, and if it is
evenly distributed, the entropy is 1. The higher the entropy, the harder it
is to draw conclusions from this information.
• Information Gain: Statistical property that measures how well the
training samples are separated based on the target classification. The
main idea behind building a decision tree is to find the attributes that
provide the minimum entropy and maximum information gain. It is
basically a measure of the decrease in total entropy and is calculated by
taking the difference between the undivided entropy of the dataset and
the average entropy after it is divided, based on the specified attribute
value.

16.3 RANDOM FOREST

Random forests are a set of decision trees that are used in supervised learning
algorithms for classification and regression analysis, but primarily for
classification. This classification algorithm is non-linear. To achieve more
accurate predictions and forecasts, Random Forest creates and combines
numerous decision trees together. However, when utilised alone, each decision
tree model is used. In the cases where the tree is not built, error estimation is
performed. This method is termed as the out-of-bag percent error estimation.
The “Random Forest” is named ‘random’ since the predictors are chosen at
random during training. It is termed as ‘forest’ because a Random Forest makes
decisions based on the findings of several trees. Since multiple uncorrelated
trees (models) that operate as committees are always better than individual
composition models, therefore the random forests are considered to be better
than the decision trees.
Random forest attempts to develop a model using samples from observations
and random beginning variables (columns).

57 57
Basic Computer Organisation The random forest algorithm is:
• Draw size n bootstrap random samples (randomly select n samples from
the training data).
• Build a decision tree from the bootstrap sample. Randomly select
features on each tree node.
• Split the node using a feature (variable) that provides the best split
according to the objective function. One such example is to maximise
the information gain.
• Repeat the first two steps “k” number of times, where k represents the
number of trees that you will create from subset of the sample.
• Aggregate the predictions from each tree of new data points and assign
a class label by majority vote. Select the selected group in the most trees
and assign new data points to that group.
Install R package

Syntax:

Formula is the formula which describes the variables i.e. predictor and
response.
Data is the name of the dataset used.
Input Data: Use R's built-in dataset “readingSkills” to build a decision tree. If
you know the age, shoe size, score (raw score on reading test), it represents the
person's reading literacy score and also if the person is native speaker or not.

Output:

Figure 16.5: Sample data for random forest

58
Advance Analysis using R
You can now create the random forest by applying the syntax given above and
print the results

Figure 16.6: Sequence of commands using R for random forest

The output of random forest is in the form of confusion matrix. Therefore, before
showing the output, let us discuss about confusion matrix.
Confusion matrix:
A confusion matrix is a performance measurement for machine learning
classification problems where output can be two or more classes.

Figure 16.7: The Confusion Matrix

Confusion Matrix Example:

Figure 16.8: Example of Confusion Matrix

Please note the following for the confusion matrix of Figure 16.8

1. Total number of observations = (100 + 50 + 150 + 9700) = 10000

59 59
Basic Computer Organisation 2. Total number of positive cases (Actual) = (100 + 50) = 150
3. Total number of negative cases (Actual) = (150 + 9700) = 9850
4. TP = 100, i.e.100 out of 150 positive cases are correctly predicted as
positive.
5. FN = 50, i.e. 50 out of 150 positive cases are incorrectly predicted as
negative.
6. FP = 150, i.e.150 out of 9850 negative cases are incorrectly predicted as
positive.
7. TN = 9700, i.e.9700 out of 9850 negative cases are correctly predicted
as negative.

Output:

Figure 16.9: Output of Random Forest

Where, no is not a native speaker and yes is a native speaker. Mean Decrease
Accuracy express how much accuracy a model loses excluding each variable
and MeanDecreaseGini tells how each variable contributes to the homogeneity
of the nodes.
From the Random Forest above, we can conclude that shoe size and score are
important factors in determining if someone is native. As the values of
MeanDecreaseGini is lower that means higher the purity and hence the 2
independent variables turns out to be important. Also, the model error is only
1%. This means that you can predict with 99% accuracy.

Check Your Progress 1

1. Can Random Forest Algorithm be used both for Continuous and Categorical
Target Variables?
…………………………………………………………………………………

…………………………………………………………………………………

60
2.What is Out-of-Bag Error? Advance Analysis using R

…………………………………………………………………………………

3. What does random refer to in ‘Random Forest’?

…………………………………………………………………………………

16.4 CLASSIFICATION
The idea of a classification algorithm is very simple. Predict the target class by
analysing the training dataset. Use the training dataset to get better boundary
conditions that you can use to determine each target class. Once the constraints
are determined, the next task is to predict the target class. This entire process is
called classification.
The classification algorithm has some important points.
• Classifier: This is an algorithm that assigns input data to a specific
category. Classification model. The classification model attempts to
draw some conclusions from the input values given to the training. This
inference predicts the class label / category of new data.
• Characteristic: This is an individually measurable property of the
observed event.
• Binary classification: This is a classification task with two possible
outcomes. For example, a gender classification with only two possible
outcomes i.e. Men and women.
• Multi-class classification: This is a classification task where
classification is done in three or more classes. Here is an example of a
multiclass classification: A classifier can recognise a digit only as one
of the digit classes say 0 or 1 or 2…or 9.

• Multi-label classification: This is a classification task where each

sample is assigned a set of target labels. Here is an example of a multi-
label classification: A news article that can be classified by a multi-label
classifier as people, places, and sports at the same time.
In R, classification algorithms can be broadly divided into the following types.
Linear classifier
In machine learning, the main task of statistical classification is to use the
properties of an object to find the class to which the object belongs. This task is
solved by determining the classification based on the value of the linear
combination of features. R has two linear classification algorithms:
• Logistic regression
• Naive Bayes classifier
61 61
Basic Computer Organisation Support vector machine
Support vector machines are supervised learning algorithms that analyse the data
used for classification and regression analysis. In SVM, each data element is
represented as a value for each attribute, that is, a point in n-dimensional space
with a value at a particular coordinate. The least squares support vector machine
is the most used classification algorithm in R.
Decision tree
The decision tree is a supervised learning algorithm used for classification and
regression tasks. In R, the decision tree classifier is implemented using the R
machine learning cullet package. The Random Forest algorithm is the most used
decision tree algorithm in R.

16.5 CLUSTERING
Clustering in the R programming language is an unsupervised learning
technique that divides a dataset into multiple groups and is called a cluster
because of its similarities. After segmenting the data, multiple data clusters are
generated. All objects in the cluster have common properties. Clustering is used
in data mining and analysis to find similar datasets.
Clustering application in the R programming language
• Marketing: In R programming, clustering is useful for marketing. This
helps identify market patterns and, therefore, find potential buyers. By
identifying customer interests through clustering and displaying the
same products of interest, you can increase your chances of buying a
product.
• Internet: Users browse many websites based on their interests.
Browsing history can be aggregated and clustered, and a user profile is
generated based on the results of the clustering.
• Games: You can also use clustering algorithms to display games based
on your interests.
• Medicine: In the medical field, every day there are new inventions of
medicines and treatments. Occasionally, new species are discovered by
researchers and scientists. Those categories can be easily found by using
a clustering algorithm based on their similarity.
Clustering method
There are two types of clustering in R programming.
• Hard clustering: With this type of clustering, data points are assigned
to only one cluster, whether they belong entirely to the cluster. The
algorithm used for hard clustering is k-means clustering.
• Soft clustering: Soft clustering assigns the probabilities or possibilities
of data points in a cluster rather than placing each data point in a cluster.
All data points have a certain probability of being present in all clusters.
The algorithm used for soft clustering is fuzzy clustering or soft k-
means.
62
Advance Analysis using R
K-Means clustering in the R programming language
K-Means is an iterative hard clustering technique that uses an unsupervised
learning algorithm. The total number of clusters is predefined by the user, and
the data points are clustered based on the similarity of each data point. This
algorithm also detects the center of gravity of the cluster.
Algorithm: Specifying the number of clusters (k): Let's look at an example of k
= 2 and 5 data points. Randomly assign each data point to the cluster. Assume,
the yellow and blue colors show two clusters with their respective random data
points assigned. Calculate the centroid of a cluster: Remap each data point to the
nearest cluster centroid. The blue data point is assigned to the yellow cluster
because it is close to the center of gravity of the yellow cluster. Refactor the
cluster centroid fuzzy clustering method or soft-k-means.

Figure 16.10: Clustering

Syntax: kmeans(x, centers, nstart)

where,
• x represents numeric matrix or data frame object
• centers represents the K value or distinct cluster centers
• nstart represents number of random sets to be chosen

Input Data and loading the necessary packages in R. For the clustering, we are
using iris data set. The dataset contains 3 classes each of around 50 instances
and the class refers to a type of iris plant.

Fitting the K means clustering model

63 63
Basic Computer Organisation

Result:

The 3 clusters are made which are of 50, 62, and 38 sizes respectively. Within
the cluster, the sum of squares is 88.4%.
Confusion Matrix:

Confusion matrix suggests that 50 setosa are correctly classified as setosa. Out
of 62 versicolor, 14 are incorrectly classified as virginica and 48 correctly as
versicolor. Out of 38 virginica, 2 are incorrectly classified as versicolor and 36
correctly classified as virginica.
Model Evaluation and Visualization
Code:

64
Advance Analysis using R
Output:

The plot shows 3 cluster plots with 3 different colors.

Plotting Cluster Centres
Code:

Output:

65 65
Basic Computer Organisation

The plot shows the center of the clusters which are marked as cross sign with
same color of the cluster.
Visualizing Clusters
Code:

Output:

66
Advance Analysis using R

The plot showing 3 clusters formed with varying sepal length and sepal width.
Check Your Progress 2
1. What is the difference between Classification and Clustering?
………………………………………………………………………………

………………………………………………………………………………
2. Can decision trees be used for performing clustering?
………………………………………………………………………………

………………………………………………………………………………
3. What is the minimum no. of variables/ features required to perform
clustering?
………………………………………………………………………………

………………………………………………………………………………

16.6 ASSOCIATION RULES

Association Rule Mining in R Language is an Unsupervised Non-linear algo-

rithm to discover how any item is associated with other. Frequent Mining shows
which items appear together in a transaction. Major usage is in Retail, grocery
stores, an online platform i.e. those having a large transactional database. The
same way when any online social media or e-commerce websites know what
you buy next using recommendations engines. The recommendations you get on
item, while you check out the order is because of Association rule mining
boarded on past user data. There are three common ways to measure association:
• Support
67 67
Basic Computer Organisation • Confidence
• Lift

Theory
In association rule mining, Support, Confidence, and Lift measure association.

[E1] Buy Product A => [E2] Buy Product B

Support (Rule) = P(E1 and E2) = Probability of Buying both the products A
and B.
Confidence (Rule) = P(E2|E1) = Probability of buying the product B given that
product A has already been bought.
Interpreting Support & Confidence of a Rule:
Computer => Antivirus software [support = 2%, confidence = 60%]
Computer: Antecedent& Antivirus Software: Consequence

Support: 2% of all the transaction under analysis show that computer and
antivirus software are purchased together.
Confidence: 60% of the customers who purchased a computer also bought the
software.

Lift: A measure of Association i.e. the occurrence of itemset A is independent

of the occurrence of itemset B if
P(A and B) = P(A).P(B)

P(A and B) = 1
P(A).P(B)

If <1, if buying A & buying B are negatively associated.

If =1, if buying A & buying B are not associated.
If >1, if buying A & buying B are positively associated.

Packages in R:
Installing Packages
install.packages("arules")
install.packages("arulesViz")

Syntax:
associa_rules = apriori(data = dataset, parameter = list(support = x,
confidence = y))

Installing the relevant packages and the dataset

The first 2 transactions and the items involved in each transaction can be ob-
served below.

68
Advance Analysis using R

The algorithm generated 15 rules with the given constraints.

Minval is the minimum value of support of an itemset which should be satisfied
to be a part of a rule.
Smax: maximum support value.
Arem is additional rule evaluation parameter. We have fed Support and confi-
dence as the constraints. There are several other rules using the arem parameter.

The top 3 rules sorted by confidence are shown below:

16.7 SUMMARY

This unit explores the concepts of advance data analysis and their application in
R. It explains about the Decision Tree model and its application in R. It
represents decisions and their results in a tree format. There are various types of
decision trees that may fall under the 2 categories based on the target variables,
i.e. categorical and continuous variable. Partitioning, pruning and the tree
selection are the steps involved in the working of the decision tree. The other
factors to consider when choosing a tree in R, are- entropy and information gain.
The unit further explains in detail the concept of Random and its application in
R. Random forests are a set of decision trees that are used in supervised learning
algorithms for classification and regression analysis, but primarily for
classification. It is a non-linear classification algorithm and takes samples from
observations and random initial variables (columns) and attempts to build a
model. In the subsequent sections, the unit explains the details of classification
algorithm and its features and types. In R, the types of classification algorithms
are- Linear classifier, Support vector machine and the decision tree. The linear
classification algorithms can further be of 2 types- logistic regression and Naive
Bayes classifier. It further explains the about Clustering and its application in R
programming. Clustering in the R programming language is an unsupervised
learning technique that divides a dataset into multiple groups and is called a
69 69
Basic Computer Organisation cluster because of its similarities. It is used in data mining and analysis to find
similar datasets ad has application in marketing, internet, gaming, medicine, etc.
There are two types of clustering in R programming- namely, hard clustering
and soft clustering. The unit also discusses K-Means clustering in the R
programming language which is an iterative hard clustering technique that uses
an unsupervised learning algorithm. The unit discusses the theory of the
Association Rules and its application in R. It is an Unsupervised Non-linear
algorithm to discover how any item is associated with other and has a major
usage in Retail, grocery stores, an online platform i.e. those having a large
transactional database. In association rule mining, support, confidence, and lift
measure the association.

16.8 ANSWERS
Check Your Progress 1

1. Yes, Random Forest can be used for both continuous as well as categor-
ical target (dependent) variables.
A random forest i.e., the combination of decision trees, the classification
model refers to the categorical dependent variable, and the regression
model refers to the numeric or continuous dependent variable. But ran-
dom forest is mainly used for Classification problems.

2. Out of Bag supports validation or test data. Random forests do not

require a separate test dataset to validate the results. It is calculated
internally during the execution of the algorithm in the following way:
Because the forest is built on training data, each tree is tested on 1/3
(36.8%) of the samples that were not used to build that tree (similar to
the validation dataset). This is known as the out-of-bag error estimation.
This is simply an internal error estimate for the random forest you are
building.

3. Random forest is one of the most popular and widely used machine
learning algorithms in classification problems. It can also be used for
regression problem statements, but it works mostly well with
classification models.
Improvements to the predictive model have become a deadly weapon for
modern data scientists. The best part of the algorithm is that there are
very few assumptions involved. Therefore, preparing the data is not too
difficult and saves time.

Check Your Progress 2

1. Classification is taking data and putting it into pre-defined categories and
in Clustering the set of categories, that you want to group the data into, is not
known beforehand.

2.True, Decision trees can also be used to for clusters in the data, but clustering
often generates natural clusters and is not dependent on any objective function.

3. At least a single variable is required to perform clustering analysis. Clustering

analysis with a single variable can be visualized with the help of a histogram.
70
Advance Analysis using R

16.9 REFERENCES AND FURTHER READINGS

71 71

Unit 1
No ratings yet
Unit 1
19 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
107 pages
Data Science & Big Data Guide
No ratings yet
Data Science & Big Data Guide
6 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
Unit I
No ratings yet
Unit I
29 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit 1
No ratings yet
Unit 1
28 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
AI Primer
No ratings yet
AI Primer
24 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Module 1
No ratings yet
Module 1
35 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Data Science & Big Data Essentials
No ratings yet
Data Science & Big Data Essentials
31 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Chapter 1
No ratings yet
Chapter 1
149 pages
Lesson 3 Data Science
No ratings yet
Lesson 3 Data Science
12 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
65 pages
Fundamentals of Data Science & Big Data"
No ratings yet
Fundamentals of Data Science & Big Data"
14 pages
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
No ratings yet
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
27 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
24 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Data Science Done
No ratings yet
Data Science Done
7 pages
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
100% (1)
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
73 pages
Unit 1-3
No ratings yet
Unit 1-3
39 pages
Module-1: Introduction To Data Science
No ratings yet
Module-1: Introduction To Data Science
98 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
FDS 4 Unit
No ratings yet
FDS 4 Unit
156 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Fdsunit 1
No ratings yet
Fdsunit 1
27 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
52 pages
Fds Question Bank With Answer
No ratings yet
Fds Question Bank With Answer
35 pages
CS 3353 FDS Unit 1 Notes JPR
No ratings yet
CS 3353 FDS Unit 1 Notes JPR
39 pages
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
No ratings yet
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
40 pages
Data Science Overview & Applications
No ratings yet
Data Science Overview & Applications
10 pages
Big Data in Data Science
No ratings yet
Big Data in Data Science
3 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Eds Unit 1
No ratings yet
Eds Unit 1
28 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Mod 3
No ratings yet
Mod 3
96 pages
Unit 1
No ratings yet
Unit 1
61 pages
Unit 1 - Data Science - III BSC Cs.
No ratings yet
Unit 1 - Data Science - III BSC Cs.
14 pages
Data Science
No ratings yet
Data Science
108 pages
Ds Unit 1
No ratings yet
Ds Unit 1
18 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Fods Unit 1
No ratings yet
Fods Unit 1
11 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
Unit I
No ratings yet
Unit I
262 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
3961502-Class10 Ai Part B Unit3 Unit3 Data Science
No ratings yet
3961502-Class10 Ai Part B Unit3 Unit3 Data Science
15 pages
FDS CH1
No ratings yet
FDS CH1
4 pages
101 Side Hustles
No ratings yet
101 Side Hustles
35 pages
Aquatic Therapy: Buoyancy Benefits
No ratings yet
Aquatic Therapy: Buoyancy Benefits
3 pages
Industrial Revolution
No ratings yet
Industrial Revolution
1 page
Power Electronics Course Guide
No ratings yet
Power Electronics Course Guide
1 page
Moderator: Dr. Usha Suwalka Presenter: Dr. Suchismita Naik
No ratings yet
Moderator: Dr. Usha Suwalka Presenter: Dr. Suchismita Naik
44 pages
PANIC Origin Story (Part 1)
100% (3)
PANIC Origin Story (Part 1)
6 pages
API 577 Welding Test Questions
No ratings yet
API 577 Welding Test Questions
38 pages
Classic Strategies: Still Effective?
No ratings yet
Classic Strategies: Still Effective?
7 pages
Raymond Caldwell: Theatre Educator & Director
No ratings yet
Raymond Caldwell: Theatre Educator & Director
9 pages
Forensic Ballistics
No ratings yet
Forensic Ballistics
4 pages
Buku Gelatin
No ratings yet
Buku Gelatin
129 pages
Alynsa Eryn Anak Musa - 1379199 - 0
No ratings yet
Alynsa Eryn Anak Musa - 1379199 - 0
17 pages
Optimal Portfolio Diversification
No ratings yet
Optimal Portfolio Diversification
18 pages
Samantha Danico Resume
No ratings yet
Samantha Danico Resume
2 pages
BioCon 700 Parts List - June 2019
No ratings yet
BioCon 700 Parts List - June 2019
3 pages
Gartner 7 Building Blocks of MDM
No ratings yet
Gartner 7 Building Blocks of MDM
1 page
ForSignOff April9 Edited April8 LAS Math 2 Q2 Week 2 v.2 Final Look For Sign Off
No ratings yet
ForSignOff April9 Edited April8 LAS Math 2 Q2 Week 2 v.2 Final Look For Sign Off
15 pages
Networking 1 Tutorial
No ratings yet
Networking 1 Tutorial
12 pages
Oscar Wao
No ratings yet
Oscar Wao
5 pages
A Crash Course in Film Technique
No ratings yet
A Crash Course in Film Technique
5 pages
IJMEDPH Ayushi
No ratings yet
IJMEDPH Ayushi
5 pages
Process Engineer Job Description
100% (1)
Process Engineer Job Description
3 pages
Material Safety Data Sheet Pbs Plug Activator™
No ratings yet
Material Safety Data Sheet Pbs Plug Activator™
4 pages
Info Required For Dir (EB) VC On 23rd July 2024 - 1
No ratings yet
Info Required For Dir (EB) VC On 23rd July 2024 - 1
2 pages
Natural Fiber Polymer Composites
No ratings yet
Natural Fiber Polymer Composites
17 pages
VT 321bridge
No ratings yet
VT 321bridge
1 page
The Lumbering Giants of Windy Pines Mo Netz PDF Download
No ratings yet
The Lumbering Giants of Windy Pines Mo Netz PDF Download
33 pages
Log Cabin House UZES 70m² Sale
No ratings yet
Log Cabin House UZES 70m² Sale
7 pages
Vocabulary & Grammar Quiz
No ratings yet
Vocabulary & Grammar Quiz
17 pages