MCS 226
MCS 226
1.0 INTRODUCTION
The Internet and communication technology has grown tremendously in the past decade
leading to generation of large amount of unstructured data. This unstructured data
includes data such as, unformatted textual, graphical, video, audio data etc., which is
being generated as a result of people use of social media and mobile technologies. In
addition, as there is a tremendous growth in the digital eco system of organisation, large
amount of semi-structured data, like XML data, is also being generated at a large rate.
All such data is in addition to the large amount of data that results from organisational
databases and data warehouses. This data may be processed in real time to support
decision making process of various organisations. The discipline of data science focuses
on the processes of collection, integration and processing of large amount of data to
produce useful decision making information, which may be useful for informed
decision making.
This unit introduces you to the basic concept of data sciences. This unit provides an
introduction to different types of data used in data science. It also points to different
types of analysis that can be performed using data science. Further, the Unit also
introduces some of the common mistakes of data science.
1.1 OBJECTIVES
At the end of this unit you should be able to:
• Define the term data science in the context of an organization
• explain different types of data
• list and explain different types of analysis that can be performed on data
• explain the common mistakes about data size
• define the concept of data dredging
• List some of the applications of data sites
• Define the life cycle of data science
1
Introduction to Data Science
1.2 DATA SCIENCE-DEFINITION
Data Science is a multi-disciplinary science with an objective to perform data analysis
to generate knowledge that can be used for decision making. This knowledge can be
in the form of similar patterns or predictive planning models, forecasting models etc.
A data science application collects data and information from multiple heterogenous
sources, cleans, integrates, processes and analyses this data using various tools and
presents information and knowledge in various visual forms.
Programming
Visualization Modelling and
Machine Learning simulation,
Computing Mathematics
Big data
Database System Statistics
Data
Science
Domain
Knowledge
What are the advantages of Data science in an organisation? The following are some
of the areas in which data science can be useful.
• It helps in making business decisions such as deciding the health of
companies with whom they plan to collaborate,
• It may help in making better predictions for the future such as making
strategic plans of the company based on present trends etc.
• It may identify similarities among various data patterns leading to
applications like fraud detection, targeted marketing etc.
In general, data science is a way forward for business decision making, especially in
the present day world, where data is being generate at the rate of Zetta bytes.
Data Science can be used in many organisations, some of the possible usage of data
science are as given below:
• It has great potential for finding the best dynamic route from a source to
destination. Such application may constantly monitor the traffic flow and
predict the best route based on collected data.
• It may bring down the logistic costs of an organization by suggesting the best
time and route for transporting foods
• It can minimize marketing expenses by identifying the similar group buying
patterns and performing selective advertising based on the data obtained.
• It can help in making public health policies, especially in the cases of
disasters.
2
Basics of Data Science
• It can be useful in studying the environmental impact of various
developmental activities
• It can be very useful in savings of resources in smart cities
Type of data is one of the important aspect, which determines the type of
analysis that has to be performed on data. In data science, the following are the
different types of data, that are required to be processed:
1. Structured Data
2. Semi-Structured Data
3. Unstructured data
4. Data Streams
Structured Data
Since the start of the era of computing, computer has been used as a data
processing device. However, it was not before 1960s, when businesses started
using computer for processing their data. One of the most popular language of
that era was Common Business-Oriented Language (COBOL). COBOL had a
data division, which used to represent the structure of the data being processed.
This was followed by a disruptive seminal design of technology by a E.F.
Codd. This lead to creation of relational database management systems
(RDBMS). RDBMS allows structured storage, retrieval and processing of
integrated data of an organisation that can be securely shared among several
applications. The RDBMS technology also supported secure transaction, thus,
became a major source of data generation. Figure 2 shows the sample structure
of data that may be stored in a relational database system. One of the key
characteristics of structured data is that it can be associated with a schema. In
addition, each schema element may be related to a specific data type.
The relational data is structured data and large amount of this structured data is
being collected by various organisations, as backend to most applications. In
90s, the concept of a data warehouse was introduced. A data warehouse is a
time-invariant, subject-oriented aggregation of data of an organisation that can
be used for decision making. A data in a data warehouse is represented using
dimension tables and fact tables. The dimensional tables classifies the data of
fact tables. You have already studied various schemas in the context of data
warehouse in MCS221. The data of data warehouse is also structured in nature
and can be used for analytical data processing and data mining. In addition,
many different types of database management systems have been developed,
which mostly store structured data.
3
Introduction to Data Science
However, with the growth of communication and mobile technologies many
different applications became very popular leading to generation of very large
amount of semi-structured and unstructured data. These are discussed next.
Semi-structured Data
As the name suggest Semi-structured has some structure in it. The structure of
semi-structured data is due to the use of tags or key/value pairs The common
form of semi-structured data is produced through XML, JSON objects, Server
logs, EDI data, etc. The example of semi-structured data is shown in the Figure
3.
<Book> "Book": {
<title>Data Science and Big Data</title> "Title": "Data
<author>R Raman</author> Science",
<author>C V Shekhar</author>
"Price": 5000,
<yearofpublication>2020</yearofpublicatio
n>
</Book> "Year": 2020
}
Figure 3: Sample semi-structured data
Unstructured Data
The unstructured data does not follow any schema definition. For example, a
written text like content of this Unit is unstructured. You may add certain
headings or meta data for unstructured data. In fact, the growth of internet has
resulted in generation of Zetta bytes of unstructured data. Some of the
unstructured data can be as listed below:
• Large written textual data such as email data, social media data etc..
• Unprocessed audio and video data
• Image data and mobile data
• Unprocessed natural speech data
• Unprocessed geographical data
In general, this data requires huge storage space, newer processing methods
and faster processing capabilities.
Data Streams
A data stream is characterised by a sequence of data over a period of time.
Such data may be structured, semi-structured or unstructured, but it gets
generated repeatedly. For example, IoT devices like weather sensors will
generate data stream of pressure, temperature, wind direction, wind speed,
humidity etc for a particular place where it is installed. Such data is huge for
many applications are required to be processed in real time. In general, not all
the data of streams is required to be stored and such data is required to be
processed for a specific duration of time.
4
1.3.1 Statistical Data Types Basics of Data Science
There are two distinct types of data that can be used in statistical analysis.
These are – Categorical data and Quantitative data
Categorical data is used to define the category of data, for example, occupation
of a person may take values of the categories “Business”, “Salaried”. “Others”
etc. The categorical data can be of two distinct measurement scales called
Nominal and Ordinal, which are given in Figure 4. If the categories are not
related, then categorical data is of Nominal data type, for example, the
Business category and Salaried categories have no relationship, therefore it is
of Nominal type. However, a categorical variable like age category, defining
age in categories “0 or more but less than 26”, “26 or more but less than 46”,
“46 or more but less than 61”, “More than 61”, has a specific relationship. For
example, the person in age category “More than 61” are elder to person in any
other age category.
Quantitative Data:
Quantitative data is the numeric data, which can be used to define different
scale of data. The qualitative data is also of two basic types –discrete, which
represents distinct numbers like 2, 3, 5,… or continuous, which represent a
continuous values of a given variable, for example, your height can be
measured using continuous scale.
Data are raw facts, for example, student data may include name, Gender, Age,
Height of student, etc. The name typically is a distinguishing data that tries to
distinctly identify two data items, just like primary key in a database.
However, the name data or any other identifying data may not be useful for
performing data analysis in data science. The data such as Gender, Age, Height
may be used to answer queries of the kind: Is there a difference in the height of
boys and girls in the age range 10-15 years? One of the important question is
how do you measure the data so that it is recorded consistently? Stanley
Stevens, a psychologist, defined the following four characteristics that any
scale that can be measured:
5
Introduction to Data Science
absolute zero value, whereas, the Intelligent quotient cannot be defined
as zero.
1.3.2 Sampling
In general the size of data that is to be processed today is quite large. This
leads you to the question, whether you would use the entire data or some
representative sample of this data. In several data science techniques sample
data is used to develop an exploratory model also. Thus, even in the data
science sample is one of the ways, which can enhance the speed of exploratory
data analysis. The population in this case may be the entire set of data that you
may be interested. Figure 5 shows the relationships between population and
sample. One of the question, which is asked in this context is what should be
the size of a good sample. You may have to find the answer in the literature.
However, you may please note that a good sample is representative of its
population.
Sample
Population
6
One of the key objectives of statistics, which uses sample data, is to determine Basics of Data Science
the statistic of the sample and find the probability that the statistic developed
for the sample would determine the parameters of population with a specific
percentage of accuracy. Please note the terms stated above are very important
and explain in the following table:
3. What would be the measurement scale for the following? Give reason
in support of your answer.
Age, AgeCategory, Colour of eye, Weight of students of a class, Grade
of students, 5-point Likert scale
Descriptive analysis is used to present basic summaries about data; however, it makes
no attempt to interpret the data. These summaries may include different statistical
values and certain graphs. Different types of data are described using different ways.
The following example illustrates this concept:
Example 1: Consider the data given in the following Figure 6. Show the summary of
categorical data in this Figure.
7
Introduction to Data Science S20200005 M 173
S20200006 M 160
S20200007 M 180
S20200008 F 178
S20200009 F 167
S20200010 M 173
Figure 6: Sample Height Data
Please note that enrolment number variable need not be used in analysis, so no
summary data for enrolment number is to be found.
In addition, you can draw bar chart or pie chart for describing the data of Gender
variable. The pie chart for such data is shown in Figure 7. Details of different
charts are explained in Unit 4. In general, you draw a bar graph, in case the number
of categories is more.
The median of the data would be the mid value of the sorted data. First data is sorted
and the median is computed using the following formula:
If n is even, then
8
! ! Basics of Data Science
median = [(𝑉𝑎𝑙𝑢𝑒𝑜𝑓( )#$ ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 + 𝑉𝑎𝑙𝑢𝑒𝑜𝑓(( + 1)#$ 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛]/2
" "
If n is odd, then
!%&
median =(𝑉𝑎𝑙𝑢𝑒𝑜𝑓( )#$ ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
"
For this example, the sorted data is as follows:
You may please note that outliers, which are defined as values highly different from
most other values, can impact the mean but not the median. For example, if one
observation in the data, as shown in example 2 is changed as:
Then the median will still remain 14, however, mean will change to 20.64, which is
quite different from the earlier mean. Thus, you should be careful about the presence
of outliers while data analysis.
Interestingly, mean and mode may be useful in determining the nature of data. The
following table describes these conditions:
9
Introduction to Data Science
Mean << Median The distribution
may be right-
skewed
Mode: Mode is defined as the most frequent value of a set of observation. For
example, in the data of example 2, the value 14, which occurs twice, is the mode. The
mode value need not be a mid-value rather it can be any value of the observations. It
just communicates the most frequently occurring value only. In a frequency graph,
mode is represented by peak of data. For example, in the graphs shown in Figure 8,
the value corresponding to the peaks is the mode.
Standard Standard deviation is one of the most Try both the formula
Deviation used measure for finding the spread or and then match the
variability of data. It can be computed answer:
as: 6.4
For Sample:
10
Basics of Data Science
!
1
𝑠= E A(𝑥 − 𝑥̅ )"
(𝑛 − 1)
'(&
For Population:
!
1
σ = E A(𝑥 − µ)"
𝑁
'(&
5-Point For creating 5-point summary first, you Use Sorted data of
Summary and need to sort the data. The five point Example 2
Interquartile summary is defined as follows:
Range (IQR) Minimum Value (Min) Min = 4
st
1 Quartile<=25% values (Q1) Q1=(9+10)/2=9.5
2nd Quartile is median (M) M = 14
3rd Quartile is <=75% values (Q3) Q3=(18+19)/2=18.5
Maximum Value (Max) Max=25
IQR is the difference between 3rd and 1st IQR= 18.5-9.5=9
quartiles values.
Figure 9: The Measure of Spread or Variability
The IQR can also be used to identify suspected outliers. In general, a suspected outlier
can exist in the following two ranges:
Observation/values less than Q1 – 1.5´IQR
Observation/values more than Q3+1.5 ´ IQR
For the example 2,
IQR is 9, therefore the outliers may be: Values < (9.5 – 9) or Values < 0.5.
or Values > (18.5 – 9) or Values > 27.5.
Thus, there is no outlier in the initial data of Example 2.
For the qualitative data, you may draw various plots, such as histogram, box plot etc.
These plots are explained in Unit 4 of this block.
1. As a first step, you may perform the descriptive of various categorical and
qualitative variables of your data. Such information is very useful in
11
Introduction to Data Science determining the suitability of data for the purpose of analysis. This may also
help you in data cleaning, modification and transformation of data.
a. For the qualitative data, you may create frequency tables and bar
charts to know the distribution of data among different categories. A
balanced distribution of data among categories is most desirable.
However, such distribution may not be possible in actual situations.
Several methods has been suggested to deal with such situations.
Some of those will be discussed in the later units.
b. For the quantitative data, you may compute the mean, median,
standard deviation, skewness and kurtosis. The kurtosis value relates
to peaks (determined by mode) in the data. In addition, you may also
draw the charts like histogram to look into frequency distribution.
2. Next, after performing the univariate analysis, you may try to perform some
bi-variate analysis. Some of the basic statistics that you can perform for bi-
variate analysis includes the following:
a. Make two way table between categorical variables and make related
stacked bar charts. You may also use chi-square testing find any
significant relationships.
b. You may draw side-by-side box plots to check if the data of various
categories have differences.
c. You may draw scatterplot and check the correlation coefficient, if that
exists between two variables.
3. Finally, you may like to look into the possibilities of multi-variate
relationships amongst data. You may use dimensionality reduction by using
techniques feature extraction or principle component analysis, you may
perform clustering to identify possible set of classes in the solution space or
you may use graphical tools, like bubble charts, to visualize the data.
It may be noted that exploratory data analysis helps in identifying the possibilities of
relationships amongst data, but does not promises that a causal relationship may exist
amongst variables. The causal relationship has to be ascertained through qualitative
analysis. Let us explain the exploratory data analysis with the help of an example.
Example 3: Consider the sample data of students given in Figure 6 about Gender and
Height. Let us explore this data for an analytical question: Does Height depends on
Gender?
You can perform the exploratory analysis on this data by drawing a side-by-side box
plot for Male and Female students height. This box plot is shown in Figure 10.
Please note that box plot of Figure 10 shows that on an average height of male
students is more than the female student. Does this result applies, in general for the
population? For answering this question, you need to find the probability of
12
occurrence of such sample data. need to determine the probability , therefore, Basics of Data Science
Inferential analysis may need to be performed.
You may have read about many of these tests in Data Warehousing and Data Mining
and Artificial Intelligence and Machine Leaning course. In addition, you may refer to
further readings for these tools. The following example explains the importance of
Inferential analysis.
Example 4:Figure 10 in Example 3 shows the box plot of height of male and female
students. Can you infer from the boxplot and the sample data (Figure 6), if there is
difference in the height of male and female students.
In order to infer, if there is a difference between the hight of two groups (Male and
Female Students), a two-sample t-test was run on the data. The output of this t-test is
shown in Figure 12.
t-Test (two tail): Assuming Unequal Variances
Female Male
Mean 167 173
Variance 94.5 63.5
Observations 5 5
Computed t-value -1.07
p-value 0.32
Critical t-value 2.30
13
Introduction to Data Science
Figure 12 shows that the mean height of the female students is 167 cm, whereas for
the male students it is 173 cm. The variance of female candidates is 94.5, whereas for
male candidate it is 63.5. Each group is interpreted on the basis of 5 observations. The
computed t-value is -1.07and p-value is 0.32. As the p-value is greater than 0.05,
therefore you can conclude that you cannot conclude that the average male student
height is different from average female student height.
With the availability of large amount of data and advanced algorithms for mining and
analysis of large data have led the way to advanced predictive analysis. The predictive
analysis of today uses tools from Artificial Intelligence, Machine Learning, Data
Mining, Data Stream Processing, data modelling etc. to make prediction for strategic
planning and policies of organisations. Predictive analysis uses large amount of data
to identify potential risks and aid the decision-making process. It can be used in
several data intensive industries like electronic marketing, financial analysis,
healthcare applications, etc. For example, in the healthcare industry, predictive
analysis may be used to determine the support for public health infrastructure
requirements for the future based on the present health data.
Advancements in Artificial intelligence, data modeling, machine learning has also led
to Prescriptive analysis. The prescriptive analysis aims to take predictions one step
forward and suggest solutions to present and future issues.
A detailed discussion on these topics is beyond the scope of this Unit. You may refer
to further readings for more information on these.
14
Basics of Data Science
Study
Causes Causes
Attend Mark
Observed, but s
not a Cause
Figure 13: Correlation does not mean causation
Data Dredging: Data Dredging, as the name suggest, is extensive analysis of very
large data sets. Such analysis results in generation of large number of data
associations. Many of those associations may not be casual, thus, requires further
exploration through other techniques. Therefore, it is essential that every data
association with large data set should be investigated further before reporting them as
conclusion of the study.
15
Introduction to Data Science
1.6 APPLICATIONS OF DATA SCIENCE
Data Science is useful in analysing large data sets to produce useful information that
can be used for business development and can help in decision making process. This
section highlights some of the applications of data science.
In general, data science can be used for the benefit of society. It should be used
creatively to improve the effective resource utilization, which may lead to sustainable
development. The ultimate goal of data science applications should be to help us
protect our environment and human welfare.
16
Basics of Data Science
1.7 DATA SCIENCE LIFE CYCLE
So far, we have discussed about various aspects of data science in the previous
sections. In this section, we discuss about the life cycle of a data science based
application. In general, a data science application development may involve the
following stages:
Thus, in general, data science project follows a spiral of development. This is shown
in Figure 16.
17
Introduction to Data Science
Data Science
Project
Requirements
Analysis Phase
Model
deployment Data collection
and and Preparation
Phase
Refinement
1.8 SUMMARY
This Unit introduces basic statistical and analytical concepts of data science. This Unit
first introduces you to the definition of the data science. Data science as a discipline
uses concepts from computing, mathematics and domain knowledge. The types of
data for data science is defined in two different ways. First, it is defined on the basis
of structure and generation rate of data, next it is defined as the measures that can be
used to capture the data. In addition, the concept of sampling has been defined in this
Unit.
This Unit also explains some of the basic methods used for analysis, which includes
descriptive, exploratory, inferential and predictive. Few interesting misconceptions
related to data science has also been explained with the help of example. This unit
also introduces you to some of the applications of data science and data science life
cycle. In the ever-advancing technology, it is suggested to keep reading about newer
data science applications
1.9 SOLUTIONS/ANSWERS
1. Box plots shows 5-point summary of data. A well spread box plot is an
indicator of normally distributed data. Side-by-side box blots can be
used to do a comparison of scale data values of two or more categories.
2. Inferential analysis also computes p-value, which determines if the
result obtained by exploratory analysis are significant enough, such that
results may be applicable for the population.
3. Simpson’s paradox signifies that grouped data sometimes statistics may
produce results that are contrary to when same statistics is applied to
ungrouped data.
19
Basics of Data Science
UNIT 2 PORTABILITY AND STATISTICS FOR
DATA SCIENCE
2.0 Introduction
2.1 Objectives
2.2 Probability
2.2.1 Conditional Probability
2.2.2 Bayes Theorem
2.3 Random Variables and Basic Distributions
2.3.1 Binomial Distribution
2.3.2 Probability Distribution of Continuous Random Variable
2.3.3 The Normal Distribution
2.4 Sampling Distribution and the Central Limit Theorem
2.5 Statistical Hypothesis Testing
2.5.1 Estimation of Parameters of the Population
2.5.2 Significance Testing of Statistical Hypothesis
2.5.3 Example using Correlation and Regression
2.5.4 Types of Errors in Hypothesis Testing
2.6 Summary
2.7 Solution/Answers
2.0 INTRODUCTION
In the previous unit of this Block, you were introduced to the basic concepts of
data science, which include the basic types of data, basic methods of data
analysis and applications and life cycle of data science. This Unit introduces
you to the basic concepts related to probability and statistics related to data
science.
It introduces the concept of conditional probability and Bayes Theorem. It is
followed by discussion on the basic probability distribution, highlighting their
significance and use. These distributions includes Binomial and Normal
distributions, the two most used distributions from discrete and continuous
variables respectively. The Unit also introduces you to the concept of sampling
distribution and central limit theorem. Finally, this unit covers the concepts of
statistical hypothesis testing with the help of an example of correlation. You
may refer to further readings for more details on these topics, if needed.
2.1 OBJECTIVES
1
Portability and Statistics
for Data Science
2.2 PROBABILITY
In the equation above, the set of all possible outcomes is also called sample
space. In addition, it is expected that all the outcomes are equally likely to occur.
Consider that you decided to roll two fair dice together at the same time. Will
the outcome of first die will affect the outcome of the second die? It will not, as
both the outcomes are independent of each other. In other words both the trials
are independent, if the outcome of the first trial does not affect the outcome of
second trail and vice-versa; else the trails are dependent trials.
How to compute the probability for more than one events in a sample space. Let
us explain this with the help of example.
Example 1: A fair die having six equally likely outcomes is to be thrown, then:
(i) What is the sample space: {1, 2, 3, 4, 5, 6}
(ii) An Event Ais die shows 2, then outcome of event A is {2}; and
probability P(A)=1/6
(iii) An Event Bis die shows odd face, then Event B is {1, 3, 5}; and
probability of Event B is P(B) = 3/6= 1/2
(iv) An Event C is die shows even face, then Event C is {2, 4, 6}; and
probability of Event C is P(C) = 3/6= 1/2
(v) Event A and B are disjoint events, as no outcomes between them is
common. So are the Event B and C. But event A and C are not
disjoint.
(vi) Intersection of Events A and B is a null set {}, as they are disjoint
events, therefore, probability that both events A and B both occur,
viz. P(AÇB) = 0. However, intersection of A and C is {2}, therefore,
P(AÇ C) = 1/6.
(vii) The union of the Events A and B would be {1, 2, 3. 5}, therefore, the
probability that event A or event B occurs, viz. P(AÈB)=4/6=2/3.
Whereas, the Union of events B and C is {1, 2, 3, 4, 5, 6}, therefore,
P(BÈC)=6/6=1.
Please note that the following formula can be derived from the above example.
Probability of occurrence of any of the two events A or B (also called union of
events) is:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) (2)
2
Basics of Data Science
For example 1, you may compute the probability of occurrence of event A or C
as:
𝑃(𝐴 ∪ 𝐶) = 𝑃(𝐴) + 𝑃(𝐶) − 𝑃(𝐴 ∩ 𝐶)
= 1/6 + 1/2 – 1/6 = 1/2.
In the case of disjoint events, since 𝑃(𝐴 ∩ 𝐵) is zero, therefore, the equation
(2) will reduce to:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) (3)
Given two events X and Y with the probability of occurrences P(X) and
P(Y) respectively. What would be the probability of occurrence of X if
the other event Y has actually occurred?
Let us analyse the problem further. Since the event Y has already occurred,
therefore, sample space reduces to the sample space of event Y. In addition, the
possible outcomes for occurrence of X could be the occurrences at the
intersection of X and Y, as that is the only space of X, which is part of sample
space of Y. Figure 1 shows this with the help of a Venn diagram.
Sample Space
if Event Y
X Y occurred
Possible
Outcome of X
Initial Sample Space after Y has
occurred
Figure 1: The conditional Probability of event A given that event B has occurred
You can compute the conditional probability using the following equation.
%(' ∩ *)
𝑃(𝑋/𝑌) = %(*) (5)
3
Portability and Statistics
for Data Science Where 𝑃(𝑋/𝑌) is the conditional probability of occurrence of event X, if event
Y has occurred.
For example, in example 1, what is the probability of occurrence of event A, if
event C has occurred?
You may please note that P(AÇC) = 1/6 and P(C)=1/2, therefore, the conditional
probability 𝑃(𝐴/𝐶) would be:
𝑃(𝐴 ∩ 𝐶) 13
𝑃(𝐴/𝐶) = = 6 = 133
𝑃(𝐶) 13
2
What would be conditional probability of disjoint events? You may find the
answer, by computing the 𝑃(𝐴/𝐵) for the Example 1.
Independent events are a special case for the conditional probability. As the two
events are independent of each other, therefore, occurrence of the any one of the
event does not change the probability or occurrence of the second event.
Therefore, for independent events X and Y
𝑃(𝑋/𝑌) = 𝑃(𝑋) 𝑎𝑛𝑑 𝑃(𝑌/𝑋) = 𝑃(𝑌) (7)
In fact, the equation (7) can be used to determine the independent events
Bayes theorem is one of the important theorem, which deals with the
conditional probability. Mathematically, Bayes theorem can be written using
equation (6) as,
𝑃(𝑋/𝑌) × 𝑃(𝑌) = 𝑃(𝑌/𝑋) × 𝑃(𝑋)
%(*/')×%(')
Or 𝑃(𝑋/𝑌) = (8)
%(*)
Example 3:Assume that you have two bags namely Bag A and Bag B. Bag A
contains 5 green and 5 red balls; whereas, Bag B contains 3 green and 7 red
balls. Assume that you have drawn a red ball, what is the probability that this
red ball is drawn from Bag B.
In this example,
Let the event X be “Drawing a Red Ball”. The probability of drawing a red ball
can be computed as follows;
You may select a Bag and then draw a ball. Therefore, the probability
will be computed as:
(Probability of selection of Bag A) ´(Probability of selection of red ball
in Bag A) + (Probability of selection of Bag B) ´ (Probability of
selection of red ball in Bag B)
4
Basics of Data Science
P(Red)= (1/2´5/10 + 1/2 ´ 7/10) = 3/5
Let the event Y be “Selection of Bag B from the two Bags, assuming equally
likely selection of Bags. Therefore, P(BagB)=1/2.
In addition, if Bag B is selected then the probability of drawing Red ball
P(Red/BagB)=7/10, as Bag B has already been selected and it has 3 Green and
7 Red balls.
As per the Bayes Theorem:
%(./0/1231)×%(1231)
𝑃(𝐵𝑎𝑔𝐵/𝑅𝑒𝑑) = %(./0)
! "
× 4
𝑃(𝐵𝑎𝑔𝐵/𝑅𝑒𝑑) = "# $
% = !#
&
Bayes theorem is a powerful tool to revise your estimate provided a given
event has occurred. Thus, you may be able to change your predictions.
1. Is 𝑃(𝑌/𝑋) = 𝑃(𝑌/𝑋)?
2. How can you use probabilities to find, if two events are independent.
3. The MCA batches of University A and University B consists of 20 and
30 students respectively. University A has 10 students who have
obtained more than 75% marks and University B has 20 such students.
A recruitment agency selects one of these student who has more than
75% marks out of the two Universities. What is the probability that the
selected student is from University A?
Using the data of Figure 2, you can create the following frequency table, which
can also be converted to probability.
5
Portability and Statistics
for Data Science
X Frequency Probability P(X)
0 1 1/8
1 3 3/8
2 3 3/8
3 1 1/8
Total 8 Sum of all P(X) = 1
Figure 3: The Frequency and Probability of Random Variable X
0.35
0.3
0.25
PROBABILITY
0.2
0.1
0.05
0
0 1 2 3
NUMBER OF HEADS (X)
6
Basics of Data Science
Another important value defined in probability distribution is the mean or
expected value, which is computed using the following equation (9)for random
variable X:
𝜇 = ∑6578 𝑥5 × 𝑝5 (9)
Thus, the mean or expected number of heads in three trials would be:
𝜇 = 𝑥8 × 𝑝8 + 𝑥! × 𝑝! + 𝑥# × 𝑝# + 𝑥9 × 𝑝9
! 9 9 ! !#
𝜇 = 0 × $ + 1 × $ + 2 × $ + 3 × $ = $ = 1.5
Therefore, in a trail of 3 tosses of coins, the mean number of heads is 1.5.
7
Portability and Statistics
for Data Science 𝜇 =𝑛×𝑠 (12a)
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) (12b)
Therefore, for the variable X which represents number of heads in three tosses
of coin, the mean and standard deviation are:
!
𝜇 = 𝑛 × 𝑠 = 3 × # = 1.5
! ! √9
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) = M3 × # × (1 − #) = #
Frequency of 'Height'
27
14 14
11 11
8
6
5
3
0 1
(145, 150] (155, 160] (165, 170] (175, 180] (185, 190]
[140, 145] (150, 155] (160, 165] (170, 175] (180, 185] (190, 195]
Figure 5: Histogram of Height of 100 students of a Class
The mean of the height was 166 and the standard distribution was about 10.The
probability for a student height is in between 165 to 170 interval is 0.27.
9
Portability and Statistics
for Data Science 𝐴𝑟𝑒𝑎 𝑢𝑛𝑑𝑒𝑟 𝑡ℎ𝑒 𝑐𝑢𝑟𝑣𝑒
= 0.9032
µ 𝜇 + 1.3𝜎
𝜇+𝜎
Figure 7: Computing Probability using Normal Distribution
With the basic introduction, as above, next we discuss one of the important
aspect of sample and population called sampling distribution. A typical
statistical experiment may be based on a specific sample of data that may be
collected by the researcher. Such data is termed as the primary data. The
question is – Does the statistical results obtained by you using the primary data
can be applied to the population? If yes, what may be the accuracy of such a
collection? To answer this question, you must study the sampling distribution.
Sampling distribution is also a probability distribution, however, this
distribution shows the probability of choosing a specific sample from the
population. In other words, a sampling distribution is the probability distribution
of means of the random samples of the population. The probability in this
distribution defines the likelihood of the occurrence of the specific mean of the
sample collected by the researcher. Sampling distribution determines whether
the statistics of the sample falls closer to population parameters or not. The
following example explains the concept of sampling distribution in the context
of a categorical variable.
10
Basics of Data Science
Example 5: Consider a small population of just 5 person, who vote for a question
“Data Science be made the Core Course in Computer Science? (Yes/No)”. The
following table shows the population:
Suppose, you take a sample size (n) = 3, and collects random sample. The
following are the possible set of random samples:
Sample Sample Proportion (𝑝̂ )
P1, P2, P3 0.67
P1, P2, P4 0.67
P1, P2, P5 0.67
P1, P3, P4 0.33
P1, P3, P5 0.33
P1, P4, P5 0.33
P2, P3, P4 0.33
P2, P3, P5 0.33
P2, P4, P5 0.33
P3, P4, P5 0.00
Frequency of all the sample proportions is:
𝑝̂ Frequency
0 1
0.33 6
0.67 3
8
6
FREQUENCY
6
4
2 3
1
0
0 0.33 0.67
SAMPLE PROPORTION
Please notice the nature of the sampling proportions distribution, it looks closer
to Normal distribution curve. In fact, you can find that out by creating an
example with 100 data points and sample size 30.
11
Portability and Statistics
for Data Science 𝑚𝑒𝑎𝑛 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 = 𝑝 (14a)
A×(!;A)
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = M 6
(14b)
Suppose, you take a sample size (n) = 3, and collects random sample. The
following are the possible set of random samples:
Sample Sample Mean (𝑥̅ )
P1, P2, P3 25
P1, P2, P4 26.67
P1, P2, P5 28.33
P1, P3, P4 28.33
P1, P3, P5 30
P1, P4, P5 31.67
P2, P3, P4 30
P2, P3, P5 31.67
P2, P4, P5 33.33
P3, P4, P5 35
The mean of all these sample means = 30, which is same as population mean μ.
The histogram of the data is shown in Figure 12.
2.5
2
Frequency
1.5
0.5
0
(26.5, 28] (29.5, 31] (32.5, 34]
[25, 26.5] (28, 29.5] (31, 32.5] (34, 35.5]
Mean Value
Given a sample size n and population mean μ, then the sampling distribution for
the given sample size would fulfil the following:
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 = 𝜇 (15a)
@
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛𝑠 = (15b)
√6
12
Basics of Data Science
Therefore, the z-score computation for sampling distribution will be as per the
following equation:
Note: You can obtain this equation from equation (13), as this is a
distribution of means, therefore, x of equation (13) is 𝑥̅ , and standard
deviation of sampling distribution is given by equation (15b).
(>̅ ;?)
𝑧= @ (15c)
C
√6
Please note that the histogram of the mean of samples is close to normal
distribution.
Such exponentiations led to the Central limit Theorem, which proposes the following:
Central Limit Theorem: Assume that a sample of size is drawn from a population that
has the mean μ and standard deviation σ. The central limit theorem states that with the
increase in n, the sampling distribution, i.e. the distribution of mean of the samples,
approaches closer to normal distribution.
However, it may be noted that the central limit theorem is applicable only if you have
collected independent random samples, where the size of sample is sufficiently large,
yet it is less than 10% of the population. Therefore, the Example 5 and Example 6 are
not true representations for the theorem, rather are given to illustrate the concept.
Further, it may be noted that the central limit theorem does not put any constraint on
the distribution of population. Equation 15 is a result of central limit theorem.
3. What would be the mean and standard deviation for the random variable of
Question 2.
4. What is the mean and standard deviation for standard normal distribution?
5. A country has the population of 1 billion, out of which 1% are the students of
class 10th. A representation sample of 10000 students of class 10 were asked a
question “Is Mathematics difficult or easy?”. Assuming that the population
proportion of this question was reported to be 0.36, what would be possible
standard deviation of the sampling distribution?
13
Portability and Statistics
for Data Science
𝑝 − 2 StDev ^p p 𝑝 + 2 StDev
Figure 13: Confidence Level 95% for a confidence interval (non-shaded area).
Since you have selected a confidence level of 95%, you are expecting that
proportion of the sample (𝑝̂ )can be in the interval–(population proportion (p) -
14
Basics of Data Science
2´(Standard Deviation)) to (population proportion (p) + 2´(Standard
Deviation)), as shown in Figure 13. The probability of occurrence of 𝑝̂ in this
interval is 95% (Please refer to Figure 6). Therefore, the confidence level is 95%.
In addition, note that you do not know the value of p that is what you are
estimating, therefore, you would be computing 𝑝̂ . You may observe in Figure
13, that the value of p will be in the interval (𝑝̂ - 2´(Standard Deviation)) to (𝑝̂
+ 2´(Standard Deviation)). The standard deviation of the sampling distribution
can be computed using equation (14b). However, as you are estimating the value
of p, therefore, you cannot compute the exact value of standard deviation.
Rather, you can compute standard error, which is
computed by estimating the standard deviation using the sample proportion (𝑝̂ ),
by using the following formula:
AD×(!;AD)
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟(𝑆𝑡𝐸𝑟𝑟) = M 6
Therefore, the confidence interval is estimated as (𝑝̂ – 2´StErr) to (𝑝̂ + 2´StErr).
In general, for a specific confidence level, you can specify a specific z-score
instead of 2. Therefore, the confidence interval, for large n, is: (𝑝̂ – z´StErr) to
(𝑝̂ + z´StErr)
In practice, you may use confidence level of 90% or 95% and 99%. The z-score
used for these confidence levels are 1.65, 1.96 (not 2) and 2.58respectively.
Example 7: Consider the statement S1 of this section and estimate the
confidence interval for the given data.
For the sample the probability that class 12th students play some sport is:
𝑝̂ = 405/1000=0.405
The sample size (n) = 1000
AD×(!;AD) 8.F8G×(!;8.F8G)
𝑆𝑡𝐸𝑟𝑟 = M 6
= M !888
= 0.016
Therefore, the Confidence Interval for the confidence level 95% would be:
(0.405 – 1.96 ´0.016) to (0.405 + 1.96 ´0.016)
0.374 to 0.436
Therefore, with a confidence of 95%, you can state that the students of class 12th,
who plays some sport is in the range 37.4% to 43.6%
How can you reduce the size of this interval? You may please observe that
StErris inversely dependent on the square root of the sample size. Therefore,
you may have to increase the sample size to approximately 4 times to reduce the
standard error to approximately half.
Confidence Interval to estimate mean
You can find the confidence interval for estimating mean in a similar manner,
as you have done for the case of proportions. However, in this case you need
estimate the standard error in the estimated mean usingthe variation of equation
15b, as follows:
H
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 =
√6
; where s is the standard deviation of the sample
Example 8: The following table lists the height of a sample of 100 students of
class 12 in centimetres. Estimate the average height of students of class 12.
170 164 168 149 157 148 156 164 168 160
149 171 172 159 152 143 171 163 180 158
167 168 156 170 167 148 169 179 149 171
164 159 169 175 172 173 158 160 176 173
15
Portability and Statistics
for Data Science 159 160 162 169 168 164 165 146 156 170
163 166 150 165 152 166 151 157 163 189
176 185 153 181 163 167 155 151 182 165
189 168 169 180 158 149 164 171 189 192
171 156 163 170 186 187 165 177 175 165
167 185 164 156 143 172 162 161 185 174
The sample mean and sample standard deviation is computed and is shown
below:
Sample Mean (𝑥̅ ) = 166; Standard Deviation of sample (s) = 11
Therefore, the estimated height confidence interval of the mean height of the
students of class 12thcan be computed as:
Mean height (𝑥̅ ) = 166
The sample size (n) = 100
!!
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 = =1.1
√!88
The Confidence Interval for the confidence level 95% would be:
(166 – 1.96 ´ 1.1) to (166 + 1.96 ´ 1.1)
163.8 to 168.2
Thus, with a confidence of 95%, you can state that average height of class 12th
students is in between 163.8 to 168.2 centimetres.
You may please note that in example 8, we have used t-distribution for means,
as we have used sample’s standard deviation rather than population standard
deviation. The t-distribution of means is slightly more restrictive than z-
distribution. The t-value is computed in the context of sampling distribution by
the following equation:
(>̅ ;?)
𝑡= H (16)
C
√6
In this section, we will discuss about how to test the statement S2, given in
section 2.5. A number of experimental studies are conducted in statistics, with
the objective to infer, if the data support a hypothesis or not. The significance
testing may involve the following phases:
1.Testing Pre-condition on Data:
Prior to preforming the test of significance, you should check the pre-conditions
on the test. Most of the statistical test require random sampling, large size of
data for each possible category being tested and normal distribution of the
population.
2. Making the statistical Hypothesis: You make statistical hypothesis after the
parameters . of the population. There are two basic hypothesis in statistical
testing – the Null Hypothesis and the Alternative Hypothesis.
Null Hypothesis: Null hypothesis either defines a particular value for the
parameter or specifies there is no difference or no change in the specified
parameters. It is represented as H0.
Alternative Hypothesis: Alternative hypothesis specifies the values or difference
in parameter values. It is represented as either H1 or Ha. We use the convention
Ha.
For example, for the statement S2 of Section 2.5, the two hypothesis would be:
16
Basics of Data Science
H0: There is no effect of hours of study on the marks percentage of 12th class.
Ha: The marks of class 12th improves with the hours of study of the student.
Please note that the hypothesis above is one sided, as your assumption is that the
marks would increase with hours of study. The second one sided hypothesis may
relate to decrease in marks with hours of study. However, most of the cases the
hypothesis will be two sided, which just claims that a variable will cause
difference in the second. For example, two sided hypothesis for statement S2
would be hours of study of students makes a difference (it may either increase
or decrease ) the marks of students of class 12th. In general, one sided tests are
called one tailed tests and two sided tests are called two tailed tests.
17
Portability and Statistics
for Data Science In order to find such a relationship, you may like to perform basic exploratory
analysis. In this case, let us make a scatter plot between the two variables, taking
wsh as an independent variable and mp as a dependent variable. This scatter plot
is shown in Figure 16
120
Figure 16: Scatter plot of Weekly Study Hours vs. Marks Percentage.
The scatter plot of Figure 16 suggests that the two variables may be associated.
But how to determine the strength of this association? In statistics, you use
Correlation, which may be used to determine the strength of linear association
between two quantitative variables. This is explained next.
19
Portability and Statistics
for Data Science On performing regressing analysis on the observed data of Example 9, the
statistics as shown in Figure 18 is generated.
Regression Statistics
Multiple R 0.9577
R Square 0.9172
Adjusted R Square 0.9069
Standard Error 4.1872
Observations 10.0000
ANOVA
df SS F Significance F
Regression 1.0000 1554.1361 88.6407 0.0000
Residual 8.0000 140.2639
Total 9.0000 1694.4000
20
Basics of Data Science
• The term “Multiple R” in Regression Statistics defines the correlation
between the dependent variable (say y) with the set of independent or
explanatory variables in the regression model. Thus, multiple R is similar
to correlation coefficient (r), expect that it is used when multiple
regression is used. Most of the software express the results in terms of
Multiple R, instead of r, to represent the regression output. Similarly, R
Square is used in multiple regression, instead of r2. The proposed model
has a large r2, therefore, can be considered for deployment.
You can go through further readings for more details on all the terms discussed
above.
Figure 19 shows the regression line for the data of Example 9. You may please
observe that residuals is the vertical difference between the Marks Percentage
and Predicted marks percentage. These residuals are shown in Figure 20.
90
Marks Percentage (mp)
80
70
60
50
40
0 5 10 15 20 25
Weekly Study Hours (wsh)
0
-5 0 5 10 15 20 25
-10
Weekly Study Hours (wsh)
In the section 2.5.1 and section 2.5.2, we have discussed about testing the Null
hypothesis. You either Reject the Null hypothesis and accepts alternative
hypothesis based on the computed probability or p-value; or you fail to Reject
the Null hypothesis. The decisions in such hypothesis testing would be:
• You reject Null hypothesis for a confidence interval 95% based on the p-
value, which lies in the shaded portion, that is p-value < 0.05 for two tailed
hypothesis (that is both the shaded portions in Figure 15, each area of
probability 0.025). Please note that in case of one tailed test, you would
consider only one shaded area of Figure 15, therefore, you would be
considering p-value < 0.05 in only one of the two shaded areas.
• You fail to reject the Null hypothesis for confidence interval 95%, when p-
value > 0.05.
The two decisions as stated above could be incorrect, as you are considering a
confidence interval of 95%. The following Figure shows this situation.
Final Decision
The Actual H0 is Rejected, that is, you You fail to reject H0, as you do not
Scenario have accepted the have enough evidence to accept
Alternative hypothesis the Alternative hypothesis
H0 is True This is called a TYPE-I error You have arrived at a correct
decision
H0 is False You have arrived at a correct This is called a TYPE-II error
decision
For example, assume that a medicine is tested for a disease and this medicine is
NOT a cure of the disease. You would make the following hypotheses:
H0: The medicine has no effect for the disease
Ha: The medicine improves the condition of patient.
However, if the data is such that for a confidence interval of 95% the p-value is
computed to be less than 0.05, then you will reject the null hypothesis, which is
Type-I error. The chances of Type-I errors for this confidence interval is 5%.
This error would mean that the medicine will get approval, even though it has
no effect on curing the disease.
However, now assume that a medicine is tested for a disease and this medicine
is a cure of the disease. Hypotheses still remains the same, as above. However,
if the data is such that for a confidence interval of 95% the p-value is computed
to be more than 0.05, then you will not be able to reject the null hypothesis,
which is Type-II error. This error would mean that a medicine which can cure
the disease will not be accepted.
22
Basics of Data Science
2. The Weight of 20 students, in Kilograms, is given in the following table
65 75 55 60 50 59 62 70 61 57
62 71 63 69 55 51 56 67 68 60
Find the estimated weight of the student population.
3. A class of 10 students were given a validated test prior and after completing
a training course. The marks of the students in those tests are given as under:
Marks before Training (mbt) 56 78 87 76 56 60 59 70 61 71
Marks after training (mat) 55 79 88 90 87 75 66 75 66 78
With a significance level of 95% can you say that the training course was useful?
2.6 SUMMARY
This Unit introduces you to the basic probability and statistics related to data
science. The unit first introduces the concept of conditional probability, which
defines the probability of an event given a specific event has occurred. This is
followed by discussion on the Bayes theorem, which is very useful in finding
conditional probabilities. Thereafter, the unit explains the concept of discrete
and continuous random variables. In addition, the Binomial distribution and
normal distribution were also explained. Further, the unit explained the
concept of sampling distribution and central limit theorem, which forms the
basis of the statistical analysis. The Unit also explain the use of confidence
level and intervals for estimating the parameters of the population. Further, the
unit explains the process of significance testing by taking an example related to
correlation and regression. Finally, the Unit explains the concept of errors in
hypothesis testing. You may refer to further readings for more details on these
concepts.
2.7 SOLUTION/ANSWERS
23
Portability and Statistics
! !
for Data Science 𝑃(𝑆𝑡𝐷𝑖𝑠/𝑈𝑛𝑖𝐴) × 𝑃(𝑈𝑛𝑖𝐴) #
×# 3
𝑃(𝑈𝑛𝑖𝐴/𝑆𝑡𝐷𝑖𝑠) = = 4 =
𝑃(𝑆𝑡𝐷𝑖𝑠) 7
!#
Check Your Progress 2
1. As the probability of getting the even number (E) or odd number (O) is equal
in each two of dice, the following eight outcomes may be possible:
Outcomes EEE EEO EOE EOO OEE OEO OOE OOO
Number of 3 2 2 1 2 1 1 0
times Even
number appears
(X)
Therefore, the probability distribution would be:
X Frequency Probability P(X)
0 1 1/8
1 3 3/8
2 3 3/8
3 1 1/8
Total 8 Sum of all P(X) = 1
2. This can be determined by using the Binomial distribution with X=0, 1, 2, 3 and
4, as follows (s and f both are 1/2):
4! 1 8 1 F 1
𝑃(𝑋 = 0) 𝑜𝑟 𝑝8 = F𝐶8 × 𝑠 8 × 𝑓 F;8 = ×H I ×H I =
0! (4 − 0)! 2 2 16
!
4! 1 1 9 4
𝑃(𝑋 = 1) 𝑜𝑟 𝑝! = F𝐶! × 𝑠 ! × 𝑓 F;! = ×H I ×H I =
1! (4 − 1)! 2 2 16
#
4! 1 1 # 6
𝑃(𝑋 = 2) 𝑜𝑟 𝑝# = F𝐶# × 𝑠 # × 𝑓 F;# = ×H I ×H I =
2! (4 − 2)! 2 2 16
9
4! 1 1 ! 4
𝑃(𝑋 = 3) 𝑜𝑟 𝑝9 = F𝐶9 × 𝑠 9 × 𝑓 F;9 = ×H I ×H I =
3! (4 − 3)! 2 2 16
4! 1 F 1 8 1
𝑃(𝑋 = 4) 𝑜𝑟 𝑝F = F𝐶F × 𝑠 F × 𝑓 F;F = ×H I ×H I =
4! (4 − 4)! 2 2 16
4. Analysis of results: The one tail p-value suggests that you reject the
null hypothesis. The difference in the means of the two results is
significant enough to determine that the scores of the student have
improved after the training.
25
Basics of Data Science
UNIT 3 DATA PREPARATION FOR ANALYSIS
3.0 Introduction
3.1 Objectives
3.2 Need for Data Preparation
3.3 Data preprocessing
3.3.1 Data Cleaning
3.3.2 Data Integration
3.3.3 Data Reduction
3.3.4 Data Transformation
3.4 Selection and Data Extraction
3.5 Data Curation
3.5.1 Steps of Data Curation
3.5.2 Importance of Data Curation
3.6 Data Integration
3.6.1 Data Integration Techniques
3.6.2 Data Integration Approaches
3.7 Knowledge Discovery
3.8 Summary
3.9 Solutions/Answers
3.10 Further Readings
3.0 INTRODUCTION
In the previous unit of this Block, you were introduced to the basic concepts of
conditional probability, Bayes Theorem and probability distribution including
Binomial and Normal distributions. The Unit also introduces you to the concept
of the sampling distribution, central limit theorem and statistical hypothesis
testing. This Unit introduces you to the process of data preparation for Data
Analysis. Data preparation is one of the most important processes, as it leads to
good quality data, which will result in accurate results of the data analysis. This
unit covers data selection, cleaning, curation, integration, and knowledge
discovery from the stated data. In addition, this unit gives you an overview of
data quality and how its preparation for analysis is done. You may refer to further
readings for more details on these topics.
3.1 OBJECTIVES
1
Data Preparation for
Analysis
3.2 NEED FOR DATA PREPARATION
In the present time, data is one of the key resources for a business. Data is
processed to create information; information is integrated to create knowledge.
Since knowledge is power, it has evolved into a modern currency, which is
valued and traded between parties. Everyone wants to discuss the knowledge
and benefits they can gain from data. Data is one of the most significant
resources available to marketers, agencies, publishers, media firms, and others
today for a reason. But only high-quality data is useful. We can determine a data
set's reliability and suitability for decision-making by looking at its quality.
Degrees are frequently used to gauge this quality. The usefulness of the data for
the intended purpose and its completeness, accuracy, timeliness, consistency,
validity, and uniqueness are used to determine the data's quality. In simpler
terms, data quality refers to how accurate and helpful the data are for the task at
hand. Further, data quality also refers to the actions that apply the necessary
quality management procedures and methodologies to make sure the data is
useful and actionable for the data consumers. A wide range of elements,
including accuracy, completeness, consistency, timeliness, uniqueness, and
validity, influence data quality. Figure 1 shows the basic factors of data quality.
COMPLETENESS
UNIQUENESS ACCURACY
DATA
QUALITY
VALIDITY TIMELINESS
CONSISTENCY
• Accuracy - The data must be true and reflect events that actually take
place in the real world. Accuracy measures determine how closely the
figures agree with the verified right information sources.
• Completeness - The degree to which the data is complete determines
how well it can provide the necessary values.
• Consistency - Data consistency is the homogeneity of the data across
applications, networks, and when it comes from several sources. For
example, identical datasets should not conflict if they are stored in
different locations.
2
Basics of Data Science
• Timeliness - Data that is timely is readily available whenever it is
needed. The timeliness factor also entails keeping the data accurate; to
make sure it is always available and accessible and updated in real-time.
• Uniqueness - Uniqueness is defined as the lack of duplicate or redundant
data across all datasets. The collection should contain zero duplicate
records.
• Validity - Data must be obtained in compliance with the firm's defined
business policies and guidelines. The data should adhere to the
appropriate, recognized formats, and all dataset values should be within
the defined range.
Consider yourself a manager at a company, say XYZ Pvt Ltd, who has been tasked
with researching the sales statistics for a specific organization, say ABC. You
immediately get to work on this project by carefully going through the ABC
company's database and data warehouse for the parameters or dimensions (such
as the product, price, and units sold), which may be used in your study. However,
your enthusiasm suffers a major problem when you see that several of the
attributes for different tuples do not have any recorded values. You want to
incorporate the information in your study on whether each item purchased was
marked down, but you find that this data has not been recorded. According to users
of this database system, the data recorded for some transactions were containing
mistakes, such as strange numbers and anomalies.
3
Data Preparation for
Analysis
3.3 DATA PREPROCESSING
Preprocessing is the process of taking raw data and turning it into information
that may be used. Data cleaning, data integration, data reduction and data
transformation, and data discretization are the main phases of data preprocessing
(see Figure 2).
DATA Pre-
processing
Data Cleaning
Data Integration
Data Transformation
Data Reduction
a. Missing Values
Consider you need to study customer and sales data for ABC
Company. As you pointed out, numerous tuples lack recorded
values for a number of characteristics, including customer
income. The following techniques can be used to add the values
that are lacking for this attribute.
i. Ignore the tuple: Typically, this is carried out in the
absence of a class label (assuming the task involves
4
Basics of Data Science
classification). This method is particularly detrimental
when each attribute has a significantly different percentage
of missing values. By disregarding the remaining
characteristics in the tuple, we avoid using their values.
ii. Manually enter the omitted value: In general, this
strategy is time-consuming and might not be practical for
huge data sets with a substantial number of missing values.
iii. Fill up the blank with a global constant: A single
constant, such as "Unknown" or “−∞”, should be used to
replace all missing attribute values. If missing data are
replaced with, say, "Unknown," the analysis algorithm can
mistakenly think that they collectively comprise valid data.
So, despite being simple, this strategy is not perfect.
iv. To fill in the missing value, use a measure of the
attribute's central tendency (such as the mean or
median): The median should be used for skewed data
distributions, while the mean can be used for normal
(symmetric) data distributions. Assume, for instance, that
the ABC company’s customer income data distribution is
symmetric and that the mean income is INR 50,000/-. Use
this value to fill in the income value that is missing.
v. For all samples that belong to the same class as the
specified tuple, use the mean or median: For instance, if
we were to categorize customers based on their credit risk,
the mean income value of customers who belonged to the
same credit risk category as the given tuple might be used
to fill in the missing value. If the data distribution is skewed
for the relevant class, it is best to utilize the median value.
vi. Fill in the blank with the value that is most likely to be
there: This result can be reached using regression,
inference-based techniques using a Bayesian
formalization, or decision tree induction. As an example,
using the other characteristics of your data's customers, you
may create a decision tree to forecast the income's missing
numbers.
b. Noisy Data
Noise is the variance or random error in a measured variable. It
is possible to recognize outliers, which might be noise,
employing tools for data visualization and basic statistical
description techniques (such as scatter plots and boxplots). How
can the data be "smoothed" out to reduce noise given a numeric
property, like price, for example? The following are some of the
data-smoothing strategies.
i. Binning: Binning techniques smooth-sorted data values
by looking at their "neighbourhood" or nearby values.
The values that have been sorted are divided into various
"buckets" or bins. Binding techniques carry out local
smoothing since they look at the values' surroundings.
When smoothing by bin means, each value in the bin is
changed to the bin's mean value. As an illustration,
suppose a bin contains three numbers 4, 8 and 15. The
average of these three numbers in the bin is 9.
5
Data Preparation for
Analysis Consequently, the value nine replaces each of the bin's
original values.
Similarly, smoothing by bin medians, which substitutes
the bin median for each bin value, can be used. Bin
boundaries often referred to as minimum and maximum
values in a specific bin can also be used in place of bin
values. This type of smoothing is called smoothing by
bin boundaries. In this method, the nearest boundary
value is used to replace each bin value. In general, the
smoothing effect increases with increasing breadth. As an
alternative, bins may have identical widths with constant
interval ranges of values.
ii. Regression: Regression is a method for adjusting the data values to a
function and may also be used to smooth out the data. Finding the "best"
line to fit two traits (or variables) is the goal of linear regression, which
enables one attribute to predict the other. As an extension of linear
regression, multiple linear regression involves more than two features
and fits the data to a multidimensional surface.
iii. Outlier analysis: Clustering, for instance, the grouping of comparable
values into "clusters," can be used to identify outliers. It makes sense to
classify values that are outliers as being outside the set of clusters.
iv. Data discretization, a data transformation and data reduction technique,
is an extensively used data smoothing technique. The number of distinct
values for each property is decreased, for instance, using the binning
approaches previously discussed. This functions as a form of data
reduction for logic-based data analysis methods like decision trees,
which repeatedly carry out value comparisons on sorted data. Concept
hierarchies are a data discretization technique that can also be applied to
smooth out the data. The quantity of data values that the analysis process
must process is decreased by a concept hierarchy. For example, the price
variable, which represents the price value of commodities, may be
discretized into “lowly priced”, “moderately priced”, and “expensive”
categories.
6
Basics of Data Science
good cause, such as suspicious measurements that are unlikely to be present in
the real data.
4. Handling missing data-Missing data is a deceptively difficult issue in
machine learning. We cannot just ignore or remove the missing observation.
They must be carefully treated since they can indicate a serious problem. Data
gaps resemble puzzle pieces that are missing. Dropping it is equivalent to
denying that the puzzle slot is there. It is like trying to put a piece from another
puzzle into this one. Furthermore, we need to be aware of how we report
missing data. Instead of just filling it in with the mean, you can effectively let
the computer choose the appropriate constant to account for missingness by
using this flagging and filling method.
5. Validate and QA-You should be able to respond to these inquiries as part of
fundamental validation following the data cleansing process, for example:
o Does the data make sense?
o Does the data abide by the regulations that apply to its particular field?
o Does it support or refute your hypothesis? Does it offer any new
information?
o Can you see patterns in the data that will support your analysis?
o Is there a problem with the data quality?
Data from many sources, such as files, data cubes, databases (both relational and
non-relational), etc., must be combined during this procedure. Both
homogeneous and heterogeneous data sources are possible. Structured,
7
Data Preparation for
Analysis unstructured, or semi-structured data can be found in the sources. Redundancies
and inconsistencies can be reduced and avoided with careful integration.
d. Data Value Conflict Detection and Resolution: Data value conflicts must
be found and resolved as part of data integration. As an illustration,
attribute values from many sources may vary for the same real-world thing.
8
Basics of Data Science
Variations in representation, scale, or encoding may be the cause of this. In
one system, a weight attribute might be maintained in British imperial
units, while in another, metric units. For a hotel chain, the cost of rooms in
several cities could include various currencies, services (such as a
complimentary breakfast) and taxes. Similarly, every university may have
its own curriculum and grading system. When sharing information among
them, one university might use the quarter system, provide three database
systems courses, and grade students from A+ to F, while another would use
the semester system, provide two database systems courses, and grade
students from 1 to 10. Information interchange between two such
universities is challenging because it is challenging to establish accurate
course-to-grade transformation procedures between the two universities.
An attribute in one system might be recorded at a lower abstraction level
than the "identical" attribute in another since the abstraction level of
attributes might also differ. As an illustration, an attribute with the same
name in one database may relate to the total sales of one branch of a
company, however, the same result in another database can refer to the
company's overall regional shop sales.
This procedure is used to change the data into formats that are suited for the
analytical process. Data transformation involves transforming or consolidating
the data into analysis-ready formats. The following are some data transformation
strategies:
a. Smoothing, which attempts to reduce data noise. Binning, regression, and
grouping are some of the methods.
b. Attribute construction (or feature construction), wherein, in order to aid
the analysis process, additional attributes are constructed and added from the
set of attributes provided.
c. Aggregation, where data is subjected to aggregation or summary procedures
to calculate monthly and yearly totals; for instance, the daily sales data may
be combined to produce monthly or yearly sales. This process is often used
to build a data cube for data analysis at different levels of abstraction.
d. Normalization, where the attribute data is resized to fit a narrower range:
−1.0 to 1.0; or 0.0 to 1.0.
e. Discretization, where interval labels replace the raw values of a numeric
attribute (e.g., age) (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth,
adult, senior). A concept hierarchy for the number attribute can then be
created by recursively organizing the labels into higher-level concepts. To
meet the demands of different users, more than one concept hierarchy might
be built for the same characteristic.
f. Concept hierarchy creation using nominal data allows for the
extrapolation of higher-level concepts like a street to concepts like a city or
country. At the schema definition level, numerous hierarchies for nominal
qualities can be automatically created and are implicit in the database
structure.
10
Basics of Data Science
2. Why is preprocessing important?
The process of choosing the best data source, data type, and collection tools is
known as data selection. Prior to starting the actual data collection procedure,
data selection is conducted. This concept makes a distinction between selective
data reporting (excluding data that is not supportive of a study premise) and
active/interactive data selection (using obtained data for monitoring
activities/events or conducting secondary data analysis). Data integrity may be
impacted by how acceptable data are selected for a research project.
The main goal of data selection is to choose the proper data type, source, and
tool that enables researchers to effectively solve research issues. This decision
typically depends on the discipline and is influenced by the research that has
already been done, the body of that research, and the availability of the data
sources.
Integrity issues may arise, when decisions about which "appropriate" data to
collect, are primarily centred on cost and convenience considerations rather than
the data's ability to successfully address research concerns. Cost and
convenience are unquestionably important variables to consider while making a
decision. However, researchers should consider how much these factors can
skew the results of their study.
Types and Sources of Data: Different data sources and types can be displayed in
a variety of ways. There are two main categories of data:
• Quantitative data are expressed as numerical measurements at the interval
11
Data Preparation for
Analysis and ratio levels.
• Qualitative data can take the form of text, images, music, and video.
Data curation is creating, organizing and managing data sets so that people
looking for information can access and use them. It comprises compiling,
arranging, indexing, and categorizing data for users inside of a company, a
group, or the general public. To support business decisions, academic needs,
scientific research, and other initiatives, data can be curated. Data curation is a
step in the larger data management process that helps prepare data sets for usage
in business intelligence (BI) and analytics applications. In other cases, the
curation process might be fed with ready-made data for ongoing management
and maintenance. In organizations without particular data curator employment,
data stewards, data engineers, database administrators, data scientists, or
business users may fill that role.
13
Data Preparation for
Analysis There are numerous tasks involved in curating data sets, which can be divided
into the following main steps.
• The data that will be required for the proposed analytics applications
should be determined.
• Map the data sets and note the metadata that goes with them.
• Collect the data sets.
• The data should be ingested into a system, a data lake, a data
warehouse, etc.
• Cleanse the data to remove abnormalities, inconsistencies, and
mistakes, including missing values, duplicate records, and spelling
mistakes.
• Model, organize, and transform the data to prepare it for specific
analytics applications.
• To make the data sets accessible to users, create searchable indexes
of them.
• Maintain and manage the data in compliance with the requirements
of continuous analytics and the laws governing data privacy and
security.
3.5.2 Importance of Data Curation
The following are the reasons for performing data curation.
1. Helps to organize pre-existing data for a corporation: Businesses produce
a large amount of data on a regular basis, however, this data can
occasionally be lacking. When a customer clicks on a website, adds
something to their cart, or completes a transaction, an online clothes
retailer might record that information. Data curators assist businesses in
better understanding vast amounts of information by assembling prior
data into data sets.
2. Connects professionals in different departments: When a company
engages in data curation, it often brings together people from several
departments who might not typically collaborate. Data curators might
collaborate with stakeholders, system designers, data scientists, and data
analysts to collect and transfer information.
3. High-quality data typically uses organizational techniques that make it
simple to grasp and have fewer errors. Curators may make sure that a
company's research and information continue to be of the highest caliber
because the data curation process entails cleansing the data. Removing
unnecessary information makes research more concise, which may
facilitate better data set structure.
4. Makes data easy to understand: Data curators make sure there are no
errors and utilize proper formatting. This makes it simpler for specialists
who are not knowledgeable about a research issue to comprehend a data
set.
5. Allows for higher cost and time efficiency: A business may spend more
time and money organizing and distributing data if it does not regularly
employ data curation. Because prior data is already organized and
distributed, businesses that routinely do data curation may be able to save
14
Basics of Data Science
time, effort, and money. Businesses can reduce the time it takes to obtain
and process data by using data curators, who handle the data.
Data integration creates coherent data storage by combining data from several
sources. Smooth data integration is facilitated by the resolution of semantic
heterogeneity, metadata, correlation analysis, tuple duplicate identification, and
data conflict detection. It is a tactic that combines data from several sources so
that consumers may access it in a single, consistent view that displays their
status. Systems can communicate using flat files, data cubes, or numerous
databases. Data integration is crucial because it maintains data accuracy while
providing a consistent view of dispersed data. It helps the analysis tools extract
valuable information, which in turn helps the executive and management make
tactical choices that will benefit the company.
15
Data Preparation for
Analysis Uniform Access Integration- This method integrates information from a wider
range of sources. In this instance, however, the data is left in its initial place and
is not moved. To put it simply, this technique produces a unified view of the
combined data. The integrated data does not need to be saved separately because
the end user only sees the integrated view.
16
Basics of Data Science
2. Selecting and producing the data set that will be used for discovery -Once
the objectives have been specified, the data that will be used for the knowledge
discovery process should be identified. Determining what data is accessible,
obtaining essential information, and then combining all the data for knowledge
discovery into one set are the factors that will be considered for the procedure.
Knowledge discovery is important since it extracts knowledge and insight from
the given data. This provides the framework for building the models.
3. Preprocessing and cleansing – This step helps in increasing the data
reliability. It comprises data cleaning, like handling the missing quantities and
removing noise or outliers. In this situation, it might make use of sophisticated
statistical methods or an analysis algorithm. For instance, the goal of the Data
Mining supervised approach may change if it is determined that a certain
attribute is unreliable or has a sizable amount of missing data. After developing
a prediction model for these features, missing data can be forecasted. A variety
of factors affect how much attention is paid to this level. However, breaking
down the components is important and frequently useful for enterprise data
frameworks.
4. Data Transformation-This phase entails creating and getting ready the
necessary data for knowledge discovery. Here, techniques of attribute
transformation (such as discretization of numerical attributes and functional
transformation) and dimension reduction (such as feature selection, feature
extraction, record sampling etc.) are employed. This step, which is frequently
very project-specific, can be important for the success of the KDD project.
Proper transformation results in proper analysis and proper conclusions.
5. Prediction and description- The decisions to use classification, regression,
clustering, or any other method can now be made. Mostly, this uses the KDD
objectives and the decisions made in the earlier phases. A forecast and a
description are two of the main objectives of knowledge discovery. The
visualization aspects are included in descriptive knowledge discovery. Inductive
learning, which generalizes a sufficient number of prepared models to produce
a model either explicitly or implicitly, is used by the majority of knowledge
discovery techniques. The fundamental premise of the inductive technique is
that the prepared model holds true for the examples that follow.
6. Deciding on knowledge discovery algorithm -We now choose the strategies
after determining the technique. In this step, a specific technique must be chosen
to be applied while looking for patterns with numerous inducers. If precision
and understandability are compared, the former is improved by neural networks,
while decision trees improve the latter. There are numerous ways that each meta-
learning system could be successful. The goal of meta-learning is to explain why
a data analysis algorithm is successful or unsuccessful in solving a particular
problem. As a result, this methodology seeks to comprehend the circumstances
in which a data analysis algorithm is most effective. Every algorithm has
parameters and learning techniques, including tenfold cross-validation or a
different division for training and testing.
7. Utilizing the Data Analysis Algorithm-Finally, the data analysis algorithm
is put into practice. The approach might need to be applied several times before
producing a suitable outcome at this point. For instance, by rotating the
algorithms, you can alter variables like the bare minimum of instances in a single
decision tree leaf.
17
Data Preparation for
Analysis 8. Evaluation-In this stage, the patterns, principles, and dependability of the
results of the knowledge discovery process are assessed and interpreted in light
of the objective outlined in the preceding step. Here, we take into account the
preprocessing steps and how they impact the final results. As an illustration, add
a feature in step 4 and then proceed. The primary considerations in this step are
the understanding and utility of the induced model. In this stage, the identified
knowledge is also documented for later use.
Check Your Progress 5:
1. What is Knowledge Discovery?
2. What are the Steps involved in Knowledge Discovery?
3. What are knowledge discovery tools?
4. Explain the process of KDD.
3.8 SUMMARY
Despite the development of several methods for preparing data, the intricacy of
the issue and the vast amount of inconsistent or unclean data mean that this field
of study is still very active. This unit gives a general overview of data pre-
processing and describes how to turn raw data into usable information. The
preprocessing of the raw data included data integration, data reduction,
transformation, and discretization. In this unit, we have discussed five different
data-cleaning techniques that can make data more reliable and produce high-
quality results. Building, organizing, and maintaining data sets is known as data
curation. A data curator usually determines the necessary data sets and makes
sure they are gathered, cleaned up, and changed as necessary. The curator is also
in charge of providing users with access to the data sets and information related
to them, such as their metadata and lineage documentation. The primary goal of
the data curator is to make sure users have access to the appropriate data for
analysis and decision-making. Data integration is the procedure of fusing
information from diverse sources into a single, coherent data store. The unit also
introduced knowledge discovery techniques and procedures.
3.9 SOLUTIONS/ANSWERS
2. It raises the reliability and accuracy of the data. Preprocessing data can
increase the correctness and quality of a dataset, making it more
dependable by removing missing or inconsistent data values brought by
human or computer mistakes. It ensures consistency in data.
18
Basics of Data Science
3. Data quality is characterized by five characteristics: correctness,
completeness, reliability, relevance, and timeliness.
References
Data Cleaning in Data Mining - Javatpoint. (n.d.). Www.Javatpoint.Com. Retrieved February 11,
2023, from https://www.javatpoint.com/data-cleaning-in-data-mining
Dowd, R., Recker, R.R., Heaney, R.P. (2000). Study subjects and ordinary patients. Osteoporos
Int. 11(6): 533-6.
Fourcroy, J.L. (1994). Women and the development of drugs: why can’t a woman be more like a
man? Ann N Y Acad Sci, 736:174-95.
Goehring, C., Perrier, A., Morabia, A. (2004). Spectrum Bias: a quantitative and graphical analysis
of the variability of medical diagnostic test performance. Statistics in Medicine, 23(1):125-35.
Gurwitz,J.H., Col. N.F., Avorn, J. (1992). The exclusion of the elderly and women from clinical
trials in acute myocardial infarction. JAMA, 268(11): 1417-22.
Hartt, J., Waller, G. (2002). Child abuse, dissociation, and core beliefs in bulimic disorders. Child
Abuse Negl. 26(9): 923-38.
Kahn, K.S, Khan, S.F, Nwosu, C.R, Arnott, N, Chien, P.F.(1999). Misleading authors’ inferences
in obstetric diagnostic test literature. American Journal of Obstetrics and Gynaecology., 181(1`),
112-5.
Maynard, C., Selker, H.P., Beshansky, J.R.., Griffith, J.L., Schmid, C.H., Califf, R.M., D’Agostino,
R.B., Laks, M.M., Lee, K.L., Wagner, G.S., et al. (1995). The exclusions of women from clinical
trials of thrombolytic therapy: implications for developing the thrombolytic predictive instrument
database. Med Decis Making (Medical Decision making: an international journal of the Society
for Medical Decision Making), 15(1): 38-43.
Robinson, D., Woerner, M.G., Pollack, S., Lerner, G. (1996). Subject selection bias in clinical:
data from a multicenter schizophrenia treatment center. Journal of Clinical Psychopharmacology,
16(2): 170-6.
21
Data Preparation for
Sharpe, N. (2002). Clinical trials and the real world: selection bias and generalisability of trial
Analysis results. Cardiovascular Drugs and Therapy, 16(1): 75-7.
Walter, S.D., Irwig, L., Glasziou, P.P. (1999). Meta-analysis of diagnostic tests with imperfect
reference standards. J Clin Epidemiol., 52(10): 943-51.
What is Data Extraction? Definition and Examples | Talend. (n.d.). Talend - A Leader in Data
Integration & Data Integrity. Retrieved February 11, 2023, from
https://www.talend.com/resources/data-extraction-defined/
Whitney, C.W., Lind, B.K., Wahl, P.W. (1998). Quality assurance and quality control in
longitudinal studies. Epidemiologic Reviews, 20(1): 71-80.
22
UNIT4: DATA VISUALIZATION AND
INTERPRETATION
Structure Page Nos.
4.0 Introduction
4.1 Objectives
4.2 Different types of plots
4.3 Histograms
4.4 Box plots
4.5 Scatter plots
4.6 Heat map
4.7 Bubble chart
4.8 Bar chart
4.9 Distribution plot
4.10 Pair plot
4.11 Line graph
4.12 Pie chart
4.13 Doughnut chart
4.14 Area chart
4.15 Summary
4.16 Answers
4.17 References
4.0 INTRODUCTION
The previous units of this course covers details on different aspects of data analysis,
including the basics of data science, basic statistical concepts related to data science and
data pre-processing. This unit explains the different types of plots for data visualization
and interpretation. This unit covers the details of the plots for data visualization and
further discusses their constructions and discusses the various use cases associated with
various data visualization plots. This unit will help you to appreciate the real-world
need for a workforce trained in visualization techniques and will help you to design,
develop, and interpret visual representations of data. The unit also defines the best
practices associated with the construction of different types of plots.
4.1 OBJECTIVES
After going through this unit, you will be able to:
• Explain the key characteristics of various types of plots for data visualization;
• Explain how to design and create data visualizations;
• Summarize and present the data in meaningful ways;
• Define appropriate methods for collecting, analysing, and interpreting the
numerical information.
Moreover, data visualisation can bring heterogeneous teams together around new
objectives and foster the trust among the team members. Let us discuss about various
graphs and charts that can be utilized in expression of various aspects of businesses.
4.3 HISTOGRAMS
A histogram visualises the distribution of data across distinct groups with continuous
classes. It is represented with set of rectangular bars with widths equal to the class
intervals and areas proportional to frequencies in the respective classes. A histogram
may hence be defined as a graphic of a frequency distribution that is grouped and has
continuous classes. It provides an estimate of the distribution of values, their extremes,
and the presence of any gaps or out-of-the-ordinary numbers. They are useful in
providing a basic understanding of the probability distribution.
• Analyse various data groups: The best data groupings can be found by
creating a variety of histograms.
• Break down compartments using colour: The same chart can display a
second set of categories by colouring the bars that represent each category.
Types of Histogram
Normal distribution: In a normal distribution, the probability that points will occur on
each side of the mean is the same. This means that points on either side of the mean
could occur.
Example: Consider the following bins shows the frequency of length of wings of
housefly in 1/10 of millimetre.
Bimodal Distribution: This distribution has two peaks. In the case of a bimodal
distribution, the data must be segmented before being analysed as normal distributions
in their own right.
Example:
Variable Frequency
0 2
1 6
2 4
3 2
4 4
5 6
6 4
Bimodal Distribution
8
6
freq
4
2
0
0 1 2 3 4 5 6
variable
Edge Peak Distribution: When there is an additional peak at the edge of the
distribution that does not belong there, this type of distribution is called an edge peak
distribution. Unless you are very positive that your data set has the expected number of
outliers, this almost always indicates that you have plotted (or collected) your data
incorrectly (i.e. a few extreme views on a survey).
Comb Distribution: Because the distribution seems to resemble a comb, with
alternating high and low peaks, this type of distribution is given the name "comb
distribution." Rounding off an object might result in it having a comb-like form. For
instance, if you are measuring the height of the water to the nearest 10 centimetres but
your class width for the histogram is only 5 centimetres, you may end up with a comb-
like appearance.
Example
Histogram for the population data of a group of 86 people:
20 18
15
15 13
11
10
6
5
0
20-25 26-30 31-35 36-40 41-45 46-50
Population Size 23 18 15 6 11 13
Age Group (Bins)
……………………………………………………………………………………
……………………………………………………………………………………
4. What do histograms show?
………………………………………………………………………………………
………………………………………………………………………………………
……………………………………………………………………………………
2. What are the most important parts of a box plot?
……………………………………………………………………………………
……………………………………………………………………………………
………………………………………………………………………………
Scatter plot is the most commonly used chart when observing the relationship between
two quantitative variables. It works particularly well for quickly identifying possible
correlations between different data points. The relationship between multiple variables
can be efficiently studied using scatter plots, which show whether one variable is a good
predictor of another or whether they normally fluctuate independently. Multiple distinct
data points are shown on a single graph in a scatter plot. Following that, the chart can
be enhanced with analytics like trend lines or cluster analysis. It is especially useful for
quickly identifying potential correlations between data points.
Constructing a Scatter Plot: Scatter plots are mathematical diagrams or plots that rely
on Cartesian coordinates. In this type of graph, the categories being compared are
represented by the circles on the graph (shown by the colour of the circles) and the
numerical volume of the data (indicated by the circle size). One colour on the graph
allows you to represent two values for two variables related to a data set, but two colours
can also be used to include a third variable.
Use Cases: Scatter charts are great in scenarios where you want to display both
distribution and the relationship between two variables.
• Display the relationship between time-on-platform (How Much Time Do
People Spend on Social Media) and churn (the number of people who stopped
being customers during a set period of time).
• Display the relationship between salary and years spent at company
Best Practices
• Analyze clusters to find segments: Based on your chosen variables, cluster
analysis divides up the data points into discrete parts.
• Employ highlight actions: You can rapidly identify which points in your
scatter plots share characteristics by adding a highlight action, all the while
keeping an eye on the rest of the dataset.
• mark customization: individual markings Add a simple visual hint to your
graph that makes it easy to distinguish between various point groups.
Example
SALE OF ICE-CREAM
₹4,000.00
₹3,000.00
₹2,000.00
₹1,000.00
₹-
0 10 20 30 40 50
TEMPERATURE IN DEGREE C
Please note that a linear trendline has been fitted to scatter plot, indicating a positive
change in sales of ice-cream with increase in temperature.
Check Your Progress 3
……………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
4. What are the 3 types of corelations that can be inferred from scatter plots?
……………………………………………………………………………………
……………………………………………………………………………………
Heatmaps are two-dimensional graphics that show data trends through colour
shading. They are an example of part to whole chart in which values are represented
using colours. A basic heat map offers a quick visual representation of the data. A
user can comprehend complex data sets with the help of more intricate heat maps.
Heat maps can be presented in a variety of ways, but they all have one thing in
common: they all make use of colour to convey correlations between data
values. Heat maps are more frequently utilised to present a more comprehensive
view of massive amounts of data. It is especially helpful because colours are simpler
to understand and identify than plain numbers.
Heat maps are highly flexible and effective at highlighting trends. Heatmaps are
naturally self-explanatory, in contrast to other data visualisations that require
interpretation. The greater the quantity/volume, the deeper the colour (the higher
the value, the tighter the dispersion, etc.). Heat Maps dramatically improve the
ability of existing data visualisations to quickly convey important data insights.
Use Cases: Heat Maps are primarily used to better show the enormous amounts of
data contained inside a dataset and help guide users to the parts of data
visualisations that matter most.
• Average monthly temperatures across the years
• Departments with the highest amount of attrition over time.
• Traffic across a website or a product page.
• Population density/spread in a geographical location.
Best Practices
• Select the proper colour scheme: This style of chart relies heavily on
colour, therefore it's important to pick a colour scheme that complements
the data.
• Specify a legend: As a related point, a heatmap must typically contain a
legend describing how the colours correspond to numerical values.
Example
Region-wise monthly sale of a SKU (stock-keeping unit)
MONTH
ZONE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
NORTH 75 84 61 95 77 82 74 92 58 90 54 83
SOUTH 50 67 89 61 91 77 80 72 82 78 58 63
EAST 62 50 83 95 83 89 72 96 96 81 86 82
WEST 69 73 59 73 57 61 58 60 97 55 81 92
The distribution of sales is shown in the sample heatmap above, broken down by
zone and spanning a 12-month period. Like in a typical data table, each cell displays
a numeric count, but the count is also accompanied by a colour, with higher counts
denoting deeper hues.
……………………………………………………………………………………
2. What kind of information does a heat map display?
……………………………………………………………………………………
……………………………………………………………………………………
3. What can be seen in heatmap?
……………………………………………………………………………………
……………………………………………………………………………………
Bubble diagrams are used to show the relationships between different variables. They
are frequently used to represent data points in three dimensions, specifically when the
bubble size, y-axis, and x-axis are all present. Using location and size, bubble charts
demonstrate relationships between data points. However, bubble charts have a restricted
data size capability since too many bubbles can make the chart difficult to read.
Although technically not a separate type of visualisation, bubbles can be used to show
the relationship between three or more measurements in scatter plots or maps by adding
complexity. By altering the size and colour of circles, large amounts of data are
presented concurrently in visually pleasing charts.
Use Cases: Usually, the positioning and ratios of the size of the bubbles/circles on this
chart are used to compare and show correlations between variables. Additionally, it is
utilised to spot trends and patterns in data.
• AdWords’ analysis: CPC vs Conversions vs share of total conversions
• Relationship between life expectancy, GD per capita and population size
Best Practices:
• Add colour: A bubble chart can gain extra depth by using colour.
• Set bubble size in appropriate proportion.
• Overlay bubbles on maps: From bubbles, a viewer can immediately determine
the relative concentration of data. These are used as an overlay to provide the
viewer with context for geographically-related data.
Example
The three variables in this example are sales, profits, and the number of units sold.
Therefore, all three variables and their relationship can be displayed simultaneously
using a bubble chart.
Sales and Profit versus the Quantity sold
BUBBLE CHART
₹30,000.00
₹25,000.00
Sales (in INR)
₹20,000.00
₹15,000.00
₹10,000.00
₹5,000.00
₹-
0 200 400 600 800 1000 1200 1400
Number of units sold
……………………………………………………………………………………
2. What is a bubble chart used for?
……………………………………………………………………………………
……………………………………………………………………………………
3. What is the difference between scatter plot and bubble chart?
……………………………………………………………………………………
……………………………………………………………………………………
4. What is bubble size in bubble chart?
……………………………………………………………………………….
……………………………………………………………………………….
A bar chart is a graphical depiction of numerical data that uses rectangles (or
bars) with equal widths and varied heights. In the field of statistics, bar charts
are one of the methods for handling data.
Constructing a Bar Chart: The x-axis corresponds to the horizontal line, and
the y-axis corresponds to the vertical line. The y-axis represents frequency in
this graph. Write the names of the data items whose values are to be noted along
the x-axis that is horizontal.
Along the horizontal axis, choose the uniform width of bars and the uniform
gap between the bars. Pick an appropriate scale to go along the y-axis that runs
vertically so that you can figure out how high the bars should be based on the
values that are presented. Determine the heights of the bars using the scale you
selected, then draw the bars using that information.
Types of Bar chart: Bar Charts are mainly classified into two types:
Horizontal Bar Charts: Horizontal bar charts are the type of graph that are
used when the data being analysed is to be depicted on paper in the form of
horizontal bars with their respective measures. When using a chart of this type,
the categories of the data are indicated on the y-axis.
Example:
Vertical Bar Charts: A vertical bar chart displays vertical bars on graph (chart)
paper. These rectangular bars in a vertical orientation represent the
measurement of the data. The quantities of the variables that are written along
the x-axis are represented by these rectangular bars.
Example:
We can further divide bar charts into two basic categories:
Grouped Bar Charts: The grouped bar graph is also referred to as the clustered
bar graph (graph). It is valuable for at least two separate types of data. The
horizontal (or vertical) bars in this are categorised according to their position.
If, for instance, the bar chart is used to show three groups, each of which has
numerous variables (such as one group having four data values), then different
colours will be used to indicate each value. When there is a close relationship
between two sets of data, each group's colour coding will be the same.
Example:
Stacked Bar Charts: The composite bar chart is also referred to as the stacked
bar chart. It illustrates how the overall bar chart has been broken down into its
component pieces. We utilise bars of varying colours and clear labelling to
determine which category each item belongs to. As a result, in a chart with
stacked bars, each parameter is represented by a single rectangular bar. Multiple
segments, each of a different colour, are displayed within the same bar. The
various components of each separate label are represented by the various
segments of the bar. It is possible to draw it in either the vertical or horizontal
plane.
Example:
Use cases: Bar charts are typically employed to display quantitative data. The
following is a list of some of the applications of the bar chart-
• In order to clearly illustrate the relationships between various variables,
bar charts are typically utilised. When presented in a pictorial format,
the parameters can be more quickly and easily envisioned by the user.
• Bar charts are the quickest and easiest way to display extensive
amounts of data while saving time. It is utilised for studying trends over
extended amounts of time.
Best Practices:
• Use a common zero valued baseline
• Maintain rectangular forms for your bars
• Consider the ordering of category level and use colour wisely.
Example:
Region Sales
East 6,123
West 2,053
South 4,181
North 3,316
Sales By Region
North 3,316
East
South 4,181
West
West 2,053 South
East 6,123 North
Constructing a Distribution Plot: You must utilise one or two dimensions, together
with one measure, in a distribution plot. You will get a single line visualisation if you
only use one dimension. If you use two dimensions, each value of the outer, second
dimension will produce a separate line.
Use Cases: Distribution of a data set shows the frequency of occurrence of each
possible outcome of a repeatable event observed many times. For instance:
• Height of a population.
• Income distribution in an economy
• Test scores listed by percentile.
Best Practices:
• It is advisable to have equal class widths.
• The class intervals should be mutually exclusive and non-overlapping.
• Open-ended classes at the lower and upper limits (e.g., <10, >100) should be
avoided.
Example
0
00
00
00
00
00
00
00
00
00
00
10
00
-2
-3
-4
-5
-6
-7
-8
-9
1-
-1
01
01
01
01
01
01
01
01
01
10
20
30
40
50
60
70
80
90
The pairs plot is an extension of the histogram and the scatter plot, which are both
fundamental figures. The scatter plots on the upper and lower triangles show the
relationship (or lack thereof) between two variables. The histogram along the diagonal
gives us the ability to see the distribution of a single variable, while the scatter plots on
the upper and lower triangles show the relationship (or lack thereof) between two
variables.
Use Cases: A pairs plot allows us to see both distribution of single variables and
relationships between two variables. It helps to identify the most distinct clusters or the
optimum combination of attributes to describe the relationship between two variables.
• By creating some straightforward linear separations or basic lines in our data
set, it also helps to create some straightforward classification models.
• Analysing socio-economic data of a population.
Best Practices:
• Use a different colour palette.
• For each colour level, use a different marker.
Example:
calories protein fat sodium fiber rating
70 4 1 130 10 68.40297
120 3 5 15 2 33.98368
70 4 1 260 9 59.42551
50 4 0 140 14 93.70491
110 2 2 180 1.5 29.50954
110 2 0 125 1 33.17409
130 3 2 210 2 37.03856
90 2 1 200 4 49.12025
90 3 0 210 5 53.31381
120 1 2 220 0 18.04285
110 6 2 290 2 50.765
120 1 3 210 0 19.82357
110 3 2 140 2 40.40021
110 1 1 180 0 22.73645
110 2 0 280 0 41.44502
100 2 0 290 1 45.86332
110 1 0 90 1 35.78279
110 1 1 180 0 22.39651
110 3 3 140 4 40.44877
110 2 0 220 1 46.89564
100 2 1 140 2 36.1762
100 2 0 190 1 44.33086
110 2 1 125 1 32.20758
110 1 0 200 1 31.43597
100 3 0 0 3 58.34514
120 3 2 160 5 40.91705
120 3 0 240 5 41.01549
110 1 1 135 0 28.02577
100 2 0 45 0 35.25244
110 1 1 280 0 23.80404
100 3 1 140 3 52.0769
110 3 0 170 3 53.37101
120 3 3 75 3 45.81172
120 1 2 220 1 21.87129
110 3 1 250 1.5 31.07222
110 1 0 180 0 28.74241
110 2 1 170 1 36.52368
140 3 1 170 2 36.47151
110 2 1 260 0 39.24111
100 4 2 150 2 45.32807
110 2 1 180 0 26.73452
100 4 1 0 0 54.85092
150 4 3 95 3 37.13686
150 4 3 150 3 34.13977
160 3 2 150 3 30.31335
100 2 1 220 2 40.10597
120 2 1 190 0 29.92429
140 3 2 220 3 40.69232
90 3 0 170 3 59.64284
130 3 2 170 1.5 30.45084
120 3 1 200 6 37.84059
100 3 0 320 1 41.50354
50 1 0 0 0 60.75611
50 2 0 0 1 63.00565
100 4 1 135 2 49.51187
100 5 2 0 2.7 50.82839
120 3 1 210 5 39.2592
100 3 2 140 2.5 39.7034
90 2 0 0 2 55.33314
110 1 0 240 0 41.99893
110 2 0 290 0 40.56016
80 2 0 0 3 68.23589
90 3 0 0 4 74.47295
90 3 0 0 3 72.80179
110 2 1 70 1 31.23005
110 6 0 230 1 53.13132
90 2 0 15 3 59.36399
110 2 1 200 0 38.83975
140 3 1 190 4 28.59279
100 3 1 200 3 46.65884
110 2 1 250 0 39.10617
110 1 1 140 0 27.7533
100 3 1 230 3 49.78745
100 3 1 200 3 51.59219
110 2 1 200 1 36.18756
The pair plot can be interpreted as follows:
Along the boxes of the diagonal, the variable names are displayed.
A scatterplot of the correlation between each pairwise combination of factors is shown
in each of the remaining boxes. For instance, a scatterplot of the values for rating and
sodium can be seen in the matrix's box in the top right corner. A scatterplot of values
for rating, that is positively connected with rating, and so forth may be seen in the box
in the upper left corner. We can see the association between each pair of variables in
our dataset from this single visualisation. For instance, calories and rating appear to
have a negative link but protein and fat appear to be unrelated.
A graph that depicts change over time by means of points and lines is known as a line
graph, line chart, or line plot. It is a graph that shows a line connecting a lot of points
or a line that shows how the points relate to one another. The graph is represented by
the line or curve that connects successive data points to show quantitative data between
two variables that are changing. The values of these two variables are compared along
a vertical axis and a horizontal axis in linear graphs.
One of the most significant uses of line graphs is tracking changes over both short and
extended time periods. It is also used to compare the changes that have taken place for
diverse groups over the course of the same time period. It is strongly advised to use a
line graph rather than a bar graph when working with data that only has slight
fluctuations.
Example:
2. Multiple Line Graph: The same set of axes is used to plot several lines. An
excellent way to compare similar objects over the same time period is via a
multiple line graph.
Example:
Example:
Constructing a line graph: When we have finished creating the data tables, we will
then use those tables to build the linear graphs. These graphs are constructed by plotting
a succession of points, which are then connected together with straight lines to offer a
straightforward method for analysing data gathered over a period of time. It provides a
very good visual format of the outcome data that was gathered over the course of time.
Use cases: Tracking changes over both short and long time periods is an important
application of line graphs. Additionally, it is utilised to compare changes over the same
time period for various groups. Anytime there are little changes, using a line graph
rather than a bar graph is always preferable.
• Only connecting adjacent values along an interval scale should be done with
lines.
• In order to provide correct insights, intervals should be of comparable size.
• Select a baseline that makes sense for your set of data; a zero baseline might
not adequately capture changes in the data.
• Line graphs are only helpful for comparing data sets if the axes have the same
scales.
Example:
A pie chart, often referred to as a circle chart, is a style of graph that can be used to
summarise a collection of nominal data or to show the many values of a single variable
(e.g. percentage distribution). Such a chart resembles a circle that has been divided into
a number of equal halves. Each segment corresponds to a specific category. The overall
size of the circle is divided among the segments in the same proportion as the category's
share of the whole data set.
A pie chart often depicts the individual components that make up the whole. In order to
bring attention to a particular piece of information that is significant, the illustration
may, on occasion, show a portion of the pie chart that is cut away from the rest of the
diagram. This type of chart is known as an exploded pie chart.
Types of a Pie chart: There are mainly two types of pie charts one is 2D pie chart and
another is 3D pie chart. This can be further classified into flowing categories:
1. Simple Pie Chart: The most fundamental kind of pie chart is referred to simply as
a pie chart and is known as a simple pie chart. It is an illustration that depicts a pie
chart in its most basic form.
Example:
Owners(%)
2. Exploded Pie Chart: To create an exploding pie chart, you must first separate the
pie from the chart itself, as opposed to merging the two elements together. It is common
practise to do this in order to draw attention to a certain section or slice of a pie chart.
Example:
3.Pie of Pie: The pie of pie method is a straightforward approach that enables more
categories to be represented on a pie chart without producing an overcrowded and
difficult-to-read graph. A pie chart that is generated from an already existing pie chart
is referred to as a "pie of pie".
Example:
Example:
Therefore, the pie chart formula is given as (Given Data/Total value of Data) × 360°
Use cases: If you want your audience to get a general idea of the part-to-whole
relationship in your data, and comparing the exact sizes of the slices is not as critical to
you, then you should use pie charts. And indicate that a certain portion of the whole is
disproportionately small or large.
• Voting preference by age group
• Market share of cloud providers
Best Practices
• Fewer pie wedges are preferred: The observer may struggle to interpret the chart's
significance if there are too many proportions to compare. Similar to this, keep the
overall number of pie charts on dashboards to a minimum.
Overlay pies on maps: Pie charts can be used to further deconstruct geographic
tendencies in your data and produce an engaging display.
Example
MARKET SHARE
Company A
22% 24% Company B
Company C
13%
Company D
33%
8% Company E
Pie charts have been superseded by a more user-friendly alternative called a doughnut
chart, which makes reading pie charts much simpler. It is recognised that these charts
express the relationship of 'part-to-whole,' which is when all of the parts represent one
hundred percent when collected together. It presents survey questions or data with a
limited number of categories for making comparisons.
In comparison to pie charts, they provide for more condensed and straightforward
representations. In addition, the canter hole can be used to assist in the display of
relevant information. You might use them in segments, where each arc would indicate
a proportional value associated with a different piece of data.
Constructing a Doughnut chart: A doughnut chart, like a pie chart, illustrates the
relationship of individual components to the whole, but unlike a pie chart, it can display
more than one data series at the same time. A ring is added to a doughnut chart for each
data series that is plotted within the chart itself. The beginning of the first data series
can be seen near the middle of the chart. A specific kind of pie chart called a doughnut
chart is used to show the percentages of categorical data. The amount of data that falls
into each category is indicated by the size of that segment of the donut. The creation of
a donut chart involves the use of a string field and a number, count of features, or
rate/ratio field.
There are two types of doughnut chart one is normal doughnut chart and another is
exploded doughnut chart. Exploding doughnut charts, much like exploded pie charts,
highlight the contribution of each value to a total while emphasising individual values.
However, unlike exploded pie charts, exploded doughnut charts can include more than
one data series.
Use cases: Doughnut charts are good to use when comparing sets of data. By using the
size of each component to reflect the percentage of each category, they are used to
display the proportions of categorical data. A string field and a count of features,
number, rate/ratio, or field are used to make a doughnut chart.
• Android OS market share
• Monthly sales by channel
Best Practices
• Stick to five slices or less because thinner and long-tail slices become unreadable
and uncomparable.
• Use this chart to display one point in time with the help of the filter legend.
• Well-formatted and informative labels are essential because the information
conveyed by circular shapes alone is not enough and is imprecise.
• It is a good practice to sort the slices to make it more clear for comparison.
Example:
Project Status
Completed 30%
Work in progress 25%
Incomplete 45%
An area chart, a hybrid of a line and bar chart, shows the relationship between the
numerical values of one or more groups and the development of a second variable, most
often the passage of time. The inclusion of shade between the lines and a baseline,
similar to a bar chart's baseline, distinguishes a line chart from an area chart. An area
chart has this as its defining feature.
Overlapping area chart: An overlapping area chart results if we wish to look at how
the values of the various groups compare to one another. The conventional line chart
serves as the foundation for an overlapping area chart. One point is plotted for each
group at each of the horizontal values, and the height of the point indicates the group's
value on the vertical axis variable.
All of the points for a group are connected from left to right by a line. A zero baseline
is supplemented by shading that is added by the area chart between each line. Because
the shading for different groups will typically overlap to some degree, the shading itself
incorporates a degree of transparency to ensure that the lines delineating each group
may be seen clearly at all times.
The shading brings attention to group that has the highest value by highlighting group's
pure hue. Take care that one series is not always higher than the other, as this could
cause the plot to become confused with the stacked area chart, which is the other form
of area chart. In circumstances like these, the most prudent course of action will consist
of sticking to the traditional line chart.
Stacked area chart: The stacked area chart is what is often meant to be conveyed when
the phrase "area chart" is used in general conversation. When creating the chart of
overlapping areas, each line was tinted based on its vertical value all the way down to
a shared baseline. Plotting lines one at a time creates the stacked area chart, which uses
the height of the most recent group of lines as a moving baseline. Therefore, the total
that is obtained by adding up all of the groups' values will correspond to the height of
the line that is entirely piled on top.
When you need to keep track of both the total value and the breakdown of that total by
groups, you should make use of a stacked area chart. This type of chart will allow you
to do both at the same time. By contrasting the heights of the individual curve segments,
we are able to obtain a sense of how the contributions made by the various subgroups
stack up against one another and the overall sum.
Example:
A B C D
Printers Projectors White Boards
2017 32 45 28
2018 47 43 40
2019 40 39 43
2020 37 40 41
2021 39 49 39
Stacked Area chart
150
100
50
0
1 2 3 4 5
Use Cases: In most cases, many lines are drawn on an area chart in order to create a
comparison between different groups (also known as series) or to illustrate how a whole
is broken down into its component pieces. This results in two distinct forms of area
charts, one for each possible application of the chart.
• Magnitude of a single quantitative variable's trend - An increase in a public
company's revenue reserves, programme enrollment from a qualified subgroup by
year, and trends in mortality rates over time by primary causes of death are just a
few examples.
• Comparison of the contributions made by different category members (or
group)- the variation in staff sizes among departments, or support tickets opened
for various problems.
• Birth and death rates over time for a region, the magnitudes of cost vs. revenue for
a business, the magnitudes of export vs. import over time for a country
Best Practices:
• To appropriately portray the proportionate difference in the data, start the y-axis at
0.
• To boost readability, choose translucent, contrasting colours.
• Keep highly variable data at the top of the chart and low variable data at the bottom
during stacking.
• If you need to show how each value over time contributes to a total, use a stacked
area chart.
• However, it is recommended to utilise 100% stacked area charts if you need to
demonstrate a part to whole relationship in a situation where the cumulative total is
unimportant.
Example:
The above Stacked area chart is belonging to tele-service offered by various television
based applications. In this data, there are different type of subscribers who are using
the services provided by tele-applications in different months.
4.15 SUMMARY
This Unit introduces you to some of the basic charts that are used in data science. The
Unit defines the characteristics of Histograms, which are very popular in univariate
frequency analysis of quantitative variables. It then discusses the importance and
various terms used in the box plots, which are very useful while comparing quantitative
variable over some qualitative characteristic. Scatter plots are used to visualise the
relationships between two quantitative variables. The Unit also discusses about the heat
map, which are excellent visual tools for comparing values. In case three variables are
to be compared then you may use bubble charts. The unit also highlights the importance
of bar charts, distribution plots, pair plots and line graphs. In addition, it highlights the
importance of Pie chart, doughnut charts and area charts for visualising different kinds
of data. In addition, there are many different kinds of charts that are used in different
analytical tool. You may read about them from reafferences.
4.16 ANSWERS
ii.
4. The box plot distribution will reveal the degree to which the data are clustered, how
skewed they are, and also how symmetrical they are.
• Positively Skewed: The box plot is positively skewed if the distance from the me-
dian to the maximum is greater than the distance from the median to the mini-
mum.
• Negatively skewed: Box plots are said to be negatively skewed if the distance
from the median to the minimum is higher than the distance from the median to
the maximum.
• Symmetric: When the median of a box plot is equally spaced from both the maxi-
mum and minimum values, the box plot is said to be symmetric.
• The most practical method for displaying bivariate (2-variable) data is a scatter plot.
• A scatter plot can show the direction of a relationship between two variables when
there is an association or interaction between them (positive or negative).
• The linearity or nonlinearity of an association or relationship can be ascertained
using a scatter plot.
• A scatter plot reveals anomalies, questionably measured data, or incorrectly plotted
data visually.
2.
• The Title- A brief description of what is in your graph is provided in the title.
• The Legend- The meaning of each point is explained in the legend.
• The Source- The source explains how you obtained the data for your graph.
• Y-Axis.
• The Data.
• X-Axis.
3. A scatter plot is composed of a horizontal axis containing the measured values of one
variable (independent variable) and a vertical axis representing the measurements of the
other variable (dependent variable). The purpose of the scatter plot is to display what
happens to one variable when another variable is changed.
4.
• Positive Correlation.
• Negative Correlation.
• No Correlation (None)
3. Using one variable on each axis, heatmaps are used to display relationships
between two variables. You can determine if there are any trends in the
values for one or both variables by monitoring how cell colours vary across
each axis.
4. Any bubbles between 0 and 5 pts on this scale will appear at 5 pt, and
all the bubbles on your chart will be between 5 and 20 pts. To construct
a chart that displays many dimensions, combine bubble size with
colour by value.
Answer 2:
Charts are primarily divided into two categories:
Answer4:
2. A scatter plot of a and b, one of a and c, and finally one of a and d are
shown in the first line. b and a (symmetric to the first row) are in the second
row, followed by b and c, b and d, and so on. In pairs, no sums, mean squares,
or other calculations are performed. That is in your data frame if you discover
it in your pairings plot.
3. Pair plots are used to determine the most distinct clusters or the best
combination of features to describe a connection between two variables. By
creating some straightforward linear separations or basic lines in our data set,
it also helps to create some straightforward classification models.
2. Tracking changes over a short as well as a long period of time is one of the
most important applications of line graphs. Additionally, it is utilised to
compare the modifications that have occurred for various groups throughout
the course of the same period of time. When dealing with data that has only
minor variations, using a line graph rather than a bar graph is strongly
recommended. For instance, the finance team at a corporation may wish to
chart the evolution of the cash balance that the company now possesses
throughout the course of time.
3.
1. A pie chart, often referred to as a circle chart, is a style of graph that can be used to
summarise a collection of nominal data or to show the many values of a single variable.
(e.g. percentage distribution).
2. There are mainly two types of pie charts one is 2D pie chart and another is 3D pie
chart. This can be further classified into flowing categories:
3. Pie of Pie
4. Bar of Pie
3.
1. Pie charts have been superseded by a more user-friendly alternative called a doughnut
chart, which makes reading pie charts much simpler. It is recognised that these charts
express the relationship of 'part-to-whole,' which is when all of the parts represent one
hundred percent when collected together. In comparison to pie charts, they provide for
more condensed and straightforward representations.
2. A donut chart is similar to a pie chart, with the exception
that the centre is cut off. When you want to display
particular dimensions, you use arc segments rather than
slices. Just like a pie chart, this form of chart can assist you
in comparing certain categories or dimensions to the
greater overall; nevertheless, it has a few advantages over
its pie chart counterpart.
3.
Product Sales
60
30
40
x y z
1. An area chart shows how the numerical values of one or more groups change in
proportion to the development of a second variable, most frequently the passage of time.
It combines the features of a line chart and a bar chart. A line chart can be differentiated
from an area chart by the addition of shading between the lines and a baseline, just like
in a bar chart. This is the defining characteristic of an area chart.
3.
Area Chart
3500
3000
2500
2000
1500
1000
500
0
2017 2018 2019 2020
4.17 REFERENCES
Structure
5.1 Introduction
5.2 Objectives
5.3 Big Data and Characteristics
5.4 Big data Applications
5.5 Structured vs semi-structured and unstructured data
5.6 Big Data vs data warehouse
5.7 Distributed file system
5.8 HDFS and Map Reduce
5.9 Apache Hadoop 1 and 2 (YARN)
5.10 Summary
5.11 Solutions/Answers
5.12 Further Readings
5.1 INTRODUCTION
In this modern era of Information and knowledge, terabytes of data are produced by a wide
variety of sources every day. These sources include social media, the data of companies
including production data, customer data, financials etc.; the interactions of users, the data
created by sensors and data produced by electronic devices such as mobile phones and
automobiles, amongst others. This voluminous increase in data relates to the field of Big
Data. This concept of "Big Data" applies to the rapidly expanding quantity of data that is being
collected, and it is described primarily by having the following characteristics: volume,
velocity, veracity, and variety.
The process of deriving useful information and insights from massive amounts of data is
referred to as "big data analytics". It requires the application of technologically advanced tools
and procedures. "Big data architecture" refers to the pattern or design that describes how this
data is received from many sources, processed for further ingestion, assessed, and eventually
made available to the end-user.
The building blocks of big data analytics are found in the big data architecture. In most cases,
the architecture components of big data analytics consist of four logical levels or layers, as
discussed below:
• Big Data Source Layer: A big data environment may manage both batch processing and
real-time processing of big data sources, like data warehouses, relational database
management systems, SaaS applications, and Internet of Things devices. This layer is
referred to as the big data sources layer.
• The Management and Storage Layer: it is responsible for receiving data from the source,
converting that data into a format that the data analytics tool can understand, and storing
the data in accordance with the format in which it was received.
• Analysis Layer: This layer is where the business intelligence that was acquired from the
big data storage layer is analysed.
• The Consumption Layer: it is responsible for collecting the findings from the Big Data
Analysis Layer and delivering them to the Appropriate Output Layer, which is also
referred to as the Business Intelligence Layer.
The architecture components of big data analytics typically include four logical layers or levels
(discussed above), which perform fundamental processes as given below :
• Connecting to Data Sources: Connectors and adapters can connect to a wide range of
storage systems, protocols, and networks. They can also connect to any type of data
format.
• Data Governance: includes rules for privacy and security that work from the time data is
taken in until it is processed, analysed, stored, and deleted.
• Systems Management: Modern big data architectures are usually built on large-scale,
highly scalable clusters that are spread out over a wide area. These architectures must be
constantly monitored using central management consoles.
• Protecting Quality of Service: The Quality of Service framework helps define data
quality, compliance policies, and how often and how much data should be taken in.
In this Unit, we will discuss Big Data its Characteristics and Applications. We will also discuss
Big Data architecture and related technologies such as Map Reduce, HDFS, Apache Hadoop and
Apache YARN.
5.2 OBJECTIVES
To define the concept of Big Data, we first revise the fundamentals of data and thereafter we will
extend our discussion to the advanced concepts of Big Data. What exactly does "Data" mean?
Data can be numbers or characters or symbols or any other form such as an image or signal,
which can be processed by a computer or may be stored in the storage media or sent in the form
of electrical impulses.
Now, we define - What is Big Data? The term "big data" refers to a collection of information that
is not only extremely large in quantity but also growing at an exponential rate over the course of
time. Because of the vast amount and the high level of complexity, none of the normal methods
that are used for managing data can effectively store or deal with such data. Simply speaking,
"big data" refers to information that is kept in exceedingly vast amounts.
It is vital to have a thorough understanding of the characteristics of big data in order to get a
handle on how the concept of big data works and how it can be used. The characteristics of big
data are:
1. Volume: The amount of information that you currently have access to is referred to as its
volume. We measure the quantity of data that we have in our possession using the units
of Gigabytes (GB), Terabytes (TB), Petabytes (PB), Exabytes (EB), Zettabytes (ZB), and
Yottabytes (YB). According to the patterns that have been seen in the industry, it is
anticipated that the quantity of data has increased exponentially.
2. Velocity: The speed at which data is generated and therefore should be processed, is
referred to as the "velocity" of that process. When it comes to the efficiency of any
operation involving vast amounts of data, a high velocity is a fundamental requirement.
This characteristic can be broken down into its component parts, which include the rate
of change, activity bursts, and the linking of incoming data sets.
3. Variety: The term "variety" relates to the numerous forms that big data might take. When
we talk about diversity, we mean the various kinds of big data that are available. This is a
major problem for the big data sector since it has an impact on the efficiency of analysis.
It is critical that you organize your data in order to effectively manage its diversity. You
get a wide range of data from a variety of sources.
4. Veracity: The veracity of data relates to its accuracy. Lack of veracity can cause
significant harm to the precision of your findings; it is one of the Big Data characteristics
that is considered to be of the utmost importance.
5. Value: The benefits that your organization gains from utilizing the data are referred to as
"value." Are the results of big data analysis in line with your company's goals? Is it
helping your organization expand in any way? In big data, this is a vital component.
6. Validity: This characteristic of Big Data relates to how valid and pertinent are the facts
that are going to be used for the purpose that they were designed for?
7. Volatility: Volatility relates to the life span of big data. Some data items may be valid for
a very short duration like the sentiments of people.
8. Variability: The field of big data is continually changing. In some cases, the information
you obtained from one source may not be the same as what you found today. This
phenomenon is known as data variability, and it has an impact on the homogeneity of
your data.
9. Visualization: You may be able to express the insights that were generated by big data by
using visual representations such as charts and graphs. The insights that big data
specialists have to provide are increasingly being presented to audiences who are not
technically oriented.
• Big Data can be illustrated by looking at the Stock Exchange, which creates
approximately one terabyte of new trade data every single day.
• Social Media: According to various reports, more than 500 Gigabytes of fresh data are
added to the databases of the social media website each and every day. The uploading of
photos and videos, sending and receiving of messages, posting of comments, and other
activities are the primary sources of this data.
• In just thirty minutes of flying time, a single jet engine is capable of producing more than
ten Gigabytes of data. Because there are many thousands of flights each day, the amount
of data generated can reach many Peta bytes.
A big data system is primarily an application of big data in an organization for decision support.
The following four components are required for a Big Data system to function properly:
Big data may be the future of worldwide governance and businesses. It presents businesses with
a great number of opportunities for improvement. The following are some of the more important
ones:
Based on these opportunities, this field of Big Data has enormous applications in a variety of
fields, and they are discussed in the subsequent section i.e. section 5.4.
It is generally stated that Big Data is one of the most valuable and effective fuel that can power the vast
IT companies of the 21st century. Big data has applications in virtually every industry. Big data enables
businesses to make better use of the vast quantities of data they generate and collect from a variety of
sources. There are many different applications for big data, which is why it is currently one of the skills
that is most in demand. The following are some examples of important applications of big data:
• One of the industries that makes the most use of big data technology is the travel and tourism
sector. It has made it possible for us to anticipate the need for travel facilities in a variety of
locations, thereby improving business through dynamic pricing and a number of other factors.
• The social media sites produce a significant amount of data. Big data allows marketers to make
greater use of the information provided by social media platforms, which results in improved
promotional activities. It enables them to construct accurate client profiles, locate their target
audience, and comprehend the criteria that are important to them.
• Big data technology is utilized widely within the financial and banking sectors. The use of big
data analytics can assist financial institutions in better comprehending the behaviours of their
clients on the basis of the inputs obtained from their investment behaviours, shopping habits,
reasons for investing, and their personal or financial histories.
• The field of healthcare has already witnessed significant changes brought on by Big Data
application implementation. Individual patients can now receive individualised medical care that
is tailored to their specific needs. This is possible due to the application of predictive analytics by
medical professionals and personnel in the healthcare industry.
• The use of big data in recommendation systems is another popular application of big data. Big
data allows businesses to recognize patterns of client behavior in order to provide better and more
individualized services to those customers.
• One of the most important industries that produce and use big data is the telecommunications and
multimedia business. Every day, Zettabytes of new data are created and managed.
• In addition, the Government and the military make heavy use of big data technology. You may
think of the quantity of data that the Government generates on its records; and in the military, a
standard fighter jet plane needs to handle Petabytes of data while it is in the air.
• Companies are able to do predictive analysis because of the capabilities provided by big data
technologies. It lets businesses to make more accurate predictions of the outcomes of processes
and events, which in turn helps them reduce risk.
• Companies are able to develop insights that are more accurate because of big data. The big data is
collected by them from a variety of sources, which allows them the capacity to use relevant data
to generate insights that can be put into action. If a corporation has more accurate information, it
will be able to make decisions that are more profitable and reduce risks.
The terms "structured," "unstructured," and "semi-structured" data are frequently brought up whenever
we are having a discussion about data or analytics. These are the three varieties of data that are becoming
increasingly important for many kinds of commercial applications i.e. Structured, Semi-Structured and
Unstructured data. Structured data has been around for quite some time, and even today, conventional
systems and reporting continue to rely on this type of data. In spite of this, there has been a rapid increase
in the production of unstructured and semi-structured data sources over the course of the last several
years. As a consequence of this, an increasing number of companies are aiming to include all three kinds
of data in their business intelligence and analytics systems in order to take these systems to the next level.
In this section of unit, we are going to discuss data classification used for understanding and
implementing the Big Data i.e. Structured, Semi Structured and Unstructured data.
Structured Data: Structured data is a term that refers to information that has been reformatted and
reorganised in accordance with a data model that has been decided in advance. After mapping the raw
data into the predesigned fields, the data can then be extracted and read using SQL in a straightforward
manner. Relational databases, which are characterised by their organisation of data into tables comprising
of rows and columns, and the query language supported by them (SQL) provide the clearest example
possible of structured data.
This relational model of the data format decreases the quantity of information that is duplicated. On the
other side, organized data is more dependent on one another and is less flexible than unstructured data.
Both humans and machines are responsible for the generation of this kind of data.
There are several examples of structured data that are generated by machines, such as data from point-of-
sale terminals (POS), such as quantity, barcodes, and statistics from blogs. In a similar vein, anybody who
works with data has probably used spreadsheets at least once in their life. Spreadsheets are an example of
a classic form of structured data that is generated by humans. Because of the way it is organized,
structured data is simpler to examine compared to semi-structured data as well as unstructured data.
Structured data can be analysed easily because it corresponds to data models that have already been
defined. For example, you can organise structured data such as names of customers in alphabetical order,
organise telephone numbers in the appropriate format, organise social security numbers in the correct
format etc.
Unstructured data: Information that is displayed in its most unprocessed form is referred to as
"unstructured data”. It is really difficult to work with this data because it has a minimal structure, and the
formatting is also very confusing. Unstructured data management can collect data from a variety of
sources, such as postings on social media platforms, conversations, satellite imagery, data from the
Internet of Things (IoT) sensor devices, emails, and presentations, and organise it in a storage system in a
logical and specified manner.
Semi-Structured Data: There is a third kind of data that falls between structured and unstructured data
called semi-structured data or partially structured data. Your data sets might not always be structured or
unstructured. One variety of such data is known as semi-structured data, and it differs from unstructured
data in that it possesses both consistent and definite qualities. It does not restrict itself to a fixed structure
like those that are required for relational databases. Although organizational qualities such as metadata or
semantics tags are utilised with semi-structured data in order to make it more manageable, there is still a
certain amount of unpredictability and inconsistency present in the data.
Use of delimited files is one illustration of a data format that is only semi-structured. It has elements that
are capable of separating the hierarchies of the data into their own distinct structures. In a similar manner,
digital images have certain structural qualities that make them semi-structured, but the image itself does
not have a pre-defined structure. If a picture is shot with a Smartphone, for instance, it will include certain
structured properties like a device ID, a geotag, and a date and time stamp. After they have been saved,
pictures can be organized further by affixing tags to them, such as "pet" or "dog," for example.
Due to the presence of one or more characteristics that allow for classification, unstructured data may on
occasion be categorized as semi-structured data instead. To summarise, organisations need to analyse all
three kinds of data in order to stay ahead of the competition and make the most of the knowledge they
have.
What should you buy for implementing an analytics system for your organization - a big data solution or
a data warehouse? A data warehouse and a big data solution are quite similar in many respects. Both have
the capacity to store a significant amount of information. When reporting, either option is acceptable to
utilize. But the question that needs to be answered is whether they can truly be utilised as a substitute for
each other. In order to understand this, you need to have a conceptual understanding of both i.e., Big data
and Data Warehouse.
The form of big data that can be found in Hadoop, Cloudera, and other similar big data platforms is the
type that is understood by most people. The following are the working definitions of big data solutions
(You may find an accurate functioning definition in the websites of Cloudera or HortonWorks.):
But despite this, "Big Data" and "Data Warehouse" are not the same thing at all. Why? In order to
understand this, we need to recapitulate the basic concepts of Data Warehouse
A data warehouse is a subject-oriented, non-volatile, integrated, and time-variant collection of data that is
generated for the aim of the management making decisions with such data. A data warehouse provides a
centralized, integrated, granular, and historical point of reference for all of the company's data in one
convenient location.
So why do individuals demand a solution that involves big data? People are looking for a solution for big
data because there is a significant amount of data in many organizations. And in such companies, that
data – if it is accessed properly – can include a great deal of important information that can lead to better
decisions, which in turn can lead to more revenue, more profitability, and more customers. And the vast
majority of businesses want to do this.
What are the reasons that people require a data warehouse? In order for people to make decisions based
on accurate information, a data warehouse is required. You need data that is trustworthy, verifiable, and
easily accessible to everyone if you want to have a thorough understanding of what is occurring within
your company.
So, in short we can understand the data Warehouse involves a process of collecting data from a number of
sources, processing it, and storing it in a repository where it can be analyzed and used for reporting
reasons. The procedure described above results in the establishment of a data repository that is known as
the data warehouse.
1. Big data refers to data that is stored in an The term "data warehouse" refers to the
extremely large format that can be utilised by accumulation of historical data from a variety
many technologies. of business processes within an organisation.
2. The term "big data" refers to a type of A data warehouse is a structure that is utilised
technology that can store and manage extensive in the process of data organisation.
amounts of data.
3. As input, it will accept data that is either The only type of data that can be used as input
structured, non-structured, or semi-structured. is structured data.
4. The distributed file system is used for Processing operations in the data warehouse are
processing big data. not carried out using a distributed file system.
5. When it comes to retrieving data from To get data from relational databases, we use
databases, big data may not use SQL queries. structured query language, or SQL, in the data
warehouse.
7. The modifications to the data that occur as a Changes in the data that occur as a result of
result of adding new data are saved in the form adding new data do not have an immediate and
of a file. which is represented by a table. direct effect on the data warehouse.
8. When opposed to data warehouses, big data As a result of the fact that the data is compiled
does not require the same level of effective from a variety of business divisions, the data
management strategies. warehouse calls for management strategies that
are more effective.
Big Data uses Distributed File System. So, we extend our discussion on Distributed file Systems in
section 5.7 of this unit.
A Distributed File System (DFS) is a file system that is distributed across a large number of file servers
may be at several locations (It's possible that these servers are located in various parts of the world). It
allows programmers to access or store isolated files the same way they do local files, allowing
programmers to access files from any network or computer.
The primary objective of the Distributed File System (DFS), is to facilitate the sharing of users' data and
resources between physically separate systems through the utilization of a Common File System. A setup
on Distributed File System is defined as a group of workstations and mainframes that are linked together
over a Local Area Network (LAN). Within the framework of the operating system, a DFS is carried out.
A namespace is generated by DFS, and the clients are not exposed to the inner workings of this
generation process.
It is possible to use the namespace component of Distributed File System (DFS) without using the file
replication component, and it is perfectly possible to use the file replication component of Distributed File
System (DFS) between servers without using the namespace component. It is not essential to make use of
both of the DFS components at the same time.
In order to gain more clarity on the Distributed File System (DFS), we need to study the Features of DFS:
1. Transparency:
When discussing distributed systems, the term "transparency" refers to the process of hiding
information about the separation of components from both the user and the application
programmer. This is done in order to protect the integrity of the system. Because of this, it seems
as though the whole system is a single thing, rather than a collection of distinct components
working together. It is classified into the following types:
a) Structure Transparency –
There is no reason for the client to be aware of the number of file servers and storage
devices and their geographical locations. But it is always recommended to provide a
number of different file servers to improve performance, adaptability, and dependability.
b) Access Transparency –
The access process for locally stored files and remotely stored files should be identical. It
should be possible for the file system to automatically locate the file that is being
accessed and send that information to the client's side.
c) Naming transparency –
There should be no indication in the name of the file as to where the file is located in any
way, shape, or form. Once a name has been assigned to the file, it should not have any
changes made to it while it is being moved from one node to another.
d) Replication transparency –
If a file is copied on multiple nodes, the locations of the copies of the file as well as the
copies themselves should be hidden when moving from one node to another.
2. User Mobility: It will automatically transfer the user's home directory to the node where the user
logs in.
3. Performance: The average amount of time it takes to get what the client wants is used to
measure performance. This time includes CPU time, the time it takes to get to secondary storage,
and the time it takes to get to the network. It would be best for the Distributed File System to
work like both DFS and as well as a centralized file system (CFS), based on requirements.
4. Simplicity and ease of use: A file system should have a simple user interface and a small
number of commands in the file.
5. High availability: If a link fails, a node fails, or a storage drive crashes, a Distributed File
System should still be able to keep working. A distributed file system (DFS) that is both reliable
and flexible should have different file servers that control different storage devices that are also
separate.
6. Scalability: Adding new machines to the network or joining two networks together is a common
way to make the network bigger, so the distributed system will always grow over time. So, a good
distributed file system (DFS) should be built so that it can grow quickly as the number of nodes
and users grows. As the number of nodes and users grows, the service should not deteriorate too
much.
7. High reliability: A good distributed file system (DFS) should make it as unlikely as possible that
data will be lost. That is, users should not feel like they have to make backup copies of their files
because the system is not reliable. Instead, a file system should make copies of important files in
case the originals get lost. Stable storage is used by many file systems to make them very reliable.
8. Data integrity: A file system is often used by more than one person at a time. The file system
must make sure that the data saved in a shared file stays the same. That is, a concurrency control
method must be used to keep track of all the different users' requests to access the same file at the
same time. Atomic transactions are a type of high-level concurrency management that a file
system often offers to users to keep their data safe.
9. Security: A distributed file system should be safe so that its users can trust that their data will be
kept private. Security measures must be put in place to protect the information in the file system
from unwanted and unauthorized access.
10. Heterogeneity: Because distributed systems are so big, there is no way to avoid heterogeneity.
Users of heterogeneous distributed systems can choose to use different kinds of computers for
different tasks.
The working of DFS can be put into practice in one of two different ways:
• Standalone DFS namespace is a namespace that only permits DFS roots to be used in it if they
are located on the local computer and do not make use of Active Directory. Only the computers
on which a Standalone DFS was initially created can be used to acquire the file system. There is
no fault liberation offered by it, and it cannot be connected to any other DFS. Standalone
DFS roots are not very common because of their limited advantage.
• Domain-based DFS namespace — This creates the DFS namespace root, which can be accessed
at domainname>dfsroot> and stores the configuration of DFS in Active Directory.
c) DFS facilitates file searching, faster to get to the data, and makes the network to run better.
d) DFS improves the ability to change the size of the data and the ability to exchange data.
e) DFS maintains data transparency and replicability, even if a server or disc fails, Distributed File
System makes data available.
b) During the process of moving from one node to another in the network, there is a chance that
some of the messages and data will be lost.
c) While using a distributed file system, Difficulty is encountered when attempting to connect to a
database
d) When compared to a system with a single user, a Distributed File System makes database
management more difficult.
e) If every node in the network attempts to transfer data at the same time, there is a possibility that
the network will become overloaded.
After understanding the Distributed File System (DFS), now it is time to understand Hadoop and HDFS
(Hadoop Distributed File System). The discussion for HDFS (Hadoop Distributed File System) and
MapReduce is given in the subsequent section, and the details of MapReduce are explicitly available in
Unit-6.
1. Describe the term "Distributed File System" in the context of Big Data.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
2. What is the primary objective of Distributed File System?
………………………………………………………………………………………………………
………………………………………………………………………………………………………
3. Explain the components of Distributed File System.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
4. Discuss the features of Distributed File System
………………………………………………………………………………………………………
………………………………………………………………………………………………………
5. Discuss the number of ways through which the working of Distributed File System can
be put into practice
………………………………………………………………………………………………………
………………………………………………………………………………………………………
6. Give advantages and disadvantages of Distributed File Systems.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
Data is being produced at a lightning speed in the modern period by a wide variety of sources, such as
corporations, scientific data, e-mails, blogs, and other online platforms, amongst other places. It is
necessary to implement data-intensive applications and storage clusters in order to analyse and manage
the massive amount of data that is being generated and to gain information that is of use to the users. This
is a prerequisite for being able to analyse and manage the data. This category of application is required to
fulfil a number of characteristics, such as being able to tolerate errors, performing parallel processing,
distributing data, maintaining load balance, being scalable, and having highly available operation. The
MapReduce programming approach was developed by Google specifically for the purpose of resolving
problems of this kind. Hadoop is another name for the open-source software project known as Apache
Hadoop, which implements MapReduce technology.
Hadoop is a collection of several software services that are all freely accessible to the public and can be
used in conjunction with one another. It offers a software framework for storing a huge amount of data in
a variety of locations and for working with that data by utilising the MapReduce programming style. Both
of these capabilities are accomplished through the use of software. Hadoop's essential components are the
Hadoop Distributed File System (HDFS) and the MapReduce programming paradigm. Hadoop is
operated using a programming style called MapReduce, and the Hadoop Distributed File System (HDFS)
is the file system that is used to store data. The combination of HDFS and MapReduce creates an
architecture that, in addition to being scalable and fault-tolerant, conceals all of the complexity associated
with the analysis of big data.
Hadoop Distributed File System, abbreviated as HDFS, is a self-healing, distributed file system that offers
dependable, scalable, and fault-tolerant data storage on commodity hardware. HDFS was developed by
Hadoop. It collaborates closely with MapReduce by distributing storage and processing across huge
clusters. This is accomplished by combining storage resources that, depending on the requests and queries
being processed, are able to scale up or down. HDFS is not architecture-specific; it may read data in any
format, including text, photos, videos, and so on, and it will automatically optimise it for high bandwidth
streaming. The ability to tolerate errors is the most significant benefit of HDFS. The possibility of a
catastrophic failure can be reduced by ensuring that there is a quick data movement between the nodes
and that Hadoop can continue to offer service even in the event that individual nodes fail.
The information presented in this section can be used for understanding the development of large-scale
distributed applications that can make use of the computing capacity of several nodes in order to finish
tasks that are data and computation intensive.
Let us discuss the Hadoop Architecture which includes the following components:
An understanding of the Hadoop architecture requires the introduction of various daemons involved in its
working. So, we need to understand the various daemons involved in Apache Hadoop, which includes the
five daemons, three of which i.e. NameNode, the DataNode, and the Secondary NameNode relates to
HDFS for the purpose of efficiently managing distributed storage, and The JobTracker and TaskTracker
components that are utilized by the MapReduce engine are responsible for both job tracking and job
execution respectively, and each of the mentioned daemons runs on their respective JVM
Firstly, we will discuss the HDFS, a distributed file system that is comprised of three nodes i.e. the
NameNode, the DataNode, and the Secondary NameNode respectively.
1) NameNode : A single NameNode daemon operates on the master node. NameNode is responsible
for storing and managing the metadata that is connected with the file system. This metadata is
stored in a file that is known as fsimage. When a client makes a request to read from or write to a
file, the metadata is held in a cache that is located within the main memory so that the client may
access it more rapidly. The I/O tasks are completed by the slave DataNode daemons, which are
directed in their actions by the NameNode.
The NameNode is responsible for managing and directing how files are divided up into blocks,
selecting which slave node should store these blocks, and monitoring the overall health and
fitness of the distributed file system. In addition, it decides which slave node should store these
blocks. Memory and input/output (I/O) are both put to intensive use in the operations that are
carried out by the NameNode in the network.
2) DataNode : A DataNode daemon is present on each slave node, which is a component of the
Hadoop cluster. DataNodes are the primary storage parts of HDFS. They are responsible for
storing data blocks and catering to requests to read or write files that are stored on HDFS. These
are under the authority of NameNode. Blocks that are kept in DataNodes are replicated in
accordance with the configuration in order to guarantee both high availability and reliability.
These duplicated blocks are dispersed around the cluster so that computation may take place more
quickly.
3) Secondary NameNode : A backup for the NameNode is not provided by the Secondary
NameNode. The job of the Secondary NameNode is to read the file system at regular intervals,
log any changes that have occurred, and then apply those changes to the fsimage file. This assists
in updating NameNode so that it can start up more quickly the next time, as shown in Figure 3.
The HDFS layer is responsible for daemons that store data and information, such as NameNode and
DataNode. The MapReduce layer is in charge of JobTracker and TaskTracker, which are seen in Figure 1
and are responsible for keeping track of and actually executing jobs.
Hadoop
NameNode JobTracker
DataNode TaskTracker
Figure-1-Hadoop Daemons
The master/slave architecture is utilised by HDFS, with the NameNode daemon and secondary
NameNode both operating on the master node and the DataNode daemon running on each and every slave
node, as depicted in Figure 2. The HDFS storage layer consists of three different daemons.
A Hadoop cluster consists of several slave nodes, in addition to a single master node. NameNode is the
master daemon for the HDFS storage layer, while JobTracker is the master daemon for the MapReduce
processing layer. Both of these daemons are executed by the master node. The remaining machines will
be responsible for running the "slave" daemons, which include DataNode for the HDFS layer and
TaskTracker for the MapReduce layer.
It is possible for a master node to take on the role of a slave at times, as it possesses the potential to flip
roles. As a result, the master node is capable of operating both the master daemons and the slave daemons
in addition to running the master daemons. The daemons that are operating on the master node are
basically responsible for coordinating and administering the daemons that are running as slaves on all of
the other nodes. This responsibility lies with the master node. These slave daemons are responsible for
carrying out the tasks essential for the processing and storing of data.
Now we will discuss the MapReduce concept and the daemons involved with MapReduce i.e. the
JobTracker and TaskTracker components that are utilized by the MapReduce engine.
MapReduce can refer to either a programming methodology or a software framework. Both are utilised in
Apache Hadoop. Hadoop MapReduce is a programming framework that is made available for creating
applications that can process and analyse massive data sets in parallel on large multi-node clusters of
commodity hardware in a manner that is scalable, reliable, and fault tolerant. The processing and analysis
of data consist of two distinct stages known as the Map phase and the Reduce phase.
During a MapReduce task, the input data is often segmented and partitioned into pieces before being
broken up and processed in parallel by the "Map phase" and then by the "Reduce phase." The data that is
generated by the Map phase is arranged and organised by the Hadoop architecture.
The result of the Map phase is sorted by the Hadoop framework, and this information is then sent as input
to the Reduce phase in order to begin parallel reduce jobs (see Figure 4). These input and output files are
saved in the system's file directory. The HDFS file system is the source of input datasets for the
MapReduce framework by default. It is not required that tasks involving Map and Reduce proceed in a
sequential way. This means that reduce jobs can begin as soon as any of the Map tasks finishes the work
that has been given to them. It is also not required that all Map jobs be finished before any reduction tasks
begin their work. MapReduce operates on key-value pairs as its data structure. In theory, a MapReduce
job will accept a data set as an input in the form of a key-value pair, and it will only produce output in the
form of a key-value pair after processing the data set through MapReduce stages. As can be seen in
Figure 5, the output of the Map phase, which is referred to as the intermediate results, is sent on to the
Reduce phase as an input.
On the same lines as HDFS, MapReduce also makes use of a master/slave architecture. As illustrated in
Figure 6, the JobTracker daemon resides on the master node, while the TaskTracker daemon resides on
each of the slave nodes.
The MapReduce processing layer consists of two different daemons i.e. JobTracker and TaskTracker,
their role are discussed below:
Fig 6: JobTracker and TaskTracker
4) JobTracker: The JobTracker service is responsible for monitoring MapReduce tasks that are carried out
on slave nodes and is hosted on the master node. The job is sent to the JobTracker by the user through
their interaction with the Master node. The next thing that happens is that JobTracker queries NameNode
to find out the precise location of the data in HDFS that needs to be processed. JobTracker searches for
TaskTracker on slave nodes and then sends the jobs to be processed to TaskTracker on those nodes. The
TaskTracker will occasionally send a heartbeat message back to the JobTracker to verify that the
TaskTracker on a specific slave node is still functioning and working on the task that has been assigned to
it. If the heartbeat message is not received within the allotted amount of time, the TaskTracker running on
that particular slave node is deemed to be inoperable, and the work that was assigned to it is moved to
another TaskTracker to be scheduled. The combination of JobTracker and TaskTracker is referred to as
the MapReduce engine. If there is a problem with the JobTracker component of the Hadoop MapReduce
service, all active jobs will be halted until the problem is resolved.
5)TaskTracker: On each slave node that makes up a cluster, a TaskTracker daemon is executed. It works
to complete MapReduce tasks after accepting jobs from the JobTracker. The capabilities of a node
determine the total amount of "task slots" that are available in each TaskTracker. Through the use of the
heartbeat protocol, JobTracker is able to determine the number of "task slots" that are accessible in
TaskTracker on a slave node. It is the responsibility of JobTracker to assign suitable work to the relevant
TaskTrackers, and the number of open task slots will determine how many jobs can be assigned. On every
slave node, TaskTracker is the master controller of how each MapReduce action is carried out. Even
though there is only one TaskTracker for each slave node, each TaskTracker has the ability to start
multiple JVMs so that MapReduce tasks can be completed simultaneously. JobTracker, the master node,
receives a "heartbeat" message regularly from each slave node's TaskTracker. This message confirms to
JobTracker that TaskTracker is still functioning.
In short, it can be said that, when it comes to processing massive and unstructured data volumes, the
Hadoop MapReduce computing paradigm and HDFS are becoming increasingly popular choices. While
masking the complexity of deploying, configuring, and executing the software components in the public
or private cloud, Hadoop makes it possible to interface with the MapReduce programming model. Users
are able to establish clusters of commodity servers with the help of Hadoop. MapReduce has been
modelled as an independent platform-as-a-service layer that cloud providers can utilise to meet a variety
of different requirements. Users are also given the ability to comprehend the data processing and analysis
processes.
Apache Hadoop 1.x suffered from a number of architectural flaws, the most notable of which was a
decline in the overall performance of the system. Actually, the cause of the problem was the excessive
strain that was placed on the MapReduce component of Hadoop-1. MapReduce was responsible for both
application management and resource management in Hadoop 1; however, with Hadoop 2, application
management is now handled by a new component known as YARN, which takes over responsibility for
resource management (yet another resource negotiator). As a result, MapReduce is in charge of managing
application management in Hadoop 2, whereas YARN is in charge of managing the resources. With the
release of Hadoop 2, YARN has added two additional daemons. These are-
• Resource Manager
• Node Manager
These two new Hadoop 2 daemons have replaced the JobTracker and TaskTracker in Hadoop 1.
In addition, there is only one NameNode in the Hadoop 1.x cluster, which implies that it serves as the
single point of failure for the entire system. Hadoop 2. x's architecture, on the other hand, incorporates
both Active and Passive NameNodes in several places. In the event that the active NameNode is unable to
complete the tasks assigned to it, the passive NameNode will step in and assume responsibility. This
element is directly responsible for Hadoop 2. x's high availability, which is one of the most important
characteristics of the new version.
The processing of data is problematic in Hadoop 1.x; however, Hadoop 2.x's YARN provides a
centralised resource management that allows sharing of sharable resources. This makes it possible for
multiple applications included within Hadoop to execute concurrently while sharing a single resource.
Limitation The architecture of Hadoop 1 is known Hadoop 2 utilises the same Master-Slave
as a Master-Slave architecture. One architecture as its predecessor. However,
master rules over a large number of this is made up of a number of different
slaves in this arrangement. In the event masters (also known as active namenodes
that the master node experienced a and standby namenodes) and a number of
catastrophic failure, the cluster would different slaves. In the event that this
be wiped out regardless of the quality master node suffers a crash, the standby
of the slave nodes. Again, in order to master node will take control of the
create that cluster, you have to copy network. You are able to create a wide
system files, picture files, and so on variety of different active-standby node
onto another machine, which takes an combinations. As a result, the issue of
excessive amount of time and is having a single point of failure will be
something that enterprises just cannot resolved with Hadoop 2.
accept.
EcoSystem
Pig, Hive, and Mahout are examples of data processing tools that sit above Hadoop
and execute their job there.
Sqoop is an application that can import and export structured data. Using a SQL
database, it is possible to immediately import and export data into HDFS.
Flume is a tool that is used to import and export streaming data as well as
unstructured data.
Support No Support of Microsoft Windows Support of Microsoft Windows available
available for Hadoop 1.x for Hadoop 2.x
☞ Check Your Progress 5
1. What is Hadoop? Discuss the components of Hadoop.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
2. Discuss the role of HDFS and MapReduce in Hadoop Architecture
………………………………………………………………………………………………………
………………………………………………………………………………………………………
3. Discuss the relevance of various nodes in HDFS architecture
………………………………………………………………………………………………………
………………………………………………………………………………………………………
4. Explain, how master/slave process works in HDFS architecture
………………………………………………………………………………………………………
………………………………………………………………………………………………………
5. What do you understand by the Map phase and reduce phase in MapReduce architecture?
………………………………………………………………………………………………………
………………………………………………………………………………………………………
6. List and explain the various daemons involved in the functioning of MapReduce.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
7. Differentiate between Apache Hadoop-1 and Hadoop-2.
………………………………………………………………………………………………………
………………………………………………………………………………………………………
5.10 SUMMARY
The unit covers the concepts necessary for the understanding of Big Data and its respective
Characteristics, none the less the understanding of the concept of Big data was also covered from the
applications point of view. The unit also covers further details like the comparison between Structured,
Semi-structured and Unstructured data; also, a comparison of Big Data and Data warehouse is presented
in this unit. This Unit also covers the important concept and understanding of Distributed file systems
used in Big Data. Finally, the unit concludes with the comparative presentation of Map Reduce and
HDFS, along with Apache Hadoop 1 and 2 (YARN) .
5.11 SOLUTIONS/ANSWERS
• D. Borthakur, "The hadoop distributed file system: architecture and design," Hadoop
Project Website [online]. Available: http://hadoop.apache.org/ core/docs/current/hdfs
design.pdf
UNIT 6 PROGRAMMING USING MAPREDUCE
Structure
6.0 Introduction
6.1 Objectives
6.2 Map Reduce Operations
6.3 Loading data into HDFS
6.3.1 Installing Hadoop
6.3.2 Loading Data
6.4 Executing the MapReduce phases
6.4.1 Executing the Map phase
6.4.2 Shuffling and sorting
6.4.3 Reduce phase execution
6.4.4 Node Failure and MapReduce
6.5 Algorithms using MapReduce
6.5.1 Word counting
6.5.2 Matrix-Vector Multiplication
6.6 Summary
6.7 Answers
6.8 References and further readings
6.0 INTRODUCTION
In the Unit 5 of this Block, you have gone through the concepts of HDFS and have been
introduced to the map-reduce programming paradigm. HDFS is a distributed file
system, which can help in reliable storage of large amount of data files across
cluster machines. This unit discusses the map reduce programming concepts
pertaining to how to leverage cluster machines to perform a particular task by
dividing the tasks effectively across them in a reliable and fault tolerant way.
These tasks may include big jobs, such as building the index of a search engine
to crawl the webpages. This Unit illustrates the three stages in MapReduce, viz.
Map Phase, Shuffle Phase and Reduce Phase. It will also discuss map and reduce
operations and how we can load and store the data in HDFS. Lastly this unit also
provides few classical examples such word count in documents, matrix vector
multiplication etc.
6.1 OBJECTIVES
Prior to 2004, huge amounts of data was being stored in a single server. Thus, if
any program was running a query which involves data stored on multiple servers
there was no means for logical integration of data for analysis. It would also lead
to massive amount of computation time and efforts. Furthermore, there was a
threat of data loss and backup. This would lead to reduced scalability. Thus to
cater to this, Google introduced Hadoop MapReduce in December 2004 which
led to significant reduction of analysis time. It allows various queries to run
simultaneously on several server machines and logically integrate the search
results, thus facilitating real-time analysis. Other advantages of MapReduce
include i) fault tolerance and ii) scalability.
MapReduce is a programming model that can process as well as analyse huge
data logically across machine clusters. Mapper is responsible for data sorting
while reducer divides it into logical clusters, while pruning the unnecessary or
bad data.
Hadoop MapReduce
A processing unit of Hadoop using which we can process the big data stored in the
HDFS storage. It allows to perform parallel and distributed computing on huge data.
MapReduce is used in Indexing and searching of data, Classification of data,
recommendation of data, and analysis of data.
There are two functions in MapReduce i.e., one is the Map function and the other is
Reduce function. The architecture of MapReduce is shown in Figure 1. You may
observe there are a number of map and reduce functions in Figure 1. All the map
functions are performed in parallel to each other, similarly all reduce functions are also
performed in parallel to each other. These activities are coordinated by the MapReduce
architecture, which also deals with the failure of nodes performing these operations. The
MapReduce architecture as shown in Figure 1, can be summarised as:
Thus, in MapReduce programming, an entire task can be divided into map task
and reduce task. Map takes input as a key value, and produces output as a list of
<key-value> pair. Reduce takes input as shuffling of key and list value and the
final output is the key value as shown in Figure 2.
Figure 2: Key-value pair in MapReduce
Next, we discuss the data loading into HDFS system, which is the starting point
of working with MapReduce architecture.
In order to load HDFS, first you need to install the Hadoop MapReduce into your
system. The minimum configuration for HDFS installation is:
1) Intel Core 2 Duo/Quad/hex/Octa or higher end 64 bit processor PC or Laptop
(Minimum operating frequency of 2.5GHz)
2) Hard Disk capacity of 1- 4TB.
3) 64-512 GB RAM
4) 10 Gigabit Ethernet or Bonded Gigabit Ethernet
5) OS: Ubuntu/any Linux distribution/Windows
In the programs explained below, we have used Ubuntu as the operating system.
6.3.1 Installing Hadoop
The step-by-step installation of Hadoop MapReduce is as follows:
Step 1: Install Java version using the following command:
sudo apt install openjdk-8-jdk
Step 2: For installing the ssh to securely connect to remote server/system and
transfer the data in encrypted form, you may use the following command:
sudo apt-get install ssh
Step 3: For installing the pdsh for parallel shell tool to run commands across
multiple nodes in clusters, you may use the following command:
Step 5: Extract the file and save where you want to save the extracted files by
using the command:
Step 9: Editing all the xml according to our requirement and paste the given
configuration
sudonano core-site.xml
Now edit hdfs-site.xml
Sudo nano hdfs-site.xml
---------------------------------------------------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
--------------------------------------------------------------------------
Step 10: Open localhost
export PDSH_RCMD_TYPE=ssh
In order to use MapReduce feature in Hadoop, you need to load the data into
HDFS format. We will show the steps for word count operation for an input
file name as input.txt on my desktop file as shown in Fig. 3.
For the same example as in Figure 3, we will check the feature of MapReduce
in Hadoop.
We will move to hadoop folder in terminal.
cd ~/Downloads/hadoop-3.3.3/
cd share/hadoop/mapreduce/
When we open one jar file, we will get different class functions which are
present as follows:
For the example in Figure 3, you can see the map, shuffle and reduce
operations in Figure 4.
Figure 4 Map, Shuffle and Reduce operations for "input.txt" file
Processing of Map and Reduce phase is done as parallel processes. In map the
input is split among the mapper nodes where each chunk is identified and
mapped to the key forming a tuple (key-value) pair. These tuples are passed to
nodes where sorting-shuffling of tuples takes place i.e. sorting and grouping
tuples based on keys so that all tuples with the same key are sent to the same
node.
Shuffling, or the method due to which the system will sort the map output and
feeds it as input to the reducer, is the process of moving data from mappers to
reducers. Therefore, without the MapReduce shuffle phase, the reducers would
not have any input from the mapper phase. As shuffling can begin even before
the map phase is complete, it speeds up work completion and saves time.
Hadoop's Reducer takes the intermediate key-value pair output from the Mapper
and processes each one separately to produce the output. The final result, which
is saved in HDFS, is the reducer's output. Typically, aggregation or summation-
type computations are done in the Hadoop Reducer. The Reducer processes the
mapper's output. It creates a new set of output after processing the data. Finally,
this output data is stored in HDFS.
A Reducer function is called on each intermediate key-value pair that the Hadoop
Reducer receives as input from the mapper. This data in the form of (key, value)
can be aggregated, filtered, and combined in a variety of ways for a variety of
processes. Reducer produces the result after processing the intermediate values
for a certain key produced by the map function. There is a one-to-one mapping
between keys and reducers. Since reducers are independent of one another, they
operate in parallel. The quantity of reducers is chosen by the user. The number
of reducers is one by default. The reduction job collects the key-value pairs after
shuffle and sort. The output of the reduction task is written to the File system by
the OutputCollector.collect() method. The user can specify the number of
reducers for the work with the Job.setNumreduceTasks(int). Figure 6 a and
Figure 6b show the code of map function and reduce function using Python. The
code includes comments and easy to follow.
Figure 6 (a): Map function for word count problem Using
Python
Figure 6b: Reduce function for word count problem using
Python
One of the major advantages of MapReduce is that it allows failure of the node
that are performing the distributed computation. The worst scenario is a failure
of the compute node where the Master program is running, as the master node
coordinates the execution of the MapReduce job. This requires restarting the
entire MapReduce job. However, you may please note that only this one node
has the power to shut down the entire job; all other failures will be handled by
the Master node and the MapReduce task will finally finish.
For example, assume a compute node where a Map worker is located
malfunctions. In such a case, the Master node, which frequently pings the
Worker processes, will notice this failure. Once this worker, which was running
the mapper process, is detected, then all the Map tasks that were assigned to this
worker node, even if they were finished, will need to be redone. Rerunning
finished Map tasks is necessary because the output of those activities, which was
intended for the Reduce tasks, is now unavailable to the Reduce tasks because it
is located at that failed compute node. Each of these Map activities will be
rescheduled by the master on another Worker. Additionally, the Master must
inform every Reduce task that the input location from that Map job has changed.
However, all this activity is done without the intervention of any external person.
The system handles the entire process.
It is easier to handle a failure at a Reduce worker node. The Master merely
changes the status of any Reduce jobs that are currently running to idle. These
will eventually be moved to another Reduce worker.
In this section we discuss two simple problems that can be solved by using
map reduce. First, we discuss about the word count problem solution and then
matric multiplication.
$HADOOP_HOME/share/hadoop/mapreduce
Matrix vector and matrix-matrix multiplication of Big matrices can be used for
several advanced computational algorithms including PageRank computation.
In this section, we discuss about how these operations can be be performed using
MapReduce.
Assume a large matrix M is multiplied with a large vector (v). Each of the vector
v and the matrix M is kept in a DFS file. Further, assume that you can find the
row-column coordinates of a matrix element either from its file position or from
explicitly saved coordinates. For example, you may save a matrix elements using
the triplet (i, j, mij ), where i and j is the row and column indices of matrix
respectively and mij is the magnitude of the element. Similarly, you may assume
that element vj's position in the vector v may be determined in a similar manner.
In order to multiply the M with v, you need to compute the multiplications of
the matrix element mij ´ vj, which will be summed for the entire row. Thus, the
ith output element would be the sum of mij ´ vj for all the columns j. In order to
multiply these elements, you need to write the Map and Reduce functions. You
may please note that for the multiplication both M as well as v are needed. Thus,
the Map method can be created to be applied with the elements of M. Since, v
is smaller than M, therefore it can be read into main memory at the compute
node running a Map task. A portion of the matrix M will be used by each Map
job to operate. It generates a key-value pair from each matrix element mij given
as {i, mijvj}. As a result, the same key, ‘i’ will be assigned to each term of the
sum that composes the component of the matrix-vector product. The Reduce
method will add up all the values related to the key ‘i’ that is specified.
However, it may happen that the vector v be so big that it can not possibly fit in
main memory as a whole vector. It is not necessary for v to fit in main memory
at a compute node, but if it does not, moving pieces of the vector into main
memory will result in a significant amount of disc requests as you multiply
components by matrix elements. As an alternative, we may divide the vector into
an equal number of horizontal stripes of the same height and the matrix into
vertical stripes of equal width to perform this multiplication. You may refer to
further readings for details on these aspects of matrix-vector multiplication.
Then we will run the matrix multiplication jar file to multiply the given
input matrixes
hadoop jar G:\Hadoop_Experiments\matrixMultiply.jar
com.mapreduce.wc/MatrixMultiply /input_matrix/*/output_matrix/
For the vector multiplication, we will use the same operations as matrix
multiplication. The vectors will be stored in matrix form as an input.
Check Your Progress 3
6.6 SUMMARY
In this unit, we learnt what are the main characteristics and functionalities of
MapReduce. We also discussed different phases of MapReduce in detail.
Starting with the installation, we have covered the main program snippets for
executing different phases with the help of classical examples such as word
count, matrix multiplication etc.
6.7 SOLUTIONS/ANSWERS
3. Step 1: Before installing java , first login into root user and set the
machine name as master and set the local host ip at your default ip.
# hostnamectl set-hostname master # nano /etc/hosts
192.168.1.41master.hadoop.lan
Step 2: Now download the java and install in your pc. # sudo apt install
default-jre
file as follows:
Now move the
extracted file to
/home/hadoop/
# ls -al /home/hadoop/
Step 5: Now login and configure hadoop and java Environment variables on
our system by just editing
Step 7: Now configure the ssh key based authentication for hadoop
Step 8: Now configure the hadoop core-site.xml
Now Edit the sore-site.xml file
$ su root
Step 10: Now edit mapred-site.xml
$ nano etc/hadoop/mapred-site.xml
Step 16: To retrieve data from HDFS to our local file system use the below
command:
Step 17: To stop the manage hadoop Service
$ stop-yarn.sh
$ stop-dfs.sh
3.
There are two functions in MapReduce i.e., one is the Map
function and the Other is Reduce function. Processing of
Map and Reduce phase is done as parallel processes,
In map the input is split among the mapper nodes where
each chunk is identified and mapped to the key forming a
tuple (key-value) pair. These tuples are passed to Reducer
nodes where sorting-shuffling of tuples takes place i.e.
sorting and grouping tuples based on keys so that all
tuples with the same key are sent to the same node.
cd $HADOOP_HOME
[1] https://kontext.tech/article/448/install-hadoop-330-on-linux
[2] https://kontext.tech/article/447/install-hadoop-330-on-windows-10-step-by-
step-guide
[3] https://intl.cloud.tencent.com/document/product/436/10867
[4] https://bigdatapath.wordpress.com/2018/02/13/introduction-to-hadoop/
[5] https://hadoop.apache.org/
[6] https://blog.csdn.net/qq_30242609/category_6519905.html
[7] https://www.tutorialspoint.com/hadoop/index.htm
[8] https://halvadeforspark.readthedocs.io/en/latest/
[9] Tilley, Scott, and KrissadaDechokul. "Testing iOS Apps with HadoopUnit: Rapid
Distributed GUI Testing." Synthesis Lectures on Software Engineering 2.2 (2014): 1-
103.
[10] Dechokul, Krissada. Distributed GUI testing of iOS applications with HadoopUnit.
Diss. 2014.
[11] https://commandstech.com/category/hadoop/
[12] Frampton, Michael. Big Data made easy: A working guide to the complete
Hadoop toolset. Apress, 2014.
[13] http://infolab.stanford.edu/~ullman/mmds/ch2n.pdf
32
UNIT 7 OTHER BIG DATA ARCHITECTURES
AND TOOLS
Structure
7.0 Introduction
7.1 Objectives
7.2 Apache SPARK Framework
7.3 HIVE
7.3.1 Working of HIVE Queries
7.3.2 Installation of HIVE
7.3.3 Writing Queries in HIVE
7.4 HBase
7.4.1 HBase Installation
7.4.2 Working with HBase
7.5 Other Tools
7.6 Summary
7.7 Answers
7.8 References and further readings
7.0 INTRODUCTION
In the Unit 5 and Unit 6 of this Block, you have gone through the concepts of Hadoop
and MapReduce programming. In addition, you have gone through various phases in
the MapReduce program. This unit introduces you to other popular big data
architecture tools and architectures. These architectures are beneficial to both
the ETL developers and analytics professionals.
This Unit introduces you to the basic software stack of SPARK architecture, one
of the most popular architecture of handling large data. In addition, this Unit
introduces two important tools – HIVE, which is a data warehouse system and
HBase, which is an important database system. An open-source NoSQL
database called HBase uses HDFS and Apache Hadoop to function. Unlimited
data can be stored on this extendable storage. Built on HDFS, HIVE is a SQL
engine that uses MapReduce.
7.1 OBJECTIVES
With SPARK, the scalability and fault tolerance of Hadoop MapReduce were
preserved while being designed for quick iterative processing, such as machine
learning, and interactive data analysis. As we have already discussed in Unit 6,
the MapReduce programming model, which is the foundation of the Hadoop
framework, allows for scalable, adaptable, fault-resilient, and cost friendly
solutions. In this case, it becomes imperative to reduce the turnaround time
between queries and execution. The Apache Software Foundation released
SPARK to quicken the Hadoop computing processes. SPARK has its own
cluster management, hence it is not dependent on Hadoop and is not a revised
form of Hadoop. Hadoop is merely one method of implementing SPARK.
SPARK's in-memory cluster computing, which accelerates application
processing, is its key feature. Numerous tasks, including batch processing,
iterative algorithms, interactive queries, and streaming, can be handled by
SRARK. Along with accommodating each job in its own system, it lessens the
administrative strain of managing various tools.
Figure 1 depicts the Spark data lake and shows how the Apache Spark can work
in conjunction with Hadoop components and information flow happens with
Apache Spark. Hadoop Distributed File Storage (HDFS) allows us to form
cluster of computing machines and utilize the combined capacity to store the
data, thus allowing to store huge data volumes. Further, MapReduce allows to
use combined power of the clusters and process it to store enormous data stored
in HDFS. The advent of HDFS and MapReduce allows horizontal scalability and
low capital cost as compared to data warehouses. Gradually, cloud infrastructure
became more economical and got wider adoption. Large amounts of organised,
semi-structured, and unstructured data can be stored, processed, and secured
using a data lake, a centralised repository. Without regard to size restrictions, it
can process any type of data and store it in its native format. A data lake has four
key capabilities: i) Ingest: allows data collection and ingestion, ii) Store:
responsible for data storage and management, iii) Process: leads to
transformation and data processing, and iv) Consume: ensures data access and
retrieval. In store capability of data lake, it could be an HDFS or cloud store such
as in Amazon S3, Azure Blob, Google cloud storage etc. The cloud storage
allows scalability and high availability access at an extremely low cost in almost
no time to procure. The notion of the data lake recommends to bring data into
data lake in raw format, i.e. ingest data into data lake and preserve an unmodified
immutable copy of data. The ingest block of data lake is about identifying,
implementing and managing the right tools to bring data from the source system
to the data lake. There is no single ingestion tools thus there could be multiple
tools such as HVR, Informatica, Talend etc. The next layer is the process layer
where all computation takes place such as initial data quality check,
transforming and preparing data, correlating, aggregating, and analysing and
applying machine learning models. Processing layer is further broken into two
parts which helps to manage better: i) data processing and ii) orchestration. The
processing is the core development framework allows to design and develop
distributed computing frameworks. Apache Spark is part of data processing. The
orchestration framework is responsible for the formation of the clusters,
managing resources, scaling up or down etc. There are three main competing
orchestration tools such as Hadoop Yarn, Kubernetes and Apache Mesos. The
last and most critical capability of data lake is to consume the data from data
lake for real life usage. The data lake is a repository of raw and processed data.
The consumption requirements could be from data analysts/scientists, or from
some applications or dashboards to take insights from data, or from
JDBC/ODBC connectors, others might be from Rest interface etc.
Step 2: Check whether SCALA is installed or not since Spark is written in Scala,
although elementary knowledge in Scala is enough to run Spark. Other
supported languages in Spark are Java, R, Python.
7.3 HIVE
A Hadoop utility for processing structured data is called Hive. It sits on top of
Hadoop to summarize big data and simplifies querying and analysis. Initially
created by Facebook, Hive was later taken up and further developed as an open
source project under the name Apache Hive by the Apache Software Foundation.
It is utilised by various businesses. Hive is not a relational database like SQL.
HIVE does not support for “Row-level” modifications of data, as the case in
SQL based database management systems. It also allows real time queries on
data. It puts processed data into HDFS and stores the schema in a database. It
offers a querying language called HiveQL or HQL that is similar to SQL. It is
dependable, quick, expandable, and scalable.
Figure 3: HIVE Architecture
ii) Hive services: Hive provides a range of services for various purposes.
The following are some of the most useful services:
a. Beeline: It is a command shell that HiveServer2 supports,
allowing users to send commands and queries to the system. It is
JDBC Client based on SQLLINE Command Line Interface
(CLI). SQLLINE CLI is a Java Console-based utility for running
SQL queries and connecting to relational databases.
b. Hive Server 2: After the popularity of HiveServer1, next
HiveServer2 was launched which helps the clients to run the
queries. It enables numerous clients to ask Hive questions and get
query responses.
c. Hive Driver: The user submits a Hive query using HiveQL
statements via the command shell, which are then received by the
Hive driver. The query is then sent to the compiler where session
handles for the query that are created.
d. Hive Compiler: The query is parsed by Hive compiler. The
metadata of the database, which is stored in the metastore can be
used to perform semantic analysis and data type validation on the
various parsed queries, and then it provides an execution plan.
The DAG (Directed Acyclic Graph) is the execution plan the
compiler generates, with each stage referred as a map or a reduce
job, HDFS action, and metadata operation.
e. Optimizer: To increase productivity and scalability, the optimizer
separates the work and performs transformation actions on the
execution plan so as to optimise the query execution time.
f. Execution engine: Following the optimization and compilation
processes, the execution engine uses Hadoop to carry out the
execution plan generated by the compiler as per the dependency
order amongst the execution plan.
g. Metastore: The metadata information on the columns and column
types in tables and partitions is kept in a central location called
the metastore. Additionally, it stores data storage information for
HDFS files as well as serializer and deserializer information
needed for read or write operations. Typically, this metastore is a
relational database. A Thrift interface is made available by
Metastore for querying and modifying Hive metadata.
h. HCatalog: It refers to the storage and table management layer of
Hadoop. It is constructed above metastore and makes Hive
metastore's tabular data accessible to other data processing tools.
And
i. WebHCat: HCatalog’s REST API is referred to as WebHCat. A
Hadoop table storage management tool called HCatalog allows
other Hadoop applications to access the tabular data of the Hive
metastore. A web service architecture known as REST API is
used to create online services that communicate via the HTTP
protocol. WebHCat is an HTTP interface for working with Hive
metadata.
iii) Processing and Resource Management: Internally, the de facto
engine for Hive's query execution is the MapReduce framework,
which is a software framework. MapReduce is used to create map
and reduce functions that process enormous amounts of data
concurrently on vast clusters of commodity hardware. Data is divided
into pieces and processed by map-reduce tasks as part of a map and
reduce jobs.
iv) Distributed Storage: Since Hadoop is the foundation of Hive, the
distributed storage is handled by the Hadoop Distributed File System.
There are various steps involved in the execution of Hive queries as follows:
Step 1: executeQuery Command: Hive UI either command line or web UI sends
the query which is to be executed to the JDBC/ODBC driver.
Step 2: getPlan Command: After accepting query expression and creating a
handle for the session to execute, the driver instructs the compiler to produce an
execution plan for the query.
Step 3: getMetadata Command: Compiler contacts the metastore with a
metadata request. The hive metadata contains the table’s information such as
schema and location and also the information of partitions.
Step 4: sendMetadata Command: The metastore transmits the metadata to the
compiler. These metadata are used by the compiler to type-check and analyse
the semantics of the query expressions. The execution plan (in the form of a
Directed Acyclic graph) is then produced by the compiler. This plan can use
MapReduce programming. Therefore, the map and reduce jobs would include
the map and the reduce operator trees.
Step 5: sendPlan Command: Next compiler communicates with the driver by
sending the created execution plan.
Step 6: executePlan Command: The driver transmits the plan to the execution
engine for execution after obtaining it from the compiler.
Step 7: submit job to MapReduce: The necessary map and reduce job worker
nodes get these Directed Acyclic Graphs (DAG) stages after being sent by the
execution engine. Each task, whether mapper or reducer, reads the rows from
HDFS files using the deserializer. These are then handed over through the
associated operator tree. As soon as the output is ready, the serializer writes it to
the HDFS temporary file. The last temporary file is then transferred to the
location of the table for Data Manipulation Language (DML) operations.
Step 8-10: sendResult Command: The execution engine now receives the
temporary files contents directly from HDFS as a retrieve request from the
driver. Results are then sent to the Hive UI by the driver.
The step-by-step installation of Hive (on Ubuntu or any other Linux platform)
is as follows:
Step 1: Check whether JAVA is installed or not, if not you need to install it.
Step 2: Check whether HADOOP is installed or not, if not you need to install it.
Step 6: Set up environment for Hive by adding the following to ~/.bashrc file:
In order to write HIVE queries, you must install the HIVE software on your
system. HIVE queries can be written using the query language that can be
supported by Hive. THE different data types supported by Hive are given below.
Maps
Struct
Hive Query Commands: Following are some of the basics commands to create
and drop database, creating, dropping and altering tables; creating partitions
etc. This section also lists some of the commands to query these database
including join command
1. Create Database
2. Drop database:
3. Creating Table:
4. Altering Table:
5. Dropping Table:
6. Add Partition:
7. Operators Used:
Relational Operators: =, !=, <, <=, >, >=, Is NULL, Is Not NULL,
LIKE (to compare strings)
Arithmetic Operators: +, -, *, /, %, & (Bitwise AND), | (Bitwise OR), ^
(Bitwise XOR), ~(Bitwise NOT)
Logical Operators: &&, || , !
Complex Operators: A[n] (nth element of Array A), M[key] (returns
value of the key in map M), S.x (return x field of struct S)
8. Functions: round(), floor(), ceil(), rand(), concat(string M, string N),
upper(), lower(), to_date(), cast()
Aggregate functions: count(), sum(), avg(), min(), max()
9. Views:
11. Select Order By Clause: To get information from a single column and
sort the result set in either ascending or descending order, use the
ORDER BY clause.
12. Select Group By Clause: A result set can be grouped using a specific
collection column by utilising the GROUP BY clause. It is used to
search a collection of records.
13. Join: The JOIN clause is used to combine specified fields from two
tables using values that are shared by both. It is employed to merge data
from two or more database tables.
Thus, most of the commands, as can be seen are very close to SQL like syntax.
In case, you know SQL, you would be able to write the Hive queries too.
7.4 HBASE
RDBMS has been the answer to issues with data maintenance and storage since
1970s. After big data became prevalent, businesses began to realise the
advantages of processing large data and began choosing solutions like Hadoop.
Hadoop processes huge data using MapReduce and stores it in distributed file
systems. Hadoop excels at processing and storing vast amounts of data in
arbitrary, semi-structured, and even unstructured formats. Hadoop is only
capable of batch processing, and only sequential access is possible to data. That
implies that even for the most straightforward tasks, one must explore the entire
dataset. When a large dataset is handled, it produces another equally large data
collection, which should also be processed in a timely manner. At this stage, a
fresh approach is needed to handle any data in a one go i.e. access at random.
On top of the Hadoop, the distributed column based data store HBase was
created. It is a horizontally scalable open-source project. Similar to Google's
Bigtable, HBase is a data model created to offer speedy random access to
enormous amounts of structured data. It makes use of the Hadoop File System's
fault tolerance (HDFS). It offers real-time read or write operations to access data
randomly from the Hadoop File System. Data can be directly stored in HDFS or
indirectly using HBase. Using HBase, data consumers randomly read from and
access the data stored in HDFS. HBase is built on top of Hadoop file system,
which offers read and write access. In nutshell, as compared to HDFS, HBase
enables faster lookup, random access with low latency due to internal storage in
hash tables. The tables in the column-oriented database HBase are sorted by row
(as depicted in Figure 4). The column families constituted of key-value pairs are
defined by the table structure. A table is a grouping of rows. A row is made up
of different column families. A collection of columns is called a column family.
Each column has a set of key-value pairs.
Figure 5: HBase components having HBase master and several region servers.
Tables in HBase are divided into regions and are handled by region servers.
Regions are vertically organised into "Stores" by column families. In HDFS,
stores are saved as files (as shown in Figure 5). The master server will assign the
regions to region servers with Apache ZooKeeper's assistance. It manages the
servers in the regions' load balancing. It transfers the regions to less busy servers
and unloads the busy servers and negotiates load balancing to maintain the
cluster's state. Tables that have been divided and distributed throughout the
region servers make up regions. The region server deal with client interactions
and data-related tasks. All read and write requests should be handled by the
respective regions. It uses the region size thresholds to determine the region's
size. The store includes HFiles and memory store. Similar to a cache memory
there is memstore. Everything that is entered into the HBase is initially saved
here. The data is then transported, stored as blocks in Hfiles, and the memstore
is then cleared. All modifications to the data in HBase's file-based storage are
tracked by the Write Ahead Log (WAL). The WAL makes sure that the data
changes can be replayed in the event that a Region Server fails or becomes
unavailable before the MemStore is cleared. An open-source project called
Zookeeper offers functions including naming, maintaining configuration data
and provides distributed synchronisation, etc. Different region servers are
represented by ephemeral nodes in Zookeeper. These nodes are used by master
servers to find servers which are unassigned jobs. Server outages and network
splinters are tracked by the nodes. Clients use zookeeper to communicate with
the region servers. HBase will handle zookeeper in pseudo mode and standalone
mode. HBase operates in standalone mode by default. For the purpose of small-
scale testing, standalone mode and pseudo-distributed mode are also offered.
Distributed mode is suited for a production setting. HBase daemon instances
execute in distributed mode across a number of server machines in the cluster.
The HBase architecture in shown in Figure 6.
Step 1: Check whether JAVA is installed or not, if not you need to install it.
Step 2: Check whether HADOOP is installed or not, if not you need to install it.
Edit hbase-site.xml
Distributed Mode:
Step 5: Edit hbase-site.xml
After you have completed the installation of HBase, you can use commands to
run HBase. Next section discusses these commands.
In order to interact with HBase, first you should use the shell commands as given
below:
HBase shell Commands: One can communicate with HBase by utilising the shell
that is included with it. The Hadoop File System is used by HBase to store its
data. It contains a master server and region servers. Regions will be used for the
data storage (tables). These regions are splited and stored in respective region
servers. These region servers are managed by the master server, and HDFS is
used for all of these functions. The following commands are supported in HBase
shell.
a) Generic Commands:
status: It gives information about HBase's status, such as how many
servers are there.
version: It gives the HBase current version that is in use.
table_help: It contains instructions for table related commands.
whoami: It will provide details of the user.
b) Data definition language:
create: To create table
list: Lists each table in the HBase database.
Disable: Turns a table into disable mode.
is_disabled: to check if table is disabled
enable: to enable the table
is_enabled: to check if table is enabled
describe: Gives a table's description.
Alter: To alter table
exists : To check if table exists
drop: To drop table
drop_all: Drops all table
Java Admin API: Java offers an Admin API for programmers to
implement DDL functionality.
c) Data manipulation language:
Put: Puts a cell value in a specific table at a specific column in a specific
row using the put command.
Get: Retrieves a row's or cell's contents.
Delete: Removes a table's cell value.
deleteall: Removes every cell in a specified row.
Scan: Scans the table and outputs the data.
Count: counts the rows in a table and returns that number.
Truncate: A table is disabled, dropped, and then recreated.
Java client API: Prior to all of the aforementioned commands, Java has a
client API under the org.apache.hadoop.hbase.client package that
enables programmers to perform DML functionality
2. List table:
This command when used in HBase prompt, gives us the list of all the
tables in HBase.
3. Disable table
4. Enable table
6. Exists table:
7. Drop table:
8. Exit shell:
9. Insert data:
A wide variety of Big Data tools and technologies are available on the market
today. They improve time management and cost effectiveness for tasks involving
data analysis. Some of these include Atlas.ti, HPCC, Apache Storm, Cassandra,
StatsIQ, CouchDB, Pentaho, FLink, Cloudera, OpenRefine, RapidMiner etc.
Atlas.ti allows to access all available platforms from one place. It can be utilised
for mixed techniques and qualitative data analysis varied research. High
Performance Computing Cluster (HPCC) Systems provides services using a
single platform, architecture, and data processing programming language. Storm
is a free large data open source distributed and fault tolerant computing system.
Today, a lot of people utilise the Apache Cassandra database to manage massive
amounts of data effectively. The statistical tool Stats iQ by Qualtrics is simple to
use. JSON documents that can be browsed online or queried using JavaScript
are used by CouchDB to store data. Big data technologies are available from
Pentaho to extract, prepare, and combine data. It provides analytics and
visualisations that transform how any business is run. Apache FLink is open
source data analytics tools for massive data stream processing. The most
efficient, user-friendly, and highly secure big data platform today is Cloudera.
Anyone may access any data from any setting using a single, scalable platform.
OpenRefine is a large data analytics tool that aids in working with unclean data,
cleaning it up, and converting it between different formats. RapidMiner is
utilised for model deployment, machine learning, and data preparation. It
provides a range of products to set up predictive analysis and create new data
mining methods.
Check Your Progress 3
7.8 SUMMARY
In this unit, we have learnt various tools and cutting-edge big data technologies
being used by the data analytics worldwide. Particularly, we studied in detail the
usage, installation, components and working of three main big data tools Spark,
Hive and HBase. Furthermore, we also discussed how to do query processing in
Hive and HBase.
7.9 SOLUTIONS/ANSWERS
iii) Advance Analytics support: Spark offers more than just "Map" and
"Reduce." Additionally, it supports graph methods, streaming data, machine
learning (ML), and SQL queries.
2. The main components of Apache Spark framework are Spark core, Spark SQL
for interactive queries, Spark Streaming for real time streaming analytics,
Machine Learning Library, and GraphX for graph processing.
ii) Hive services: Hive offers a number of services, like the Hive
server2, Beeline, etc., to handle all queries. Hive provides a range of
services, including: a) Beeline, b) Hive Server 2, c) Hive Driver, d)
Hive Compiler, e) Optimizer, f) Execution engine, g) Metastore, h)
HCatalog i) WebHCat.
iii) Processing and Resource Management: Internally, the de facto
engine for Hive's query execution is the MapReduce framework. A
software framework called MapReduce is used to create programmes
that process enormous amounts of data concurrently on vast clusters
of commodity hardware. Data is divided into pieces and processed
by map-reduce tasks as part of a map-reduce job.
iv) Distributed Storage: Since Hadoop is the foundation of Hive, the
distributed storage is handled by the Hadoop Distributed File System.
2. Hive has 4 main data types:
1. Column Types: Integer, String, Union, Timestamp
2. Literals: Floating, Decimal
3. Null Values: All missing values as NULL
4. Complex Types: Arrays, Maps, Struct
3. There are 4 main types:
1. Join: Join is extremely similar to SQL's Outer Join.
2. FULL OUTER JOIN: The records from the left and right outer tables are
combined in a FULL OUTER JOIN.
3. LEFT OUTER JOIN: All rows from the left table are retrieved using the
LEFT OUTER JOIN even if there are no matches in the right table.
4. RIGHT OUTER JOIN: In this case as well, even if there are no matches in
the left table, all the rows from the right table are retrieved.
[1] https://www.infoworld.com/article/3236869/what-is-apache-spark-the-big-data-
platform-that-crushed-hadoop.html
[2] https://aws.amazon.com/big-data/what-is-spark/
[3] https://spark.apache.org/
[4] https://www.tutorialspoint.com/hive/hive_views_and_indexes.htm
[5] https://hive.apache.org/downloads.html
[6] https://data-flair.training/blogs/apache-hive-architecture/
[7] https://halvadeforspark.readthedocs.io/en/latest/
[8] https://www.tutorialspoint.com/hbase/index.htm
[9] Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive:
Data warehouse and query language for Hadoop. " O'Reilly Media, Inc.", 2012.
[10] Du, Dayong. Apache Hive Essentials. Packt Publishing Ltd, 2015.
[11] Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark:
lightning-fast big data analysis. " O'Reilly Media, Inc.".
[12] George, Lars. HBase: the definitive guide: random access to your planet-size
data. " O'Reilly Media, Inc.", 2011.
26
UNIT 8 NoSQL DATABASE
Structure Page Nos.
8.0 Introduction 88
8.1 Objectives 46
8.2 Introduction to NoSQL 46
8.2.1 What is NoSQL 56
8.2.2 Brief history of NoSQL Databases 89
8.2.3 NoSQL database features 90
8.2.4 Differentiate between RDBMS and NoSQL 98
8.3 Types of NoSQL Databases 66
8.3.1 Column based 77
8.3.2 Graph based 77
8.3.3 Key-value pair based 55
8.3.4 Document based 76
8.4 Summary 65
8.5 Solutions/Answers 65
8.6 Further Readings 66
8.0 INTORDUCTION
In the previous Units of this Block, you have gone through various large data
architectural frameworks, such as Hadoop, SPARK and other similar technologies.
However, these technologies are not a replacement for large-scale database systems.
NoSQL databases arose because databases at the time were not able to support the rapid
development of scalable web-based applications.
NoSQL databases have changed the manner in which data is stored and used, despite
the fact that relational databases are still commonly employed. Most applications come
with features like Google-style search, for instance. The growth of data, online surfing,
mobile use, and analytics have drastically altered the requirements of contemporary
databases. These additional requirements have spurred the expansion of NoSQL
databases, which now include a range of types such as key-value, document, column,
and graph.
In this Unit, we will discuss the many kinds of NoSQL databases, including those that
are built on columns, graphs, key-value pairs, and documents respectively.
8.1 OBJECTIVES
Databases are a crucial part of many technological and practical systems. The
phrase "NoSQL database" is frequently used to describe any non-relational
database. NoSQL is sometimes referred to as "non SQL," but it is also referred
to as "not only SQL." In either case, the majority of people agree that a NoSQL
database is a type of database that stores data in a format that is different from
relational tables.
Whenever you want to use the data, it must first be saved in a particular structure
and then converted into a usable format. On the other hand, there are some
circumstances in which the data are not always presented in a structured style,
which means that their schemas are not always rigorous. This unit provides an
in-depth look into NoSQL and the features that make it unique.
NoSQL is a way to build databases that can accommodate many different kinds
of information, such as key-value pairs, multimedia files, documents, columnar
data, graphs, external files, and more. In order to facilitate the development of
cutting-edge applications, NoSQL was designed to work with a variety of
different data models and schemas.
In the late 2000s, as the price of storage began to plummet, No-SQL databases
began to gain popularity. No longer is it necessary to develop a sophisticated,
difficult-to-manage data model to prevent data duplication. Because developers'
time was quickly surpassing the cost of data storage, NoSQL databases were
designed with efficiency in mind.
47
2010 Adoption of Cloud DBaaS (Database as a Service)
onwards
As storage costs reduced significantly, the quantity of data that applications were
required to store and query grew. This data came in all forms— structured, semi-
structured, and unstructured — and sizes making it practically difficult to define
the schema in advance. NoSQL databases give programmers a great deal of
freedom by enabling them to store enormous amounts of unstructured data.
The use of the public cloud as a platform for storing and serving up data and
applications was another trend that arose, as cloud computing gained popularity.
To make their applications more robust, to expand out rather than up, and to
strategically position their data across geographies, they needed the option to
store data across various servers and locations. These features are offered by
some NoSQL databases like MongoDB.
Every NoSQL database comes with its own set of one-of-a-kind capabilities.
The following are general characteristics shared by several NoSQL databases:
• Schema flexibility
• Horizontal scaling
• Quick responses to queries as a result of the data model
• Ease of use for software developers
The differences and similarities between the two DBMSs are as follows:
• For the most part, NoSQL databases fall under the category of non-
relational or distributed databases, while SQL databases are classified as
Relational Database Management Systems (RDBMS).
• Databases that use the Structured Query Language (SQL) are table-
oriented, while NoSQL databases use either document-oriented or key-
value pairs or wide-column stores, or graph databases.
• Unlike NoSQL databases, which have dynamic or flexible schema to
manage unstructured data, SQL databases have a strict or static schema.
• Structured data is stored using SQL, whereas both structured and
unstructured data can be stored using NoSQL.
• SQL databases are thought to be scalable in a vertical direction, whereas
NoSQL databases are thought to be scalable in a horizontal direction.
• Increasing the computing capability of your hardware is the first step in
the scaling process for SQL databases. In contrast, NoSQL databases
scale by distributing the load over multiple servers.
• MySQL, Oracle, PostgreSQL, and Microsoft SQL Server are all
examples of SQL databases. BigTable, MongoDB, Redis, Cassandra,
RavenDb, Hbase, CouchDB, and Neo4j are a few examples of NoSQL
databases.
Vertical scalability is required for SQL databases. This means that an excessive
amount of load must be able to be managed by increasing the amount of CPU,
SSD, RAM, GPU, etc. on your server. When it comes to NoSQL databases, the
ability to scale horizontally is one of their defining characteristics. This means
that the addition of additional servers will make the task of managing demand
more manageable.
49
3) Differentiate between the NoSQL and SQL.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
For instance:
• Row Database: “Customer 1: Name, Address, Location". (The fields for each
new record are stored in a long row).
• Columnar Database: “Customer 1: Name, Address, Location”. (Each field
has its own set of columns). Refer Table 2 for relational database example.
51
Column databases: Disadvantages
While there are many benefits to adopting column-oriented databases, there are
also a few drawbacks to keep in mind.
Before we conclude, we should note that column-store databases are not always
NoSQL-only. It is frequently argued that column-store belongs firmly in the
NoSQL camp because it differs so much from relational database approaches.
The debate between NoSQL and SQL is generally quite nuanced, therefore this
is not usually the case. They are essentially the same as SQL techniques when it
comes to column-store databases. For instance, keyspaces function as schema,
so schema management is still necessary. A NoSQL data store's keyspace
contains all column families. The concept is comparable to relational database
management systems' schema. There is typically only one keyspace per
program. Another illustration is the fact that the metadata occasionally
resembles a conventional relational DBMS perfectly. Ironically, column-store
databases frequently adhere to ACID and SQL standards. However, NoSQL
databases are often either document-store or key-store, neither of which are
column-store. Therefore, it is difficult to claim that column-store is a pure
NoSQL system.
53
Figure 3. Example-Five friends sharing social network.
57
• Delivery of personalized ads to users based on their data profiles
• Cache data for infrequently updated data
There are numerous other circumstances where key-value works nicely. For
instance, because of its scalability, it frequently finds usage in big data research.
Similar to how it works for web applications, key-value is effective for
organizing player sessions in MMOG (massively multiplayer online game) and
other online games.
{ BookID 978-1449396091
“BookID”: “978-1449396091”,
“Title”: “DBMS”, Title DBMS
“Author”: “Raghu Ramakrishnan”,
“Year”: “2022”, Author Raghu Ramakrishnan
} Year 2022
(a) (b)
Figure 8: Example of Document and Key value database
When to use a document database?
• When your application requires data that is not structured in a table
format.
• When your application requires a large number of modest continuous
reads and writes and all you require is quick in-memory access.
• When your application requires CRUD (Create, Read, Update, Delete)
functionality.
• These are often adaptable and perform well when your application has to
run across a broad range of access patterns and data kinds.
59
How does a Document Database Work?
It appears that document databases work under the assumption that any kind of
information can be stored in a document. This suggests that you shouldn't have
to worry about the database being unable to interpret any combination of data
types. Naturally, in practice, most document databases continue to use some sort
of schema with a predetermined structure and file format.
Document stores do not have the same foibles and limitations as SQL databases,
which are both tubular and relational. This implies that using the information at
hand is significantly simpler and running queries may also be much simpler.
Ironically, you can execute the same types of operations in a document storage
that you can in a SQL database, including removing, adding, and querying.
Each document requires a key of some kind, as was previously mentioned, and
this key is given to it through a unique ID. This unique ID processed the
document directly instead of being obtained column by column.
Document databases often have a lower level of security than SQL databases.
As a result, you really need to think about database security, and utilizing Static
Application Security Testing (SAST) is one approach to do so. SAST, examines
the source code directly to hunt for flaws. Another option is to use DAST, a
dynamic version that can aid in preventing NoSQL injections.
8.4 SUMMARY
This unit covered the fundamentals of NoSQL as well as the many kinds of
NoSQL databases, such as those based on columns, graphs, key-value pairs, and
61
documents. Numerous businesses now use NoSQL. It is difficult to pick the best
database platform. NoSQL databases are used by many businesses because of
their ability to handle mission-critical applications while decreasing risk, data
spread, and total cost of ownership.
Despite their incredible capability, column-store databases do have their own set
of problems. Due to the fact that columns require numerous writes to the disk,
for instance, the way the data is written results in a certain lack of consistency.
Graph databases can be used to offer content in high-performance scenarios
while producing threads that are simple to comprehend for the typical user,
beyond merely expressive information in a graphical and effective way (such as
in the case of Twitter). The simplicity of a key-value store is what makes it so
brilliant. Although this has potential drawbacks, particularly when dealing with
more complicated issues like financial transactions, it was designed specifically
to fill in relational databases' inadequacies. We may create a pipeline that is even
more effective by combining relational and non-relational technologies, whether
we are working with users or data analysis. Document-store data models are
quite popular and regularly used due to their versatility. It helps analytics by
making it easy for firms to store multiple sorts of data for later use.
8.5 SOLUTIONS/ANSWERS
63
8.6 FURTHER READINGS
1) Next Generation Databases: NoSQL and Big Data 1st ed. Edition, G. Harrison,
Apress, December 26, 2015.
2) Shashank Tiwari, Professional NoSQL, 1st Edition, Wrox, September 2011.
3) https://www.kdnuggets.com/
Big Data Analysis
UNIT 9 MINING BIG DATA
9.0 INTRODUCTION
In the previous Block, you have gone through the concepts of big data and Big data
handling frameworks. These concepts include the distributed file system, MapReduce
and other similar architectures. This Block focuses on some of the techniques that
can be used to find useful information from big data.
This Unit focuses on the issue of finding similar item sets in big data. The Unit also
defines the measures for finding the distances between two data objects. Some of
these techniques include Jaccard distance, Hamming’s distance etc. In addition, the
Unit also discusses finding similarities among the documents using shingles. Finally,
this unit introduces you to some of the other techniques that can be used for analysing
Big data.
9.1 OBJECTIVES
After going through this Unit, you will be able to:
• Explain different techniques for finding similar items
• Explain the process of the collaborative filtering process
• Use shingling to find similar documents
• Explain various techniques to measure the distance between two objects
• Define supervised and unsupervised learning
Finding similar items for a small set of documents may be a simple problem,
but how do you find a set of similar items when the number of items is
extremely large? This section defines the basic issues of finding similar items
and their applications.
1
Link Analysis
Problem Definition:
Given an extremely large collection of item sets, which may have millions or
billions of sets, how to find the set of similar item sets without comparing all
the possible combinations of item sets, using the notion that similar item sets
may have many common sub-sets.
Purpose of Finding Similar Items:
1. You may like to classify web pages that are using similar words. This
information can be used to classify web pages.
2. You may like to find the purchases and feedback of customers on
similar items to classify them into similar groups, leading to making
recommendations for purchases for these customers.
3. Another interesting use of similar item set is in entity resolution, where
you need to find out if it is the same person across different
applications like e-Commerce web site, social media, searches etc.
4. Two or more web pages amongst millions of web pages may be
identical. These web pages may be plagiarized or a mirror of a simple
website.
Why Finding Similar Items is a problem?
One of the simplest ways to find similar items would be to compare two
documents or web pages and determine if they are identical or not by
comparing sequences of characters/words used in those documents/web
pages. However, considering that you need to find duplicates in 10
documents/web pages, you need to compare 10C2 =45 pairs. In general, for
n documents/webpages, you may need to compare n×(n-1)/2 pairs of
documents/web pages. For about 106 documents/web pages, you need to
make about 5×1011 comparisons. Thus, the question is how to find these
identical documents/web pages amongst billion of documents/web pages
without checking all the combinations.
Jaccard similarity is defined in the context of sets. Consider two sets – set A
and set B, then the Jaccard similarity JSA,B is defined as a ratio of the
cardinality of the set A∩B and cardinality of the set A∪B. The value of JSA,B
can vary from 0 to 1. The following equation represents the Jaccard similarity:
|𝐴 ∩ 𝐵|
JS!,# =
|𝐴 ∪ 𝐵|
For example, consider the sets A={a, b, c, d, e} and set B={c, d, e, f, g}, then
Jaccard similarity of these two sets would be:
2
|{𝑐, 𝑑, 𝑒}| 3 Big Data Analysis
JS!,# = =
|𝑎, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑔| 7
9.4.1 Shingles
In order to define the term shingle, let us first try to answer the question: How
to represent a document as a set of items so that you can find
lexicographically similar documents?
One way would be to identify words in the document. However, the
identification of words itself is a time-consuming problem and would be more
useful if you are trying to find the semantics of sentences. One of the simplest
and efficient ways would be to divide the document into smaller substrings of
characters, say of size 3 to 7. The advantage of this division is that for almost
common sentences many of these substrings would match despite small
changes in those sentences or changes in the ordering of sentences.
4
A k-shingle is defined as any substring of the document of size k. A document Big Data Analysis
has many shingles, which may occur at least once. For example, consider a
document that consists of the following string:
“bit-by-bit”
Assuming the value of k=3 and even using a blank character as part of
substrings. The following are the possible substrings of this document of size
3:
“bit”, “it-”, “t-b”, “-by”, “by-”, “y-b”, “-bi”, “bit”
Please note that out of these substrings “bit” is occurring twice. Therefore, the
3-shingles for this document would be:
{“bit”, “it-”, “t-b”, “-by”, “by-”, “y-b”, “-bi”}
One of the issues while making shingles is how to deal with white spaces. A
possible solution, which is used commonly, would be to replace all the
continuous white space characters with a single blank space.
So, how do you use shingles to find similar documents?
You may convert the documents into shingles and check if the two documents
share the same shingles.
An interesting issue here would be to determine the size of the shingle. If the
size of the shingle is small then most of the documents would have those
shingles, even if they are not identical. For example, if you keep a single size
of k=1, then almost all the characters would be shingles and these shingles
would be common in most of the documents, as most of them will have
almost all the alphabets. On the other hand, bigger shingles may not be able to
distinguish similar items. The ideal size of a shingle is k=5 to 9 for small to
large documents.
How many different shingles are possible for a character set? Assume that a
typical language has n characters and the size of shingles is k, then the
possible number of shingles is nk. Thus, the number of possible shingles in a
document may be very large.
Consider that you are using 9-shingles for finding similar research articles
amongst very large size articles. The possible character set would require 26
alphabets of the English language and one character for the space character.
Therefore, the maximum possible set of shingles would have (26+1)9 = 279 9-
shingles with each shingle being 9 bytes long (assuming 1 alphabet = 1 byte).
This is a very large set of data.
One of the ways of reducing the data would be to use hash functions instead
of shingles. For example, assume that a hash function maps a 9-byte substring
to an integral hash bucket number. Assuming that the size of the integer is 4
bytes, then the hash function maps these 279 possible 9-shingles to 24*8-1=232-
1 possible buckets. You may now use the bucket number as the shingle itself,
thus reducing the size of the shingle from 9 bytes to 4 bytes.
For certain applications, such as finding similar news articles, shingles are
created based on the words. These shingles are found to be more effective for
finding the similarity of news articles.
9.4.2 Minhashing
' '
JS $%&(,$%&* = * ; JS $%&(,$%&) = )
2
JS $%&*,$%&) =
3
You can verify the Jaccard similarity from the set definitions also. Thus, using
the matrix representation, you may be able to compute the similarity of two
documents or sets.
After going through the representation and its Jaccard similarity, next, let us
define the term minhashing. You can create a large number of orderings of the
rows of the shingle/document matrix given in Figure 1, to generate a new
sequence of rows or shingles (for our example). The first non-zero shingle of
each set is called the minhashed value of that set. For example, consider the
following three ordering of the matrix:
6
ORDER Shingle/Documents Set 1 Set 2 Set 3 Set 4 Big Data Analysis
st
1 2 1 0 0 1
2nd 4 0 1 1 1
rd
3 3 1 1 0 0
4th 5 1 0 0 0
th
5 1 0 0 1 1
MH1(Set n) - 1st 2nd 2nd 1st
7
Link Analysis
' +
AJS $%&(,$%&* = * ; AJS $%&(,$%&) = *
2
AJS $%&*,$%&) =
3
Thus, you may observe that the Jaccard Similarity can be computed
approximately using the signature matrix, which can be used to determine the
similarity between two documents.
Now, consider the case when you are computing the signatures for 1 million
rows of data. The ordering of these itself is time-consuming and representing
an ordering will require large storage space. Thus, for real data, the presented
algorithm may not be practical. Therefore, you would like to simulate the
ordering using simple hash functions rather than actual ordering. You can
decide the number of buckets, say 100, and hash the matrix (see Figure 1) into
these buckets. You must select different hash functions such that a column of
a row (if it has a value 1) is mapped to a different bucket by a particular hash
function. This will ensure that hash functions create different ordering in the
buckets. Please remember that hashing may result in collisions, but by using
many hash functions, the effect of collision can be minimized. You must
select the smallest bucket number to which a column is mapped as its minhash
signature value. The following example explains the process with the help of
two hash functions. Consider the documents and shingles given in Figure 1,
and assume two hash functions given below:
ℎ' (𝑥) = (𝑥 + 1) 𝑚𝑜𝑑 5
ℎ( (𝑥) = (2𝑥 + 3) 𝑚𝑜𝑑 5
We assume 5 buckets, numbered 0 to 4. The following signatures can be
created by using these hashed functions:
Initial Values:
Set 1 Set 2 Set 3 Set 4
h1 ∞ ∞ ∞ ∞
h2 ∞ ∞ ∞ ∞
Figure 4: Initial hashed values
Now, apply the hash function on the first row of Figure 1. Since Set 1 and Set
2 have values 0, therefore, the value in the above table will not change.
However, Set 3 and Set 4 would be mapped as follows:
For Set 3 and Set 4:
ℎ' (1) = (1 + 1) 𝑚𝑜𝑑 5 = 2
ℎ( (1) = (2 ∗ 1 + 3) 𝑚𝑜𝑑 5 = 0
8
Since Set 4 already has lower values than these so no change will take place in Big Data Analysis
set 4. Figure 5 will be modified as:
The signature so computed will give almost the same similarity measure as
earlier. Thus, we are able to simplify the process of finding the similarity
between two documents. However, still one of the problems remains, i.e.,
there are a very large number of documents between which the similarity is to
be checked. The next section explains the process of minimizing the pairs that
should be checked for similarities.
9
Link Analysis 9.4.3 Locality Sensitive Hashing
The signature matrix as shown in Figure 9 can also be very large, as there may
be many millions of documents or sets, which are to be checked for similarity.
In addition, the number of minhash functions that can be used for these
documents may be in the hundreds to thousands. Therefore, you would like to
compare only those pairs of documents that have some chance of similarity.
Other pairs of documents will be considered non-similar, though there may be
small false negatives in this group. Locality-sensitive hashing is a technique to
find possible pairs of documents that should be checked for similarity. It takes
the signature matrix, as shown in Figure 9, as input and produces the list of
possible pairs, which should be checked for similarity. The locality-sensitive
hashing uses hash functions to do so.
Thus, finally reducing the pairs that are to be checked for similarity can be
reduced using locality-sensitive hashing.
You may please note that this method is an approximate method, as there
would be a certain number of false positives as well as false negatives.
However, given the size of the data, a small probability of errors are
acceptable in the results.
Let us discuss some of the basic distance measures in the following sub-
sections.
-/'
You can verify that these measures satisfy all the properties of the distance
measure.
11
Link Analysis 9.5.2 Jaccard Distance
As discussed earlier, you compute the Jaccard similarity of two sets, namely
set A and set B, using the following formula:
|𝐴 ∩ 𝐵|
JS!,# =
|𝐴 ∪ 𝐵|
The Jaccard distance between these two sets is computed as:
|𝐴 ∩ 𝐵|
𝐽𝑎𝑐𝑐𝑎𝑟𝑑𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐴, 𝐵) = 1 − JS!,# = 1 −
|𝐴 ∪ 𝐵|
Does this distance measure fulfil all the criteria of a distance measure?
The value of Jaccard Similarity varies from 0 to 1, therefore, the value of
Jaccard distance will be from 1 to 0. The value 0 of Jaccard distance means
that the value of 𝐴 ∩ 𝐵 will be the same as 𝐴 ∪ 𝐵, which can occur only if set
A and set B are identical. In addition, as set intersection and union both are
symmetrical operations, therefore the Jaccard distance is also symmetrical.
You can also test the third property using certain values for three sets.
You may check if this measure also fulfils the criteria of the measures. It may
be noted that the range of cosine distance is evaluated between 0 and 180
degrees only. You may go through further readings for more details.
However, finding these operations may not be easy to code, therefore, edit
distance can be computed using the following method:
Step 1: Consider two strings - string A consisting of n characters a1, a2,…an
and a string B of m characters b1, b2,…bm. Find a sub-sequence, which
is the longest and has the same character sequence in the two strings.
12
Use the deletion of character operation to do so. Assume the size of Big Data Analysis
this sub-sequence is ls
Step 2: Compute the edit distance using the formula:
𝑒𝑑𝑖𝑡𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐴, 𝐵) = 𝑛 + 𝑚 − 2 × 𝑙𝑠
For example, in the strings A= “abcdef” and B= “acdfg”, the longest common
sub-sequence is “acdf”, which is 4 characters long, therefore, the edit distance
between the two strings is:
𝑒𝑑𝑖𝑡𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐴, 𝐵) = 6 + 5 − 2 × 4 = 3
Vector A 1 0 1 1 0 0 1 1
Vector B 1 1 0 1 0 0 0 0
Difference N Y Y N N N Y Y
You may use any of these distance measures based on the type of data being
used or the type of problem.
Textual Analysis:
Textual data is produced by a large number of sources of Big data, e.g. email,
blogs, websites, etc. Some of the important types of analytics that you may
perform on the textual data are:
1. Structured Information Extraction from the unstructured textual
data: The purpose here is to generate information from the data
that can be stored for a longer duration and can be reprocessed
easily if needed. For example, the Government may find the list of
medicines that are being prescribed by doctors in various cities by
analysing the prescriptions given by the doctors to different
patients. Such analysis would essentially require recognition of
13
Link Analysis entities, such as doctor, patient, disease, medicine etc. and then
relationships among these entities, such as among doctor, disease
and medicine.
2. Meaningful summarization of single or multiple documents: The
basic objective here is to identify and report the key aspects of a
large group of documents, e.g. financial websites may be used to
generate data that can be summarised to produce information for
stock analysis of various companies. In general, the summarization
techniques either extract some important portions of the original
documents based on the frequency of words, location of words
etc.; or semantic abstraction-based summaries, which use Artificial
Intelligence to generate meaningful summaries.
3. Question-Answering: In the present time, these techniques are
being used to create automated query-answering systems. Inputs to
such systems are the questions asked in the natural languages.
These inputs are processed to determine the type of question,
keywords or semantics of the questions and possible focus of the
question. Next, based on the type of the Question-Answering
system, which can be either an Information Retrieval based or a
knowledge-based system, either the related information is retrieved
or information is generated. Finally, the best possible answer is
sent to the person, who has asked the question. Some examples of
Question-answering systems are – Siri, Alexa, Google Assistant,
Cortona, IBM Watson etc.
4. Sentiment Analysis: Such analysis is used to determine the
opinion or perception of feedback about a product or service.
Sentiment analysis at the document level determines if the given
feedback is positive feedback or negative feedback. However,
techniques have been developed to perform sentiment analysis at
the sentence level or aspect level.
Audio-Video Analysis:
This is one of the largest chunks of present-day data. Such data can be part of
social networks, micro-blogging, wiki-based content and many other related
web applications. The challenge here is to obtain structured information from
noisy, user-oriented, unstructured data available in a large number of diverse
14
and dispersed web pages to produce information like finding a connected Big Data Analysis
group of people, community detection, social influence, link predictions etc.
leading to applications like recommender systems.
Predictive Analysis:
The purpose of such analysis is to predict some of the patterns of the future
based on the present and past data. In general, predictive analysis uses some
statistical techniques like moving averages, regression and machine learning.
However, in the case of Big data, as the size of the data is very large and it has
low veracity, you may have to develop newer techniques to perform predictive
analysis.
Some of the Big data problems can also be addressed using machine learning
algorithms, which create models by learning from the enormous data. These
models are then used to make future predictions. In general, these algorithms
can be classified into two main categories – Supervised Learning algorithms
and unsupervised learning algorithms.
Supervised Learning:
Supervised learning uses already available knowledge to generate
models, one of the most common examples of supervised learning is
spam detection in which the word patterns and anomalies of already
classified emails (spam or not-spam) are used to develop a model,
which is used to detect if a newly arrived email is spam or not.
Supervised learning problems can further be categorised into
classification problems and regression problems.
Unsupervised Learning:
Unsupervised learning generates its own set of classes and models, one
of the most common examples of unsupervised learning is creating a
new categorisation of customers based on their feedback on different
types of products. This new categorisation may be used to market
different types of products to these customers. Such types of problems
are also called clustering problems.
You can obtain more details on supervised and unsupervised learning in the
AI and Machine learning course (MCS224).
9.7 SUMMARY
This unit introduces you to basic techniques for finding similar items. The Unit first
explains the methods of finding similarity between two sets. In this context, the
concept of Jaccard similarity, document similarity and collaborative similarity are
15
Link Analysis discussed. This is followed by a detailed discussion on one of the important set
similarity applications – document similarity. For finding document similarity of a
very large number of documents, a document is first converted to a set of shingles,
next the minhashing function is used to compute the signatures of document sets
against these shingles. Finally, locality-sensitive hashing is used to find the sets that
must be compared to find similar documents. The locality-sensitive hashing greatly
reduces the number of documents that should be compared to find similar documents.
The Unit then describes several common distance measures that may be used in
different types of Big data analysis. Some of the distance measures defined are
Euclidean distance, Jaccard distance, Cosine distance, Edit distance and Hamming
distance. Further, the unit introduces you to some of the techniques that are used to
perform analysis of big data.
9.8 SOLUTIONS/ANSWERS
3. The minhash signatures for the following three orderings are shown:
16
5th 2 1 1 Big Data Analysis
5. The locality sensitive hashing first divides the signature matrix into
several horizontal bands and hashes each column using a hash function
to hash buckets. The similarity is checked only for the documents that
hash into the same bucket. It may be noted that the chances of similar
document sets may get to the same hashed bucket by at least one hash
function is high, whereas the documents that are not similar have low
chances of hashing into the same bucket.
17
Link Analysis
9.9 REFERENCES/FURTHER READINGS
18
UNIT 10 MINING DATA STREAMS
10.0 Introduction
10.1 Objectives
10.2 Data Streams
10.2.1 Model for Data Stream Processing
10.3 Data Stream Management
10.3.1 Queries of Data Stream
10.3.2 Examples of Data Stream and Queries
10.3.3 Issues and Challenges of Data Stream
10.4 Data Sampling in Data Streams
10.4.1 Example of Representation Sample
10.5 Filtering of Data Streams
10.5.1 Bloom Filter
10.6 Algorithm to Count Different Elements in Stream
10.7 Summary
10.8 Answers
10.9 References/Further Readings
10.0 INTRODUCTION
In the previous Unit of this Block, you have gone through the concepts relating to
mining of Big data, where difference distance measures and techniques were discussed
to uncover hidden pattern in Big data. However, there are certain applications, like
satellite communication data, sensor data etc. which produce a continuous stream of
data. This data stream can be regular or irregular, homogenous or heterogeneous, partly
stored or completely stored etc. Such data streams are processed using various
techniques.
This unit explains the characteristics and models of data stream processing. It also
identifies the challenges or stream processing and introduces you to various techniques
for processing of data streams. You may refer to further readings of this Unit for more
details on data streams and data stream processing.
10.1 OBJECTIVES
After going through this Unit, you would be able to:
• Define the characteristics of data streams
• Discuss models of data stream processing
• Explain the uses of data streams
• Illustrate the example of data stream queries
• Explain the role of Bloom filter in data stream
• Explain an algorithm related to data stream processing.
Usually, the data resides in a database or a distributed file system from where users can
access the same data repeatedly, as it is available to the users whenever they need it.
But there are some applications where the data does not reside in a database, or if it
does the database is so large that the users cannot query it fast enough to answer
questions about it. One such example is, data being received from a weather satellites.
1
Answering queries about this sort of data requires clever approximation techniques and
methods for compressing data in a way that allows us to answer the queries, we need to
answer.
Thus, mining data stream is a process to extract knowledge in real time from a large
amount of volatile data, which comes in an infinite stream. The data is volatile because
it is continuously changing and evolving over time. The system does not store the data
in the database due to limited amount of resources. How would you anlayse this data
stream? The following section presents the models for data stream processing.
a b c d e f a b c d b c a b c
(a) Landmark
window at t1 window at t2 window at t3
a b c d e f a b c d b c a b c
weight Decreasing
(b) DSMS:
In a DSMS, the management cannot control the rate of input. For example, the
search queries that arrive at Google search engine are generated by random
people around the globe, who search for information at their respective pace.
Google staff literally has no control over the rate of arrival of queries. They
have to design and architect their system in such a way that it can easily deal
with the varying data rate.
Data Stream Management System (DSMS) extracts knowledge from multiple data
streams by eliminating undesirable elements, as shown in Figure 2. DSMS is important
where the input data rate is controlled externally. For example, Google queries.
3
Figure 2: A simple outline of Data Stream Management System
The sources of data streams include Internet traffic, online transactions, satellite data,
sensors data, live event data, real-time surveillance systems, etc.
Figure 3 shows the detailed view of a data stream management system. The components
of this system are described as follows:
• Output: The system makes output in response to the standing queries and the
ad-hoc queries (refer to section 10.3.1).
4
• Archival Storage: There is a massive archival storage and we cannot assume
the archival storage is architected like a database system. Further, we can use
appropriate indices or other tools to efficiently answer the queries from that
data. We only know that if we had to reconstruct the history of the streams it
could take a long time.
• Input tuples: The input elements are the tuples of a very simple kind such as
bits or integers. There are one or more input ports at which data arrives. The
arrival rates of input tuples are very high on these input ports.
• Arrival rate: The arrival rate of data is fast enough that it is not feasible for the
system to store all the arriving data and at the same time make it instantaneously
available for any query that we might want to perform on the data.
• Critical calculations: The algorithms for data stream are general methods that
use a limited amount of storage (perhaps only main memory) and still enables
to answer important queries about the content of the stream. However, it
becomes difficult to perform critical calculations about the data stream with
such a limited amount of memory.
Check Your Progress 1:
1. Define the data stream processing. Which model of data stream processing is
useful in finding stock market trends? Justify.
2. Differentiate between DBMS and DSMS. Why all the data of data streams is
not stored?
If the data on the stream is arriving too rapidly, we may not want to or need to look at
every stream element. Perhaps we can work with just a small sample of the values in
the stream.
While taking samples, you need to consider the following two issues:
• First, the sample should be unbiased
• Second, the sampling process must preserve the answer to the query or queries
that you may like to ask about the data
6
Thus, you require a method that preserves the answer to the queries. The following
example explains the concepts stated above:
Google has a stream of search queries that are generated at all the times. You might
want to know what fraction of search queries received over a period (say last month)
were unique? This query can also be written as: What fraction of search queries have
only one occurrence in the entire month?
One of the sampling techniques that can be used for data streams to answer the queries
may be to randomly select 1/10th of the stream. For example, if you want to know what
fraction of the search queries are a single word queries; you can compute that fraction
using the 1/10th sample of the data stream. The computed fraction would be very close
to the actual fraction of single word queries over the entire data stream. Thus, over the
month, you would be testing 1/10th of the data stream as sample. If those queries are
selected at random, the deviation from the true answer will be extremely small.
However, the query – to find the fraction of unique queries, cannot be answered
correctly from a random sample of the stream.
10.4.1 The Representative Sample
In this section, we discuss about the need of a representative sample with the help of an
example. The query that needs to be answered is – “to find the fraction of unique search
queries”:
• We know that the length of a sample is 10% of the length of the whole stream
(as mentioned above in section 10.4). The problem is that the probability of a
given query appearing to be unique in the sample gets distorted because of the
sampling.
• Suppose a query is unique in the data stream. It has only a 1/10th of chance of
being selected for the sample. It implies that fraction of truly unique queries
that may get selected into the sample is the same as for the entire stream. If we
could only count the truly unique queries in the sample, we would get the right
answer, as we are trying to find proportions.
• However, suppose a search query appears exactly twice in the whole stream.
The chance that the first occurrence will be selected for the sample is 10% and
the chance that the second occurrence will not be selected is 90%. Multiply
those and we have a 9% chance of this query occurrence being unique in the
sample.
• Moreover, the first occurrence could not be selected but the second is selected,
this also has 9% chance. Thus, a query that really occurs twice but may be
selected as unique in the sample with the probability of a total of 18%.
• Similarly, the queries that appears on the stream three times has a 24.3% chance
of being selected as unique queries in the sample.
• In fact, any query no matter how many times it appears in the original stream,
has at least a small chance of being selected as unique query in the sample.
So, when you count the number of unique queries in the sample, it will be an
overestimate of the true fraction of unique queries.
In the example given above, the problem was that we performed sampling on the basis
of the position in the stream, rather than the value of the stream element. In other words,
we assumed that we flipped a 10-sided coin every time a new element arrived in the
stream. The consequences are that when a query occurs at several positions in the stream,
7
we decide independently whether to add or not to add the query into the sample.
However, this is not the sampling which we are interested in, for answering the query
about finding the unique queries.
We want to pick 1/10thof the search queries, not 1/10th of the instances of search queries
in the stream. We can make a random decision when we see a search query for the first
time.
If we kept a table/list of our decision for each search query we have ever seen, then each
time a query appears, we can look it up in the table. If the query is found in the table/list,
then we can either add it to the sample or not to add it to the sample. But if you did not
find the query in the table, then you can flip the ten sided coin to decide what to do with
it and we crawled the query and the outcome from the table.
However, it will be hard to manage and lookup such a table with each stream element.
Fortunately, there is a much simpler way to get the same effect without storing anything
in the list by using a Hash function.
• Select a hash function, which maps the search queries into 10 buckets i.e. 0 to
9.
• Apply the hash function when a search query arrives, if it is mapped to bucket
0, then add it to the sample, otherwise if it maps to any of the other 9 buckets,
then do not add it to the sample.
The advantage of this approach is that all occurrences of the same query would be
mapped to the same bucket, as the same hash function is applied. As a result, you do
not need to know whether the search query that just arrived has been seen before or not.
Therefore, the fraction of unique queries in the sample is the same as for the stream as
a whole. The result of sampling this way is that 1/10th of the queries are selected for the
sample.
If selected, then the query appears in the sample exactly as many times as it does in the
entire data stream. Thus, the fraction of unique queries in the sample should be exactly
as it is in the data stream.
What if the total sample size is limited?
Suppose you want your sample not to be a fixed fraction of the total stream, but a fixed
number of samples from the stream.
In that case, the solution would be to perform hashing to a large number of
buckets, such that the resulting sample just stays within the size limit.
As more stream elements are added, your sample gets too large. In that case, you can
pick one of the buckets that you have included in the sample and delete all the stream
elements from the sample that hash to that bucket. Organizing the sample by bucket can
make this decision process efficient.
With this different way of sampling, the stream unique query problem will be addressed
as:
Ø You still want a 10% sample for the search queries, however, eventually even
the 10% sample will become too large. Hence, you want the ability to throw
out some fraction of sampled elements. You may have to do it repeatedly. If
any-one occurrence of the query is thrown out, then all other occurrences of the
same query are thrown out.
Ø Perform hashing to 100 buckets for our example, but for a real data stream you
may require a million buckets or even more, as long as you want 10% sample.
8
Ø You could choose any 10 buckets out of 100 buckets for the sample. For the
present example, let us choose bucket 0 to bucket 9.
Ø If the sample size gets too big, then you pick one of the buckets to remove the
samples, say bucket 9. It means that you delete those elements from the sample
that hash to bucket 9 while retaining those that hash to bucket 0 to bucket 8.
You just returned bucket 9 to the available space. Now, your sample is 9% of
the stream.
Ø In future, you add only those new stream elements to sample that hash to bucket
0 through bucket 8.
Ø Sooner or later, even the 9% sample will exceed our space bound. So, you
remove those elements from the sample that hash to bucket 8, then to bucket 7,
and so on.
Sampling Key-Value Pairs
The idea, which has been explained with the help of an example given above, is really
an instance of a general idea. A data stream can be any form of key-value pairs. You
can choose sample by picking a random key set of a desired size and take all key value
pairs whose key falls into the accepted set regardless of the associated value.
In our example, the search query itself was the key with no associated value. In general,
we select our sample by hashing keys only, the associated value is not part of the
argument of the hash function.
You can select an appropriate number of buckets for acceptance and add to sample each
key-value pair whose key hashes to one of the accepting buckets.
Example: Salary Ranges
Ø Assume that a data stream elements are tuples with three components - ID for
some employee, department that employee works for and the salary of that
employee.
StreamData=tuples(EmpID, Department, Salary)
Ø For each department, there is a salary range, which is the difference between
the maximum and minimum salaries and is computed from the salaries of all
the employees of that department.
Query: What is the average salary range within a given department?
Assuming that you want to use a 10% sample of those stream tuples to estimate the
average salary range. Picking 10% of the tuples at random would not work for a given
department, as you are likely to be missing one or both of the employees with the
minimum or maximum salary in that department. This will result in computation of a
lower difference between the MAX and MIN salaries in the sample for a department.
Key= Department
Value= (EmpID, Salary)
The right way to sample is to treat only the department component of tuples as the key
and the other two components: employee ID and salary as part of the value. In general,
both the key and value parts can consist of many components.
If you sample this way, you would be sampling a subset of the departments. But for
each department in the sample, you get all its employee salary data, and you can
compute the true salary range for that department.
9
When you compute the average of the ranges, you might be off a little because you are
sampling the ranges for some departments rather than averaging the ranges for all the
departments. But that error is just random noise introduced by the sampling process and
not a bias in one direction or another.
Check Your Progress 2:
1. What are the different ways of sampling data stream?
3. What is the purpose of sampling using <Key, Value> pair? Why did you choose
department as the key in the example?
In the previous section, you were answering queries using the recent window
items. What would you do if you want to extract information from the entire data
stream? You may have to use the filtering of data stream for this. In this section,
we discuss about one data steam filter called Bloom filter.
10.5.1 Bloom filter
Bloom filters enable us to select only those items in the stream that are on some list or
a set of items. In general, Bloom filter is used for cases where the number of items on
the list or set is so large, that you cannot do a comparison of each stream element with
element of the list or check the set membership.
Need of Bloom filters:
Let us explain the need of Bloom filter with the help of an example application of a
search engine.
• Web crawler performs many crawling tasks and uses different processors to
crawl pages.
• The crawler maintains a list of all URLs in the database that it has already found.
Its goal is to explore the web pages in each of these URLs to find the additional
URLs that are linked to these web pages.
• It assigns these URLs to any of a number of parallel tasks, these tasks stream
back the URLs they find in the links they discover on a page.
• However, it is not expected to have the same URL to get into the list twice.
Because in that case you will be wasting your time in crawling the page twice.
So, each time a URL comes back to the central controller, it needs to determine
whether it has seen that URL before and discard the second report so that you
could create an index, say a hash table, to make it efficient to look up the URL
and see whether it is already among those web pages that has been in the index.
• But the number of URLs are extremely large and such an index will not fit in
the main memory. It may be required to be stored in the secondary memory
(hard disk), requiring disk access every time a URL is to be checked. This
would be very time consuming process.
Therefore, you need a Bloom filter.
10
• When a URL arrives in a stream, pass it through a Bloom filter. This filter will
determine if the URL has already been visited or not.
• If the filter says it has not been visited earlier, the URL will be added to the list
of URLs that needs to be crawled. And eventually it will be assigned to some
crawling task.
• But the Bloom filter can have false positives, which would result in assigning
some of the URLs as already visited, while in fact that it was not.
• The good news- if a Bloom filter says that the URL has never been seen. Then
that is true i.e., there are no false negatives.
Working of Bloom filter:
• Bloom filter itself is a large array of bits, perhaps several times as many bits as
there are possible elements in the stream.
• Each hash function maps a stream element to one of the positions in the array.
• Now, when a stream element, say x, arrives, we compute the value of hi(x) for
each hash function hi that are to be used for Bloom filtering.
• A hash function maps a stream element to an index value on Bloom filter array.
In case, this index value is 0, then it is changed to 1.
The following example explains the working of Bloom filter in details.
Example of Bloom filters:
For example, the size of an array that is going to be used for Bloom filter is of 11 bits,
i.e., N=11. Also, assume that the stream that is to be filtered consists of only unsigned
integers of 12 bits, i.e., Stream elements=unsigned integers. Further, for the purpose of
this example, let us use only two hash functions h1 and h2, as given below:
Ø The first hash function h1 maps an integer x to a hash value h1(x) as follows:
o Write the binary value of integer x. Select the bits at the odd bit
positions starting from the left most (least significant) bit.
o Extract these odd bits of x to another binary, say xodd
o Take modulo as: (xodd modulo 11) to map x into hash value h1(x).
Ø h2(x) is computed in exactly the same manner except that it collects the even bit
positions of the binary representation to create xeven. The modulo is also
computed using xeven modulo 11.
Next, you will initialize all the array values of the Bloom filter to zero. Assuming that
the set of valid stream elements is {25, 159, 585}, you train the Bloom filter as:
Initial Bloom filter contents=00000000000, which is representation as:
Bloom Filter Array Index 0 1 2 3 4 5 6 7 8 9 10
Bloom Filter 0 0 0 0 0 0 0 0 0 0 0
11
Bit Position 12 11 10 9 8 7 6 5 4 3 2 1
Binary Equ 2048 1024 512 256 128 64 32 16 8 4 2 1
x = 25 0 0 0 0 0 0 0 1 1 0 0 1
xodd 0 0 0 1 0 1
h1(x) 000101 = 5 in decimal; h1(x) = 5 mod 11 = 5
x 0 0 0 0 0 0 0 1 1 0 0 1
xeven 0 0 0 0 1 0
h2(x) 000010 = 2 in decimal; h2(x) = 2 mod 11 = 2
12
Assuming that you are using Bloom filter to test the membership of a 12-bit unsigned
integer to the set {25, 159, 585}. The bloom filter for this set is 10100101010 (as shown
above). Find if the stream element is member of the set or not.
Lookup element y = 118
Bit Position 12 11 10 9 8 7 6 5 4 3 2 1
Binary Equ 2048 1024 512 256 128 64 32 16 8 4 2 1
y = 118 0 0 0 0 0 1 1 1 0 1 1 0
yodd 0 0 1 1 1 0
h1(y) 001110 = 14 in decimal; h1(y) = 14 mod 11 = 3
y 0 0 0 0 0 1 1 1 0 1 1 0
yeven 0 0 0 1 0 1
h2(y) 000101 = 5 in decimal; h2(y) = 5 mod 11 = 5
Since, there is a mismatch represented by M, therefore, y is not a member of the set and
it can be filtered out.
However, there can be false positives, when you use Bloom filter. For example:
Lookup element y = 115
Bit Position 12 11 10 9 8 7 6 5 4 3 2 1
Binary Equ 2048 1024 512 256 128 64 32 16 8 4 2 1
y= 115 0 0 0 0 0 1 1 1 0 0 1 1
yodd 0 0 1 1 0 1
h1(y) 001101 = 13 in decimal; h1(y) = 13 mod 11 = 2
y 0 0 0 0 0 1 1 1 0 0 1 1
yeven 0 0 0 1 0 1
h2(y) 000101 = 5 in decimal; h2(y) = 5 mod 11 = 5
Since, there is No mismatch, therefore, y is a member of the set, and it cannot be filtered
out. However, you may notice this is a false positive.
In stream processing sometimes instead of exact solution, you can accept approximate
solutions. One such algorithm, which is used to count different elements in a stream in
a single pass was given by Flajolet-Martin. This algorithm is discussed below:
Steps of the algorithm:
1. Pick a hash function h that maps each of the n elements of the data stream to at
least log2(n) bits.
2. For each stream element a, let r(a) be the number of trailing 0’s in h(a).
13
3. Record R = the maximum r(a) seen.
4. Estimated number of different elements= 2R.
Example:
Given a good uniform distribution of numbers as shown in Table 1. It has eight different
elements in the stream.
Probability that the right-most set bit is at position 0 = ½
At position 1 = 1/2 * 1/2 = 1/4
At position 2 =1/2 * 1/2 *1/2 = 1/8
…
At position n = 1/2n
By keeping the record of these positions of the right-most set bit, say ρ, for each element
in the stream. We will expect position of rightmost set bit = 0 to be 0.5, ρ = 1 to be 0.25,
etc. Also consider that m is the number of distinct elements in the stream.
Ø This probability will come to 0 when bit position b is greater than log m
Ø This probability will be non-zero, when b <= log m
14
Therefore, if we find the right-most unset bit position b such that the probability = 0,
we can say that the number of unique elements will approximately be 2b. This forms the
core intuition behind the Flajolet Martin algorithm.
A detailed discussion on this algorithm can be referred from the further reading.
Check Your Progress 3:
1. Explain how Bloom filter can be used to identify emails that are not from a
selected group of addresses.
10.7 SUMMARY
This unit introduces the concept of mining data streams. Data streams can be processed
using different models. Three such models (landmark, sliding windows, and damped
model) for data stream processing are introduced in this unit. Further, data stream
management system (DSMS) is explained. In addition, different types of queries of data
stream namely, ad-hoc and standing queries are discussed followed by examples of
these queries. The issues and challenges of Data streams and data sampling in data
streams with examples of representation sample and sampling Key-Value pairs are
discussed in this unit. This unit also explains bloom filter with its need, working, and
related examples. In the end, this unit shows the algorithm to approximately count
different elements in stream with example.
10.8 ANSWERS
3. The purpose of sampling using <Key, Value> pair is that you can choose a
sample by picking a random key set of a desired size and take all key value
pairs whose key falls into the accepted set regardless of the associated value.
The department is chosen as a key in the example because department is a
constant used to define the data set.
References
[1] Mansalis, Stratos, et al. "An evaluation of data stream clustering algorithms."
Statistical Analysis and Data Mining: The ASA Data Science Journal 11.4 (2018): 167-
187
16
[2] Albert C. “Introduction to Stream Mining.”
https://towardsdatascience.com/introduction-to-stream-mining-8b79dd64e460
[3] “Mining Data Streams.” http://infolab.stanford.edu/~ullman/mmds/ch4.pdf
[4] Bhyani, A., “Approximate Count-Distinct using Flajolet Martin Algorithm.”
https://arpitbhayani.me/blogs/flajolet-martin
17
Big Data Analysis
UNIT 11 LINK ANALYSIS
11.0 INTRODUCTION
In the previous Units of this Block, you have gone through the concept of measuring
the distances and different algorithms of handling data streams. In this section, we will
discuss about the link analysis, which is used for computing the PageRank. The
PageRank algorithms use graphs to represent the web and computes the Rank based on
probability of moving to different links. Since, the size of web is enormous, the size of
computation requires operations using very large size matrices. Therefore, the
PageRank algorithm uses MapReduce programming paradigm over the distributed file
system. This unit discusses several of these algorithms.
11.1 OBJECTIVES
After going through this Unit, you will be able to:
• Define link analysis
• Use graphs to perform link analysis
• Explain computation of PageRank
• Discuss different techniques for computation of PageRank
• Use MapReduce to compute PageRank
Link analysis is a data analysis technique used in network theory to analyze the links
of the web graph. Graphs consists of a set of nodes and a set of edges or connections
or intersections between nodes. The graphs are used everywhere, for example, in social
media networks such as Facebook, Twitter, etc.
1
Link Analysis Purpose of Link Analysis
The purpose of link analysis is to create connections in a set of data that can actually
be represented as networks or the networks of information. For example, in Internet,
computers or routers communicate with each other which can be represented as a
dynamic network of nodes which represent the computer and routers. The edges of the
network are physical links between these machines. Another example is the web which
can be represented as a graph.
2
Big Data Analysis
C
B 36%
A 40%
3%
D
5% E
10%
2%
2%
2%
The PageRank, as shown in Figure 1, are kind of intuitive, and they correspond to our
intuitive notion of how important a node in our graph is. However, how the PageRank
can be computed? Next sections answer this question.
In the last section, we discuss about the basic concepts that can be used to
compute the PageRank, without giving details on how actually you can compute
the PageRank. This section provides details on the PageRank computation
process.
11.4.1 Different Mechanisms of Finding PageRank
In this section, we discuss two important techniques that are employed to compute the
PageRank, viz. Simple recursive formulation and the flow model.
The Simple Recursive Formulation technique will be used to find PageRank Scores in
which:
• Each link is considered as a vote and the importance of a given vote is
proportional to the importance of the source web page that is casting this vote
or that is creating a link to the destination.
• Suppose that there is page j with an importance rj.
• The page j has n outgoing links.
• This importance rj of page j basically gets split on all its outgoing links evenly.
3
Link Analysis • Each link gets rj divided by n votes i.e., rj/n votes.
This is how the importance gets divided from j to the pages it points to. And in a similar
way, you can define the importance of a node as the sum of the votes that it receives
on its in-links.
Figure 2 shows a simple graph that shows the votes related to a page labeled as j. You
can calculate the score of node j as a sum of votes received from ri and rk, using the
following formula:
rj= ri /2 + rk/ 3
This is so because page i has 3 out links and page k has 4 out links. The score of page
j further gets propagated outside of j along with the three outgoing links. So, each of
these links get the importance of node j divided by 3.
i k
𝑟! ⁄2 𝑟" ⁄3
j 𝑟# ⁄3
𝑟# ⁄3
𝑟# ⁄3
This is basically the mechanism of how every node collects the importance of the pages
that points to it and then propagate it through the neighbours.
The Flow Model: This model is basically the vote flow through the network. So that
is why this is called the flow formulation or a flow model of PageRank. To understand
the concept, let us use a web graph of WWW that contains only three web pages named
a, b, and c (Please refer to Figure 3).
4
Where di is the out-degree of node i. You may please recall that out degree of a node Big Data Analysis
is number of links that go out of it. Thus, importance score of page j is simply the sum
of all the other pages i that point to it. The importance of a page i is computed by
dividing its importance by the out degree.
This means that for every node in the network, we obtain a separate equation based on
the number of links. For example, the importance of node a in the network is simply
the importance of a divided by 2 plus the importance of b divided by 2. Because a has
2 outgoing links and then similarly node b has 2 outgoing links. Thus, the three
equations, one each for individual node, for the Figure 3, would be:
ra= ra /2 + rb/ 2
rb= ra /2 + rc
rc= rb /2
a/2
b/2
a/2
c
b c
b/2
Figure 3: A very simple web graph [2]
The problem is that these three equations do not have a unique solution. So, there is a
need to add an additional constraint to make the solution unique. Let us add a constraint
- The total of all the PageRank scores should be one, which is given by the following
equation:
ra + rb + rc = 1
You can solve these equations and find the solution to the PageRank scores, which is:
ra= 2/5; rb = 2/5; rc = 1/5
Problem: This approach will work well for small graphs, but it would not work
for a graph, which has a billion web pages, as this would result in development
of a billion of equations or system of billions of equations that we would be
required to solve for computing the PageRank. So, you need a different
formulation.
5
Link Analysis 11.4.2 Web Structure and Associated Issues
In a web structure, the original idea was to manually categorize all the pages of the
WWW into a set of groups. However, this model is not scalable, as web has been
growing far too quickly.
Another way to organize the web and to find things on the web is to search the web.
There is a rich literature, particularly in the field of information retrieval, that covers
the problem of how do you find a document in a large set of documents. You can think
of every web page as a document, the whole web is one giant corpus of documents,
and your goal would be to find relevant document based on a given query from this
huge set.
The traditional information retrieval field was interested in finding these documents in
relatively small collection of trusted documents. For example, finding a newspaper
collection. However, now the web is very different. The web is huge and full of
untrusted documents, random things, spam, unrelated things, and so on.
Issues: The associated issues are:
So far, you know that the importance of page j in a web graph is the sum of the
importance of page i that point to it divided by the out-degree of the ith node. This can
be represented using the following equation, as given earlier:
6
𝑟" Big Data Analysis
𝑟! = #
"→! 𝑑"
d c
b
a
The graph in Figure 4 can be written with the following matrix, where ith row
is representing the ith node. The matrix represents the equations given above.
7
Link Analysis
0 0 0 1/2 1/2
1/3 0 0 0 1/2
M=
1/3 1/2 0 0 0
1/3 1/2 1 0 0
0 0 0 1/2 0
𝑟 $%& = 𝑀 𝑟 $ (1)
Where rt is the PageRank at the tth iteration, and Matrix M is the link matrix,
as shown above.
The following example explains this concept.
Example: Consider the graph of Figure 3 and compute the PageRank using
Matric M, assuming the starting value of PageRank as:
𝑟' 1/3
𝑟( = 1/3
𝑟) 1/3
1/2 1/2 0
The matrix M = 1/2 0 1
0 1/2 0
Application of equation (1) will be:
𝑟& = 𝑀 𝑟 *
1/2 1/2 0 1/3
𝑟& = 1/2 0 1 1/3
0 1/2 0 1/3
1/3
&
𝑟 = 1/2
1/6
On applying the equation again, the value would be:
1/2 1/2 0 1/3
+
𝑟 = 1/2 0 1 1/2
0 1/2 0 1/6
5/12
+
𝑟 = 1/3
1/4
8
On repeated application, you will get: Big Data Analysis
3/8
,
𝑟 = 11/24
1/6
20/48
𝑟 - = 17/48
11/48
37/96
.
𝑟 = 42/96
17/96
79/192
/
𝑟 = 71/192
42/192
0 1 1 0
𝑟& = =
1 0 0 1
0 1 0 1
𝑟+ = =
1 0 1 0
Likewise
0 1
𝑟, = 𝑎𝑛𝑑 𝑟 - =
1 0
9
Link Analysis • So, in the next time step, the values will flip. And now when we multiply
again, the value will flip again.
So, what we see here is that we will never converge because the score of 1 gets
passed between a and b and, a score of 0 gets passed between b and a. So it
seems that PageRank computation will never converge. This problem is called
the spider trap problem.
a b
Figure 5: Spider Trap Problem [2]
So, here is the second question: Does the equation (1) converge to what we
want?
0 0 0 0
𝑟+ = =
1 0 1 0
• The first multiplication with matrix M, the scores get flipped i.e., 1 0
0 1
• But in the second step of multiplication is basically the score of 1 gets lost.
Thus, a and b are not able to pass the score to anyone else. So the score gets
lost.
• And we converge to this vector of zeros, which is a problem.
In dead end problem, the dead ends are those web pages that have no outgoing links
as shown in Figure 7. Such pages cause importance to “leak out”. The idea behind is
that, whenever a web page receives its PageRank score, then there is no way for a web
page to pass this PageRank score to anyone else because it has no out-links. Thus,
PageRank score will “leak out” from the system. In the end, the PageRank scores of
all the web pages will be zero. So this is called the Dead End problem.
10
Big Data Analysis
Dead end
Spider trap
Figure 7: Dead end and Spider trap Problem [2]
In the spider trap problem, out-links from webpages can form a small group as
shown in Figure 7. Basically, the random walker will get trapped in a single part of the
web graph and then the random walker will get indefinitely stuck in that part. At the
end, those pages in that part of the graph will get very high weight, and every other
page will get very low weight. So, this is called the problem of spider traps.
Solution to spider trap problem: Random Teleports
A random walk is known as a random process which describes a path that includes a
succession of random steps. Figure 8(a) shows a graph, with node c being spider trap.
So whenever a random walker basically stuck in an infinite loop because there is no
other way.
The way Google has solved the problem to the spider traps, is to say that at each step,
the random walker has two choices. With some probability β occurs, the random
walker will follow the outgoing link.
So, the way we can think of this now is that we have a random walker that whenever a
random walker arrives to a new page, flips a coin and if the coin says yes, the random
walker will pick another link at random and walk that link and, if the coin says no then
the walker will randomly teleport basically jump to some other random page on the
web.
So this means that the random walker will be able to jump out or teleport out from a
spider track within only a few time steps.
After a few time steps, the coin will say yes, let us teleport and the random walker will
be able to jump out of the trap.
Figure 8(a) shows a graph, with node m being spider trap where random walker will
teleport out of spider trap within a few time steps.
11
Link Analysis
a a
b c b c
Figure 8(a): Random teleports approach [2]
Figure 8(b) shows that if a node has no outgoing links and when we reach that node,
we will teleport with probability 1. So, this basically means that whenever you reach
node m, you will always jump out of it, i.e., you will teleport out to a random web page.
So in stochastic (a stochastic matrix is a square matrix whose columns are probability
vectors.) matrix M, column one will have values of 1/3 for all its entries. Basically
whenever a random surfer comes to end, it teleports out and with probability 1/3 lands
to any other node in the graph. So this is again the way to use the random jumps or
random teleports to solve the problem of dead ends.
a a
b c b c
a b c a b c
a 1# 1# 0 a 1# 1# 1#
2 2 2 2 3
b 1# 0 0 b 1# 0 1#
2 2 3
c 0 1# 0 c 0 1# 1#
2 2 3
So, the page rank equation discussed in section 11.5 is re-arranged into a different
equation:
1−𝛽
𝑟! = # 𝛽𝑀. 𝑟 + 7 :
"→! 𝑁 0
Where M is a sparse Matrix (with no-dead ends). In every step, the random surfer
can either follow a link randomly with probability Β or jump to some random
page with probability 1-β. The β is constant and its value generally lies in a
range 0.8 to 0.9.
&12 &12
; < is a vector with all N entries ; <.
0 0 0
So in each iteration of page rank equation, we need to compute the product of matrix
M with old rank vector:
12
rnew = 𝛽𝑀. rold Big Data Analysis
&12
and then add a constant value ; < to each entry in rnew.
0
Do:
𝑟(𝑡−1)
∀𝑗: 𝑟!'(() = '𝛽 𝑖
𝑑𝑖
)→!
𝑟!'(() = 0 𝑖𝑓 𝑖𝑛 − 𝑑𝑒𝑔. 𝑜𝑓 𝑗 𝑖𝑠 0
In the above algorithm, a directed graph G is given as an input along with its parameter
β. The graph may have spider traps and dead ends. The algorithm will give a
new Page rank vector r. If the graph does not have any dead-end then the
amount of leaked PageRank is 1-β. On the other hand, if there are dead-ends,
the amount of leaked PageRank may be larger. This initial equation assumes
that matrix M has no dead ends. It can be either preprocessed to remove all dead
ends or it has to explicitly follow random teleport links with probability 1.0
from dead-ends. If M has dead-ends then ∑ 𝑗 𝑟!'(() < 1 and you also have to
renormalize r’ so that it sums to 1. The computation of new page rank is done
repeatedly until the algorithm converges. The convergence can be checked by
measuring the difference between the old and new page rank value. The
algorithm has to explicitly account for dead ends by computing S.
11.5.1 PageRank Computation Using MapReduce
13
Link Analysis
Slave Slave
Process
Data
Slave Slave Slave
Data Process Process
Data Data
Data
Master Master
Process
Data
Slave Slave Slave
There are 4 steps in MapReduce approach as shown in Figure 9 and are explained as
follows:
Step 1-Input Split: The input data is raw and is further divided into small chunks
called input splits. Each chunk will be an input of a single map and the data input will
make a key-value pair(key1, Value1).
Step 2- Mapping: A node which is given with a map function takes the input and
produces a set of (key2, value2) pairs, shown as (K2, V2) in Figure 10. One of the nodes
in the cluster is called as a “Master node” which is responsible for assigning the work
to the worker nodes called as “Slave nodes”. The master node will ensure that the slave
nodes perform the work allocated to them. The master node saves the information
(location and size) of all intermediate files produced by each map task.
Step 3- Shuffling: The output of the mapping function is being clustered by keys and
reallocated in a manner that all data with the same key are positioned on the same node.
The output of this step will be (K2, list(V2)).
Step 4- Reducing: The nodes will now process respective group of output data by
aggregating the shuffle phase values (output). The ultimate output will be from the
shape of (list (K3, V3)).
14
Big Data Analysis
Figure 11 shows the pseudocode of the MapReduce approach with map and reduce
functions.
15
Link Analysis # Adjacency list
links = sc.textFile('links.txt')
links.collect()
# Key/value pairs
links = links.map(lambda x: (x.split(' ')[0], x.split('
')[1:]))
print(links.collect())
ITERATIONS=20
for i in range(ITERATIONS):
# Join graph info with rank info and propagate to all
neighbors rank scores (rank/number of neighbors)
# And add up ranks from all in-coming edges
ranks = links.join(ranks).flatMap(lambda x : [(i,
float(x[1][1])/len(x[1][0])) for i in x[1][0]])\
.reduceByKey(lambda x,y: x+y)
print(ranks.sortByKey().collect())
The program receives document as page input (webpages or xml data). The
page is parsed using a regular expression method to extract the page title and
its outgoing links. While computing the PageRank using MapReduce program,
at first the web graph is split into some partitions and each partition is sorted as
an adjacency matrix file. Each Map task will process one partition and computes
the partial rank score for some pages. The Reduce task will then merge all the
partial scores and produces the global Rank values for all the web page. Initially,
page identifier and its outgoing link are extracted as key-value pair. Then node
count is done and then we initialize the ranks for each page. After that in each
iteration, join the graph information with rank information and propagate the
same to all neighbors rank scores (rank/number of neighbors). Further, it adds
up ranks from all in-coming edges to generation the final score.
Check Your Progress 2:
1 Explain the spider trap and dead-end problem in PageRank. What are the solutions
for the spider trap and dead-end problem?
2. Why MapReduce paradigm is suitable for computation of PageRank?
3. Given the following graph, compute the matrix of PageRank Computation.
16
Big Data Analysis
17
Link Analysis • Consider S as the teleport set.
• The set S contains only pages that are relevant to a given topic.
• This allows us to measure the relevance of all the other web pages on the web
with regard to this given set S.
• So, for every set S, for every teleport set, you will now be able to compute a
different PageRank vector R that is specific to those data points.
To achieve this, you need to change the teleportation part of the PageRank formulation:
1−𝛽
𝛽 𝑀"! + 𝑖𝑓 𝑖 𝜖𝑆
𝐴"! = |𝑆|
𝛽𝑀"! 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• If the entry i is not in the teleport set S then basically nothing happens.
• But if our entry i is in the teleport set, then add the teleport edges.
• A is still stochastic.
The idea here is basically that you have lot of freedom in how to set the teleport set S.
For example, when teleport set S is just a single node. This is called a random walk
with restarts.
18
• So, spammer wants his web pages to appear as high on the web search ranking Big Data Analysis
as possible, regardless of whether that page is really relevant to a given topic
or not, as driving traffic to a website is good.
The spammer did this by using a technique called link spam. The following example
illustrates the link spam:
• There is a T-shirt seller, she/he creates a web page and wants that her/his web
page should appear, when any web searcher searches the word “theatre”, as a
person in theatre may be interested in her/his T-shirts.
• So this online seller can insert the word “theatre” 1000 times in the webpage.
How can this be done? In the early days of the web, you would have a web
page, which on the top of the page had the legitimate text, and at the bottom
of the web page, there the seller inserts a huge list of words “theater”. This list
would use the same text color of these words as that of the color of the
background.
• So, these words would not bother the user who comes to the web page. But the
web search engine would see these words and would think that this particular
webpage is all about theatre.
• When, of course, somebody will run a query for the word theatre, this web
page would appear to be all about theatre and this page get it ranked high. This
and similar techniques to this one are called term spam.
Spam farm were developed to concentrate PageRank on a single page. For example, a
t-shirt seller creates a fake set of web pages that they all link to his own webpage, and
all these web pages in the anchor text says that the target page is about movies. This is
known as Spam farming.
Google came up with a solution to combat link spam. According to Google; Rather
than believing what the page says about itself, let us believe what other people say
about the page. This means to look at the words in the anchor text and its surrounding
text. The anchor text in a web page contains words that appear underlined to the user
to represent the link. The idea is that you are able to find web pages, even for the
queries/words that the webpage itself does not even mention, but other web pages on
the web may mention that word when referring to the target page.
The next section discusses a similar technique that overcomes the problem of link spam.
The basic idea in hubs and authority model is that every web page is assigned two
scores, not just one, as is the case of PageRank algorithm. One type of score is called
a hub score and the other one will be called as an authority score.
An authority value is computed as the sum of the scaled hub values that point
to a particular web page. A hub value is the sum of the scaled authority values
of the pages it points to.
The basic features of this algorithm are:
• The hubs and authority based algorithm is named HITS algorithm, which is
Hypertext Induced Topic Selection.
• It is used to measure the importance of pages or documents in a similar way to
what we did with PageRank computation.
• For example, we need to find a set of good newspaper webpages. The idea is
that we do not just want to find good newspapers, but in some sense, we want
to find specialists (persons) who link in a coordinated way to good newspapers.
19
Link Analysis So, the idea is similar to PageRank such that we will count links as votes.
• For every page, you will have to compute its hub and its authority score.
• Each page will basically have a quality score of its own as an expert. We call
this as a hub score
• Each page also has a quality score of it as a content provider, so we call it an
authority score.
• So, the idea would be that the hub score is simply the sum of all the votes of
the authorities that are pointed to. The authority score will be the sum of the
experts votes that the page is receiving.
• We will then apply the principle of repeated improvement to compute the
steady state score.
Hence, the way we will think about this is that generally pages on the web fall into two
classes, hubs and authorities as shown in Figure 12.
Class of Authorities: These are the pages which contain useful information or content.
So in our case of newspaper example, these are newspaper webpages.
Class of Hub: where pages link to good authorities or list of good things on the web.
So in newspaper example, these are my favorite newspapers and that page would link
to them.
The HITS algorithm, thus, minimizes the problem due to link spam and users would
be able to get good results, as high-ranking pages.
Check Your Progress 3:
1. Explain the mechanism of link spam.
2. What is term spam?
3. What is a Hub score and Authority score in HITS algorithm?
20
Big Data Analysis
11.9 SUMMARY
This unit introduces the concept of link analysis along with its purpose to create
connections in a set of data. Further, PageRank algorithm is discussed with the
calculation of page scores of the graph. Different mechanisms of finding PageRank
such as Simple Recursive Formulation and the flow model with the associated
problems are also presented in this Unit. The web structure with its associated issues
are also included, which shows various ways to address these challenges. The use of
PageRank in search engine page is also discussed with spider trap and dead end
problem and their solutions. Further, this Unit shows rank computation using map-
reduce approach and topic sensitive PageRank. The Unit also introduces the link spam
and how Google came up with a solution to combat link spam. Lastly, hubs and
authorities algorithm is explained, which can be used to compute PageRank.
11.10 SOLUTIONS/ANSWERS
-------------->
21
Link Analysis
Matrix=
22
Big Data Analysis
UNIT 12 WEB AND SOCIAL NETWORK
ANALYSIS
12.0 INTRODUCTION
In the previous units of this block, you have gone through various methods of
analysing big data. This unit introduces three different types of problems that may be
addressed in data science. The first problem relates to the issue of advertising on the
web. advertisement on the web is most popular with search queries. A typical
advertisement-related problem is: what advertisements are to be shown, as a result of
a search word? This unit defines this problem in detail and shows an example of an
algorithm that can address this problem. Next, this unit discusses the concept of a
recommender system. A recommender system is needed, as there is an abundance of
choices of products and services. Therefore, a customer may need certain
suggestions. A recommender system attempts to give recommendations to customers,
based on their past choices. This unit only introduces content-based
recommendations, though many other types of recommender systems techniques are
available. Finally, this unit introduces you to concepts of the process of finding social
media communities. You may please note that each of these problems is complex and
is a hard problem. Therefore, you are advised to refer to further readings and the
latest research articles to learn more about these systems.
12.1 OBJECTIVES
After going through this Unit, you will be able to:
• Define the issues relating to advertisements on the web
• Explain the process of solving the AdWords problem.
• Define the term long tail in the context of recommender systems.
• Explain the model of the recommendation system and use the utility matrix
• Implement a mode for the content-based recommendations
• Define the use of graphs for social network
• Define clustering in the context of social network graphs.
1
Link Analysis
12.2 WEB ANALYTICS
Web analytics is performed to address the basic question: Is the website able
to fulfil the objectives with which it was created? Some of these objectives
may be:
Is the website informative for users? This would require analysis of the
number of users visiting, the time spent by them on information pages,
whether are users coming again to the website etc.
How can cloud services used for hosting the website be optimised? This
question may be performed by doing the traffic analysis and simultaneous
user data etc.
Thus, web analytics may be very useful for enhancing the efficiency and
effectiveness of a website. In the subsequent sections, we discuss some of the
specialized applications of website-related information.
Advertising is one of the major sources of revenue for many professions like
Television, newspapers etc. The popularity of WWW has also led to the
placement of advertisements on web pages. Interestingly one of the most
popular places to put advertisements is the output of a search engine. This
section describes some of the basic ways of dealing with advertising on the
WWW and some of the algorithms that can be used to find the outcome of
using the advertisements on the WWW.
2
12.3.1 The issues Big Data Analysis
In order to define the issues related to web advertisements, first let us discuss
different ways of placing advertisements on the Web.
You may need to design an algorithm to find out, which advertisement will be
displayed as a result of a search query. This is called the AdWords problem.
The AdWords problem can be defined as follows:
Given:
• A sequence or stream of queries, which are arriving at a search engine,
regularly. The typical nature of queries is that only the present query is
known, and which query will come next cannot be predicted. Let us
say, the sequence of the keyword queries is q1, q2, q3, …
• On each type of query, several advertisers have put their bids. Let us
say that m bids are placed on each type of query, say b1, b2, …, bm.
• The probability of clicking on an advertisement shown for a query, say
p1, p2, …
• The Budget stipulated by an advertiser, which may be allocated for
every day, for n advertisers, say B1, B2, B3, …Bn.
• The maximum number of advertisements that are to be displayed for a
given search query, say t, where t < m.
The Output:
• The size of the selected sub-set of advertisement should be equal to t
3
Link Analysis • The advertiser of the selected advertisement has made a bid for that
type of query.
• In case the ith advertisement is selected then the Budget left should be
more than the bid amount, Bi >= bi.
Greedy Algorithm:
One of the simplest algorithms to solve this problem would be to use the
greedy algorithm, where the important considerations here are to show the
advertisements which have (1) high bid value and (2) high chances of getting
clicked, as the payment will be made only if the advertisement is clicked. This
will be subject to the constraint that the advertiser’s budget has not been over.
For example, consider a query q1 that has the following bids:
One of the questions here is how to compute the probability of clicking the
advertisement, this information can be computed only after certain
experimentation and for a new advertisement this value would not be known.
A discussion on this is beyond the scope of this unit, you may refer to further
reading for more details on this issue.
4
able to produce ½ of the optimal revenue. This can be explained with the help Big Data Analysis
of the following worst case.
Consider there are two types of queries Q1 and Q2 having the following
bidders:
Please note that query Q1 has only one bidder X and query Q2 has two
bidders X and Y. You may also note that all other parameters are the same for
the queries. Now assume the following sequence of queries occurs:
Q2, Q2, Q1, Q1
Now, assume that the advertisement of X is selected randomly for the
impression on query Q2 and it also gets clicked, then the allocation of the
advertisements would be:
X, X, -, -
This occurs because the advertisement budget of X is over after 2nd click and
only X has bid for Query Q1. This is a typical problem of an online algorithm,
where that information about future queries is not known. You may observe
that if this sequence is known before then an optimal selection would have
been:
Y, Y, X, X
Which would optimise the revenue. In the worst case, you may also observe
that the Greedy algorithm has earned ½ of the revenue earned by the Optimal
algorithm.
5
Link Analysis
Let us first try to answer the question: What is the need for recommender
systems?
Consider you have to buy some products from a very large list of products, as
shown in Figure 3. For example, you have to buy a TV and you are not sure
about the size, type, brand etc and a very large choice of sizes, types and
brands exists. In such situations, you would like to consider impartial advice.
This may be the reason the recommender systems were created.
Which one
to buy???
large set
of items
Common
Products
units / time period
Sales of product
Some of the most common applications of the recommender systems are for
recommending books, music albums, movies, published articles etc. The
usefulness of the recommender system can be gauged from the fact that there
are a large number of cases that highlights the success of the recommender
system application. For example, a recommender system of books found that
several purchasers of a book, say a book named A, are also purchasing not a
very popular but highly rated book B. The system started showing this book,
as a recommendation to the purchasers of Book A. It turns out that after a
while even book B became one of the best sellers. Thus, recommendation
system applications have great potential, as they can inform purchasers about
the availability of some good items, which were not known to them due to the
very large number of items.
7
Link Analysis 2. Aggregates recommendations: Such information is aggregated from
the various user or customer activities, for example, the most popular
video or the most purchased product or the highly rated services etc.
However, these recommendations may not suitable to your
requirements, as these recommendations are generic.
3. User-specific Recommendations: These recommendations are
specifically being made for a specific user based on his/her past web
activities such as online purchases, searches, social media interactions,
viewing of video, listening to audio etc.
In this section, we will discuss the model for user-specific recommendations.
The specific problem of user-specific recommendations can be stated as:
Given:
• A set of users or customers, say S who has given some rating to some
of the products/services.
• A set of products or services, say P, being sold or provided by the web
application.
• A set of ratings for every pair of customers and products, these ratings
may use a star rating scale of 0 to 5 or 0 to 10 or just 0 (dislike),
1(like). It may be noted that the values of 0 to 5 or 0 to 10 are ordered
in terms of liking a product with 5 or 10, as the case may be, being the
highest level of liking.
The Output:
• Customer-specific recommendations based on the computation of
expected ratings.
P1 P2 P3 P4 P5
C1 3 1 2
C2 4 1
C3 1 5 3
Customer
Likes Producs/Services
Make Profile of
Products/ Services
liked
Make Customer
Profile
Recommend
Products/Services
based on Customer
Profile
The profile of products or services is defined with the help of a set of features.
For example, the features of a movie can be the actors, genre, director, etc.,
9
Link Analysis whereas a research article's features may be authors, title, metadata of research
articles, etc. The accumulation of these features makes the customer profile.
But how do we represent the product profile? One way to represent the
product profile would be to use a set of vectors. For example, to represent a
movie, we may use a feature vector consisting of the names of the actors,
names of the director and types of genres. You may notice that this vector is
going to be sparse. For research articles, you may use the term frequency and
inverse of document frequency, which were defined in an earlier unit. The
following examples show how the utility matrix and feature matrix can be
used to build a customer profile.
Consider you are making a recommendation system of movies and the only
feature you are considering is the genre of the movies. Also, assume that
movies have just two genres – X and Y. A person rates 5 such movies and the
only ratings here are likes or no ratings, then the following may be a portion
of the utility matrix for the customer, let us say Customer Z. Please note that
in Figure 8, 0 means no ratings and 1 means like.
M1 M2 M3 M4 M5 M6 M7
Cz 1 1 1 1 1 0 0
Figure 8: A Sample Utility matrix
Further, assuming that among the movies liked by customer Z, movies M1,
M2 and M3 are of genre X and M4 and M5 are of genre Y, then the article
profile for the movies is given in Figure 9. Please note that in Figure 9, 1
means the movie is of that genre, whereas 0 means that the movie is not of
that genre.
M1 M2 M3 M4 M5
X 1 1 1 0 0
Y 0 0 0 1 1
Figure 9: A sample product feature matrix
This product matrix now can be used to produce the customer Z profile as:
Feature Genre X profile = sum of the row of X/Movies rated
= 3/5
Feature Genre Y profile = 2/5
You may please note that the method followed here is just finding the average
of all the genres. Though for an actual study, you may use a different
aggregation method.
However, in general, the customers rate the movies on a 5-point scale of say,
1 to 5. With 1 and 2 being negative ratings, 3 being neutral rating and 4 and 5
being positive ratings. In such as case the utility matrix may be as shown in
Figure 10.
M1 M2 M3 M4 M5 M6 M7
Cz 1 1 2 4 2 NR NR
Figure 10: Utility matrix on a 5-point rating scale (1-5)
NR in Figure 10 means not rated. Now considering the same product feature
matrix as Figure 9. You may like to compute the profile of customer Z.
However, let us use a slightly different method here.
10
In general, on a 5-point scale, each customer has their way of assessing these Big Data Analysis
ratings. Therefore, it may be a good idea would be to normalize the ratings of
each customer. For that first find the average rating of a customer. This
customer has rated 5 movies with an average rating of 10/5 =2. Next, use this
average rating to mean neutral rating and subtracting it from the other rating
would make the ratings as:
M1 M2 M3 M4 M5
Normalize ratings of Cz -1 -1 0 2 0
Figure 11: Normalized Utility matrix
In this case, you may use the following method to create the customer Z
profile:
Feature Genre X profile
= sum of normalised rating of genre X/Movies rated of genre X
= Normalised rating of (M1+M2+M3)/3
= (-1-1+0)/3 = -2/3
Feature Genre Y profile
= sum of normalised rating of genre Y/Movies rated of genre Y
= Normalised rating of (M4+M5)/2
= (0+2)/2=1
Thus, customer Z has a positive profile for genre Y.
There can be other methods of normalization and aggregation. You may refer
to the latest research on this topic for better algorithms.
College friends
Family
Working with
There are many different kinds of social networks. Most of these networks are
used for sharing personal or official information. Some of the basic categories
of these networks are:
12
1. Traditional social networks: Mainly used for sharing information Big Data Analysis
among communities of friends or people who come together for a
specific purpose. These networks are primarily analysed for getting
information about groups. An interesting type of social network in this
category is the collaborative research network. Such a network
includes links among authors who have co-authored a research paper.
In addition, one of the communities of this network is the editors of the
research work. These networks can be used to find researchers who
share common research areas.
2. Social review or discussion or blogging networks: Such networks
create a node for the users and make links if two user
review/discuss/blog the same topic/article. These networks can be used
to identify communities, which have similar thinking and inclinations.
3. Image or video sharing networks: These networks may allow people to
follow a person hosting a video. A follower may be linked to a host in
such networks. The meta tags of videos and images and comments of
the people watching the videos may be used to create communities that
share common interests.
The distance measure, as suggested above has one basic issue, which is as
follows:
Consider three nodes A, B and C with only two of these node pairs (A, B) and
(B, C) are connected; then:
disA,B = 1 ; disB,C = 1 ; disA,C = ∞
However, this may violate the true distance measure rule:
disA,B + disB,C >= disA,C
Thus, the definition of distance, when there is no direct link may need to be
redefined, if any traditional clustering methods are to be considered.
13
Link Analysis 12.5.4 Clustering of Social Network Graphs
The social media network consists of a very large number of nodes and links.
As stated earlier, one of the interesting problems here is to identify a set of
communities in this graph. What are the characteristics of a community or
cluster in a graph?
A community or cluster in a graph is a subset of a graph having a large
number of links within the subset, but having less number of links to other
clusters. You may observe that in Figure 12 such clusters exist, though the
cluster of “college friends” and “working with” have few common nodes, but,
in general, the previous statement holds. One additional, feature of the social
media network graph is that a cluster can be further broken into sub-clusters.
For example, in Figure 12, the “college friend” cluster may consist of two
sub-clusters: Undergraduate College Friends and Postgraduate College
friends.
You may please note that such clustering is very similar to hierarchical
clustering or k-means clustering used in machine learning, with the difference
that here you are wanting to find clusters in graphs and not in large datasets of
points.
Given:
A social network undirected graph, say G (V, E), where V is the set of
nodes, which represent an entity such as a person, and E is the set of
edges, which represent the connections, such as friends. The links are not
assigned any weight (see Figure 13)
The Process:
Find a set of minimal edges, which when removed creates a cluster. For
example, in figure 13 the edges {C, E} and {D, F}. This will divide the
graph given in Figure 13 into two clusters {A, B, C, D} and {E, F, G, H}.
A C E G
D F H
B
Several algorithms have been developed for efficient clustering and the
creation of communities in social network graphs. You can refer to the further
readings for a detailed discussion on this topic.
14
3. Define the objective of clustering in social network graphs. Big Data Analysis
12.6 SUMMARY
This unit introduces you to some of the basic problems that can be addressed by data
science. The Unit first introduces you to the concept of web analytics, which is based
on a collection of information about a website to determine if the objectives of the
website are met. One of the interesting aspects of doing business through the web is
web advertising. This unit explains one of the most important problems of
advertisement through web AdWords problem. It also proposes a greedy solution to
the problem. The unit also proposes a better algorithm, named a balanced algorithm,
than the greedy solution to the AdWords problem. Next, the Unit discussed the
recommender system, which results from long-tail products. The recommender
system aims at providing suggestions to a person based on his/her ranking of past
purchases. These recommendations are, in general, such that recommended products
may be liked by the person to whom the recommendations are made. In the context
of recommender systems, this unit discusses the concept of the long-tail, utility
matrix. In addition, the unit discusses two algorithms that may be used for making
content-based recommendations. Finally, the unit discusses the social media network,
which is represented as graphs. The unit also introduces the process of clustering for
the social media network. You may please go through further readings and research
papers for more detail on these topics.
12.7 SOLUTIONS/ANSWERS
15
Link Analysis 3. The content-based recommendations collect the customer ratings of
various products and services. It also makes the profile of products
based on various attributes. The customer ratings and product profiles
are used to make a profile of the customer as per the attributes, which
are then used to make specific recommendations of the products,
which are expected to be rated highly by the customer.
16
Basic of R Programming
UNIT 13 BASICS OF RPROGRAMMING
Structure Page Nos.
13.0 Introduction
13.1 Objectives
13.2 Environment of R
13.3 Data types, Variables, Operators, Factors
13.4 Decision Making, Loops, Functions
13.5 Data Structures in R
13.5.1 Strings and Vectors
13.5.2 Lists
13.5.3 Matrices, Arrays and Frames
13.6 Summary
13.7 Answers
13.0 INTRODUCTION
This unit covers the fundamental concepts of R programming. The unit
familiarises with the environment of R and covers the details of the global
environment. It further discusses the various types of data that is associated with
every variable to reserve some memory space and store values in R. The unit
discusses the various types of the data objects known as factors and the types of
operators used in R programming. The unit also explains the important elements
of decision making, the general form of a typical decision-making structures and
the loops and functions. R’s basic data structures including vector, strings, lists,
frames, matrices and arrays would also be discussed.
13.1 OBJECTIVES
After going through this Unit, you will be able to:
• explain about the environment of R, the global environment and their
elements;
• explain and distinguish between the data types and assign them to
variables;
• explain about the different types of operators and the factors;
• explain the basics of decision making, the structure and the types of
loops;
• explain about the function- their components and the types;
• explain the data structures including vector, strings, lists, frames,
matrices, and arrays.
13.2 ENVIRONMENT OF R
R Programming language has been designed for statistical analysis of data. It
also has a very good support for graphical representation of data. It has a vast
set of commands. In this Block, we will cover some of the essential component
of R programming, which would be useful for you for the purpose of data
analysis. We will not be covering all aspects of this programming language;
therefore, you may refer to the further readings for more details.
5
Basics of R Programming
The discussion on R programming will be in the context of R-Studio, which is
an open-source software. You may try various commands listed in this unit to
facilitate your learning. The first important concept of R is its environment,
which is discussed next.
Environment can be thought of as a virtual space having collection of objects
(variables, functions etc.) An environment is created when you first hit the R
interpreter.
The top level environment present at R command prompt is the global
environment known as R_GlobalEnv, it can also be referred as .GlobalEnv. You
can use ls() command to know what variables/ functions are defined in the
working environment. You can even check it in the Environment section of R
Studio.
Every variable in R has an associated data type, which is known as the reserved
6
Basic of R Programming
memory. This reserved memory is needed for storing the values. Given below is
a list of basic data types available in R programming:
DATA TYPE Allowable Values
Integer Values from the Set of Integers, Z
Numeric Values from the Set of Real Numbers, R
Complex Values from the Set of Complex numbers,
C
Logical Only allowable values are True ; False
Character Possible values are -“x”, “@”, “1”, etc.
Table 1: Basic Data Types
Numeric Datatype:
Decimal values are known to be numeric in R and is default datatype for any
number in R.
Integer Datatype:
R supports integer data type, you can create an integer by suffixing “L” to denote
that particular variable as integer as well as convert a value to an integer by
passing the variable to as.integer() function.
7
Basics of R Programming
Logical Datatype:
R has a logical datatype which returns value as either TRUE or FALSE. It is
usually used while comparing two variables in a condition.
Complex Datatype:
Complex data types are also supported in R. These datatype includes the set of
all complex numbers.
Character Datatype:
R supports character datatype which includes alphabets and special characters.
We need to include the value of the character type inside single or double
inverted commas.
8
Basic of R Programming
VARIABLES:
OPERATORS:
As the case with other programming languages, R also supports assignment,
arithmetic, relational and logical operators. The logical operators of R include
element by element operations. In addition, several other operators are supported
by R, as explained in this section.
Arithmetic Operators:
• Addition (+): The value at the corresponding positions in the vectors are
added. Please note the difference with C programming, as you are adding
a complete vector using a single operator.
• Subtraction (-): The value at the corresponding positions are subtracted.
Once again please note that single operator performs the task of subtract-
ing elements of two vectors.
• Multiplication (*): The value at the corresponding positions are multi-
plied.
9
Basics of R Programming
• Division (/): The value at the corresponding positions are divided.
• Power (^): The first vector is raised to the exponent (power) of the sec-
ond.
• Modulo (%%): The remainder after dividing the two will be returned.
Logical Operators:
Relational Operators:
The relational operators can take scalar or vector operands. In case of vector
operands comparison is done element by element and a vector of TRUE/FALSE
values is returned.
• Less than (<): If an element of the first operand (scalar or vector) is less
than that the corresponding element of the second operand, then this op-
erator returns Boolean value TRUE.
10
Basic of R Programming
• Less than Equal to (<=): If every element in the first operand or vector
is less than or equal to the corresponding element of the second operand,
then this operator returns Boolean value TRUE.
• Greater than (>): If every element in the first operand or vector is greater
than that the corresponding element of the second operand, then this op-
erator returns Boolean value TRUE.
• Greater than (>=): If every element in the first operand or vector is
greater than or equal to the corresponding element of the second operand,
then this operator returns Boolean value TRUE.
• Not equal to (!=): If every element in the first operand or vector is not
equal to the corresponding element of the second operand, then this op-
erator returns Boolean value TRUE.
• Equal to (==): If every element in the first operand or vector is equal to
the corresponding element of the second operand, then this operator re-
turns Boolean value TRUE.
Assignment Operators:
• Left Assignment (ß or <<-or =): Used for assigning value to a vector.
• Right Assignment (-> or ->>): Used for assigning value to a vector.
11
Basics of R Programming
Miscellaneous Operators:
FACTORS:
Factors are the data objects are used for categorizing and further storing the data
as levels. They store both, strings and integer values. Factors are useful in the
columns that have a limited number of unique values also known to be categor-
ical variable. They are useful in data analysis for statistical modelling. For ex-
ample, a categorical variable employment types – (Unemployed, Self-Em-
ployed, Salaried, Others) can be represented using factors. More details on fac-
tors can be obtained from the further readings.
Check Your Progress 1
1. What are various Operators in R?
……………………………………………………………………………
……………………………………………………………………………
12
Basic of R Programming
2. What does %*% operator do?
…………………………………………………………………………….
………………………………………………………………………………
3. Is .5Var a valid variable name? Give reason in support of your answer.
………………………………………………………………………………
………………………………………………………………………………
If Condition
condition
is true If
condition
is false
Conditional code
Conditional code
If condition
Condition is true
If condition
is false
Example:
• For loop: Like while statement, executes the test condition at the end of
the loop body.
14
Basic of R Programming
Syntax:
Example:
FUNCTIONS:
Function Components
• Function Name: Actual name of the function.
15
Basics of R Programming
• Arguments: Passed when the function is invoked. They are optional.
• Function Body: statements that define the logic of the function.
• Return value: last expression of the function to be executed.
Built-in function: Built in functions are the functions already written and is ac-
cessible just by calling the function name. Some examples are seq(), mean(),
min(), max(), sqrt(), paste() and many more.
R’s basic data structures include Vector, Strings, Lists, Frames, Matrices and
Arrays.
13.5.2 Lists
Lists are the objects in R that contains different types of objects within itself
like number, string, vectors or even another list, matrix or any function as its
element It is created by calling list() function.
17
Basics of R Programming
Matrix Manipulations:
Mathematical operations can be performed on the matrix like addition,
subtraction, multiplication and division. You may please note that matrix
division is not defined mathematically, but in R each element of a matrix is
divided by the corresponding element of other matrix.
18
Basic of R Programming
Arrays:
An array is a data object in R that can store multi-dimensional data that have the
same data type. It is used using the array() function and can accept vectors as an
input. An array is created using the values passed in the dim parameter.
For instance, an array is created with dimensions (2,3,5); then R would create 5
rectangular matrices comprising of 2 rows and 3 columns each. However, the
data elements in each of the array will be of the same data type.
19
Basics of R Programming
Dataframe:
A data frame represents a table or a structure similar to an array with two
dimensions It can be interpreted as matrices where each column of that matrix
can be of different data types.
The characteristics of a data frame are given as follow
20
Basic of R Programming
• The names of the columns should not be left blank
• The row names should be unique.
• The data frame can contain elements with numeric, factor or
character data type
• Each column should contain same number of data items.
Extracting specific data from data frame by specifying the column name.
21
Basics of R Programming
Expanding the data frame by Adding additional column.
2. What are the different data structures in R? Briefly explain about them.
…………………………………………………………………………………
…………………………………………………………………………………
22
Basic of R Programming
13.6 SUMMARY
The unit introduces you to the basics of R programming. It explains about the
environment of R, a virtual space having collection of objects and how a new
environment can be created within the global environment. The unit also
explains about the various types of data associated with the variables that
allocates a memory space and stores the values that can be manipulated. It also
gives the details of the five types of operators in R programming. It also explains
about factors that are the data objects used for organizing and storing the data as
levels. The concept of decision making is also been discussed in detail that
requires the programmer to specify one or more conditions to be evaluated or
tested by the program. The concept of loops and their types has also been defined
in this unit. It gives the details of function in R that is a set of instructions that is
required to execute a a command to achieve a task in R. There are several built-
in functions available in R. Further, users may create a function basis their
requirements. The concept of matrices, arrays, dataframes etc have also been
discussed in detail.
13.7 ANSWERS
Check Your Progress 1
23
Basics of R Programming
Matrix A matrix is a two-dimensional data
structure. Matrices are used to bind
vectors from the same length. All
the elements of a matrix must have
the same data type, i.e. (numeric,
logical, character, complex).
Dataframe A dataframe is more generic than a
matrix, i.e. different columns can
have different data types (numeric,
logical etc). It combines features of
matrices and lists like a rectangular
list.
1. De Vries, A., & Meys, J. (2015). R for Dummies. John Wiley & Sons.
2. Peng, R. D. (2016). R programming for data science (pp. 86-181). Victoria, BC, Canada:
Leanpub.
3. Schmuller, J. (2017). Statistical Analysis with R For Dummies. John Wiley & Sons.
4. Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage publications.
5. Lander, J. P. (2014). R for everyone: Advanced analytics and graphics. Pearson
Education.
6. Lantz, B. (2019). Machine learning with R: expert techniques for predictive modeling.
Packt publishing ltd.
7. Heumann, C., & Schomaker, M. (2016). Introduction to statistics and data analysis.
Springer International Publishing Switzerland.
8. Davies, T. M. (2016). The book of R: a first course in programming and statistics. No
Starch Press.
9. https://www.tutorialspoint.com/r/index.html
24
Data Interfacing & Visualization in R
UNIT 14 DATA INTERFACING AND
VISUALISATION IN R
Structure Page Nos.
14.1 Introduction
14.2 Objectives
14.3 Reading Data From Files
14.3.1 CSV Files
14.3.2 Excel Files
14.3.3 Binary Files
14.3.4 XML Files
14.3.5 JSON Files
14.3.6 Interfacing with Databases
14.3.7 Web Data
14.4 Data Cleaning and Pre-processing
14.5 Visualizations in R
14.5.1 Bar Charts
14.5.2 Box Plots
14.5.3 Histograms
14.5.4 Line Graphs
14.5.5 Scatterplots
14.6 Summary
14.7 Answers
14.8. References and Further Readings
14.1 INTRODUCTION
In the previous unit, you have learnt about basic concepts of R programming.
This unit explains how to read and analyse data in R from various file types
including- CSV, Excel, binary, XML, JSON, etc. It also discusses how to extract
and work on data in R from databases and also web data. The unit also explains
in detail about data cleaning and pre-processing in R. In the later sections, the
unit explores the concept of visualisations in R. Various types of graphs and
charts, including - bar charts, box plots, histograms, line graphs and scatterplots,
are discussed.
14.2 OBJECTIVES
After going through this Unit, you will be able to:
• explain the various file types and their interface that can be processed for
data analysis in R;
• read, write and analyse data in R from different type of files including-
CSV, Excel, binary, XML and JSON;
• extract and use data from databases and web for analysis in R;
• explain the steps involved in data cleaning and pre-processing using R;
• Visualise the data using various types of graphs and charts using R and
explain their usage.
25
Basics of R Programming
14.3 READING DATA FROM FILES
In R, you can read data from files outside of the R environment. One
may also write data to files that the operating system can store and further
access. There is a wide range of file formats, including CSV, Excel,
binary, and XML, etc., R can read and write from.
14.3.1 CSV Files
Input as CSV File:
CSV file is a text file in which column values are separated by commas.
For example, you can create data with name, programme, phone of
students. By copying and pasting this data into Windows Notepad, you
can create the CSV file. Using notepad's Save As option, save the file as
input.csv.
Reading a CSV File:
Function used to read a CSV file: read.csv()
Microsoft Excel is the most extensively used spreadsheet tool and it uses the.xls
or.xlsx file extension to store data. Using various Excel-specific packages, R
can read directly from these files. XLConnect, xlsx, and gdata are a few
examples of such packages. The xlsx package also allows R to write to an Excel
file.
26
Data Interfacing & Visualization in R
Install xlsx Package
Syntax:
writeBin(object, con)
readBin(con, what, n )
where,
• The connection object con is used to read or write a binary file.
• The binary file to be written is the object.
• The mode that represents the bytes to be read, such as character,
integer, etc is what.
• The number of bytes to read from the binary file is given by n.
27
Basics of R Programming
install.packages("XML")
28
Data Interfacing & Visualization in R
29
Basics of R Programming
14.3.6 Databases
RMySQL Package
R contains a built-in package called "RMySQL" that allows you to
connect to a MySql database natively. The following command will
install this package in the R environment.
install.packages("RMySQL")
Connecting R to MySQL
30
Data Interfacing & Visualization in R
Many websites make data available for users to consume. The World
Health Organization (WHO), for example, provides reports on health and
medical information in CSV, txt, and XML formats. You can
programmatically extract certain data from such websites using R
applications. "RCurl," "XML," and "stringr" are some R packages that
are used to scrape data from the web. They are used to connect to URLs,
detect required file links, and download the files to the local
environment.
Install R Packages
For processing the URLs and links to the files, the following packages
are necessary.
install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("plyr")
14.5 VISUALIZATION IN R
In the previous section, we have discussed about obtaining input from different
types of data. This section explains various types of graphs that can be drawn
using R. It may please be noted that only selected types of graphs have been
presented here.
Syntax:
barplot(H,xlab,ylab,main, names.arg,col)
where,
• In a bar chart, H is a vector or matrix containing numeric values.
• The x axis label as xlab.
• The y axis label is ylab.
• The title of the bar chart is main.
• names.arg is a list of names that appear beneath each bar.
• col is used to color the graph's bars.
32
Data Interfacing & Visualization in R
More parameters can be added to the bar chart to increase its capabilities.
The title is added using the main parameter. Colors are added to the bars
using the col parameter. To express the meaning of each bar, args.name
is a vector with the same number of values as the input vector.
Figure 14.18: Function for plotting Bar chart with labels and
colours
33
Basics of R Programming
34
Data Interfacing & Visualization in R
14.5.3 Histograms
Syntax:
hist(v,main,xlab,xlim,ylim,breaks,col,border)
where,
• The parameter v is a vector that contains the numeric values for
which histogram is to be drawn.
• The title of the chart is shown by the main.
• The colour of the bars is controlled by col.
• Each bar's border colour is controlled by the border parameter.
• The xlab command is used to describe the x-axis.
• The x-axis range is specified using the xlim parameter.
• The y-axis range is specified with the ylim parameter.
• The term "breaks" refers to the breadth of each bar.
35
Basics of R Programming
A graph that uses line segments to connect a set of points is known as the
line graph. These points are sorted according to the value of one of their
coordinates (typically the x-coordinate). Line charts are commonly used
to identify data trends.
The line graph was created using R's plot() function.
Syntax:
plot(v,type,col,xlab,ylab)
where,
• The numeric values are stored in v, which is a vector.
• type takes values, "p","l","o". The value "p" is used to draw only
points, "l" is used to draw only lines, and "o" is used to draw both
points and lines.
• xlab specifies the label for the x axis.
• ylab specifies the label for the x axis..
• Main is used to specify the title of chart .
• col is used to specify the color of the points and/or the lines.
36
Data Interfacing & Visualization in R
Figure 14.27: A Line Chart with multiple lines for data of Figure
14.26
14.5.4 Scatterplots
Syntax:
37
Basics of R Programming
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Scatterplot Matrices
The scatterplot matrix is used when there are more than two variables
and you want to identify the correlation between one variable and the
others. To make scatterplot matrices, we use pairs() function.
Syntax:
pairs(formula, data)
where,
• The formula represents a set of variables that are utilised in pairs.
• The data set from which the variables will be derived is referred
to as data.
38
Data Interfacing & Visualization in R
14.6 Summary
In this unit you have gone though various file types that can be processed for
data analysis in R and further discussed their interfaces. R can read and write a
variety of file types outside the R environment, including CSV, Excel, binary,
XML and JSON. Further, R can readily connect to various relational databases,
such as MySQL, Oracle, and SQL Server, and retrieve records as a data frame
that can be modified and analysed with all of R's sophisticated packages and
functions. The data can also be programmatically extracted from websites using
R applications. "RCurl," "XML," and "stringr" are some R packages that are
used to scrape data from the web. The unit also explains the concept of data
cleaning and pre-processing which is the process of identifying, correcting and
removing incorrect raw data, familiarization with the dataset, checking data for
structural errors and data irregularities and deciding on how to deal with missing
values are the steps involved in cleaning and preparing data which is mainly
considered among the best practices. The unit finally explores the concept of
39
Basics of R Programming
visualisations in R. There are various types of graphs and charts including- bar
charts, box plots, histograms, line graphs and scatterplots that can be used to
visualise the data effectively. The unit explained the usage and syntax for each
of the illustration with graphics.
14.7 Answers
Check your progress 1
1. Install.packages(“rjson”)
library(rjson)
2. rb mode opens the file in the binary format for reading and wb mode
opens the file in the binary format for writing.
3. The checklist points used for cleaning/ preparing data:
Check for data irregularities: You may check for the invalid values and
outliers.
Decide on how to deal missing values: Either delete the observations if
they are not providing any meaningful insights to our data or imputing
the data with some logical values like mean or median based on the
observations.
Check your progress 2
1. A scatter plot is a chart used to plot a correlation between two or more
variables at the same time
2. We use a histogram to plot the distribution of a continuous variable,
while we can use a bar chart to plot the distribution of a categorical
variable.
3. When you are trying to show “relationship” between two variables, you
will use a scatter plot or chart. When you are trying to show
“relationship” between three variables, you will have to use a bubble
chart.
40
Data Analysis and R
UNIT 15 DATA ANALYSIS AND R
Structure Page Nos.
15.1 Introduction
15.2 Objectives
15.3 Chi Square Test
15.4 Linear Regression
15.5 Multiple Regression
15.6 Logistic Regression
15.7 Time Series Analysis
15.8 Summary
15.9 Answers
15.10 References and Further Readings
15.1 INTRODUCTION
This unit deals with the concept of data analysis and how to leverage it by using
R programming. The unit discusses various tests and techniques to operate on
data in R and how to draw insights from it. The unit covers the Chi-Square Test,
its significance and the application in R with the help of an example. The unit
also familiarises with the concept of Regression Analysis and its types
including- Simple Linear and Multiple Linear Regression and afterwards,
Logistic Regression. It is further substantiated with examples in R that explain
the steps, functions and syntax to use correctly. It also explains how to interpret
the output and visualise the data. Subsequently, the unit explains the concept of
Time Series Analysis and how to run it on R. It also discusses about the
Stationary Time Series, extraction of trend, seasonality, and error and how to
create lags of a time series in R.
15.2 OBJECTIVES
After going through this Unit, you will be able to:-
• Run tests and techniques on data and interpret the results using R;
• explain the correlation between two variables in a dataset by running
Chi-Square Test in R;
• explain the concept of Regression Analysis and distinguish between their
types- simple Linear and Multiple Linear;
• build relationship models in R to plot and interpret the data and further
use it to predict the unknown variable values;
• explain the concept of Logistic Regression and its application on R;
• explain about the Time Series Analysis and the special case of Stationary
Time Series;
• explain about extraction of trend, seasonality, and error and how to create
lags of a time series in R.
41
Basics of R Programming
population and be categorical in nature, such as – top/bottom, True/False,
Black/White.Syntax of a chi-square test: chisq.test(data)
EXAMPLE:
Let’s consider R’s built in “MASS” library that contains Cars93 dataset that
represents the sales of different models of car.
42
Data Analysis and R
Chi-square test is one of the most useful test in finding relationships between
categorical variables.
How can you find the relationships between two scale or numeric variables using
R? One such technique, which helps in establishing a model-based relationship
is regression, which is discussed next.
residual
Input Data
Below is the sample data with the observations between weight and height,
which is experimentally collected and is input in the Figure 15.4
44
Data Analysis and R
Summary of the relationship:
Predict function:
Function which will be used to predict the weight of the new person.
Plot for Visualization: Finally, you may plot these values by setting the plot
title and axis titles (see Figure 15.8). The linear regression line is shown in
Figure 15.3.
45
Basics of R Programming
Linear regression has one response variable and one predictor variables,
however, in many practical cases there can be more than one predictor
variables. This is the case of multiple regression and is discussed next.
INPUT Data
46
Data Analysis and R
Let’s take the R inbuilt data set “mtcars”, which gives comparison between
various car models based on the mileage per gallon (mpg), cylinder
displacement (“disp”), horse power(“hp”), weight of the car(“wt”) & more.
The aim is to establish relationship of mpg (response variable) with predictor
variable (disp, hp, wt). The head function, as used in Figure 15.9, shows the first
5 rows of the dataset.
47
Basics of R Programming
Creating Equation for Regression Model: Based on the intercept & coefficient
values one can create the mathematical equation as follows:
𝑌 = 𝑎 + 𝑏 × 𝑥%&'( + 𝑐 × 𝑥)( + 𝑑 × 𝑥*+
or
𝑌 = 37.15 − 0.000937 × 𝑥%&'( − 0.0311 × 𝑥)( − 3.8008 × 𝑥*+
Input: Import the data set and then use ts() function.
The steps to use the function are given below. However, it is pertinent to note
here that the input values used in this case should ideally be a numeric vector
belonging to the “numeric” or “integer” class.
The following functions will generate quarterly data series from 1959:
ts(inputData, frequency =4, start = c(1959,2)) #frequency 4 => QuarterlyData
The following function will generate monthly data series from 1990
ts(1:10, frequency =12, start = 1990) #freq 12 => MonthlyData
The following function will generate yearly data series from 2009 to 2014.
ts(inputData, start=c(2009), end=c(2014), frequency=1) # YearlyData
In case, you want to use Additive Time Series, you use the following:
𝑌+ = 𝑆+ + 𝑇+ + 𝑒+
However, for Multiplicative Time Series, you may use:
𝑌+ = 𝑆+ × 𝑇+ × 𝑒+
The additive time series can be converted from multiplicative time series by
taking using the log function on the time series as represented below:
𝑎𝑑𝑑𝑖𝑡𝑖𝑣𝑒𝑇𝑆 = 𝑙𝑜𝑔(𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑐𝑎𝑡𝑖𝑣𝑒𝑇𝑆)
1. When the mean value of a time series remains constant over a period of time
and hence, the trend component is removed Over time, the variance does not
increase.
2. Seasonality has a minor impact.
timeSeriesData = EuStockMarkets[,1]
resultofDecompose = decompose(timeSeriesData, type=”mult”)
plot(resultofDecompose)
resultsofSt1 = stl(timeSeriesData, s.window = “periodic”)
15.8 SUMMARY
This unit introduces the concept of data analysis and examine its application
using R programming. It explains about the Chi-Square Test that is used to
determine if two categorical variables are significantly correlated and further
study its application on R. The unit explains the Regression Analysis, which is
a common statistical technique for establishing a relationship model between
two variables- a predictor variable and the response variable. It further explains
the various models in Regression Analysis including Linear and Logistics
Regression Analysis. In Linear Regression the two variables are related through
an equation of degree is one and employs a straight line to explain the
relationship between variables. It is categorised into two types- Simple Linear
Regression which uses only one independent variable and Multiple Linear
Regression which uses two or more independent variables. Once familiar with
the Regression, the unit proceeds to explain about the logistic regression, which
is a classification algorithm for determining the probability of event success and
failure. It is also known as Binomial logistic regression and is based on the sigmoid
function, with probability as the output and input ranging from -∞ to +∞ . At the end,
the unit introduces the concept of time series analysis and help understand its
application and usage on R. It also discusses the special case of Stationary Time Series
and how to make a time series stationary. This section further explains how to extract
the trend, seasonality and error in a time series in R and the creating lags of a time
series.
51
Basics of R Programming
15.9 ANSWERS
Check your Progress 1
1. A regression model that employs a straight line to explain the relationship
between variables is known as linear regression. In Linear Regression these
two variables are related through an equation, where exponent (power) of
both these variables is one. It searches for the value of the regression
coefficient(s) that minimises the total error of the model to find the line of
best fit through your data.
2. The Chi-square test of independence determines whether there is a
statistically significant relationship between categorical variables. It’s a
hypothesis test that answers the question—do the values of one categorical
variable depend on the value of other categorical variables?
3. Linear regression considers 2 variables whereas multiple regression consists
of 2 or more variables.
1. De Vries, A., & Meys, J. (2015). R for Dummies. John Wiley & Sons.
2. Peng, R. D. (2016). R programming for data science (pp. 86-181). Victoria, BC, Canada:
Leanpub.
3. Schmuller, J. (2017). Statistical Analysis with R For Dummies. John Wiley & Sons.
4. Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage publications.
5. Lander, J. P. (2014). R for everyone: Advanced analytics and graphics. Pearson
Education.
6. Lantz, B. (2019). Machine learning with R: expert techniques for predictive modeling.
Packt publishing ltd.
7. Heumann, C., & Schomaker, M. (2016). Introduction to statistics and data analysis.
Springer International Publishing Switzerland.
8. Davies, T. M. (2016). The book of R: a first course in programming and statistics. No
Starch Press.
9. https://www.tutorialspoint.com/r/index.html
10. https://data-flair.training/blogs/chi-square-test-in-r/
11. http://r-statistics.co/Time-Series-Analysis-With-R.html
12. http://r-statistics.co/Logistic-Regression-With-R.html
52
Advance Analysis using R
UNIT 16 ADVANCE ANALYSIS USING R
Structure Page Nos.
16.0 Introduction
16.1 Objectives
16.2 Decision Trees
16.3 Random Forest
16.4 Classification
16.5 Clustering
16.6 Association rules
16.7 Summary
16.7 Answers
16.8 References and Further Readings
16.0 INTRODUCTION
This unit explores the concepts pertaining to advance level of data analysis and
their application in R. The unit explains the theory and the working of the
decision tree model and how to run it on R. It further discusses its various types
that may fall under the 2 categories based on the target variables. The unit also
explores the concept of Random Forest and discusses its application on R. In the
subsequent sections, the unit explains the details of classification algorithm and
its features and types. It further explains the unsupervised learning technique-
Clustering and its application in R programming. It further discusses the 2 types
of clustering in R programming including the concept and algorithm of K-Means
Clustering. The unit concludes by drawing insights on the theory of the
Association Rules and its application in R.
16.1 OBJECTIVES
After going through this Unit, you will be able to:
• explain the concept of Decision Tree- including its types and application in
R;
• explain the working of Decision Tree and the factors to consider when
choosing a tree in R;
• explain the concept of Random Forest and its application on R;
• explain the concept of Classification algorithm and its features including-
classifier, characteristics, binary classification, multi-class classification and
multi-label classification;
• explain the types of classifications including- linear classifier and its types,
support vector machine and decision tree;
• explain the concept of Clustering and its application in R;
• explain the methods of Clustering and their types including the K-Means
clustering and its application in R;
• explain the concept and the theory behind the Association Rule Mining and
its further application in R Language.
53 53
Basic Computer Organisation
Syntax: Defining tree formula describes the predictor and response variables
and data is the name of dataset used. The following are the different types of
decision trees that can be created using this package.
• Decision Stump: It is used to generate decision trees with only one split
and is therefore also known as one-level decision tree. In most cases,
known for its low predictive performance due to its simplicity.
• M5: It is known for its exact classification accuracy, and the ability to
work well with small noisy datasets.
• ID3 (Iterative Dichroatiser 3): One of the core and a wide range of
decision structures is the best attribute for classifying the specified
record with the top-down, greedy search approach via the specified
dataset.
54
Advance Analysis using R
• C4.5: This type of decision tree, known as a statistical classifier, is
derived from its parent ID3. This creates a decision based on the
predictor's bundle.
• C5.0: As a successor to C4.5, there are two models, the base tree and the
rule-based model, whose nodes can only predict category targets.
• CHAID: This algorithm is extended as a chi-square automatic
interaction detector and basically examines the merged variables and
justifies the result of the dependent variable by building a predictive
model.
• MARS: Extended as a multivariate adaptive regression spline, this
algorithm builds a set of piecewise linear models used to model
anomalies and interactions between variables. They are known for their
ability to process numerical data more efficiently.
• Conditional inference tree: This is a type of decision tree that
recursively separates response variables using the conditional inference
framework. It is known for its flexibility and strong fundamentals.
• CART: Expanded as a classification and regression tree, the value of the
target variable is predicted if it is continuous. Otherwise, if it is a
category, the required class is identified.
There are many types of decision tree but all of them fall under two main
categories based on the target variable.
• Categorical Variable: It refers to the variables whose target variables
has definite set of values and belong to a group.
• Continuous Variable: It refers to the variables whose target variables
can choose value from the wide range of data types.
Input Data:
Use R's built-in dataset “readingSkills” to build a decision tree. If you know the
age, shoe size, score (raw score on reading test), it represents the person's
reading literacy score and also if the person is native speaker or not. Figure 16.2
shows this data.
55 55
Basic Computer Organisation Let’s use ctree() function on the above data set to create decision tree and its
graph.
56
Advance Analysis using R
• Partitioning: Refers to the process of partitioning a dataset into subsets.
The decision to make a strategic split has a significant impact on the
accuracy of the tree. Many algorithms are used in the tree to divide a
node into sub-nodes. As a result, the overall clarity of the node with
respect to the target variable is improved. To do this, various algorithms
such as chi-square and Gini coefficient are used and the most efficient
algorithm is selected.
• Pruning: This refers to the process of turning a branch node into a leaf
node and shortening the branches of a tree. The essence behind this idea
is that most complex classification trees fit well into training data, but do
not do the compelling task of classifying new values, thus avoiding
overfitting with simpler trees. That is.
• Tree selection: The main goal of this process is to select the smallest
tree that fits your data for the reasons described in the pruning section.
Important factors to consider when choosing a tree in R
• Entropy: Mainly used to determine the uniformity of a particular
sample. If the sample is perfectly uniform, the entropy is 0, and if it is
evenly distributed, the entropy is 1. The higher the entropy, the harder it
is to draw conclusions from this information.
• Information Gain: Statistical property that measures how well the
training samples are separated based on the target classification. The
main idea behind building a decision tree is to find the attributes that
provide the minimum entropy and maximum information gain. It is
basically a measure of the decrease in total entropy and is calculated by
taking the difference between the undivided entropy of the dataset and
the average entropy after it is divided, based on the specified attribute
value.
Random forests are a set of decision trees that are used in supervised learning
algorithms for classification and regression analysis, but primarily for
classification. This classification algorithm is non-linear. To achieve more
accurate predictions and forecasts, Random Forest creates and combines
numerous decision trees together. However, when utilised alone, each decision
tree model is used. In the cases where the tree is not built, error estimation is
performed. This method is termed as the out-of-bag percent error estimation.
The “Random Forest” is named ‘random’ since the predictors are chosen at
random during training. It is termed as ‘forest’ because a Random Forest makes
decisions based on the findings of several trees. Since multiple uncorrelated
trees (models) that operate as committees are always better than individual
composition models, therefore the random forests are considered to be better
than the decision trees.
Random forest attempts to develop a model using samples from observations
and random beginning variables (columns).
57 57
Basic Computer Organisation The random forest algorithm is:
• Draw size n bootstrap random samples (randomly select n samples from
the training data).
• Build a decision tree from the bootstrap sample. Randomly select
features on each tree node.
• Split the node using a feature (variable) that provides the best split
according to the objective function. One such example is to maximise
the information gain.
• Repeat the first two steps “k” number of times, where k represents the
number of trees that you will create from subset of the sample.
• Aggregate the predictions from each tree of new data points and assign
a class label by majority vote. Select the selected group in the most trees
and assign new data points to that group.
Install R package
Syntax:
Formula is the formula which describes the variables i.e. predictor and
response.
Data is the name of the dataset used.
Input Data: Use R's built-in dataset “readingSkills” to build a decision tree. If
you know the age, shoe size, score (raw score on reading test), it represents the
person's reading literacy score and also if the person is native speaker or not.
Output:
58
Advance Analysis using R
You can now create the random forest by applying the syntax given above and
print the results
Please note the following for the confusion matrix of Figure 16.8
Output:
…………………………………………………………………………………
60
2.What is Out-of-Bag Error? Advance Analysis using R
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
16.4 CLASSIFICATION
The idea of a classification algorithm is very simple. Predict the target class by
analysing the training dataset. Use the training dataset to get better boundary
conditions that you can use to determine each target class. Once the constraints
are determined, the next task is to predict the target class. This entire process is
called classification.
The classification algorithm has some important points.
• Classifier: This is an algorithm that assigns input data to a specific
category. Classification model. The classification model attempts to
draw some conclusions from the input values given to the training. This
inference predicts the class label / category of new data.
• Characteristic: This is an individually measurable property of the
observed event.
• Binary classification: This is a classification task with two possible
outcomes. For example, a gender classification with only two possible
outcomes i.e. Men and women.
• Multi-class classification: This is a classification task where
classification is done in three or more classes. Here is an example of a
multiclass classification: A classifier can recognise a digit only as one
of the digit classes say 0 or 1 or 2…or 9.
16.5 CLUSTERING
Clustering in the R programming language is an unsupervised learning
technique that divides a dataset into multiple groups and is called a cluster
because of its similarities. After segmenting the data, multiple data clusters are
generated. All objects in the cluster have common properties. Clustering is used
in data mining and analysis to find similar datasets.
Clustering application in the R programming language
• Marketing: In R programming, clustering is useful for marketing. This
helps identify market patterns and, therefore, find potential buyers. By
identifying customer interests through clustering and displaying the
same products of interest, you can increase your chances of buying a
product.
• Internet: Users browse many websites based on their interests.
Browsing history can be aggregated and clustered, and a user profile is
generated based on the results of the clustering.
• Games: You can also use clustering algorithms to display games based
on your interests.
• Medicine: In the medical field, every day there are new inventions of
medicines and treatments. Occasionally, new species are discovered by
researchers and scientists. Those categories can be easily found by using
a clustering algorithm based on their similarity.
Clustering method
There are two types of clustering in R programming.
• Hard clustering: With this type of clustering, data points are assigned
to only one cluster, whether they belong entirely to the cluster. The
algorithm used for hard clustering is k-means clustering.
• Soft clustering: Soft clustering assigns the probabilities or possibilities
of data points in a cluster rather than placing each data point in a cluster.
All data points have a certain probability of being present in all clusters.
The algorithm used for soft clustering is fuzzy clustering or soft k-
means.
62
Advance Analysis using R
K-Means clustering in the R programming language
K-Means is an iterative hard clustering technique that uses an unsupervised
learning algorithm. The total number of clusters is predefined by the user, and
the data points are clustered based on the similarity of each data point. This
algorithm also detects the center of gravity of the cluster.
Algorithm: Specifying the number of clusters (k): Let's look at an example of k
= 2 and 5 data points. Randomly assign each data point to the cluster. Assume,
the yellow and blue colors show two clusters with their respective random data
points assigned. Calculate the centroid of a cluster: Remap each data point to the
nearest cluster centroid. The blue data point is assigned to the yellow cluster
because it is close to the center of gravity of the yellow cluster. Refactor the
cluster centroid fuzzy clustering method or soft-k-means.
Input Data and loading the necessary packages in R. For the clustering, we are
using iris data set. The dataset contains 3 classes each of around 50 instances
and the class refers to a type of iris plant.
63 63
Basic Computer Organisation
Result:
The 3 clusters are made which are of 50, 62, and 38 sizes respectively. Within
the cluster, the sum of squares is 88.4%.
Confusion Matrix:
Confusion matrix suggests that 50 setosa are correctly classified as setosa. Out
of 62 versicolor, 14 are incorrectly classified as virginica and 48 correctly as
versicolor. Out of 38 virginica, 2 are incorrectly classified as versicolor and 36
correctly classified as virginica.
Model Evaluation and Visualization
Code:
64
Advance Analysis using R
Output:
Output:
65 65
Basic Computer Organisation
The plot shows the center of the clusters which are marked as cross sign with
same color of the cluster.
Visualizing Clusters
Code:
Output:
66
Advance Analysis using R
The plot showing 3 clusters formed with varying sepal length and sepal width.
Check Your Progress 2
1. What is the difference between Classification and Clustering?
………………………………………………………………………………
………………………………………………………………………………
2. Can decision trees be used for performing clustering?
………………………………………………………………………………
………………………………………………………………………………
3. What is the minimum no. of variables/ features required to perform
clustering?
………………………………………………………………………………
………………………………………………………………………………
Theory
In association rule mining, Support, Confidence, and Lift measure association.
Support: 2% of all the transaction under analysis show that computer and
antivirus software are purchased together.
Confidence: 60% of the customers who purchased a computer also bought the
software.
P(A and B) = 1
P(A).P(B)
Packages in R:
Installing Packages
install.packages("arules")
install.packages("arulesViz")
Syntax:
associa_rules = apriori(data = dataset, parameter = list(support = x,
confidence = y))
The first 2 transactions and the items involved in each transaction can be ob-
served below.
68
Advance Analysis using R
16.7 SUMMARY
This unit explores the concepts of advance data analysis and their application in
R. It explains about the Decision Tree model and its application in R. It
represents decisions and their results in a tree format. There are various types of
decision trees that may fall under the 2 categories based on the target variables,
i.e. categorical and continuous variable. Partitioning, pruning and the tree
selection are the steps involved in the working of the decision tree. The other
factors to consider when choosing a tree in R, are- entropy and information gain.
The unit further explains in detail the concept of Random and its application in
R. Random forests are a set of decision trees that are used in supervised learning
algorithms for classification and regression analysis, but primarily for
classification. It is a non-linear classification algorithm and takes samples from
observations and random initial variables (columns) and attempts to build a
model. In the subsequent sections, the unit explains the details of classification
algorithm and its features and types. In R, the types of classification algorithms
are- Linear classifier, Support vector machine and the decision tree. The linear
classification algorithms can further be of 2 types- logistic regression and Naive
Bayes classifier. It further explains the about Clustering and its application in R
programming. Clustering in the R programming language is an unsupervised
learning technique that divides a dataset into multiple groups and is called a
69 69
Basic Computer Organisation cluster because of its similarities. It is used in data mining and analysis to find
similar datasets ad has application in marketing, internet, gaming, medicine, etc.
There are two types of clustering in R programming- namely, hard clustering
and soft clustering. The unit also discusses K-Means clustering in the R
programming language which is an iterative hard clustering technique that uses
an unsupervised learning algorithm. The unit discusses the theory of the
Association Rules and its application in R. It is an Unsupervised Non-linear
algorithm to discover how any item is associated with other and has a major
usage in Retail, grocery stores, an online platform i.e. those having a large
transactional database. In association rule mining, support, confidence, and lift
measure the association.
16.8 ANSWERS
Check Your Progress 1
1. Yes, Random Forest can be used for both continuous as well as categor-
ical target (dependent) variables.
A random forest i.e., the combination of decision trees, the classification
model refers to the categorical dependent variable, and the regression
model refers to the numeric or continuous dependent variable. But ran-
dom forest is mainly used for Classification problems.
3. Random forest is one of the most popular and widely used machine
learning algorithms in classification problems. It can also be used for
regression problem statements, but it works mostly well with
classification models.
Improvements to the predictive model have become a deadly weapon for
modern data scientists. The best part of the algorithm is that there are
very few assumptions involved. Therefore, preparing the data is not too
difficult and saves time.
2.True, Decision trees can also be used to for clusters in the data, but clustering
often generates natural clusters and is not dependent on any objective function.
1. De Vries, A., & Meys, J. (2015). R for Dummies. John Wiley & Sons.
2. Peng, R. D. (2016). R programming for data science (pp. 86-181). Victoria, BC, Canada:
Leanpub.
3. Schmuller, J. (2017). Statistical Analysis with R For Dummies. John Wiley & Sons.
4. Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage publications.
5. Lander, J. P. (2014). R for everyone: Advanced analytics and graphics. Pearson
Education.
6. Lantz, B. (2019). Machine learning with R: expert techniques for predictive modeling.
Packt publishing ltd.
7. Heumann, C., & Schomaker, M. (2016). Introduction to statistics and data analysis.
Springer International Publishing Switzerland.
8. Davies, T. M. (2016). The book of R: a first course in programming and statistics. No
Starch Press.
9. https://www.tutorialspoint.com/r/index.html
10. https://www.guru99.com/r-decision-trees.html
11. https://codingwithfun.com/p/r-language-random-forest-algorithm/
12. https://towardsdatascience.com/association-rule-mining-in-r-ddf2d044ae50
13. https://www.geeksforgeeks.org/k-means-clustering-in-r-programming/
71 71