0% found this document useful (0 votes)

323 views36 pages

Unit 01

The document provides an introduction to big data, including its definition, types (structured, unstructured, and semi-structured), characteristics (volume, velocity, and variety), sources, and risks. It discusses the growth of big data and challenges of conventional data systems. The document also summarizes big data platforms, their features, and lists some examples of big data platforms.

Uploaded by

BARATH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

323 views36 pages

Unit 01

Uploaded by

BARATH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

lOMoARcPSD|20574153

DS4015 BDA UNIT I KVL Notes

Master of Computer Applications (Anna University)

Studocu is not sponsored or endorsed by any college or university

Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

DS4015 Big Data Analytics

DS4015 BIG DATA ANALYTICS

UNIT - I INTRODUCTION TO BIG DATA

Introduction to Big Data Platform – Challenges of Conventional Systems -

Intelligent data analysis –Nature of Data - Analytic Processes and Tools -
Analysis Vs Reporting - Modern Data Analytic Tools- Statistical Concepts:
Sampling Distributions - Re-Sampling - Statistical Inference - Prediction Error.

UNIT - II SEARCH METHODS AND VISUALIZATION

Search by simulated Annealing – Stochastic, Adaptive search by Evaluation –

Evaluation Strategies –Genetic Algorithm – Genetic Programming – Visualization
– Classification of Visual Data Analysis Techniques – Data Types – Visualization
Techniques – Interaction techniques – Specific Visual data analysis Techniques

UNIT - III MINING DATA STREAMS

Introduction To Streams Concepts – Stream Data Model and Architecture -

Stream Computing - Sampling Data in a Stream – Filtering Streams – Counting
Distinct Elements in a Stream – Estimating Moments – Counting Oneness in a
Window – Decaying Window - Real time Analytics Platform(RTAP)
Applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions

UNIT - IV FRAMEWORKS

MapReduce – Hadoop, Hive, MapR – Sharding – NoSQL Databases - S3 -

Hadoop Distributed File Systems – Case Study- Preventing Private Information
Inference Attacks on Social Networks- Grand Challenge: Applying Regulatory
Science and Big Data to Improve Medical Device Innovation

UNIT - V R LANGUAGE

Overview, Programming structures: Control statements -Operators -Functions -

Environment and scope issues -Recursion -Replacement functions, R data
structures: Vectors -Matrices and arrays - Lists -Data frames -Classes,
Input/output, String manipulations

Unit – I 1

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

REFERENCE:

1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer, 2007.

2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets,
Cambridge
University Press, 3rd edition 2020.
3. Norman Matloff, The Art of R Programming: A Tour of Statistical Software
Design,
No Starch Press, USA, 2011.
4. Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge
Data
Streams with Advanced Analytics, John Wiley & sons, 2012.
5. Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007.

Unit – I 2

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

UNIT - I INTRODUCTION TO BIG DATA

Introduction to Big Data Platform – Challenges of Conventional Systems - Intelligent data

analysis –Nature of Data - Analytic Processes and Tools - Analysis Vs Reporting - Modern
Data Analytic Tools- Statistical Concepts: Sampling Distributions - Re-Sampling - Statistical
Inference - Prediction Error.

INTRODUCTION TO BIG DATA

Big Data
Types of Big Data
Characteristics of Big Data
Growth of Big Data
Sources of Big Data
Risks in Big Data

Big Data

o Big Data is a term used to describe a collection of data that is huge in size and yet
growing exponentially with time.
o A collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications.

Examples of Big Data generation includes

– stock exchanges,
– social media sites,
– jet engines, etc…

Types Of Big Data:

Structured
Unstructured
Semi-structured

Structured Data

o Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.

Unit – I 3

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

o Data stored in a relational database management system is one example of a
'structured' data.

An 'Employee' table in a database is an example of Structured Data

Unstructured Data

o Any data with unknown form or the structure is classified as unstructured data.
o The size being huge, – un-structured data poses multiple challenges in terms of its
processing for deriving value out of it.
o Example of unstructured data is a heterogeneous data source containing a combination
of simple text files, images, videos etc.

Example of Unstructured data

The output returned by 'Google Search'

Semi-structured Data

o Semi-structured data can contain both the forms of data.

o Semi-structured data as a structured in form, but it is actually not defined with e.g. a
table definition in relational DBMS

Example of semi-structured data is – a data represented in an XML file.

Personal data stored in an XML file.
<rec>
<name>Prashant Rao</name>
<sex>Male</sex>
<age>35</age>
</rec>
<rec>

Unit – I 4

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

<name>Seema R.</name>
<sex>Female</sex>
<age>41</age>
</rec>
<rec>
<name>Satish Mane</name>
<sex>Male</sex>
<age>29</age>
</rec>
<rec>
<name>Subrato Roy</name>
<sex>Male</sex>
<age>26</age>
</rec>
<rec>
<name>Jeremiah J.</name>
<sex>Male</sex>
<age>35</age>
</rec>

Characteristics of BD OR 3Vs of Big Data

Three Characteristics of Big Data V3s:

 Volume - Data quantity
 Velocity - Data Speed
 Variety - Data Types

Growth of Big Data:

Unit – I 5

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Storing Big Data

• Analyzing your data characteristics

– Selecting data sources for analysis
– Eliminating redundant data
– Establishing the role of NoSQL
• Overview of Big Data stores
– Data models: key value, graph, document, column-family
– Hadoop Distributed File System (HDFS)
– Hbase
– Hive

Processing Big Data

• Integrating disparate data stores

– Mapping data to the programming framework
– Connecting and extracting data from storage
– Transforming data for processing
– Subdividing data in preparation for Hadoop MapReduce
• Employing Hadoop MapReduce
– Creating the components of Hadoop MapReduce jobs
– Distributing data processing across server farms
– Executing Hadoop MapReduce jobs
– Monitoring the progress of job flows

Growth of Big Data is needed

– Increase of storage capacities
– Increase of processing power
– Availability of data(different data types)
– Every day we create 2.5 quintillion bytes of data; 90% of the data in the
world today has been created in the last two years alone

Huge storage need in Real Time Applications

– FB generates 10TB daily
– Twitter generates 7TB of data Daily
– IBM claims 90% of today’s stored data was generated in just the last two years.

Unit – I 6

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

How Is Big Data Different?

1) Automatically generated by a machine (e.g. Sensor embedded in an engine)

2) Typically an entirely new source of data(e.g. Use of the internet)
3) Not designed to be friendly(e.g. Text streams)
4) May not have much values
– Need to focus on the important part

Sources of Big Data

• Users
• Application
• Systems
• Sensors

Risk in Big Data

• Will be so overwhelmed
– Need the right people and solve the right problems
• Costs escalate too fast
– Isn’t necessary to capture 100%
• Many sources of big data is privacy
– self-regulation
– Legal regulation

INTRODUCTION TO BIG DATA PLATFORM

Big Data Platform

Features of Big Data Platform
List of Big Data Platform

Big Data Platform

Big Data Platform is integrated IT solution for Big Data management which combines
several software system, software tools and hardware to provide easy to use tools system
to enterprises.

Features of Big Data Platform

1. It should support linear scale-out

2. It should have capability for rapid deployment

Unit – I 7

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

3. It should support variety of data format
4. Platform should provide data analysis and reporting tools
5. It should provide real-time data analysis software
6. It should have tools for searching the data through large data set

List of BigData Platforms

a. Hadoop
b. Cloudera
c. Amazon Web Services
d. Hortonworks
e. MapR
f. IBM Open Platform
g. Microsoft HDInsight
h. Intel Distribution for Apache Hadoop
i. Datastax Enterprise Analytics
j. Teradata Enterprise Access for Hadoop
k. Pivotal HD

CHALLENGES OF CONVENTIONAL SYSTEM

Conventional System
Comparison of Big Data with Conventional Data
Challenges of Conventional System.
Challenges of Big Data

Conventional System

o The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
o Big data is huge amount of data which is beyond the processing capacity of
conventional data base systems to manage and analyze the data in a specific time
interval.

Unit – I 8

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Comparison of Big Data with Conventional Data

List of challenges of Conventional Systems:

The following list of challenges has been dominating in the case Conventional systems in
real time scenarios:

Unit – I 9

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

1. Uncertainty of Data Management Landscape
2. The Big Data Talent Gap
3. The talent gap that exists in the industry Getting data into the big data
platform
4. Need for synchronization across data sources
5. Getting important insights through the use of Big data analytics

Big Data Challenges

– The challenges include capture, duration, storage, search, sharing, transfer,

– analysis, and visualization.

Challenges of Big Data:

1. Dealing with outlier

2. Addressing data quality
3. Understanding the data
4. Visualization helps organizations perform analyses
5. Meeting the need for speed
6. Degree of granularity increases.
7. Displaying meaningful results.

INTELLIGENT DATA ANALYSIS

Intelligent Data Analysis

Benefits of Intelligent Data Analysis
Intelligent Data Analysis – Knowledge Acquisition
Evaluation of Intelligent Data Analysis Results

Intelligent Data Analysis (IDA)

– used for extracting useful information from large quantities of online data;
extracting desirable knowledge or interesting patterns from existing databases;

– interdisciplinary study concerned with the effective analysis of data;

Goal: Goal of Intelligent data analysis is to extract useful knowledge, the process
demands a combination of extraction, analysis, conversion, classification, organization,
reasoning, and so on.

Unit – I 10

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Uses / Benefits of IDA:

• Data Engineering
• Database mining techniques, tools and applications
• Use of domain knowledge in data analysis
• Big Data applications
• Evolutionary algorithms
• Machine Learning(ML)
• Neural nets
• Fuzzy logic
• Statistical pattern recognition
• Knowledge Filtering and
• Post-processing

Intelligent Data Analysis :Knowledge Acquisition

The process of eliciting, analyzing, transforming, classifying, organizing and integrating

knowledge and representing that knowledge in a form that can be used in a computer
system. Knowledge in a domain can be expressed as a number of rules

A Rule : A formal way of specifying a recommendation, directive, or strategy, expressed

as "IF premise THEN conclusion" or "IF condition THEN action".

Evaluation of IDA results:

• Absolute& relative accuracy

• Sensitivity& specificity
• False positive & false negative
• Error rate
• Reliability of rule

NATURE OF DATA

Data
Properties of Data
Types of Data
Data Conversion
Data Selection

Unit – I 11

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Data:

Data is a set of values of qualitative or quantitative variables; restated, pieces of data

are individual pieces of information.

Data is measured, collected and reported, and analyzed, whereupon it can be

visualized using graphs or images

Data is nothing but facts and statistics stored or free flowing over a network,
generally it's raw and unprocessed.

When data are processed, organized, structured or presented in a given context so as to

make them useful, they are called Information.

3 Actions on Data:
Capture
Transform
Store
Properties of Data

 Clarity
 Accuracy
 Essence
 Aggregation
 Compression
 Refinement

TYPES OF DATA:

Unit – I 12

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

1. Nominal scales:

Measure categories and have the following characteristics:

• Order: The order of the responses or observations does not matter.

• Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is
not the same as a 2 and 3.
• True Zero: There is no true or real zero. In a nominal scale, zero is
uninterruptable.
• Appropriate statistics for nominal scales: mode, count, frequencies
• Displays: histograms or bar charts

2. Ordinal Scales:

At the risk of providing a tautological definition, ordinal scales measure, well, order. So,
our characteristics for ordinal scales are:

 Order: The order of the responses or observations matters.

 Distance: Ordinal scales do not hold distance. The distance between first and
second is unknown as is the distance between first and third along with all
observations.
 True Zero: There is no true or real zero. An item, observation, or category
cannot finish zero.
 Appropriate statistics for ordinal scales: count, frequencies, mode
 Displays: histograms or bar charts

3 .Interval Scales:

Interval scales provide insight into the variability of the observations or data. Classic
interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).

 Order: The order of the responses or observations does matter.

 Distance: Interval scales do offer distance.
 True Zero: There is no zero with interval scales
 Appropriate statistics for interval scales: count, frequencies, mode, median,
mean, standard deviation (and variance), skewness, and kurtosis.
 Displays: histograms or bar charts, line charts, and scatter plots.


Unit – I 13

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

4. Ratio Scales:

Ratio scales appear as nominal scales with a true zero. They have the following
characteristics:

– Order: The order of the responses or observations matters.

– Distance: Ratio scales do do have an interpretable distance
– True Zero: There is a true zero.
– Appropriate statistics for ratio scales: count, frequencies, mode, median, mean,
standard deviation (and variance), skewness, and kurtosis.
– Displays: histograms or bar charts, line charts, and scatter plots.

The table below summarizes the characteristics of all four types of scales.

Data Conversion

We can convert or transform our data from ratio to interval to ordinal to nominal. , we
cannot convert or transform our data from nominal to ordinal to interval to ratio.

 Scaled data can be measured in exact amounts.

For example, 60 degrees , 12.5 feet, 80 Miles per hour

 Scaled data can be measured with equal intervals.

For example, Between 0 and 1 is 1 inch, Between 13 and 14 is also 1 inch

Ordinal or ranked data provides comparative Amounts

Example:

Unit – I 14

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

1st Place 2nd Place 3rd Place

 Not equal intervals

1st Place 2nd Place 3rd Place

19.6 feet 18.2 feet 12.4 feet

Data Selection

Example – Average Driving Speed

a) Scaled
b) Ordinal

Scaled – Speed:- Speed can be measured in exact amounts withequal intervals.

Example :

60 degrees 12.5 feet 80 Miles per hour

 Ordinal or ranked data provides comparative amounts.

For example, 1st Place 2nd Place 3rd Place

ANALYTIC PROCESS AND TOOLS:

There are 6 analytic processes:

1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation

Unit – I 15

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Step 1: Deployment :

– plan the deployment and monitoring and maintenance, – we need to produce a

final report and review the project.
– In this phase,
we deploy the results of the analysis, this is also known as reviewing the project.

Step 2: Business Understanding :

– The very first step consists of business understanding.

– Whenever any requirement occurs, firstly we need to determine the business
objective, – assess the situation
– determine data mining goals and then
– produce the project plan as per the requirement.
– Business objectives are defined in this phase.

Step 3: Data Exploration :

The second step consists of Data understanding.

– For the further process, we need to gather initial data, describe and explore the data
and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its
application and the need for the project in this phase.
– This is also known as data exploration.
– This is necessary to verify the quality of data collected.

Unit – I 16

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Step 4: Data Preparation:

– we need to select data as per the need, clean it, construct it to get useful
information and – then integrate it all.
– Finally, we need to format the data to get the appropriate data.
– Data is selected, cleaned, and integrated into the format finalized for the
analysis in this phase.

Step 5: Data Modeling:

– select a modeling technique, generate test design, build a model and assess the
model built.
– The data model is build to – analyze relationships between various selected
objects in the data,
– test cases are built for assessing the model and model is tested and
implemented on the data in this phase
Step 6: Data Evaluation

Where processing is hosted?

• Distributed Servers / Cloud (e.g. Amazon EC2)
Where data is stored?
– Distributed Storage (e.g. Amazon S3)
What is the programming model?
– Distributed Processing (e.g. MapReduce)
– How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
– What operations are performed on data?
– Analytic / Semantic Processing

Analytical Tools
– Big data tools for HPC and supercomputing
– MPI
– Big data tools on clouds
– MapReduce model
– Iterative MapReduce model
– DAG model

Unit – I 17

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

– Graph model
– Collective model
– Other BDA tools
– SaS
– R
– Hadoop

ANALYSIS AND REPORTING

Analysis
Reporting
Differences between Analysis and Reporting

Analysis

The process of exploring data and reports in order to extract meaningful in sights, which
can be used to better understand and improve business performance.
Reporting

Reporting is the process of organizing data into informational summaries, in order to

monitor how different areas of a business are performing.

Differences between Analysis and Reporting

Unit – I 18

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

 Reporting translates raw data into information.
 Analysis transforms data and information into insights.
 reporting shows you what is happening
 while analysis focuses on explaining why it is happening and what you can do
about it.

MODERN ANALYTIC TOOLS:

Modern Analy琀椀c Tools: Current Analy琀椀c tools concentrate on three classes:

1. Batch processing tools

2. Stream Processing tools and
3. Interactive Analysis tools.

1. Batch processing system :

Batch Processing System involves :
– collecting a series of processing jobs and carrying them out periodically as a
group (or batch) of jobs.
– It allows a large volume of jobs to be processed at the same time.
– An organization can schedule batch processing for a time when there is little
activity on their computer systems,
– One of the most famous and powerful batch process-based Big Data tools is
Apache Hadoop.
– It provides infrastructures and platforms for other specific Big Data
applications.
2. Stream Processing tools :

Stream processing – Envisioning (predicting) the life in data as and when it transpires
– The key strength of stream processing is that it can provide insights faster,
often within milliseconds to seconds.
– It helps understanding the hidden patterns in millions of data records in real time.
– It translates into processing of data from single or multiple sources – in real or near-
real time applying the desired business logic and emitting the processed
information to the sink.
– Stream processing serves – multiple – resolves in today’s business arena.
Real time data streaming tools are:

Unit – I 19

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

a) Storm
 Storm is a stream processing engine without batch support,
 a true real-time processing framework,
 taking in a stream as an entire ‘event’ instead of series of small batches. Apache
Storm is a distributed real-time computation system.
 It’s applications are designed as directed acyclic graphs.
b) Apache flink:
 Apache flink is – an open source platform
 which is a streaming data flow engine that provides communication fault tolerance
and – data distribution computation over data stream .
 flink is a top level project of Apache flink is scalable data analytics
framework that is fully compatible to hadoop .
 flink can execute both stream processing and batch processing easily.
 flink was designed as an alternative to map-reduce.

c) Kinesis
 Kinesis as an out of the box streaming data tool.
 Kinesis comprises of shards which Kafka calls partitions.
 For organizations that take advantage of real-time or near real-time access to large
stores of data,
 Amazon Kinesis is great.
 Kinesis Streams solves a variety of streaming data problems.
 One common use is the real-time aggregation of data which is followed by
loading the aggregate data into a data warehouse.
 Data is put into Kinesis streams.
 This ensures durability and elasticity
3.Interactive Analysis -Big Data Tools
 The interactive analysis presents – the data in an interactive environment,
– allowing users to undertake their own analysis of information.
 Users are directly connected to – the computer and hence can interact with it in
real time.
 The data can be : – reviewed, compared and analyzed in tabular or graphic
format or both at the same time.
IA -Big Data Tools –

a) Google’s Dremel:

Unit – I 20

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

 is the Google proposed an interactive analysis system in 2010. And named
named Dremel.
 which is scalable for processing nested data.
 Dremel provides a very fast SQL like interface to the data by using a different
technique than MapReduce

b) Apache drill:
Apache drill is:
 Drill is an Apache open-source SQL query engine for Big Data
exploration
 It is similar to Google’s Dremel.
Other major Tools:

a) AWS b) BigData c ) Cassandra d) Data Warehousing e) DevOps f) HBase

g) Hive h)MongoDB i) NiFi j) Tableau k) Talend l) ZooKeeper.

Categories of Modern Analytic Tools

a) Big data tools for HPC and supercomputing

– MPI
b) Big Data Tools for HPC and Supercomputing
• MPI(Message Passing Interface, 1992)
– Provide standardized function interfaces for communication
between parallel processes.
• Collective communication operations
– Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce,
Reduce- scPopular implementations
– atter.
– MPICH (2001)
– OpenMPI (2004)
c) Big data tools on clouds
 MapReduce model
 Iterative MapReduce model
 DAG model
 Graph model
 Collective model

Unit – I 21

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

STATISTICAL CONEPTS

Fundamental Statistics
Elements in Statistics.
Types of Statistics
Statistics Vs Statistical Analysis
Basic Statistical Operations
Application of Statistical Concepts

Fundamental Statistics

Statistics is the methodology for collecting, analyzing, interpreting and drawing conclusions
from information.

Statistics is the methodology which scientists and mathematicians have developed for
interpreting and drawing conclusions from collected data.

Statistics provides methods for:

1. Design: Planning and carrying out research studies.

2. Description: Summarizing and exploring data.
3. Inference: Making predictions and generalizing about phenomena represented by the data.

Elements in Statistics

1. Experimental unit
• Object upon which we collect data

2. Population
• All items of interest

3. Variable
• Characteristic of an individual experimental unit

4. Sample
• Subset of the units of a population

• P in Population & Parameter

• S in Sample & Statistic

Unit – I 22

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

5. Statistical Inference
• Estimate or prediction or generalization about a population based on information contained
in a sample

6. Measure of Reliability
• Statement (usually qualified) about the degree of uncertainty associated with a statistical
inference

Example for Statistics

o Agricultural problem: Is new grain seed or fertilizer more productive?
o Medical problem: What is the right amount of dosage of drug to treatment?
o Political science: How accurate are the gallups and opinion polls?
o Eeconomics: What will be the unemployment rate next year?
o Technical problem: How to improve quality of product?

Types or Branches of Sta琀椀s琀椀c:

The study of statistics has two major branches: descriptive statistics and inferential
statistics.

Descriptive statistics: –

– Methods of organizing, summarizing, and presenting data in an informative way.

– Involves: Collecting Data

Unit – I 23

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Presenting Data
Characterizing Data
Purpose
Describe Data

Inferential statistics: –

– The methods used to determine something about a population on the basis of a sample:

– Population –The entire set of individuals or objects of interest or the measurements obtained
from all individuals or objects of interest

– Sample – A portion, or part, of the population of interest

Statistics Vs Statistical Analysis

• Statistics :- The science of

– collectiong,
– organizing,
– presenting,
– analyzing, and
– interpreting data
to assist in making more effective decisions.

• Statistical analysis: – used to

– manipulate summarize, and
– investigate data,
so that useful decision-making information results.

Basic Statistical Operations

Mean: A measure of central tendency for Quantitative data i.e. the long term average
value.
Median :A measure of central tendency for Quantitative data i.e. the half-way point.
Mode :The most frequently occurring (discrete), or where the probability density
function peaks (contin- ious).
Minimum :The smallest value. •
Maximum: The largest value. Inter quartile range Can be thought or as the middle 50 of
the (Quantitative) data, used as a measure of spread.
Variance : Used as a measure of spread, may be thought of as the moment of inertia.
Unit – I 24

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Standard deviation : A measure of spread, the square root of the variance.

Application of Statistical Concepts and Areas

Statistical Concepts :

• Finance – correlation and regression, index numbers, time series analysis

• Marketing – hypothesis testing, chi-square tests, nonparametric statistics
• Personel – hypothesis testing, chi-square tests, nonparametric tests
• Operating management – hypothesis testing, estimation, analysis of variance, time series
analysis

Application Areas :

• Economics
– Forecasting
– Demographics
• Sports
– Individual & Team Performance
• Engineering
– Construction
– Materials
• Business
– Consumer Preferences
– Financial Trends

Sampling Distribution

Sample
Types of Samples
Examples of Sampling Distribution
Errors on Sampling Distribution.

Sample

A sample is “a smaller (but hopefully representative) collection of units from a population

used to determine truths about that population”

Unit – I 25

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Types of Samples

1. Stratified Samples
2. Cluster Samples
3. Systematic Samples
4. Convenience Sample

1. Stratified Samples

A stratified sample has members from each segment of a population. This ensures that each
segment from the population is represented.

2. Cluster Samples :

A cluster sample has all members from randomly selected segments of a population. This is
used when the population falls into naturally occurring subgroups

3. Systematic Samples:

Unit – I 26

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

A systematic sample is a sample in which each member of the population is assigned a

number. A starting number is randomly selected and sample members are selected at regular
intervals.

4. Convenience Samples: A convenience sample consists only of available members of the

population.

Example:
You are doing a study to determine the number of years of education each teacher at your college
has.
Identify the sampling technique used if you select the samples listed.

Examples of Sampling Distribution

1) Your sample says that a candidate gets support from 47%.

2) Inferential statistics allow you to say that
– (a) the candidate gets support from 47% of the population
– (b) with a margin of error of +/- 4%
– This means that the support in the population is likely somewhere between 43% and
51%.

Errors on Sampling Distribution

• Margin of error is taken directly from a sampling distribution.

• It looks like this:

Unit – I 27

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Re-Sampling

Re-Sampling
Re-Sampling in Statistics
Need for Re-Sampling
Re-Sampling Methods

Re-Sampling

• Re-sampling is:
– the method that consists of drawing repeated samples from the original data
samples.

• The method of Resampling is

– a nonparametric method of statistical inference. ...

• The method of resampling uses:

– experimental methods, rather than analytical methods, to generate the unique
sampling distribution.

Re-Sampling in statistics

• In statistics, re-sampling is any of a variety of methods for doing one of the following:

– Estimating the precision of sample statistics (medians, variances, percentiles)

– by using subsets of available data (jackknifing) or drawing randomly with
replacement from a set of data points (bootstrapping)

Unit – I 28

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Need for Re-Sampling

• Re-sampling involves:
– the selection of randomized cases with replacement from the original data sample
• in such a manner that each number of the sample drawn has a number of cases
that are similar to the original data sample.
• Due to replacement:
– the drawn number of samples that are used by the method of re-sampling consists of
repetitive cases.

• Re-sampling generates a unique sampling distribution on the basis of the actual data.

• The method of re-sampling uses

– experimental methods, rather than analytical methods, to generate the unique
sampling distribution.

• The method of re-sampling yields

– unbiased estimates as it is based on the unbiased samples of all the possible results
of the data studied by the researcher.

Re-Sampling Methods

– processes of repeatedly drawing samples from a data set and refitting a given model
on each sample with the goal of learning more about the fitted model.
• Re-sampling methods can be expensive since they require repeatedly performing the same
statistical methods on N different subsets of the data.
• Re-sampling methods refit a model of interest to samples formed from the training set,
– in order to obtain additional information about the fitted model.
• For example, they provide estimates of test-set prediction error, and the standard deviation
and bias of our parameter estimates.

There are four major re-sampling methods available and are:

1. Permutation
2. Bootstrap
3. Jackknife
4. Cross validation

Unit – I 29

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

1. Permutation:

The term permutation refers to a mathematical calculation of the number of ways a

particular set can be arranged.

Permutation Re-sampling Processes:

Step 1: Collect Data from Control & Treatment Groups

Step 2: Merge samples to form a pseudo population
Step 3: Sample without replacement from pseudo population to simulate control Treatment
groups
Step 4: Compute target statistic for each example

2. Bootstrap :

• The bootstrap is
– a widely applicable tool that

Unit – I 30

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

– can be used to quantify the uncertainty associated with a given estimator or
statistical learning approach, including those for which it is difficult to obtain a measure of
variability.
• The bootstrap generates:
– distinct data sets by repeatedly sampling observations from the original data set.
– These generated data sets can be used to estimate variability in lieu of sampling
independent data sets from the full population.

Bootstrap Types

a) Parametric Bootstrap
b) Non-parametric Bootstrap

3.Jackknife Method:

Jackknife method was introduced by Quenouille (1949) to estimate the bias of an

estimator.
The method is later shown to be useful in reducing the bias as well as in estimating the
variance of an estimator.

A comparison of the Bootstrap & Jackknife

Bootstrap

. Yields slightly different results when repeated on the same data (when estimating the
standard error)
. Not bound to theoretical distributions

Unit – I 31

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Jackknife

– Less general technique

– Explores sample variation differently
– Yields the same result each time
– Similar data requirement

4. Cross validation:

Cross-validation is a technique used to protect against over fitting in a predictive

model, particularly in a case where the amount of data may be limited.
In cross-validation, you make a fixed number of folds (or partitions) of the data, run
the analysis on each fold, and then average the overall error estimate.

Statistical Inference

Inference
Statistical Inference
Types of Statistical Inference

Inference : -

Use a random sample to learn something about a larger population

Two ways to make inference

Statistical Inference:

The process of making guesses about the truth from a sample.

Statistical inference is the process through which inferences about a population are
made based on certain statistics calculated from a sample of data drawn from that
population.

Unit – I 32

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Types of Statistical Inference

There are Two most common types of Statistical Inference and they are:

– Confidence intervals and

– Tests of significance.

Confidence Intervals

Range of values that m is expected to lie within • 95% confidence interval 95

probability that m will fall within range probability is the level of confidence

Test of Significance ( Hypothesis testing):

A statistical method that uses: – sample data to evaluate a hypothesis about a

population parameter.

• A hypothesis is an assumption about the population parameter.

– A parameter is a Population mean or proportion
– The parameter must be identified before analysis.

Hypothesis Testing

• Is also called significance testing

• Tests a claim about a parameter using evidence (data in a sample
• The technique is introduced by considering a one-sample z test
• The procedure is broken into four steps
• Each element of the procedure must be understood

Unit – I 33

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Hypothesis Testing Steps

A. Null and alternative hypotheses

B. Test statistic
C. P-value and interpretation
D. Significance level (optional)

Prediction Error

Error in Predictive Analysis

Predication Error in Statistics
Predication Error in Regression

Prediction Error
o A prediction error is the failure of some expected event to occur.
o Errors are an inescapable element of predictive analytics that should also be
quantified and presented along with any model, often in the form of a confidence
interval that indicates how accurate its predictions are expected to be.
o A prediction error is the failure of some expected event to occur.
o When predictions fail, humans can use metacognitive functions, examining prior
predictions and failures.
o For example, whether there are correlations and trends, such as consistently being
unable to foresee outcomes accurately in particular situations.
o Applying that type of knowledge can inform decisions and improve the quality of
future predictions.

Error in Predictive Analysis

– Errors are an inescapable element of predictive analytics that should also be quantified and
presented along with any model, often in the form of a confidence interval that indicates how
accurate its predictions are expected to be.
– Analysis of prediction errors from similar or previous models can help determine
confidence intervals.

Predication Error in Statistics

1. Standard Error of the Estimate

The standard error of the estimate is a measure of the accuracy of predictions.

Unit – I 34

Downloaded by BARATH S (htarab86@gmail.com)

lOMoARcPSD|20574153

DS4015 Big Data Analytics

Recall that the regression line is the line that minimizes the sum of squared deviations
of prediction (also called the sum of squares error).

2. Mean squared prediction error

– In statistics the mean squared prediction error or mean squared error of the predictions of a
smoothing or curve fitting procedure is the expected value of the squared difference between
the fitted values implied by the predictive function and the values of the (unobservable)
function g.
– The MSE is a measure of the quality of an estimator—it is always non-negative, and values
closer to zero are better.
– Root-Mean-Square error or Root-Mean-Square Deviation (RMSE or RMSD)
Predication Error in Regression

Regressions differing in accuracy of prediction.

The standard error of the estimate is a measure of the accuracy of predictions.
Recall that the regression line is the line that minimizes the sum of squared deviations of
prediction (also called the sum of squares error).

Unit – I 35

Downloaded by BARATH S (htarab86@gmail.com)

Unit01 03
No ratings yet
Unit01 03
147 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
Unit V
100% (1)
Unit V
66 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
B.Tech Data Analytics Assignment
0% (1)
B.Tech Data Analytics Assignment
2 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
No ratings yet
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
192 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
BDA Notes-1
No ratings yet
BDA Notes-1
39 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
61 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
26 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
Multimedia & Web Data Mining Guide
100% (2)
Multimedia & Web Data Mining Guide
13 pages
BCA-404: Data Mining and Data Ware Housing
No ratings yet
BCA-404: Data Mining and Data Ware Housing
19 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Lecturer Notes On IT 2353 UNIT III
100% (3)
Lecturer Notes On IT 2353 UNIT III
30 pages
L21 Mining Social Network Graphs
No ratings yet
L21 Mining Social Network Graphs
30 pages
Web Services Notes
No ratings yet
Web Services Notes
38 pages
DVT Unit 4
No ratings yet
DVT Unit 4
21 pages
Unit II Ui Design
No ratings yet
Unit II Ui Design
28 pages
Measuring and Reporting
No ratings yet
Measuring and Reporting
3 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Lab 1: Preprocessing Using Python
No ratings yet
Lab 1: Preprocessing Using Python
5 pages
Data Mining: Techniques & Applications
No ratings yet
Data Mining: Techniques & Applications
16 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
38 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
JSP Program
No ratings yet
JSP Program
10 pages
Unit-Iii Advanced Database Systems
No ratings yet
Unit-Iii Advanced Database Systems
29 pages
Unit 5
No ratings yet
Unit 5
104 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
Unit IV Group Dynamics
No ratings yet
Unit IV Group Dynamics
18 pages
BigData Mining and Analytics
No ratings yet
BigData Mining and Analytics
2 pages
MC4102 OOSE Question Bank
No ratings yet
MC4102 OOSE Question Bank
4 pages
R22-Ids-Question Bank
No ratings yet
R22-Ids-Question Bank
4 pages
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
100% (1)
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
17 pages
Intro to WWW for CSE Students
No ratings yet
Intro to WWW for CSE Students
17 pages
Mean Stack Technologies Lab Record
No ratings yet
Mean Stack Technologies Lab Record
49 pages
V Sem Solution Bank
100% (1)
V Sem Solution Bank
303 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
4 pages
5 Knowledge Representation
No ratings yet
5 Knowledge Representation
19 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
Predictive Analytics Overview
No ratings yet
Predictive Analytics Overview
39 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Data Analytics Data Visualization Unit V
No ratings yet
Data Analytics Data Visualization Unit V
12 pages
Unit 3 Topic 9 Hadoop Archives
No ratings yet
Unit 3 Topic 9 Hadoop Archives
32 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
M.E. Bda 2021
No ratings yet
M.E. Bda 2021
64 pages
Mining Graphs
No ratings yet
Mining Graphs
23 pages
Computer Architecture Overview
No ratings yet
Computer Architecture Overview
4 pages
BDCC Unit 1
No ratings yet
BDCC Unit 1
165 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
ASSIGNT-Research Design
No ratings yet
ASSIGNT-Research Design
10 pages
Ds4015 Big Data Analytics QB
No ratings yet
Ds4015 Big Data Analytics QB
155 pages
Assignmt-Sb-Impact Factor
No ratings yet
Assignmt-Sb-Impact Factor
11 pages
MC7305 IP Lecturer Notes
No ratings yet
MC7305 IP Lecturer Notes
101 pages
MOM 08.08.2023 (Criterion 4 and 5) .
No ratings yet
MOM 08.08.2023 (Criterion 4 and 5) .
3 pages
Cerificate FDP
No ratings yet
Cerificate FDP
1 page
PPT-SB - Impact Factor
No ratings yet
PPT-SB - Impact Factor
19 pages
Ds4015 Bda Updated Co Po
No ratings yet
Ds4015 Bda Updated Co Po
56 pages
MG8591POM
No ratings yet
MG8591POM
5 pages
Ijms 46 (9) 1743-1750
No ratings yet
Ijms 46 (9) 1743-1750
8 pages
Stqa - Unit Iii, Iv, V
No ratings yet
Stqa - Unit Iii, Iv, V
50 pages
Directing and Motivation in Management
No ratings yet
Directing and Motivation in Management
60 pages
03 Declaration
No ratings yet
03 Declaration
1 page
Horn of Africa's Honey Wine Potential
No ratings yet
Horn of Africa's Honey Wine Potential
5 pages
14 - Chapter 6 PDF
No ratings yet
14 - Chapter 6 PDF
10 pages
Chapter - 3 Service Oriented Approximation Technique Based Intrusion Detection in Cloud Environment
No ratings yet
Chapter - 3 Service Oriented Approximation Technique Based Intrusion Detection in Cloud Environment
11 pages
Chapter - 4 Dynamic Session Based Enforcement of Encryption Standards For Intrusion Detection in Cloud Environment 4.1
No ratings yet
Chapter - 4 Dynamic Session Based Enforcement of Encryption Standards For Intrusion Detection in Cloud Environment 4.1
10 pages
Control Flow: Control Structures Conditional Structure
No ratings yet
Control Flow: Control Structures Conditional Structure
40 pages
Ijtk 18 (1) 193-199
No ratings yet
Ijtk 18 (1) 193-199
7 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
34 pages
Python Operators & Control Flow
No ratings yet
Python Operators & Control Flow
55 pages
DLT Unit-2
No ratings yet
DLT Unit-2
11 pages
Research Analysis for Educators
No ratings yet
Research Analysis for Educators
3 pages
Dbms Part 3 DPP
No ratings yet
Dbms Part 3 DPP
3 pages
Distributed DBMS Explained
No ratings yet
Distributed DBMS Explained
12 pages
Data Science Model QP
No ratings yet
Data Science Model QP
1 page
Notes UNIT 4 Information Theory For Cyber Security
No ratings yet
Notes UNIT 4 Information Theory For Cyber Security
9 pages
Ask Mona - AI Solutions
No ratings yet
Ask Mona - AI Solutions
8 pages
R.M.K. Engineering College: Office of The Controller of Examinations
No ratings yet
R.M.K. Engineering College: Office of The Controller of Examinations
1 page
Unit2 - DWDM Notes
No ratings yet
Unit2 - DWDM Notes
63 pages
Sharique PDF
No ratings yet
Sharique PDF
117 pages
R20 Regulations Full Syllabus 14112021 Min
No ratings yet
R20 Regulations Full Syllabus 14112021 Min
33 pages
Chapter 1
No ratings yet
Chapter 1
43 pages
PostgreSQL Write Processes Explained
No ratings yet
PostgreSQL Write Processes Explained
5 pages
Relational Online Analytical Processing ROLAP
No ratings yet
Relational Online Analytical Processing ROLAP
10 pages
SCS 307 Systems Programming Course Outline
No ratings yet
SCS 307 Systems Programming Course Outline
2 pages
4684large Scale Data Analytics Chung Yik Cho Download Full Chapters
No ratings yet
4684large Scale Data Analytics Chung Yik Cho Download Full Chapters
157 pages
Sanjanasingesume
No ratings yet
Sanjanasingesume
3 pages
CS3491 Set7
No ratings yet
CS3491 Set7
2 pages
IRS Lab Manual Odd Sem 2025-26
No ratings yet
IRS Lab Manual Odd Sem 2025-26
16 pages
Milan
No ratings yet
Milan
2 pages
Data Warehousing Essentials
No ratings yet
Data Warehousing Essentials
132 pages
CS 3308 Learning Journal Unit 1
No ratings yet
CS 3308 Learning Journal Unit 1
6 pages
Mohit Singhal: IIT Grad & SDE Profile
No ratings yet
Mohit Singhal: IIT Grad & SDE Profile
1 page
BCA Operating Systems & Linux
No ratings yet
BCA Operating Systems & Linux
19 pages
Alamanda Bhaskar
No ratings yet
Alamanda Bhaskar
1 page
Detecting Rumors on Twitter
No ratings yet
Detecting Rumors on Twitter
11 pages
Unit 3 Assessment - Attempt Review - Saylor Academy
No ratings yet
Unit 3 Assessment - Attempt Review - Saylor Academy
13 pages
Relational Data Modal 11 lacture-WPS Office
No ratings yet
Relational Data Modal 11 lacture-WPS Office
24 pages
Evolution and Impact of IT
No ratings yet
Evolution and Impact of IT
3 pages
Applications of Digital Image Processing in Real Time World
No ratings yet
Applications of Digital Image Processing in Real Time World
4 pages