lOMoARcPSD|20574153
DS4015 BDA UNIT I KVL Notes
Master of Computer Applications (Anna University)
Studocu is not sponsored or endorsed by any college or university
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
DS4015 BIG DATA ANALYTICS
UNIT - I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of Conventional Systems -
Intelligent data analysis –Nature of Data - Analytic Processes and Tools -
Analysis Vs Reporting - Modern Data Analytic Tools- Statistical Concepts:
Sampling Distributions - Re-Sampling - Statistical Inference - Prediction Error.
UNIT - II SEARCH METHODS AND VISUALIZATION
Search by simulated Annealing – Stochastic, Adaptive search by Evaluation –
Evaluation Strategies –Genetic Algorithm – Genetic Programming – Visualization
– Classification of Visual Data Analysis Techniques – Data Types – Visualization
Techniques – Interaction techniques – Specific Visual data analysis Techniques
UNIT - III MINING DATA STREAMS
Introduction To Streams Concepts – Stream Data Model and Architecture -
Stream Computing - Sampling Data in a Stream – Filtering Streams – Counting
Distinct Elements in a Stream – Estimating Moments – Counting Oneness in a
Window – Decaying Window - Real time Analytics Platform(RTAP)
Applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions
UNIT - IV FRAMEWORKS
MapReduce – Hadoop, Hive, MapR – Sharding – NoSQL Databases - S3 -
Hadoop Distributed File Systems – Case Study- Preventing Private Information
Inference Attacks on Social Networks- Grand Challenge: Applying Regulatory
Science and Big Data to Improve Medical Device Innovation
UNIT - V R LANGUAGE
Overview, Programming structures: Control statements -Operators -Functions -
Environment and scope issues -Recursion -Replacement functions, R data
structures: Vectors -Matrices and arrays - Lists -Data frames -Classes,
Input/output, String manipulations
Unit – I 1
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
REFERENCE:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer, 2007.
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets,
Cambridge
University Press, 3rd edition 2020.
3. Norman Matloff, The Art of R Programming: A Tour of Statistical Software
Design,
No Starch Press, USA, 2011.
4. Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge
Data
Streams with Advanced Analytics, John Wiley & sons, 2012.
5. Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007.
Unit – I 2
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
UNIT - I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of Conventional Systems - Intelligent data
analysis –Nature of Data - Analytic Processes and Tools - Analysis Vs Reporting - Modern
Data Analytic Tools- Statistical Concepts: Sampling Distributions - Re-Sampling - Statistical
Inference - Prediction Error.
INTRODUCTION TO BIG DATA
Big Data
Types of Big Data
Characteristics of Big Data
Growth of Big Data
Sources of Big Data
Risks in Big Data
Big Data
o Big Data is a term used to describe a collection of data that is huge in size and yet
growing exponentially with time.
o A collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications.
Examples of Big Data generation includes
– stock exchanges,
– social media sites,
– jet engines, etc…
Types Of Big Data:
Structured
Unstructured
Semi-structured
Structured Data
o Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
Unit – I 3
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
o Data stored in a relational database management system is one example of a
'structured' data.
An 'Employee' table in a database is an example of Structured Data
Unstructured Data
o Any data with unknown form or the structure is classified as unstructured data.
o The size being huge, – un-structured data poses multiple challenges in terms of its
processing for deriving value out of it.
o Example of unstructured data is a heterogeneous data source containing a combination
of simple text files, images, videos etc.
Example of Unstructured data
The output returned by 'Google Search'
Semi-structured Data
o Semi-structured data can contain both the forms of data.
o Semi-structured data as a structured in form, but it is actually not defined with e.g. a
table definition in relational DBMS
Example of semi-structured data is – a data represented in an XML file.
Personal data stored in an XML file.
<rec>
<name>Prashant Rao</name>
<sex>Male</sex>
<age>35</age>
</rec>
<rec>
Unit – I 4
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
<name>Seema R.</name>
<sex>Female</sex>
<age>41</age>
</rec>
<rec>
<name>Satish Mane</name>
<sex>Male</sex>
<age>29</age>
</rec>
<rec>
<name>Subrato Roy</name>
<sex>Male</sex>
<age>26</age>
</rec>
<rec>
<name>Jeremiah J.</name>
<sex>Male</sex>
<age>35</age>
</rec>
Characteristics of BD OR 3Vs of Big Data
Three Characteristics of Big Data V3s:
Volume - Data quantity
Velocity - Data Speed
Variety - Data Types
Growth of Big Data:
Unit – I 5
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Storing Big Data
• Analyzing your data characteristics
– Selecting data sources for analysis
– Eliminating redundant data
– Establishing the role of NoSQL
• Overview of Big Data stores
– Data models: key value, graph, document, column-family
– Hadoop Distributed File System (HDFS)
– Hbase
– Hive
Processing Big Data
• Integrating disparate data stores
– Mapping data to the programming framework
– Connecting and extracting data from storage
– Transforming data for processing
– Subdividing data in preparation for Hadoop MapReduce
• Employing Hadoop MapReduce
– Creating the components of Hadoop MapReduce jobs
– Distributing data processing across server farms
– Executing Hadoop MapReduce jobs
– Monitoring the progress of job flows
Growth of Big Data is needed
– Increase of storage capacities
– Increase of processing power
– Availability of data(different data types)
– Every day we create 2.5 quintillion bytes of data; 90% of the data in the
world today has been created in the last two years alone
Huge storage need in Real Time Applications
– FB generates 10TB daily
– Twitter generates 7TB of data Daily
– IBM claims 90% of today’s stored data was generated in just the last two years.
Unit – I 6
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
How Is Big Data Different?
1) Automatically generated by a machine (e.g. Sensor embedded in an engine)
2) Typically an entirely new source of data(e.g. Use of the internet)
3) Not designed to be friendly(e.g. Text streams)
4) May not have much values
– Need to focus on the important part
Sources of Big Data
• Users
• Application
• Systems
• Sensors
Risk in Big Data
• Will be so overwhelmed
– Need the right people and solve the right problems
• Costs escalate too fast
– Isn’t necessary to capture 100%
• Many sources of big data is privacy
– self-regulation
– Legal regulation
INTRODUCTION TO BIG DATA PLATFORM
Big Data Platform
Features of Big Data Platform
List of Big Data Platform
Big Data Platform
Big Data Platform is integrated IT solution for Big Data management which combines
several software system, software tools and hardware to provide easy to use tools system
to enterprises.
Features of Big Data Platform
1. It should support linear scale-out
2. It should have capability for rapid deployment
Unit – I 7
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
3. It should support variety of data format
4. Platform should provide data analysis and reporting tools
5. It should provide real-time data analysis software
6. It should have tools for searching the data through large data set
List of BigData Platforms
a. Hadoop
b. Cloudera
c. Amazon Web Services
d. Hortonworks
e. MapR
f. IBM Open Platform
g. Microsoft HDInsight
h. Intel Distribution for Apache Hadoop
i. Datastax Enterprise Analytics
j. Teradata Enterprise Access for Hadoop
k. Pivotal HD
CHALLENGES OF CONVENTIONAL SYSTEM
Conventional System
Comparison of Big Data with Conventional Data
Challenges of Conventional System.
Challenges of Big Data
Conventional System
o The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
o Big data is huge amount of data which is beyond the processing capacity of
conventional data base systems to manage and analyze the data in a specific time
interval.
Unit – I 8
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Comparison of Big Data with Conventional Data
List of challenges of Conventional Systems:
The following list of challenges has been dominating in the case Conventional systems in
real time scenarios:
Unit – I 9
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
1. Uncertainty of Data Management Landscape
2. The Big Data Talent Gap
3. The talent gap that exists in the industry Getting data into the big data
platform
4. Need for synchronization across data sources
5. Getting important insights through the use of Big data analytics
Big Data Challenges
– The challenges include capture, duration, storage, search, sharing, transfer,
– analysis, and visualization.
Challenges of Big Data:
1. Dealing with outlier
2. Addressing data quality
3. Understanding the data
4. Visualization helps organizations perform analyses
5. Meeting the need for speed
6. Degree of granularity increases.
7. Displaying meaningful results.
INTELLIGENT DATA ANALYSIS
Intelligent Data Analysis
Benefits of Intelligent Data Analysis
Intelligent Data Analysis – Knowledge Acquisition
Evaluation of Intelligent Data Analysis Results
Intelligent Data Analysis (IDA)
– used for extracting useful information from large quantities of online data;
extracting desirable knowledge or interesting patterns from existing databases;
– interdisciplinary study concerned with the effective analysis of data;
Goal: Goal of Intelligent data analysis is to extract useful knowledge, the process
demands a combination of extraction, analysis, conversion, classification, organization,
reasoning, and so on.
Unit – I 10
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Uses / Benefits of IDA:
• Data Engineering
• Database mining techniques, tools and applications
• Use of domain knowledge in data analysis
• Big Data applications
• Evolutionary algorithms
• Machine Learning(ML)
• Neural nets
• Fuzzy logic
• Statistical pattern recognition
• Knowledge Filtering and
• Post-processing
Intelligent Data Analysis :Knowledge Acquisition
The process of eliciting, analyzing, transforming, classifying, organizing and integrating
knowledge and representing that knowledge in a form that can be used in a computer
system. Knowledge in a domain can be expressed as a number of rules
A Rule : A formal way of specifying a recommendation, directive, or strategy, expressed
as "IF premise THEN conclusion" or "IF condition THEN action".
Evaluation of IDA results:
• Absolute& relative accuracy
• Sensitivity& specificity
• False positive & false negative
• Error rate
• Reliability of rule
NATURE OF DATA
Data
Properties of Data
Types of Data
Data Conversion
Data Selection
Unit – I 11
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Data:
Data is a set of values of qualitative or quantitative variables; restated, pieces of data
are individual pieces of information.
Data is measured, collected and reported, and analyzed, whereupon it can be
visualized using graphs or images
Data is nothing but facts and statistics stored or free flowing over a network,
generally it's raw and unprocessed.
When data are processed, organized, structured or presented in a given context so as to
make them useful, they are called Information.
3 Actions on Data:
Capture
Transform
Store
Properties of Data
Clarity
Accuracy
Essence
Aggregation
Compression
Refinement
TYPES OF DATA:
Unit – I 12
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
1. Nominal scales:
Measure categories and have the following characteristics:
• Order: The order of the responses or observations does not matter.
• Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is
not the same as a 2 and 3.
• True Zero: There is no true or real zero. In a nominal scale, zero is
uninterruptable.
• Appropriate statistics for nominal scales: mode, count, frequencies
• Displays: histograms or bar charts
2. Ordinal Scales:
At the risk of providing a tautological definition, ordinal scales measure, well, order. So,
our characteristics for ordinal scales are:
Order: The order of the responses or observations matters.
Distance: Ordinal scales do not hold distance. The distance between first and
second is unknown as is the distance between first and third along with all
observations.
True Zero: There is no true or real zero. An item, observation, or category
cannot finish zero.
Appropriate statistics for ordinal scales: count, frequencies, mode
Displays: histograms or bar charts
3 .Interval Scales:
Interval scales provide insight into the variability of the observations or data. Classic
interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).
Order: The order of the responses or observations does matter.
Distance: Interval scales do offer distance.
True Zero: There is no zero with interval scales
Appropriate statistics for interval scales: count, frequencies, mode, median,
mean, standard deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
Unit – I 13
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
4. Ratio Scales:
Ratio scales appear as nominal scales with a true zero. They have the following
characteristics:
– Order: The order of the responses or observations matters.
– Distance: Ratio scales do do have an interpretable distance
– True Zero: There is a true zero.
– Appropriate statistics for ratio scales: count, frequencies, mode, median, mean,
standard deviation (and variance), skewness, and kurtosis.
– Displays: histograms or bar charts, line charts, and scatter plots.
The table below summarizes the characteristics of all four types of scales.
Data Conversion
We can convert or transform our data from ratio to interval to ordinal to nominal. , we
cannot convert or transform our data from nominal to ordinal to interval to ratio.
Scaled data can be measured in exact amounts.
For example, 60 degrees , 12.5 feet, 80 Miles per hour
Scaled data can be measured with equal intervals.
For example, Between 0 and 1 is 1 inch, Between 13 and 14 is also 1 inch
Ordinal or ranked data provides comparative Amounts
Example:
Unit – I 14
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
1st Place 2nd Place 3rd Place
Not equal intervals
1st Place 2nd Place 3rd Place
19.6 feet 18.2 feet 12.4 feet
Data Selection
Example – Average Driving Speed
a) Scaled
b) Ordinal
Scaled – Speed:- Speed can be measured in exact amounts withequal intervals.
Example :
60 degrees 12.5 feet 80 Miles per hour
Ordinal or ranked data provides comparative amounts.
For example, 1st Place 2nd Place 3rd Place
ANALYTIC PROCESS AND TOOLS:
There are 6 analytic processes:
1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
Unit – I 15
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Step 1: Deployment :
– plan the deployment and monitoring and maintenance, – we need to produce a
final report and review the project.
– In this phase,
we deploy the results of the analysis, this is also known as reviewing the project.
Step 2: Business Understanding :
– The very first step consists of business understanding.
– Whenever any requirement occurs, firstly we need to determine the business
objective, – assess the situation
– determine data mining goals and then
– produce the project plan as per the requirement.
– Business objectives are defined in this phase.
Step 3: Data Exploration :
The second step consists of Data understanding.
– For the further process, we need to gather initial data, describe and explore the data
and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its
application and the need for the project in this phase.
– This is also known as data exploration.
– This is necessary to verify the quality of data collected.
Unit – I 16
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Step 4: Data Preparation:
– we need to select data as per the need, clean it, construct it to get useful
information and – then integrate it all.
– Finally, we need to format the data to get the appropriate data.
– Data is selected, cleaned, and integrated into the format finalized for the
analysis in this phase.
Step 5: Data Modeling:
– select a modeling technique, generate test design, build a model and assess the
model built.
– The data model is build to – analyze relationships between various selected
objects in the data,
– test cases are built for assessing the model and model is tested and
implemented on the data in this phase
Step 6: Data Evaluation
Where processing is hosted?
• Distributed Servers / Cloud (e.g. Amazon EC2)
Where data is stored?
– Distributed Storage (e.g. Amazon S3)
What is the programming model?
– Distributed Processing (e.g. MapReduce)
– How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
– What operations are performed on data?
– Analytic / Semantic Processing
Analytical Tools
– Big data tools for HPC and supercomputing
– MPI
– Big data tools on clouds
– MapReduce model
– Iterative MapReduce model
– DAG model
Unit – I 17
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
– Graph model
– Collective model
– Other BDA tools
– SaS
– R
– Hadoop
ANALYSIS AND REPORTING
Analysis
Reporting
Differences between Analysis and Reporting
Analysis
The process of exploring data and reports in order to extract meaningful in sights, which
can be used to better understand and improve business performance.
Reporting
Reporting is the process of organizing data into informational summaries, in order to
monitor how different areas of a business are performing.
Differences between Analysis and Reporting
Unit – I 18
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Reporting translates raw data into information.
Analysis transforms data and information into insights.
reporting shows you what is happening
while analysis focuses on explaining why it is happening and what you can do
about it.
MODERN ANALYTIC TOOLS:
Modern Analy琀椀c Tools: Current Analy琀椀c tools concentrate on three classes:
1. Batch processing tools
2. Stream Processing tools and
3. Interactive Analysis tools.
1. Batch processing system :
Batch Processing System involves :
– collecting a series of processing jobs and carrying them out periodically as a
group (or batch) of jobs.
– It allows a large volume of jobs to be processed at the same time.
– An organization can schedule batch processing for a time when there is little
activity on their computer systems,
– One of the most famous and powerful batch process-based Big Data tools is
Apache Hadoop.
– It provides infrastructures and platforms for other specific Big Data
applications.
2. Stream Processing tools :
Stream processing – Envisioning (predicting) the life in data as and when it transpires
– The key strength of stream processing is that it can provide insights faster,
often within milliseconds to seconds.
– It helps understanding the hidden patterns in millions of data records in real time.
– It translates into processing of data from single or multiple sources – in real or near-
real time applying the desired business logic and emitting the processed
information to the sink.
– Stream processing serves – multiple – resolves in today’s business arena.
Real time data streaming tools are:
Unit – I 19
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
a) Storm
Storm is a stream processing engine without batch support,
a true real-time processing framework,
taking in a stream as an entire ‘event’ instead of series of small batches. Apache
Storm is a distributed real-time computation system.
It’s applications are designed as directed acyclic graphs.
b) Apache flink:
Apache flink is – an open source platform
which is a streaming data flow engine that provides communication fault tolerance
and – data distribution computation over data stream .
flink is a top level project of Apache flink is scalable data analytics
framework that is fully compatible to hadoop .
flink can execute both stream processing and batch processing easily.
flink was designed as an alternative to map-reduce.
c) Kinesis
Kinesis as an out of the box streaming data tool.
Kinesis comprises of shards which Kafka calls partitions.
For organizations that take advantage of real-time or near real-time access to large
stores of data,
Amazon Kinesis is great.
Kinesis Streams solves a variety of streaming data problems.
One common use is the real-time aggregation of data which is followed by
loading the aggregate data into a data warehouse.
Data is put into Kinesis streams.
This ensures durability and elasticity
3.Interactive Analysis -Big Data Tools
The interactive analysis presents – the data in an interactive environment,
– allowing users to undertake their own analysis of information.
Users are directly connected to – the computer and hence can interact with it in
real time.
The data can be : – reviewed, compared and analyzed in tabular or graphic
format or both at the same time.
IA -Big Data Tools –
a) Google’s Dremel:
Unit – I 20
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
is the Google proposed an interactive analysis system in 2010. And named
named Dremel.
which is scalable for processing nested data.
Dremel provides a very fast SQL like interface to the data by using a different
technique than MapReduce
b) Apache drill:
Apache drill is:
Drill is an Apache open-source SQL query engine for Big Data
exploration
It is similar to Google’s Dremel.
Other major Tools:
a) AWS b) BigData c ) Cassandra d) Data Warehousing e) DevOps f) HBase
g) Hive h)MongoDB i) NiFi j) Tableau k) Talend l) ZooKeeper.
Categories of Modern Analytic Tools
a) Big data tools for HPC and supercomputing
– MPI
b) Big Data Tools for HPC and Supercomputing
• MPI(Message Passing Interface, 1992)
– Provide standardized function interfaces for communication
between parallel processes.
• Collective communication operations
– Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce,
Reduce- scPopular implementations
– atter.
– MPICH (2001)
– OpenMPI (2004)
c) Big data tools on clouds
MapReduce model
Iterative MapReduce model
DAG model
Graph model
Collective model
Unit – I 21
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
STATISTICAL CONEPTS
Fundamental Statistics
Elements in Statistics.
Types of Statistics
Statistics Vs Statistical Analysis
Basic Statistical Operations
Application of Statistical Concepts
Fundamental Statistics
Statistics is the methodology for collecting, analyzing, interpreting and drawing conclusions
from information.
Statistics is the methodology which scientists and mathematicians have developed for
interpreting and drawing conclusions from collected data.
Statistics provides methods for:
1. Design: Planning and carrying out research studies.
2. Description: Summarizing and exploring data.
3. Inference: Making predictions and generalizing about phenomena represented by the data.
Elements in Statistics
1. Experimental unit
• Object upon which we collect data
2. Population
• All items of interest
3. Variable
• Characteristic of an individual experimental unit
4. Sample
• Subset of the units of a population
• P in Population & Parameter
• S in Sample & Statistic
Unit – I 22
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
5. Statistical Inference
• Estimate or prediction or generalization about a population based on information contained
in a sample
6. Measure of Reliability
• Statement (usually qualified) about the degree of uncertainty associated with a statistical
inference
Example for Statistics
o Agricultural problem: Is new grain seed or fertilizer more productive?
o Medical problem: What is the right amount of dosage of drug to treatment?
o Political science: How accurate are the gallups and opinion polls?
o Eeconomics: What will be the unemployment rate next year?
o Technical problem: How to improve quality of product?
Types or Branches of Sta琀椀s琀椀c:
The study of statistics has two major branches: descriptive statistics and inferential
statistics.
Descriptive statistics: –
– Methods of organizing, summarizing, and presenting data in an informative way.
– Involves: Collecting Data
Unit – I 23
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Presenting Data
Characterizing Data
Purpose
Describe Data
Inferential statistics: –
– The methods used to determine something about a population on the basis of a sample:
– Population –The entire set of individuals or objects of interest or the measurements obtained
from all individuals or objects of interest
– Sample – A portion, or part, of the population of interest
Statistics Vs Statistical Analysis
• Statistics :- The science of
– collectiong,
– organizing,
– presenting,
– analyzing, and
– interpreting data
to assist in making more effective decisions.
• Statistical analysis: – used to
– manipulate summarize, and
– investigate data,
so that useful decision-making information results.
Basic Statistical Operations
Mean: A measure of central tendency for Quantitative data i.e. the long term average
value.
Median :A measure of central tendency for Quantitative data i.e. the half-way point.
Mode :The most frequently occurring (discrete), or where the probability density
function peaks (contin- ious).
Minimum :The smallest value. •
Maximum: The largest value. Inter quartile range Can be thought or as the middle 50 of
the (Quantitative) data, used as a measure of spread.
Variance : Used as a measure of spread, may be thought of as the moment of inertia.
Unit – I 24
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Standard deviation : A measure of spread, the square root of the variance.
Application of Statistical Concepts and Areas
Statistical Concepts :
• Finance – correlation and regression, index numbers, time series analysis
• Marketing – hypothesis testing, chi-square tests, nonparametric statistics
• Personel – hypothesis testing, chi-square tests, nonparametric tests
• Operating management – hypothesis testing, estimation, analysis of variance, time series
analysis
Application Areas :
• Economics
– Forecasting
– Demographics
• Sports
– Individual & Team Performance
• Engineering
– Construction
– Materials
• Business
– Consumer Preferences
– Financial Trends
Sampling Distribution
Sample
Types of Samples
Examples of Sampling Distribution
Errors on Sampling Distribution.
Sample
A sample is “a smaller (but hopefully representative) collection of units from a population
used to determine truths about that population”
Unit – I 25
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Types of Samples
1. Stratified Samples
2. Cluster Samples
3. Systematic Samples
4. Convenience Sample
1. Stratified Samples
A stratified sample has members from each segment of a population. This ensures that each
segment from the population is represented.
2. Cluster Samples :
A cluster sample has all members from randomly selected segments of a population. This is
used when the population falls into naturally occurring subgroups
3. Systematic Samples:
Unit – I 26
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
A systematic sample is a sample in which each member of the population is assigned a
number. A starting number is randomly selected and sample members are selected at regular
intervals.
4. Convenience Samples: A convenience sample consists only of available members of the
population.
Example:
You are doing a study to determine the number of years of education each teacher at your college
has.
Identify the sampling technique used if you select the samples listed.
Examples of Sampling Distribution
1) Your sample says that a candidate gets support from 47%.
2) Inferential statistics allow you to say that
– (a) the candidate gets support from 47% of the population
– (b) with a margin of error of +/- 4%
– This means that the support in the population is likely somewhere between 43% and
51%.
Errors on Sampling Distribution
• Margin of error is taken directly from a sampling distribution.
• It looks like this:
Unit – I 27
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Re-Sampling
Re-Sampling
Re-Sampling in Statistics
Need for Re-Sampling
Re-Sampling Methods
Re-Sampling
• Re-sampling is:
– the method that consists of drawing repeated samples from the original data
samples.
• The method of Resampling is
– a nonparametric method of statistical inference. ...
• The method of resampling uses:
– experimental methods, rather than analytical methods, to generate the unique
sampling distribution.
Re-Sampling in statistics
• In statistics, re-sampling is any of a variety of methods for doing one of the following:
– Estimating the precision of sample statistics (medians, variances, percentiles)
– by using subsets of available data (jackknifing) or drawing randomly with
replacement from a set of data points (bootstrapping)
Unit – I 28
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Need for Re-Sampling
• Re-sampling involves:
– the selection of randomized cases with replacement from the original data sample
• in such a manner that each number of the sample drawn has a number of cases
that are similar to the original data sample.
• Due to replacement:
– the drawn number of samples that are used by the method of re-sampling consists of
repetitive cases.
• Re-sampling generates a unique sampling distribution on the basis of the actual data.
• The method of re-sampling uses
– experimental methods, rather than analytical methods, to generate the unique
sampling distribution.
• The method of re-sampling yields
– unbiased estimates as it is based on the unbiased samples of all the possible results
of the data studied by the researcher.
Re-Sampling Methods
– processes of repeatedly drawing samples from a data set and refitting a given model
on each sample with the goal of learning more about the fitted model.
• Re-sampling methods can be expensive since they require repeatedly performing the same
statistical methods on N different subsets of the data.
• Re-sampling methods refit a model of interest to samples formed from the training set,
– in order to obtain additional information about the fitted model.
• For example, they provide estimates of test-set prediction error, and the standard deviation
and bias of our parameter estimates.
There are four major re-sampling methods available and are:
1. Permutation
2. Bootstrap
3. Jackknife
4. Cross validation
Unit – I 29
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
1. Permutation:
The term permutation refers to a mathematical calculation of the number of ways a
particular set can be arranged.
Permutation Re-sampling Processes:
Step 1: Collect Data from Control & Treatment Groups
Step 2: Merge samples to form a pseudo population
Step 3: Sample without replacement from pseudo population to simulate control Treatment
groups
Step 4: Compute target statistic for each example
2. Bootstrap :
• The bootstrap is
– a widely applicable tool that
Unit – I 30
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
– can be used to quantify the uncertainty associated with a given estimator or
statistical learning approach, including those for which it is difficult to obtain a measure of
variability.
• The bootstrap generates:
– distinct data sets by repeatedly sampling observations from the original data set.
– These generated data sets can be used to estimate variability in lieu of sampling
independent data sets from the full population.
Bootstrap Types
a) Parametric Bootstrap
b) Non-parametric Bootstrap
3.Jackknife Method:
Jackknife method was introduced by Quenouille (1949) to estimate the bias of an
estimator.
The method is later shown to be useful in reducing the bias as well as in estimating the
variance of an estimator.
A comparison of the Bootstrap & Jackknife
Bootstrap
. Yields slightly different results when repeated on the same data (when estimating the
standard error)
. Not bound to theoretical distributions
Unit – I 31
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Jackknife
– Less general technique
– Explores sample variation differently
– Yields the same result each time
– Similar data requirement
4. Cross validation:
Cross-validation is a technique used to protect against over fitting in a predictive
model, particularly in a case where the amount of data may be limited.
In cross-validation, you make a fixed number of folds (or partitions) of the data, run
the analysis on each fold, and then average the overall error estimate.
Statistical Inference
Inference
Statistical Inference
Types of Statistical Inference
Inference : -
Use a random sample to learn something about a larger population
Two ways to make inference
Statistical Inference:
The process of making guesses about the truth from a sample.
Statistical inference is the process through which inferences about a population are
made based on certain statistics calculated from a sample of data drawn from that
population.
Unit – I 32
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Types of Statistical Inference
There are Two most common types of Statistical Inference and they are:
– Confidence intervals and
– Tests of significance.
Confidence Intervals
Range of values that m is expected to lie within • 95% confidence interval 95
probability that m will fall within range probability is the level of confidence
Test of Significance ( Hypothesis testing):
A statistical method that uses: – sample data to evaluate a hypothesis about a
population parameter.
• A hypothesis is an assumption about the population parameter.
– A parameter is a Population mean or proportion
– The parameter must be identified before analysis.
Hypothesis Testing
• Is also called significance testing
• Tests a claim about a parameter using evidence (data in a sample
• The technique is introduced by considering a one-sample z test
• The procedure is broken into four steps
• Each element of the procedure must be understood
Unit – I 33
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Hypothesis Testing Steps
A. Null and alternative hypotheses
B. Test statistic
C. P-value and interpretation
D. Significance level (optional)
Prediction Error
Error in Predictive Analysis
Predication Error in Statistics
Predication Error in Regression
Prediction Error
o A prediction error is the failure of some expected event to occur.
o Errors are an inescapable element of predictive analytics that should also be
quantified and presented along with any model, often in the form of a confidence
interval that indicates how accurate its predictions are expected to be.
o A prediction error is the failure of some expected event to occur.
o When predictions fail, humans can use metacognitive functions, examining prior
predictions and failures.
o For example, whether there are correlations and trends, such as consistently being
unable to foresee outcomes accurately in particular situations.
o Applying that type of knowledge can inform decisions and improve the quality of
future predictions.
Error in Predictive Analysis
– Errors are an inescapable element of predictive analytics that should also be quantified and
presented along with any model, often in the form of a confidence interval that indicates how
accurate its predictions are expected to be.
– Analysis of prediction errors from similar or previous models can help determine
confidence intervals.
Predication Error in Statistics
1. Standard Error of the Estimate
The standard error of the estimate is a measure of the accuracy of predictions.
Unit – I 34
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
DS4015 Big Data Analytics
Recall that the regression line is the line that minimizes the sum of squared deviations
of prediction (also called the sum of squares error).
2. Mean squared prediction error
– In statistics the mean squared prediction error or mean squared error of the predictions of a
smoothing or curve fitting procedure is the expected value of the squared difference between
the fitted values implied by the predictive function and the values of the (unobservable)
function g.
– The MSE is a measure of the quality of an estimator—it is always non-negative, and values
closer to zero are better.
– Root-Mean-Square error or Root-Mean-Square Deviation (RMSE or RMSD)
Predication Error in Regression
Regressions differing in accuracy of prediction.
The standard error of the estimate is a measure of the accuracy of predictions.
Recall that the regression line is the line that minimizes the sum of squared deviations of
prediction (also called the sum of squares error).
Unit – I 35
Downloaded by BARATH S (htarab86@gmail.com)