IT18702 – BIG DATA ANALYTICS
MS.MEENAKSHI.P,AP/INT
OBJECTIVES & OUTCOMES
OBJECTIVES
To understand the concept of big data.
To learn about various practical data analytics with R and Hadoop.
To learn about big data frameworks.
OUTCOMES:
Upon completion of the course, students will be able to
Work with big data tools and its analysis techniques
Design efficient algorithms for mining the data from large volumes
Design an efficient recommendation system
Design the tools for visualization
Learn NoSQL databases and management.
Reference Books
• TEXT BOOKS:
– 1. Vignesh Prajapati, “Big Data Analytics with R and Hadoop”, Packt
Publishing, 2014.
– 2. Stephan Kudyba, “Big Data, Mining, and Analytics: Components of
Strategic Decision Making”, First Edition, CRC Press, 2014.
AGENDA
• Big Data
• Sources Of Big Data
• 5 V’s
• Data Analytics
• Applications of Big data Analytics
• Difference Between Data Science, Bigdata, Data Analytics
• Example
BIG DATA
• Massive or huge amount of data that
– cannot be stored,
– processed and analyzed using traditional method(RDBMS)
– within the given time frame
Introduction to Big Data Analytics
• A field to analyze and to extract information about the big data
involved in the business or the data world so that proper
conclusions can be made is called big data Analytics. These
conclusions can be used to predict the future or to forecast the
business.
SOURCES OF BIG DATA
• Social Networking Sites- Facebook, twitter, YouTube etc.
• E-commerce Site – Amazon, flip kart, etc.
• Weather Station
• Telecom Company
• Airlines
• Share Market
5 V’s of Big Data
Variety of Big data
• Variety is the idea that data comes from different sources, machines,
people, processes, both internal and external to organizations.
• Attributes include the degree of structure and complexity and drivers
are mobile technologies, social media, wearable technologies, geo
technologies, video, and many, many more.
Categories of Big Data
Structured Data
• Structured data is data whose elements are addressable for effective
analysis.
• It has been organized into a formatted repository that is typically a
database. It concerns all data which can be stored in database SQL in
a table with rows and columns
• Example: Relational data.
Unstructured Data
• Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model, thus it is not a
good fit for a mainstream relational database.
• Example: Word, PDF, Text, Media logs.
Semi structured data
• Semi-structured data is information that does not reside in a
relational database but that has some organizational properties that
make it easier to analyze.
• With some processes, you can store them in the relation database.
• Example: XML data.
Volume
Volume is the amount of data generated. For example, exabytes,
zettabytes, yottabytes, etc.. Drivers of volume are the increase in data
sources, higher resolution sensors and scalable infrastructure.
Veracity
Veracity is the quality and origin of data. Attributes include
consistency, completeness, integrity, and ambiguity. Drivers include
cost, and the need for traceability.
Velocity
• Velocity is the idea that data is being generated extremely fast, a process
that never stops. Attributes include near or real-time streaming and local and
cloud-based technologies that can process information very quickly.
• Every 60 seconds, hours of footage are uploaded to YouTube. This amount of
data is generated every minute. So think about how much accumulates over
hours, days, and in years.
Value
• The emerging V is value. This V refers to our ability and need to turn
data into value. Value isn't just profit. It may be medical or social
benefits, or customer, employee, or personal satisfaction. The main
reasons for why people invest time to understand Big Data is to
derive value from it.
Why Big data Analytics?
• NEED FOR DATA ANALYTICS
– Making smarter and more efficient organization
– Optimize business operation by analyzing customer behavior
– Cost Reduction
– Next Generation Products(self driving car, sensor yoga mat)
What is DATA ANALYTICS?
• Examines large and different types of data to uncover hidden
patterns, correlations and insights
Types of DATA ANALYTICS
• Descriptive analytics – What has happened or happening now based on incoming
data
– Google analytics tool, Netflix
• Predictive analytics- What might happen in future ,possible outcome
– Airlines
• Prescriptive analytics – It analysis the data and provides the action to be taken or
recommend one or more action
– Automatic car driving, Health care
• Diagnostic analytics- Why did it happen
– Social media
Applications of big data analytics
DATA ANALYTICS LIFE CYCLE
• Big Data analysis differs from traditi onal data analysis primarily
due to the volume, velocity and variety c h a ra c te rsti c s of the
data being processes.
• To address the disti nct requirements for performing analysis
on Big Data, a step-by-step methodology is needed to organize the
activities and tasks involved with acquiring, processing, analyzing
and repurposing data.
DATA ANALYTICS LIFE CYCLE
• From a Big Data adoption and planning perspective, it is important that in addition to the
lifecycle, consideration be made for issues of training, education, tooling and staffing of a
data analytics team.
Business User – understands the domain area
Project Sponsor – provides requirements
Project Manager – ensures meeting objectives
Business Intelligence Analyst – provides business domain expertise based
on deep understanding of the data
Database Administrator ( DBA) – creates DB environment
Data Engineer – provides technical skills, assists data management and
extraction, supports analytic sandbox
Data Scientist – provides analytic techniques and modeling
The data analytic lifecycle is designed for Big Data problems and data science
projects.
The cycle is iterative to represent a real project
Work can return to earlier phases as new information is uncovered
Phase 1: Discovery
• The team should perform five main activities during this step of the discovery phase:
• Identify data sources: Make a list of data sources the team may need to test the initial
hypotheses outlined in this phase.
• Make an inventory of the datasets currently available and those that can be purchased
or otherwise acquired for the tests the team wants to perform.
• Capture aggregate data sources: This is for previewing the data and providing high-level
understanding.
• It enables the team to gain a quick overview of the data and perform further
exploration on specific areas.
• Review the raw data: Begin understanding the interdependencies among the data
attributes.
• Become familiar with the content of the data, its quality, and its limitations.
Phase 1: Discovery (cont.)
Evaluate the data structures and tools needed: The data type and structure
dictate which tools the team can use to analyze the data.
Scope the sort of data infrastructure needed for this type of problem: In
addition to the tools needed, the data influences the kind of infrastructure that's
required, such as disk storage and network capacity.
Unlike many traditional stage-gate processes, in which the team can advance only
when specific criteria are met, the Data Analytics Lifecycle is intended to
accommodate more ambiguity.
For each phase of the process, it is recommended to pass certain checkpoints
as a way of gauging whether the team is ready to move to the next phase of the
Data Analytics Lifecycle.
Phase 2: Data preparation
It requires the presence of an analytic sandbox (workspace), ni which the
team can work with data and perform analytics for the duration of the project.
The team needs to execute Extract, Load, and Transform (ELT) or extract, load
and transform (ETL) to get data into the sandbox.
In ETL, users perform processes to extract data from a datastore, perform data
transformations, and load the data back into the datastore.
The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed
in the ETLT process so the team can work with it and analyze it.
Phase 2: Data preparation
Rules for Analytics Sandbox
When developing the analytic sandbox, collect all kinds of data there, as
team members need access to high volumes and varieties of data for a
Big Data analytics project.
This can include everything from summary-level aggregated data,
structured data , raw data feeds, and unstructured text data from
call logs or web logs, depending on the kind of analysis the team plans to
undertake.
A good rule is to plan for the sandbox to be at least 5– 10 times the size of
the original datasets, partly because copies of the data may be created
that serve as specific tables or data stores for specific kinds of analysis in the
project.
Phase 2: Data preparation
Performing ETLT
As part of the ETLT step, it is advisable to make an inventory of the data and compare
the data currently available with datasets the team needs.
Performing this sort of gap analysis provides a framework for understanding
which datasets the team can take advantage of today and where the team needs to
initiate projects for data collection or access to new datasets currently unavailable.
A component of this subphase involves extracting data from the available
sources and determining data connections for raw data, online transaction
processing (OLTP) databases, online analytical processing (OLAP) cubes,
or other data feeds.
Data conditioning refers to the process of cleaning data, normalizing datasets, and performing
transformations on the data.
Common Tools for the Data Preparation Phase
Several tools are commonly used for this phase:
Hadoop can perform massively parallel ingest and custom analysis for web traffic analysis, GPS
location analytics, and combining of massive unstructured data feeds from multiple sources.
Alpine Miner provides a graphical user interface (GUI) for creating analytic workflows,
including data manipulations and a series of analytic events such as staged data-mining
techniques (for example, first select the top 100 customers, and then run descriptive statistics
and clustering).
OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool for working
with messy data. A GUI- based tool for performing data transformations, and it's one of
the most robust free tools currently available.
Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and transformation.
Wrangler was developed at Stanford University and can be used to perform many transformations
on a given dataset.
Phase 3: Model Planning
P h a s e 3 i s m o d e l p l a n n i n g , w h e r e t h e t e a mdetermines the
methods, techniques, and workflow it intends to follow for the subsequent
model building phase.
The team explores the data to learn about the relationships
between variables and subsequently selects key variables and the
most suitable models.
During this phase that the team refers to the hypotheses developed
in Phase 1, when they first b e c a m e a c q u a i n t e d w i t h t h e d
ata
a n d understanding the business problems or domain area.
Common Tools for the Model Planning Phase
Here are several of the more common ones:
R has a complete set of modeling capabilities and provides a good environment for
building interpretive models with high-quality code. In addition, it has the ability to interface
with databases via an ODBC connection and execute statistical tests.
SQL Analysis services can perform in-database analytics of common data mining functions,
involved aggregations, and basic predictive models.
SAS/ ACCESS provides integration between SAS and the analytics sandbox via multiple data
connectors such as OBDC, JDBC, and OLE DB. SAS itself is generally used on file extracts,
but with SAS/ ACCESS, users can connect to relational databases (such as Oracle or
Teradata).
Phase 4: Model Building
In this phase the data science team needs to develop data sets for training,
testing, and production purposes. These data sets enable the data scientist
to develop the analytical model and train it ("training data"), while holding
aside some of the data ("hold- out data" or "test data") for testing the model.
the team develops datasets for testing, training, and production purposes.
In addition, in this phase the team builds and executes models based on the
work done in the model planning phase.
The team also considers whether its existing tools will sufficient for
running the models, or if it will need a more robust environment for
executing models and workflows (for example, fast hardware and parallel
processing, if applicable).
Free or Open Source tools: Rand PL/R, Octave, WEKA, Python
Commercial Tools: Matlab, STATISTICA.
Phase 5: Communicate Results
In Phase 5, After executing the model, the team needs to compare the
outcomes of the modeling to the criteria established for success and failure.
The team considers how best to articulate the findings and outcomes to
the various team members and stakeholders, taking into account warning,
assumptions, and any limitations of the results.
The team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
Phase 6: Operationalize
In the final phase 6, Operationalize), the team communicates the benefits of
the project more broadly and sets up a pilot project to deploy the work in a
controlled way before broadening the work to a full enterprise or ecosystem of
users.
This approach enables the team to learn about the performance and re lated
constraints of the model in a production environment on a small scale and
make adjustments before a full deployment.
The team delivers final reports, briefings, code, and technical documents.
In addition, the team may run a pilot project to implement the models in a
production environment.
Common Tools for the Model Building Phase
Free or Open Source tools:
R and PL/ R was described earlier in the model planning phase, and PL/ R is a
procedural language for PostgreSQL with R. Using this approach means that R
commands can be executed in database.
Octave , a free software programming language for computational modeling, has
some of the functionality of Matlab. Because it is freely available, Octave is used in
major universities when teaching machine learning.
WEKA is a free data mining software package with an analytic workbench. The
functions created in WEKA can be executed within Java code.
Python is a programming language that provides toolkits for machine learning
and analysis, such as scikit-learn, numpy, scipy, pandas, and related data
visualization using matplotlib.
SQL in-database implementations, such as MADlib, provide an alterative to in-
memory desktop analytical tools.
MADlib provides an open-source machine learning library of algorithms that can
be executed in-database, for PostgreSQL or Greenplum.
Key outputs for each of the main stakeholders
Key outputs for each of the main stakeholders of an analytics project and what they usually expect
at the conclusion of a project.
Business User typically tries to determine the benefits and implications of the findings to the
business.
Project Sponsor typically asks questions related to the business impact of the project, the risks
and return on investment (ROI), and the way the project can be evangelized within the
organization (and beyond).
Project Manager needs to determine if the project was completed on time and within budget and
how well the goals were met.
Business Intelligence Analyst needs to know if the reports and dashboards he manages
will be impacted and need to change.
Data Engineer and Database Administrator (DBA) typically need to share their code from
the analytics project and create a technical document on how to implement it.
Data Scientist needs to share the code and explain the model to her peers, managers, and other
stakeholders.
• CHALLENGES & LIMITATIONS OF BIG DATA ANALYTICS
Analytics platform must support-functions for processing data
Open source – advantages and disadvantages
Platform evaluation – Availability, Continuity, easy of use, scalability, privacy, quality
Lag between data collection and processing has to be addressed
Dynamic availability of algorithm, models, necessary for large scale
Key issues- ownership, governance
Continuous acquisition and cleaning
Appliance driven approach(mobile, wireless)
Need for synchronization across disparate data sources
Acute shortage of professionals who understand big data analysis
Getting meaningful insights through the use of big data analytics
Getting voluminous data into the big data platform
Uncertainty of data management landscape
Data storage and quality
Security and privacy of data
EVOLUTION OF ANALYTIC SCALABILITY
• The world of big data requires new levels of scalability.
• As the amount of data organizations process continues to increase,
the same old methods for handling data just won’t work anymore.
• Analytic professionals had to pull all their data together into a
separate analytics environment to do analysis.
• Analysts do what is called “data preparation.”
• In this process, they pull data from various sources and merge it
all together to create the variables required for an analysis.
• RDBMS, Enterprise Data Warehouse
• Massively Parallel Processing (MPP) Databases, Cloud Architectures, And
Mapreduce
Are all powerful tools to aid in attacking big data.
• Analytical massively parallel processing (MPP) databases are databases that are
optimized for analytical workloads:
– aggregating and processing large datasets.
– MPP databases tend to be columnar, so rather than storing each row in a table as
an object (a feature of transactional databases, MPP databases generally store each
column as an object.
– This architecture allows complex analytical queries to be processed much more
quickly and efficiently.
• Massively parallel processing as a term refers to the fact that tables loaded into
these databases are distributed across each node in a cluster,
– when a query is issued, every node works simultaneously to process the data that resides
on it.
• Advantages:
– 1.Performance
– 2.Scalability and Concurrency
• Disadvantages
– 1.Complexity
– 2.Distribution of data
– 3.Downtime
– 4.Lack of Elasticity
• A cloud database is a database that typically runs
– on a cloud computing platform, and
– access to the database is provided as-a-service.
• Database services take care of scalability and high availability of the database.
• A database service built and accessed through a cloud platform
• Enables enterprise users to host databases without buying dedicated hardware
• Can be managed by the user or offered as a service and managed by a provider
• Can support relational databases (including mysql and postgresql) and nosql
databases (including mongodb and apache couchdb)
• Accessed through a web interface or vendor-provided API
• ADVANTAGES:
– Ease of access
– Scalability
– Disaster recovery
• DISADVANTAGES
– Control options
– Security
– Maintenance
GRID
• Instead of having a single high-end server (or maybe a few of
them), a large number of lower-cost machines are put in place.
• As opposed to having one server managing its cpu and
resources across jobs, jobs are parceled out individually to the
different machines to be processed in parallel.
• Each machine may only be able to handle a fraction of the work
of the original server and can potentially handle only one job at
a time
Advantages
- can solve larger, more complex problems in a shorter time
– Easier to collaborate with other organization
– Make better use of existing hardware
Disadvantages
– Grid software and standards are still evolving
– Learning curve to get started
– Non-interactive job submission
EVOLUTION OF ANALYTIC PROCESSES
• A sandbox is ideal for data exploration, analytic development, and prototyping.
– It should not be used for ongoing or production processes.
• There are several types of sandbox environments,
– including internal, external, and hybrid sandboxes.
– Each can be augmented with a mapreduce environment to help handle big data sources.
• An Analytic Data Set (ADS) is a set of data at the level of analysis.
– Examples include customer, location, product, and supplier.
– that is pulled together in order to create an analysis or model.
– It is data in the format required for the specific analysis at hand.
• An ADS is generated by transforming, aggregating, and combining data. It is
going to mimic a denormalized, or flat file, structure.
• There are two primary kinds of analytic data sets,
A development ADS is going to be the data set used to build an
analytic process.
• It will have all the candidate variables that may be needed to solve a problem and
will be very wide.
• A development ADS might have hundreds or even thousands of variables or metrics
within it.
• A production analytic data set, however, is what is needed for
scoring and deployment.
• It’s going to contain only the specific metrics that were actually in
the final solution.
• Scoring can be embedded via SQL, user-defined function, embedded
process, or PMML.
• Model and score management procedures will need to be in place to
truly scale the use of models by an organization.
• The four primary components of a model and score management
system are
– analytic data set inputs,
– model definitions,
– model validation and reporting,
– model scoring output.
EVOLUTION OF ANALYTIC TOOLS AND METHODS
• Ensemble methods leverage the concept of the wisdom of the crowd.
Combining estimates from many approaches can lead to a better answer than
the individual approaches alone.
• Commodity models aim for a good-enough model quickly and in a mostly
automated fashion
– allow the expansion of modeling to lower-value problems, as well as problems where
too many models are needed to manually intervene on them all.
– User interfaces should be used as productivity enhancers for analytic professionals
• Text analysis has become a very important topic in the era of big
data, and methods for addressing text data are advancing rapidly
and being applied widely.
• A huge challenge in text analysis is the fact that words alone don’t
tell the entire story. Emphasis, tone, and inflection all come into
play yet are not captured in text
• R is an open-source analytic tool for STATISTICAL COMPUTING that has
experienced increased adoption in recent years.
• An advantage of R is the speed with which new algorithms are added to the
software. A disadvantage of R is its current lack of enterprise-level scalability.
• R is a descendent from the original “S.” S was an early language for statistical
analysis that was developed decades ago.
• The name R appears to be derived both due to the software being an update
to S and also because the original authors names Robert Gentleman and
Ross Ihaka began with R.
• It is far easier to see a pattern than it is to explain it or pull it out of
a bunch of spreadsheet data.
• Visualization tools allow database connections, intertwined and
interactive graphics, and more visualization options than traditional
charting tools.
CONCEPTUAL ARCHITECTURE OF BIG DATA ANALYTICS
• CLICK TO EDIT MASTER TEXT STYLES
• SECOND LEVEL
• THIRD LEVEL
• FOURTH LEVEL
• FIFTH LEVEL
• MIDDLE WARE –
• Service-Oriented Architecture (SOA)- software design where
services are provided to the other components by application
components, through a communication protocol over a network.
• Independent of vendors and other technologies.
• Implemented with web services, which makes the “functional
building blocks accessible over standard internet protocols.”
• SOAP( Simple Object Access Protocol.)
DATA WAREHOUSE:
• Also called as enterprise data ware house
• Central repositories of integrated data from one or more disparate
sources
• Stores current and historical data
• Used for creating analytical reports
EXTRACT ,TRANSFORM,LOAD(ETL)
• Used to blend data from multiple sources.
• Build a data warehouse.
• During this process, data is taken (extracted) from a source system, converted
(transformed) into a format that can be analyzed, and stored (loaded) into a
data warehouse or other system.
• Extract, Load, Transform (ELT) is an alternate but related approach designed
to push processing down to the database for improved performance.
• ETL and ELT are both important parts of an organization’s broader data
integration strategy.
• Advanced ETL tools can load and convert structured and
unstructured data into hadoop.
• These tools read and write multiple files in parallel from and to
hadoop, simplifying how data is merged into a common
transformation process.
• Some ETL also supports integration across transactional systems,
operational data stores, BI platforms, master data management
hubs and the cloud.
HADOOP
• Open sources software framework which serves twin roles as data
organizer or analytics tool.
• Framework used for storing and processing big data in a distributed
manner on large clusters of commodity hardware.
• Belongs to the class of NOSQL
• Hadoop is licensed under the Apache v2 license.
• Hadoop was developed, based on the paper written by google on the
mapreduce system .
• Hadoop is written in the java programming language
• Hadoop was developed by Doug Cutting and Michael J. Cafarella.
TWO MAJOR modules in HADOOP
1.HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
- storage for the Hadoop cluster.
- HDFS breaks it into smaller parts and
• Redistributes the parts among the different servers (nodes) engaged
in the cluster.
• Only a small chunk of the entire data set resides on each
server/node, and it is conceivable each chunk is duplicated on other
servers/nodes.
2.Map reduce
• The primary component of an algorithm would map the broken up
tasks (e.G., Calculations) to the various locations in the distributed
file system
• And consolidate the individual results (the reduce step) that are
computed at the individual nodes of the file system.
• In summary, the data mining algorithm would perform
computations at the server/node level and simultaneously in the
overall distributed system to summate the individual outputs
• Hadoop is neither a programming language nor a service, it is a
platform or framework which solves Big Data problems.
• It Encompasses a number of services for
– ingesting,
– storing
– analyzing huge data sets
– along with tools for configuration management.
HIVE
• HIVE is a hadoop support architecture that leverages SQL with the
hadoop platform
• It permits SQL programmers to develop hive query language (HQL)
statements akin to typical SQL statements.
• HQL is limited in the commands it recognizes. Ultimately, HQL statements
are decomposed by the hive service into mapreduce tasks and executed
across a Hadoop cluster of servers/nodes
• Hive is dependent on hadoop and mapreduce executions, queries may
have lag time in processing up to several minutes.
• Hive may not be suitable for big data analytics applications that need
rapid response times
Mahout
• Mahout – Apache
• Goal is to generate free applications of distributed and scalable
machine learning algorithms
– Classification,
– clustering,
– collaborative filtering
• Companies such as adobe, Facebook, LinkedIn, foursquare, twitter,
and yahoo use mahout internally.
PIG-SIMPLIFIES MAP REDUCE
• Originally developed by yahoo
• Assimilate all types of data
• Two key modules –language itself called PIG Latin and runtime
version in which pig Latin will be executed.
• Advantage - to focus more on the big data analytics and less on
developing the mapper and reducer code
HBASE
• HBase is a column-oriented database management system that sits on the top
of HDFS
• NOSQL, Initially, it was google big table, afterward, it was re-named as HBASE
• Applications are developed using java much similar to map reduce
• A master node manages the cluster in HBase, and regional servers store parts
of the table and execute the tasks on the big data
• Hbase is built for low latency operations
• Hbase is used extensively for random read and write operations
• Hbase stores a large amount of data in terms of tables
• Strictly consistent to read and write operations
• Automatic and configurable sharing of tables
ZOOKEEPER-coordination
• Apache project - enables highly reliable distributed coordination.
• Centralized service for maintaining
- Configuration information, naming, providing distributed synchronization,
and providing group services.
- With these services the project can be embedded in by zookeeper without
duplicating or requiring constructing all over again.
Interface with zookeeper happens via java or c interfaces presently
OOZIE-SCHEDULER
• Oozie is a workflow scheduler system to manage apache hadoop job
• Oozie workflow jobs are Directed Acyclic Graphs (DAGs) of actions
• Oozie coordinator jobs are
• Recurrent oozie workflow jobs triggered by time (frequency) and
data availability.
• Oozie is a scalable, reliable and extensible system
Solr and LUCENE- SEARCH
• Solr (pronounced "solar") is an open-source enterprise-search platform,
written in java
• Features include full-text search, hit highlighting, faceted search, real-time
indexing, dynamic clustering, database integration, NOSQL features and rich
document (e.G., Word, PDF) handling..[3]
• Solr runs as a standalone full-text search server.
• It uses the lucene java search library at its core for full-text indexing and search,
• And has REST-like HTTP/XML and JSON APIs that make it usable from most
popular programming languages.
• Tailored to many types of applications without java coding, and it has a plugin
architecture to support more advanced customization.
KAFKA - STROM
• Is a highly-available, high-throughput, distributed message broker
• Handles real-time data feeds.
• Developed by LinkedIn and open sourced in january 2011.
• yahoo, twitter, spotify
• Kafka stores basic metadata in zookeeper such as information
about topics (message groups), brokers (list of kafka cluster
instances), messages’ consumers (queue readers) and so on.
• To start a single-node kafka broker,
– single-node zookeeper instance and a kafka broker id.
STORM
• Is a distributed real-time computation system.
• It is often compared with apache hadoop.
• Storm was initially developed by backtype and then acquired
• Open sourced by twitter in september 2011.
• Like kafka, storm is also used by yahoo, twitter and spotify and many others.
• Also requires zookeeper
• 2 components
– Storm nimbus – similar to master node in hadoop job tracker
– Storm supervisor – worker node
consists of streams handled by two components:
“spouts” and “bolts”.
Spout- source stream.
Bolt -- stream data processing entity which possibly emits new
streams.
For example,
Spout –connected to twitter outputs stream of tweets
Bolt – Consume this and output trending topics
SPARK
• Apache spark is an open-source, distributed processing system used
for big data workloads.
• It utilizes in-memory caching and optimized query execution for
fast queries against data of any size.
• Simply put, spark is a fast and general engine for large-scale data
processing
• Running distributed SQL, creating data pipelines, ingesting data into
a database, running machine learning algorithms, working with
graphs or data streams, and much more.
• Graphx -to manipulate graph databases and perform computations called graphx.
Graphx unifies ETL (extract, transform, and load) process, exploratory analysis, and
iterative graph computation within a single system
• Mllib (machine learning library) –a rich library known as mllib. Wide array of
machine learning algorithms- classification, regression, clustering, and collaborative
filtering. It also includes other tools for constructing, evaluating, and tuning ML
pipelines.
• Spark streaming – this component allows spark to process real-time streaming data.
• Spark sql – spark sql is apache spark’s module for working with structured data.
• Apache spark core – spark core is the underlying general execution engine for the
spark platform that all other functionality is built upon. It provides in-memory
computing and referencing datasets in external storage systems
YARN
• YARN stands for “Yet Another Resource Negotiator“.
• It was introduced in Hadoop 2.0 by yahoo to remove the bottleneck
on job tracker(processing and resource management functions)
which was present in Hadoop 1.0.
• Now evolved to be known as large-scale distributed operating
system used for big data processing
• The basic idea behind YARN is to relieve MapReduce by taking over
the responsibility of resource management and job scheduling.
• Yarn started to give Hadoop the ability to run non-mapreduce jobs
within the Hadoop framework.
• Resource manager – 2 components
– Scheduler –responsible allocating resource
– Application manager – accepts job submissions
Node manager
– Manages nodes
• Application master –
– Coordinates an application’s execution in the cluster and also manages faults.
– Its task is to negotiate resources from the resource manager and work with the
node manager to execute and monitor the component tasks.
• Container
– It is a collection of physical resources such as RAM, CPU cores, and disks on a
single node.
– Managed by a container launch context this record contains
• a map of environment variables,
• dependencies stored in a remotely accessible storage,
• security tokens, payload for node manager services and the command necessary to create
the process.
– It grants rights to an application to use a specific amount of resources (memory,
cpu etc.) On a specific host.
• Sqoop − “SQL to Hadoop and Hadoop to SQL”
• Sqoop is a tool designed to transfer data between Hadoop and
relational database servers.
• It is used to import data from relational databases such as MySQL,
Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases.
• It is provided by the Apache Software Foundation.
REPORTING AND ANALYTICS
• The end result of an analytic endeavor is to extract/generate information to
provide a resource to enhance the decision-making process.
• 1. Spreadsheet applications (also facilitated by vendor software packages)
– A. Data/variable calculations, sorting, formatting, organizing
– B. Distribution analysis and statistics (max, min, average, median,
percentages, etc.)
– C. Correlation calculation between variables
– D. Linear and goal programming (optimization)
– E. Pivot tables
• . Business intelligence
– A. Query and report creating
– B. Online analytic processing
– C. Dashboards
• 3. Multivariate analysis (also part of business intelligence)
– A. Regression (hypothesis approach)
– B. Data mining applications (data-driven information creation)
– Neural networks
– Clustering
– Segmentation classification
– Real-time mining
• 4. Analysis of unstructured data
• A. Text mining
• 5. Six sigma
• 6. Visualization
OLAP
• OLAP Online Analytical Processing.
• OLAP performs multidimensional analysis of business data
• Complex calculations, trend analysis, and sophisticated data modeling.
• No two-dimensional, row-by-column format, like a worksheet, but
instead use multidimensional database structures—known as cubes
• Cubes -consolidated information.
• The data and formulas in optimized multidimensional database,
• While views of the data are created on demand
• Roll up – performs aggregation
• Drill down – reverse operation of roll up
• Slice - selects one particular dimension from a given cube and
provides a new sub-cube
• Dice - dice selects two or more dimensions from a given cube and
provides a new sub-cube.
• Pivot - Rotation - It rotates the data axes in view in order to provide
an alternative presentation of data.
DRILLS
DOWN
ROLL UP
DICE
SLICE
DATA MINING
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful)
– Patterns or knowledge from huge amount of data
– Data mining: a misnomer?
• Alternative names
– Knowledge discovery (mining) in databases (KDD),
– Knowledge extraction,
– Data/pattern analysis, data archeology, data dredging,
– Information harvesting, business intelligence, etc.
• Involving methods at the intersection of machine learning, statistics,
and database systems.[1]
• Data mining is an interdisciplinary subfield of computer science and
statistics
• Automatic discovery of patterns
• Prediction of likely outcomes
• Creation of actionable information
• Focus on large data sets and databases
• Data mining can answer questions that cannot be addressed
through simple query and reporting techniques
• Two major sources of valuable information
• The first refers to descriptive information, or the identification of
why things may be occurring in a business process.
»The identification of recurring patterns between variables.
• The patterns that have been identified are often embedded in an
equation or algorithmic function, often referred to as the model,
- used to perform a “what if” analysis or estimate future
expected results based on inputs
PIVOT TABLE
• Is a table of statistics that summarizes the data of a more extensive table
• database, spreadsheet, or business intelligence program
• Sums, averages, or other statistics, which the pivot table groups
together in a meaningful way.
• Excel pivot tables include the feature to directly query an
online analytical processing (OLAP) server for retrieving data instead of
getting the data from an excel spreadsheet.
• On this configuration a pivot table is a simple client of an olap server.
• Excel's pivot table not only allows for connecting to microsoft's analysis
service, but to any xml for analysis
DASH BOARDS
• Data dashboard is an information management tool
• Visually tracks, analyzes and displays key performance indicators (KPI),
metrics and key data points to monitor the health of a business,
department or specific process.
• Dashboards are a data visualization tool that allow all users to understand
the analytics that matter to their business, department or project.
• A dashboard is a business intelligence tool used to display data
visualizations in a way that is immediately understood.
• Answer important questions about your business.
• By monitoring multiple KPIs and metrics on one central dashboard, users
can make adjustments to their business practices
• Help decision makers, executives and senior leaders, establish targets, set goals
and understand what and why something happened with the same information
they can use to implement appropriate changes.
• An analytical dashboard does this based on insights from data collected over a
period of time determined by the user (i.e. Last month, quarter or year).
•
– Total visibility into your business. ...
– Big time savings. ...
– Improved results. ...
– Reduced stress. ...
– Increased productivity. ...
– Increased profits: as discussed, your dashboard shows you exactly which areas of your
business are performing poorly.
Statistics
Statistics is the methodology which scientists and mathematicians
have developed for interpreting and drawing conclusions from
collected data”
– Science and it is a tool
The scientific discipline of statistics brings sophisticated techniques and
models to bear on the issues in big data analytics.
Statistical analysis may be used to:
• Present key findings revealed by a dataset.
• Summarize information.
• Calculate measures of cohesiveness, relevance, or diversity in data.
• Make future predictions based on previously recorded data.
• Test experimental predictions.
• Consists of a body of methods for collecting and analyzing data.
• It provides methods for-
– Design- planning and carrying out research studies
– Description- summarizing and exploring data
– Inference- making predictions & generalizing about phenomena
represented by data.
TYPES OF STATISTICS
• 2 major types of statistics
• Descriptive statistics- It consists of methods for organizing and summarizing
information.
– Includes- graphs, charts, tables & calculation of averages, percentiles
• Inferential statistics- It consists of methods for drawing and measuring the reliability of
conclusions about population based on information obtained.
– Includes- point estimation, interval estimation, hypothesis testing.
• Both are interrelated. Necessary to use methods of descriptive statistics to organize and
summarize the information obtained before methods of inferential statistics can be used.
• Basic concepts in statistics.
• Population- It is the collection of all individuals or items under
consideration in a statistical study
• Sample- It is the part of the population from which information is collected.
• Population always represents the target of an investigation.
• We learn about population by sampling from the collection.
• Parameters- used to summarize the features of the population under
investigation.
• Statistic- it describes a characteristics of the sample, which can then
be used to make inference about unknown parameters.
VARIABLES & TYPES
• Variable- a characteristic that varies from one person or thing to another.
• Types- Qualitative/ Quantitative, Discrete/ Continuous, Dependent/
Independent
• Qualitative data- the variable which yield non numerical data.
– Eg- sex, marital status, eye colour
• Quantitative data- the variables that yield numerical data
– Eg- height, weight, number of siblings.
• Discrete variable- the variable has only a countable number of distinct possible
values.
– Eg- number of car accidents, number of children
• Continuous variable- the variable has divisible unit.
– Eg- weight, length, temperature.
• Independent variable- variable is not dependent on other variable.
– Eg- age, sex.
• Dependent variable- depends on the independent variable.
– Eg- weight of a newborn, stress
• Variables can also be described according to the scale on which they are defined.
• Nominal scale- the categories are merely names. They do not have a natural
order.
• Eg- male/female, yes/no
• Ordinal scale- the categories can be put in order. But the difference between the
two may not be same as other two.
– Eg- mild/ Moderate/ Severe
• Interval scale- Interval scale offers labels, order, as well as, a specific interval
between each of its variable options.
– Eg- temperature, time
• Ratio scale- the variable has absolute zero as well as difference
between variables are comparable..
– Eg- stress using PSS, insomnia using ISI
• Nominal & Ordinal scales are used to describe Qualitative data.
• Interval & Ratio scales are used to describe Quantitative data.
• Qualitative data-
– Frequency- number of observations falling into particular class/ category of
the qualitative variable.
– Frequency distribution- table listing all classes & their frequencies.
– Graphical representation- Pie chart, Bar graph.
– Nominal data best displayed by pie chart
– Ordinal data best displayed by bar graph
• Quantitative data-
– Can be presented by a frequency distribution.
– If the discrete variable has a lot of different values, or if the data is a continuous
variable then data can be grouped into classes/ categories.
– Class interval- covers the range between maximum & minimum values.
– Class limits- end points of class interval.
– Class frequency- number of observations in the data that belong to
each class interval.
– Usually presented as a Histogram or a Bar graph.
BASIC TERMINOLOGY
• Population: A population consists of the totality of the observation, with which we are
concerned.
• Sample: A sample is a subset of a population.
A parameter is an unknown numerical summary of the population.
A statistics is a known numerical summary of the sample which can be
used to make inference about parameters.
A statistic describes a sample,
while a parameter describes the population from which the sample was
taken.
Why sample?
Lower cost
More accuracy of results
High speed of data collection
Availability of Population elements.
Less field time
When it‟s impossible to study the whole population
•The sample must be:
1. representative of the population;
2. appropriately sized (the larger the better);
3. unbiased;
4. random (selections occur by chance);
• Merits of Sampling
Size of population
Fund required for the study
Facilities
Time
TYPES OF SAMPLING
•Probability sample – a method of sampling that uses of random
selection so that all units/ cases in the population have an equal
probability of being chosen.
•Non-probability sample – does not involve random selection and
methods are not based on the rationale of probability
Probabilistic (Random samples)
Simple random sample
Systematic random sample
Stratified random sample
Cluster sample
Types of Non-Probablistic sampling
Convenience samples (ease of access)
sample is selected from elements of a population that are easily accessible
Purposive sample (Judgmental Sampling)
You chose who you think should be in the study
QuotaSampling
Snowball Sampling (friend of friend….etc.)
SIMPLE RANDOM SAMPLING
• Applicable when population is small, homogeneous & readily available
• All subsets of the frame are given an equal probability. Each element of the
frame thus has an equal probability of selection. A table of random number or
lottery system is used to determine which units are to be selected.
•Advantage
Easy method to use
No need of prior information of population
Equal and independent chance of selection to every element
•Disadvantages
If sampling frame large, this method impracticable.
Does not represent proportionate
Systematic Sampling
– Similar to simple random sample. No table of random numbers – select directly
from sampling frame. Ratio between sample size and population size
•ADVANTAGES:
Sample easy to select
Suitable sampling frame can be identified easily
Sample evenly spread over entire reference population
Cost effective
•DISADVANTAGES:
Sample may be biased if hidden periodicity in population coincides with that of
selection.
Each element does not get equal chance
Ignorance of all element between two n element
STRATIFIED SAMPLING
• The population is divided into two or more groups called strata,
according to some criterion, such as geographic location, grade level,
age, or income, and subsamples are randomly selected from each
strata.
•Stratified random sampling can be classified in to
a. Proportionate stratified sampling
– It involves drawing a sample from each stratum in proportion to the letter‟s
share in total population
b. Disproportionate stratified sampling
proportionate representation is not given to strata it necessary involves
giving over representation to some strata and under representation to other.
•Advantage :
Enhancement of representativeness to each sample
Higher statistical efficiency
Easy to carry out
Disadvantage:
Classification error
Time consuming and expensive
Prior knowledge of composition and of distribution of population
CLUSTER SAMPLING
Cluster sampling is an example of 'two-stage sampling' .
First stage a sample of areas is chosen;
Second stage a sample of respondents within those areas is selected.
Population divided into clusters of homogeneous units, usually
based on geographical contiguity.
Sampling units are groups rather than individuals.
A sample of such clusters is then selected.
All units from the selected clusters are studied.
The population is divided into subgroups (clusters) like families. A
simple random sample is taken of the subgroups and then all members of
the cluster selected are surveyed
Convenience Sampling
•Advantage: A sample selected for ease of access, immediately
known population group and good response rate.
•Disadvantage:
– cannot generalize findings (do not know what population group the sample
is representative of) so cannot move beyond describing the sample.
Problems of reliability
Do respondents represent the target population
Results are not generalizable
Use results that are easy to get
JUDGEMENTAL SAMPLING
- The researcher chooses the sample based on who they think
would be appropriate for the study. This is used primarily when
there is a limited number of people that have expertise in the
area being researched
Selected based on an experienced individual‟s belief
Advantages
• 🞑 Based on the experienced person‟s judgment
Disadvantages
• 🞑 Cannot measure the representativeness' of the sample
QUOTA SAMPLING
The population is first segmented into mutually exclusive sub-
groups, just as in stratified sampling.
Then judgment used to select subjects or units from each
segment
based on a specified proportion.
In quota sampling the selection of the sample is non-random.
For example interviewers might be tempted to interview those who look
most helpful. The problem is that these samples may be biased because not
everyone gets a chance of selection. This random element is its greatest
weakness and quota versus probability has been a matter of controversy for
many years
Quota sampling
• 🞑 Based on prespecified quotas regarding demographics, attitudes,
behaviors, etc
Advantages
• 🞑 Contains specific subgroups in the proportions desired
• 🞑 May reduce bias
• 🞑 easy to manage, quick
Disadvantages
• 🞑 Dependent on subjective decisions
• 🞑 Not possible to generalize
• 🞑 only reflects population in terms of the quota, possibility of bias in
selection, no standard error
SNOWBALL SAMPLING
Snowball Sampling
Useful when a population is hidden or difficult to gain access to. The contact with an
initial group is used to make contact with others.
• 🞑 Respondents identify additional people to included in the study
The defined target market is small and unique
Compiling a list of sampling units is very difficult
Advantages
• 🞑 Identifying small, hard-to reach uniquely defined target population
• 🞑 Useful in qualitative research
• 🞑 access to difficult to reach populations (other methods may not yield any results).
Disadvantages
• 🞑 Bias can be present
• 🞑 Limited generalizability
• 🞑 not representative of the population and will result in a biased sample as it is self-
selecting
Test of significance
• The test which is done for testing the research
hypothesis against the null hypothesis.
Why it is done?
To assist administrations and clinicians in making
decision.
The difference is real ? or
Has it happened by chance ?
14
8
• 1st step in testing any hypothesis.
• Set up such that it conveys a meaning that there exists
no difference between the different samples.
• Eg: Null Hypothesis – The mean pulse rate among the two
groups are same (or) there is no significant difference between
their pulse rates.
14
9
• By using various tests of significance we either:
–Reject the Null Hypothesis
(or)
–Accept the Null Hypothesis
• Rejecting null hypothesis → difference
is significant.
• Accepting null hypothesis → difference is
not significant.
15
0
Level of significance and confidence
• Significance means the percentage risk to reject a null
hypothesis when it is true and it is denoted by 𝛼.
Generally takenas 1%, 5%, 10%
• (1 − 𝛼) is the confidence level in which the null hypothesis
will exist when it is true.
15
1
Level of Significance – “P” Value
• p-value is a function of the observed sample results
(a statistic) that is used for testing a statistical hypothesis.
• It is the probability of null hypothesis being true. It can
accept or reject the null hypothesis based on P value.
• Practically, P < 0.05 (5%) is considered significant.
15
2
• P = 0.05 implies,
– We may go wrong 5 out of 100 times by rejecting null
hypothesis.
– Or, We can attribute significance with 95%
confidence.
15
3
5% Significance level & 95% confidence
l1e0vel
Acceptance and
Rejection regions
𝑇𝑜𝑡𝑎𝑙𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒 𝑟𝑒𝑔𝑖𝑜𝑛
𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛 𝑜𝑟 𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛
/𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙 (𝛼 = 0.025 𝑜𝑟 /𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙 (𝛼 =
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑒𝑙𝑣𝑒𝑙 0.025 𝑜𝑟 2.5%)
2.5%)
(1 − 𝛼) = 95%
15
4
Power of the test :
Type II error isβ and 1−β is called power of the test.
Probability of rejecting False H0, i. e. taking correct
decision
Degree of freedom:
• Number of independent observations used in statistics (d.
f.)
15
5
Various tests of significance
1. Parametric – Data is or assumed to be normal distributed.
2. Non parametric – Data is not normal distributed.
Parametric tests :
For Qualitative data:-
1. Z test
2. Chi-square test or
X2
For Quantitative
data:-
2. Unpaired ‘t’ test
15
3. Paired ‘t’ test 6
Selection of the test
• Base
1. Type of data
2. Size of sample
3. Number of samples
15
7
Qualitative Quantitative
Data Data
n > 30 n </> 30 n </> 30
Large Small / large Small / large
Sample Sample Sample
More than 2 More than 2
1, 2 samples 1 samples 2 samples
samples samples
Chi square Unpaired t-
Z test One sample
test test / Paired t- ANOV
t-test test A
15
8
• General procedure in testing a hypothesis
1. Set up a null hypothesis (HO).
2. Define alternative hypothesis (HA).
3. Calculate the test statistic (Z, X2, t etc.).
4. Find out the corresponding probability level (P Value) for
the calculated test statistic from relevant tables.
5. Accept or reject the Null hypothesis depending on P
value.
P> 0.05 HO accepted
P < 0.05 HO
rejected 15
9
TEST OF HYPOTHESIS CONCERNING NORMAL POPULATION, INFINITE, LARGE
SAMPLES, WITH KNOWN POPULATION VARIANCE.( Z TEST)
If the size of sample exceeds 30 it should be regarded as a
large sample.
a) Testing Hypothesis about population mean μ.
Z= x-μ
S.E. of x
Z= x– μ
σp/√n x = sample mean
σp = population s.d.
n= sample size
Population size is infinite
.
ILLUSTRATION2 ONE TAILED (UPPER TAILED)
An insurance company is reviewing its current policy rates.
When originally setting the rates they believed that the
average claim amount will be maximum Rs180000. They are
concerned that the true mean is actually higher than this,
because they could potentially lose a lot of money. They
randomly select 40 claims, and calculate a sample mean of
Rs195000. Assuming that the standard deviation of claims is
Rs50000 and set α= .05, test to see if the insurance company
should be concerned or not.
SOLUTION
Step 1: Set the null and alternative hypotheses
H0 : μ≤ 180000
H1 : μ > 180000
Step 2: Calculate the test statistic
z= = x – μ
σ/√n
= 1.897
Step 3: Set Rejection Region
1.65
Step 4: Conclude
We can see that 1.897 > 1.65, thus our test statistic
is in the rejection region. Therefore we fail to
accept the null hypothesis. The insurance company
should be concerned about their current
policies.
I LLUSTRATION ONE TAILED (LOWER TAILED)
Trying to encourage people to stop driving to campus, the
university claims that on average it takes at least 30
minutes to find a parking space on campus. I don’t think it
takes so long to find a spot. In fact I have a sample of the
last five times I drove to campus, and I calculated x = 20.
Assuming that the time it takes to find a parking spot is
normal, and that σ = 6 minutes, then perform a hypothesis
test with level α= 0.10 to see if my claim is correct.
SOLUTION
Step 1: Set the null and alternative hypotheses
H0 : μ ≥ 30
H1 : μ < 30
Step 2: Calculate the test statistic
Z= x – μ
σ/√n
= -3.727
STEP 3: SET REJECTION REGION
STEP 4: CONCLUDE
We can see that -3.727 <-1.28 ( or absolute value is
higher than the critical value) , thus our test
statistic is in the rejection region. Therefore we
Reject the null hypothesis in favor of the alternative.
We can conclude that the mean is significantly less
than 30, thus I have proven that the mean time to
find a parking space is less than 30.
EXERCISE
REJECTION REGION AND CONCLUSION
TESTING HYPOTHESIS ABOUT DIFFERENCE BETWEEN TWO POPULATION MEANS
We assume that the populations are normally
distributed.
The null hypothesis is H0 : μ1 = μ2
i.e. H0 : μ1 -μ2 = 0
Z= x1 - x2
√σ1 2/ n1 + σ 2 2 / n2
In case σ 1 2 and σ 2 2 are not known then s1 2 and s2 2 can be
used.
ILLUSTRATION 1
A test given to two groups of boys and girls gave
the following information:
Gender Mean score S.D. Sample Size
Girls 75 10 50
Boys 70 12 100
Is the difference in the mean scores of boys and girls
statistically significant? Test at 1% level.
Z=2.695, table value Z =2.58.
ILLUSTRATION 2
Suppose you are working as a purchase manager
for a company. The following information has been
supplied to you by two manufacturers of electric
bulbs:
Company A Company B
Mean life ( in hours) 1,300 1,288
Standard deviation( in hours) 82 93
Sample size 100 100
Which brand of bulbs are you going to purchase if you desire to take
a risk of 5%
SOLUTION HINT
Take the null hypothesis that there is no significant difference
In the quality of two brands of bulbs i. e. H0 : μ1 = μ2
Z= 0.968
Z- test proportions
• let’s say you’re testing two flu drugs A and B. Drug A works on 41
people out of a sample of 195. Drug B works on 351 people in a
sample of 605. Are the two drugs comparable? Use a 5% alpha level
.
• Step 1: Find the two proportions:
• P1 = 41/195 = 0.21 (that’s 21%)
• P2 = 351/605 = 0.58 (that’s 58%).
• Set these numbers aside for a moment.
• Step 2: Find the overall sample proportion. The numerator will be
the total number of “positive” results for the two samples and the
denominator is the total number of people in the two samples.
• p = (41 + 351) / (195 + 605) = 0.49.
• .
• Step 3: Insert the numbers from Step 1 and Step 2 into the test
statistic formula:
Z- test - proportions
Solving the formula, we get:
Z = 8.99
We need to find out if the z-score falls into the “rejection region.”
Z test – One sample Proportion
A survey claims that 9 out of 10 doctors recommend aspirin for their patients with
headaches. To test this claim, a random sample of 100 doctors is obtained. Of these 100
doctors, 82 indicate that they recommend aspirin. Is this claim accurate? Use alpha = 0.05
STEP 1:
H0:p=.90
Ha:p≠.90
Here Alpha = 0.05. Using an alpha of 0.05 with a two-tailed test,
.
P0=0.9 ,p=0.82,n=100
• Categorical variables fall into a particular category of those
variables that can be divided into finite categories.
• These categories are generally names or labels.
• These variables are also called qualitative variables as they depict
the quality or characteristics of that particular variable.
• Movie Genre” in a list of movies could contain the categorical
variables – “Action”, “Fantasy”, “Comedy”, “Romance”, etc.
• A Chi-Square test is a test of statistical significance for categorical
variables.
• chi-square test comes with a few assumptions of its own:
– The χ2 assumes that the data for the study is obtained through random
selection, i.e. they are randomly picked from the population
– The categories are mutually exclusive i.e. each subject fits in only one
category. For e.g.- from our above example – the number of people who
lunched in your restaurant on Monday can’t be filled in the Tuesday
category
– The data should be in the form of frequencies or counts of a particular
category and not in percentages
– The data should not consist of paired samples or groups or we can say the
observations should be independent of each other
• There are two types of chi-square tests. Both use the chi-square
statistic and distribution for different purposes:
• A chi-square goodness of fit test determines if sample data
matches a population.
• A chi-square test for independence compares two variables in a
contingency table to see if they are related.
– In a more general sense, it tests to see whether distributions of
categorical variables differ from each another.
Chi-Square Goodness of Fit Test
• Non-parametric test.
• Use it to find how the observed value of a given event is significantly different from the
expected value.
• In this case, we have categorical data for one independent variable, and we want to
check whether the distribution of the data is similar or different from that of the
expected distribution.
– Let’s consider the above example where the research scholar was interested in the relationship
between the placement of students in the statistics department of a reputed University and their
C.G.P.A.
– In this case, the independent variable is C.G.P.A with the categories 9-10, 8-9, 7-8, 6-7, and below
6.
– The statistical question here is: whether or not the observed frequencies of placed students are
equally distributed for different C.G.P.A categories (so that our theoretical frequency distribution
contains the same number of students in each of the C.G.P.A categories).
C.G.P.A
10-9 9-8 8-7 7-6 Below 6 Total
Observed
Frequency
of Placed 30 35 20 10 5 100
students
Expected
Frequency
of Placed 20 20 20 20 20 100
students
• Step 1: Subtract each expected frequency from the related observed frequency. For example, for
the C.G.P.A category 10-9, it will be “30-20 = 10”. Apply similar operation for all the categories
• Step 2: Square each value obtained in step 1, i.e. (O-E)2. For example: for the C.G.P.A category 10-9,
the value obtained in step 1 is 10. It becomes 100 on squaring. Apply similar operation for all the
categories
• Step 3: Divide all the values obtained in step 2 by the related expected frequencies i.e. (O-E)2/E. For
example: for the C.G.P.A category 10-9, the value obtained in step 2 is 100. On dividing it with the
related expected frequency which is 20, it becomes 5. Apply similar operation for all the categories
• Step 4: Add all the values obtained in step 3 to get the chi-square value. In this case, the chi-square
value comes out to be 32.5
• Step 5: Once we have calculated the chi-square value, the next task is to compare it with the critical
chi-square value.
the degrees of freedom (number of categories – 1) and the level of significance:
• Therefore, we can say that the observed frequencies are
significantly different from the expected frequencies.
• In other words, C.G.P.A is related to the number of placements
that occur in the department of statistics.
Steps involved for test of independence
Determine The Hypothesis:
Ho : The two variables are independent
Ha : The two variables are associated
Calculate Expected frequency
Calculate test statistic
(O E ) 2
2
E
Determine Degrees of Freedom
df = (R-1)(C-1)
EXAMPLE
Suppose a researcher is interested in voting preferences
on gun control issues.
A questionnaire was developed and sent to a random
sample of 90 voters.
The researcher also collects information about the political
party membership of the sample of 90 respondents.
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row
Democrat 10 10 30 50
Republican 15 15 10 40
f column
25 25 40 n = 90
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row
Democrat 10 10 30 50
Republican 15 15 10 40
f column
25 25 40 n = 90
BIVARIATE FREQUENCY TABLE
Row frequency
OR
CONTINGENCY TABLE
Favor Neutral Oppose
f row
Democrat 10 10 30 50
Republican 15 15 10 40
25 25 40 n = 90
f column
22
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row
Democrat 10 10 30 50
Republican 15 15 10 40
f column 25 25 40 n = 90
Col umn frequency
DETERMINE THE HYPOTHESIS
• Ho : There is no difference between D &
R in their opinion on gun control issue.
• Ha : There is an association between
responses to the gun control survey and
the party membership in the population.
CALCULATING TEST STATISTICS
Favor Neutral Oppose f row
Democrat fo =10 fo =10 fo =30 50
fe =13.9 fe =13.9 fe=22.2
Republican fo =15 fo =15 fo =10 40
fe =11.1 fe =11.1 fe =17.8
f column
25 25 40 n = 90
CALCULATING TEST STATISTICS
Favor Neutral Oppose f row
Democrat fo =10 fo =10 fo =30 50
fe =13.9 fe =13.9 fe=22.2
Republican fo =15 fo =15 fo =10 40
fe =11.1 fe =11.1 fe =17.
= 40* 25/90
f column
25 25 40 n = 90
CALCULATING TEST STATISTICS
(10 13.89)2 (10 13.89)2 (30 22.2)2
2
13.89 13.89 22.2
(15 11.11)2 (15 11.11)2 (10 17.8)2
11.11 11.11 17.8
=
11.03
DETERMINE DEGREES OF FREEDOM
df = (R-1)(C-1) =
(2-1)(3-1)
=2
COMPARE COMPUTED TEST STATISTIC
AGAINST TABLE VALUE
α = 0.05
df = 2
Critical tabled value = 5.991
Test statistic, 11.03, exceeds critical
value
Null hypothesis is rejected
Democrats & Republicans differ
significantly in their opinions on gun
control issues
• A t-test is a type of inferential statistic used to determine if
there is a significant difference between the means of two
groups, which may be related in certain features
Several types of t-tests exist for different situations, but
they all use a test statistic that follows a t-distribution
under the null hypothesis
1-sample t test
2 sample t test
paired t test
• The data should follow a continuous or ordinal scale (the IQ test scores of
students, for example)
• The observations in the data should be randomly selected
• The data should resemble a bell-shaped curve when we plot it, i.e., it should be
normally distributed.
• Large sample size should be taken for the data to approach a normal distribution
(although t-test is essential for small samples as their distributions are non-
normal)
• Variances among the groups should be equal (for independent two-sample t-test)
T value and P value
• The larger the t score, the more difference there is between groups. The smaller the t score, the more
similarity there is between groups
• . A t score of 3 means that the groups are three times as different from each other as they are within each
other. When you run a t test, the bigger the t-value, the more likely it is that the results are repeatable.
• A large t-score tells you that the groups are different.
• A small t-score tells you that the groups are similar.
• T-Values and P-values
•
How big is “big enough”? Every t-value has a p-value to go with it. A p-value is the probability that the
results from your sample data occurred by chance. P-values are from 0% to 100%. They are usually
written as a decimal.
• For example, a p value of 5% is 0.05. Low p-values are good; They indicate your data did not occur by
chance. For example, a p-value of .01 means there is only a 1% probability that the results from an
experiment happened by chance. In most cases, a p-value of 0.05 (5%) is accepted to mean the data is
valid.
One sample t test
• Step 1: The accepted hypothesis is that there is no difference in sales,and
alternate hypothesis is:
H0: μ = $100.
• H1: μ > $100.
• Step 2:
• Identify the following pieces of information you’ll need to calculate the test
statistic. The question should give you these items:
• The sample mean(x̄). This is given in the question as $130.
• The population mean(μ). Given as $100 (from past data).
• The sample standard deviation(s) = $15.
• Number of observations(n) = 25.
• Find the t-table value. You need two values to find this:
– The alpha level: given as 5% in the question.
– The degrees of freedom, which is the number of items in the sample (n) minus 1: 25
– 1 = 24.
• Look up 24 degrees of freedom in the left column and 0.05 in the top row. The
intersection is 1.711.This is your one-tailed critical t-value.
• What this critical value means is that we would expect most values to fall under 1.711. If
our calculated t-value falls within this range, the null hypothesis is likely true.
• The calculated t value from does not fall into the range calculated so we can
reject the null hypothesis. The value of 10 falls into the rejection region.
• In other words, it’s highly likely that the mean sale is greater. The sales training was
probably a success.
PAIRED T TEST
• A paired t test (also called a correlated pairs t-test, a paired
samples t test or dependent samples t test) is where you run a
t test on dependent samples.
• Dependent samples are essentially connected — they are tests
on the same person or thing. For example:
– Knee MRI costs at two different hospitals,
– Two tests on the same person before and after training,
– Two blood pressure measurements on the same person using
different equipment.
• Subtract 1 from the sample size to get the degrees of freedom.
We have 11 items, so 11-1 = 10.
• If you don’t have a specified alpha level, use 0.05 (5%). For this
example problem, with df = 10, the t-value is 2.228.
• Step 8: Compare your t-table value (2.228) to your calculated t-
value (-2.74).
• The calculated t-value is greater than the table value at an
alpha level of .05. We can reject the null hypothesis that there
is no difference between means.
To test the null hypothesis that the true mean difference is zero, the
procedure is as follows:
1.Calculate the difference (di = yi − xi) between the two observations on each pair.
2.Calculate the mean difference, d.
3.Calculate the standard error of the mean differences.S.E=S.D/√n
4. Calculate the t-statistic, which is given by T = d/S.E, Under the null
hypothesis, this statistic follows a t-distribution with n − 1 degrees of freedom.
5. Use tables of the t-distribution to compare your value for T to the t n−1
distribution. This will give the p-value for the paired t-test.
The independent samples t test (also called the unpaired samples
t test) is the most common form of the T test.
It helps you to compare the means of two sets of data.
For example, you could run a t test to see if the average test
scores of males and females are different; the test answers the
question, “Could these differences have occurred by random
chance?
You should use this test when:
You do not know the population mean or standard deviation.
You have two independent, separate samples.
X X2
3 9 .
1 1
5 25
Sum of squares
6 36
3 9
5 25
5 25
Variance
5 25
4 16
6 36
3 9
3 9
Sums 49 225 Standard deviation
Group 1
Group
2
5 3
8 5
7 2
8 3
7
sum 35 13
sum sq 251 47
• To evaluate the results, you compare the computed t to the critical
value of t. The critical value of t (obtained from the Student's t Table
) is 2.365 (alpha = 0.05 and df = N1 + N2 - 2 = 7). Because the
computed value of t (4.52) exceeds the critical value (2.365), we
reject the null hypothesis and conclude that the two populations
from which the samples are drawn do have different means.
• F statistic also known as F value is used in ANOVA and regression analysis to
identify the means between two populations are significantly different or
not.
• In other words F statistic is ratio of two variances (Variance is nothing but
measure of dispersion, it tells how far the data is dispersed from the mean).
F statistic accounts corresponding degrees of freedom to estimate the
population variance.
• F states a group of variables are statistically significant or not.
• F statistics are based on the ratio of mean squares. F statistic is the ratio of
the mean square for treatment or between groups with the Mean Square
for error or within groups.
• An F-test is any statistical test in which the test statistic has an F
-distribution under the null hypothesis.
• It is most often used when comparing statistical models that have
been fitted to a data set, in order to identify the model that best fits
the population from which the data were sampled.
• The name was coined by George W. Snedecor, in honour of Sir
Ronald A. Fisher.
• Fisher initially developed the statistic as the variance ratio in the
1920s
If calculated F value is greater than the appropriate value of the F critical value (found in a
table or provided in software), then the null hypothesis can be rejected.
• A botanical research team wants to study the growth of plants with the usage of urea.
Team conducted 8 tests with a variance of 600 during initial state and after 6 months 6
tests were conducted with a variance of 400. The purpose of the experiment is to know
is there any improvement in plant growth after 6 months at 95% confidence level.
• Degrees of freedom
– ϑ1=8-1 =7 (highest variance in numerator)
– ϑ2 = 6-1= 5
• Statistical hypothesis:
– Null hypothesis H0: σ12≤ σ22
– Alternative hypothesis H1: σ12>σ22
• Since the team wants to see the improvement it is a one-tail (right) test
• Level of significance α= 0.05
• Compute the critical F from table = 4.88
• Reject the null hypothesis if the calculated F value more than or
equal to 4.88
• Calculate the F value F= S12/ S22 =600/400= 1.5
• Fcalc< Fcritical Hence fail to reject the null hypothesis
• An insurance company sells health insurance and motor insurance
policies. Premiums are paid by customers for these policies. The
CEO of the insurance company wonders if premiums paid by either
of insurance segments (health insurance and motor insurance) are
more variable as compared to another. He finds the following data
for premiums paid:
• Conduct a two-tailed F-test with a level of significance of 10%.
• Solution:
• Step 1: Null Hypothesis H0: σ12 = σ22
• Alternate Hypothesis Ha: σ12 ≠ σ22
• Step 2: F statistic = F Value = σ12 / σ22 = 200/50 = 4
• Step 3: df1 = n1 – 1 = 11-1 =10
• df2 = n2 – 1 = 51-1 = 50
• Step 4: Since it is a two-tailed test, alpha level = 0.10/2 = 0.050. The F value from the F
Table with degrees of freedom as 10 and 50 is 2.026.
• Step 5: Since F statistic (4) is more than the table value obtained (2.026), we reject the
null hypothesis.
• The bank has a Head Office in Delhi and a branch at Mumbai. There
are long customer queues at one office, while customer queues are
short at the other office. The Operations Manager of the bank
wonders if the customers at one branch are more variable than the
number of customers at another branch. A research study of
customers is carried out by him.
• The variance of Delhi Head Office customers is 31, and that for the
Mumbai branch is 20. The sample size for Delhi Head Office is 11,
and that for the Mumbai branch is 21. Carry out a two-tailed F-test
with a level of significance of 10%.
• Solution:
• Step 1: Null Hypothesis H0: σ12 = σ22
• Alternate Hypothesis Ha: σ12 ≠ σ22
• Step 2: F statistic = F Value = σ12 / σ22 = 31/20 = 1.55
• Step 3: df1 = n1 – 1 = 11-1 = 10
• df2 = n2 – 1 = 21-1 = 20
• Step 4: Since it is a two-tailed test, alpha level = 0.10/2 = 0.05. The F value from
the F Table with degrees of freedom as 10 and 20 is 2.348.
• Step 5: Since F statistic (1.55) is lesser than the table value obtained (2.348), we
cannot reject the null hypothesis.
• A toy manufacturer is planning to place a bulk order for batteries for
the toys. The quality team collected 21 samples from supplier A,
and the variance is 36 hours, and also collected 16 samples from
supplier B with a variance of 28. At 95% confidence level, determine
is there a difference in variance between two suppliers?
• Degrees of freedom ϑ1=21-1 =20 (highest variance in numerator)
• ϑ2 = 16-1= 15
• Statistical hypothesis:
– Null hypothesis H0: σ12= σ22
– Alternative hypothesis H1: σ12≠σ22
• Since team wants to see is there a difference between two suppliers, it is a two –tailed test
• Level of significance α= 0.05
• α/2= 0.025
• Critical value for the right tail F(0.025,20,15) =2.7559
• Critical value for left tail: Since it is a left tail, we must switch the degrees of freedom, then
take a reciprocal of final answer
• Reciprocal of F(0.025,15,20) = 1/F(0.025,15,20) = 1/2.57=0.389
• Calculate the F value F= S12/ S22 =36/28= 1.285
• Compare f calc to f critical .
• In hypothesis testing, a critical value is a point on the test
distribution compares to the test statistic to determine whether to
reject the null hypothesis.
• Since f cal value does lie not lie in the rejection region
. Hence we failed to reject the null hypothesis at 95% confidence
level.
RESAMPLING
• Sampling is an active process of gathering observations intent on
estimating a population variable.
• Resampling is a methodology of economically using a data sample
to improve the accuracy and quantify the uncertainty of a population
parameter.
• Resampling methods, in fact, make use of a nested resampling
method.
• Resampling is the method that consists of drawing repeated
samples from the original data samples.
• The method of Resampling is a nonparametric method of statistical
inference.
• In other words, the method of resampling does not involve the
utilization of the generic distribution tables (for example, normal
distribution tables) in order to compute approximate p probability
value
• There is no specific sample size requirement.
• Resampling methods are
– Very easy to use,
– Requiring little mathematical knowledge
– Easy to understand and
– Easy to implement compared to specialized statistical methods that may
require deep technical skill in order to select and interpret.
A downside of the methods is that they can be computationally very
expensive, requiring tens, hundreds, or even thousands of resamples in order
to develop a robust estimate of the population parameter.
• 1.Bootstrapping
• 2. Cross Validation
• Cross-validation is a resampling procedure used to evaluate machine
learning models on a limited data sample.
• The procedure has a single parameter called k that refers to the
number of groups that a given data sample is to be split into. As
such, the procedure is often called k-fold cross-validation. When a
specific value for k is chosen, it may be used in place of k in the
reference to the model, such as k=10 becoming 10-fold cross-
validation
K fold validation
Procedure – K-fold
1.Shuffle the dataset randomly.
2.Split the dataset into k groups
3.For each unique group:
1.Take the group as a hold out or test data set
2.Take the remaining groups as a training data set
3.Fit a model on the training set and evaluate it on the test set
4.Retain the evaluation score and discard the model
4.Summarize the skill of the model using the sample of model
evaluation scores
Variations on Cross Validation
• Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a
single train/test split is created to evaluate the model.
• LOOCV: Taken to another extreme, k may be set to the total number of
observations in the dataset such that each observation is given a chance to be the
held out of the dataset. This is called leave-one-out cross-validation, or LOOCV
for short.
• Stratified: The splitting of data into folds may be governed by criteria such as
ensuring that each fold has the same proportion of observations with a given
categorical value, such as the class outcome value. This is called stratified cross-
validation.
• Repeated: This is where the k-fold cross-validation procedure is repeated n
times, where importantly, the data sample is shuffled prior to each repetition,
which results in a different split of the sample.
BOOT STRAPPING
• Bootstrapping resamples the original dataset with replacement
many thousands of times to create simulated datasets. This process
involves drawing random samples from the original dataset. Here’s
how it works:
• The bootstrap method has an equal probability of randomly
drawing each original data point for inclusion in the resampled
datasets.
• The procedure can select a data point more than once for a
resampled dataset. This property is the “with replacement” aspect
of the process.
• The procedure creates resampled datasets that are the same size as
the original dataset.
• The process ends with your simulated datasets having many
different combinations of the values that exist in the original
dataset.
• Each simulated dataset has its own set of sample statistics, such as
the mean, median, and standard deviation..
• The bootstrap method can be used to estimate a quantity of a
population. This is done by repeatedly taking small samples,
calculating the statistic, and taking the average of the calculated
statistics. We can summarize this procedure as follows:
– Choose a number of bootstrap samples to perform
– Choose a sample size
– For each bootstrap sample
• Draw a sample with replacement with the chosen size
• Calculate the statistic on the sample
– Calculate the mean of the calculated sample statistics.
Calculate Confidence Interval
• if we were interested in a confidence interval of 95%, then alpha would be
0.95 and we would select the value at the 2.5% percentile as the lower bound
and the 97.5% percentile as the upper bound on the statistic of interest.
• For example, if we calculated 1,000 statistics from 1,000 bootstrap samples,
then the lower bound would be the 25th value and the upper bound would be
the 975th value, assuming the list of statistics was ordered.
• In this, we are calculating a non-parametric confidence interval that does not
make any assumption about the functional form of the distribution of the
statistic.
• This confidence interval is often called the empirical confidence interval.
PREDICTION ERROR
• A prediction error is the failure of some expected event to occur.
• When predictions fail, applying that type of knowledge can inform
decisions and improve the quality of future predictions.
• Predictive analytics software processes new and historical data to
forecast activity, behavior and trends.
• The programs apply statistical analysis techniques,
analytical queries and machine learning algorithms to data sets to
create predictive models that quantify the likelihood of a particular
event happening.
• Errors are an inescapable element of predictive analytics that
should also be quantified and presented along with any model,
often in the form of a confidence interval that indicates how
accurate its predictions are expected to be.
• Analysis of prediction errors from similar or previous models can
help determine confidence intervals
The ingredients of prediction error are actually:
•bias: the bias is how far off on the average the model is from the
truth.
• variance :The variance is a measure of variability. It is
calculated by taking the average of squared deviations from
the mean.
– Variance tells you the degree of spread in your data set. The more
spread the data, the larger the variance is in relation to the mean.
Bias and variance together gives us prediction error.
This difference can be expressed in term of variance and bias:
e2=var(model)+var(chance)+bias
where:
•var(model) is the variance due to the training data set selected.
(Reducible)
•var(chance) is the variance due to chance (Not reducible)
•bias is the average of all Y^ over all training data set minus the
true Y (Reducible)