Lecture on
Big Data Analytics
Introduction to course
Mrs. Archana Shirke,
Department of Information Technology,
Fr. C. R. I. T., Vashi.
Big Data Analytics
Syllabus
Modules
1. Introduction to Big Data
2. Introduction to Big Data Frameworks: Hadoop,
NOSQL
3. MapReduce Paradigm
4. Mining Big Data Streams
5. Big Data Mining Algorithms
6. Big Data Analytics Applications
Course Objectives
• To provide an overview of an exciting growing field of Big
Data analytics.
• To discuss the challenges traditional data mining
algorithms, face when analysing Big Data.
• To introduce the tools required to manage and analyse
big data like Hadoop, NoSql, Map Reduce.
• To teach the fundamental techniques and principles
in achieving big data analytics with scalability and
streaming capability.
• To introduce to the students several types of big data
like social media, web graphs and data streams.
Course Outcomes
Students will be able to:
• Demonstrate fundamentals of Big Data analytics.
• Describe Big data techniques and algorithms.
• Apply Big data techniques and algorithms to solve real
life applications.
• Analyse different Implementation of big data algorithms.
• Evaluate big Data solution using various modern tools.
• Investigate the key issues in complex real world big data
applications.
Course Work
• IA1 : 20 M
• IA2 : 20 M
• Assignment Test 1 : 20M
• Assignment Test 2 : 20M
• Innovative Teaching learning methods- TPS , Mind
Map, CBL, etc.
• Prelim exam: 80 M
• Final exam: 80 M
11
References
References
• Readings: Book Mining of Massive Datasets
with A. Rajaraman and J. Ullman
Free online: http://www.mmds.org
• Many more online material – inform you as and
when required
13
It’s going to be fun and hard
work.
Big Data Analytics
5W and 1H
• What ???
• Why ???
• When ???
• Where ???
• Who ???
• How much ???
What is Big Data?
• IBM defines -- “Big data is data that exceeds the
processing capacity of conventional database systems.
The data is too big, moves too fast, or doesn't fit the
structures of your database architectures. To gain value
from this data, you must choose an alternative way to
process it.”
Why do we use Big Data?
• Big data is used for Smart Decisions
– Time and Cost Reduction
– Faster, better decision making
– New product/Personalized offerings etc.
Ex: Traffic
Why do we use Big Data?
When do we use Big Data?
• Discovery of useful, possibly unexpected, patterns in data
• Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
• Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns.
Where is Big Data ???
• Big data is everywhere
• Lots of data is being collected and warehoused
– Web data, e-commerce
– purchases at department/grocery stores
– Bank/Credit Card transactions
– Social Network
– Sensor Network(IoT)
How big data generated???
How much?
Big Data Size varies
• Byte (1) : one grain of rice
• Kilobyte (103) : cup of rice
• Megabyte (106) : 8 bags of rice
• Gigabyte (109) : 3 Semi trucks
• Terabyte (1012) : 2 Container Ships
• Petabyte (1015) : Entire City Yottabyte
• Exabyte (1018) : Half India
• Zettabyte (1021) : Fills the Pacific Ocean
• Yottabyte (1024) : Earth size rice
Big Data Technologies
TOP 5 BIG DATA TECHNOLOGIES
• 1. Hadoop Ecosystem
• 2. Artificial Intelligence
• 3. NoSQL Database
• 4. R Programming
• 5. Data Lakes
EMERGING BIG DATA TECHNOLOGIES
• 1. TensorFlow
• 2. Beam
• 3. Docker
• 4. Airflow
• 5. Kubernetes
• 6. Blockchain
Who uses???
Prerequisite Quiz
https://forms.office.com/r/NXwFWd4mL2
Q1
What comes next to Petabyte unit of data
A. Exabyte
B. Zettabyte
C. Yottabyte
D. Terabyte
• Answer: A) Exabyte
Q2
Big data is used for Smart Decisions for
A. Time and Cost Reduction
B. New product offerings
C. Personalized offerings
D. All of the above
• Answer: D) All of the above
Q3
Which one of the following is not part of Big Data
Process
A. Big Data Analytics
B. Descriptive Analysis
C. Customer Analytics
D. Oracle Analytics
• Answer: D) Oracle Analytics
Q4
What are the big data analytics tools used
A. Hadoop
B. Cloudera
C. HDInsight
D. All of the above
• Answer: D) All of the above
Q5
What are the main characteristics of Big Data
A. Volume, Vocabulary, Value
B. Velocity, Visualization, Vagueness
C. Variety, Veracity, Value
D. Volume, Velocity, Variety
• Answer: D) Volume, Velocity, Variety
Summary
• Introduction
• Syllabus
• Expectations for course/Term work
• 5W and H of BDA
Thank you
Lecture on
Big Data Analytics
Introduction to Big Data
Mrs. Archana Shirke,
Department of Information Technology,
Fr. C. R. I. T., Vashi.
Modules
1. Introduction to Big Data
• Introduction to Big Data, characteristics, types
• Traditional vs. Big Data business approach, Big Data Challenges,
• Examples of Big Data in Real Life, Big Data Applications
2. Introduction to Big Data Frameworks: Hadoop, NOSQL
3. MapReduce Paradigm
4. Mining Big Data Streams
5. Big Data Mining Algorithms
6. Big Data Analytics Applications
36
Introduction
• Big data is data whose scale , diversity and complexity
requires new architecture , techniques, algorithm and
analytics to manage and extract value and hidden knowledge
from it.
• Datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze.
• IBM defines -- “Big data is data that exceeds the processing
capacity of conventional database systems. The data is too
big, moves too fast, or doesn't fit the structures of your
database architectures. To gain value from this data, you must
choose an alternative way to process it.”
Introduction
• Gartner defines - “Big Data in general is defined as high
volume, velocity and variety information assets that
demand cost-effective, innovative forms of information
processing for enhanced insight and decision making.”
• O’Reilly defines - Big Data is often described as extremely
large data sets that have grown beyond the ability to
manage and analyze them with traditional data
processing tools.
• Webopedia defines - Big Data is a phrase used to mean
a massive volume of both structured and
unstructured data that is so large it is difficult to process
using traditional database and software techniques.
Big Data characteristics: 3V’s
Big Data characteristics: 3V’s
40
1. Characteristics : Volume(Scale)
• Data Volume
– 44x increase from 2009 to 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
41
1. Characteristics : Volume(Scale) 4.6
30 billion RFID billion
tags today
12+ TBs (1.3B in 2005)
camera
of tweet data phones
every day world
wide
100s of
millions
of GPS
data every
of
enable
? TBs
day
d devices
sold
annually
25+ TBs of
log data 2+
every day billion
people
on the
76 million smart Web by
meters in 2009… end
200M by 2014 2011
2. Characteristics: Variety (Complexity)
• Various formats, types, and structures
• Text, numerical, images, audio, video, sequences, time series, social
media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be generating/collecting many types of data
To extract knowledge all these
types of data need to linked
together
43
A Single View to the Customer
Banki
Social ng
Financ
Media
e
Our
Know
Customer
Gami
n
ng
Histor
y
Entertai
Entertai Purcha
n se
3. Characteristics : Velocity (Speed)
• Data is begin generated fast and need to be processed
fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store next
to you
– Healthcare monitoring: sensors monitoring your activities and
body any abnormal measurements require immediate reaction
45
The Model Has Changed…
• The Model of Generating/Consuming Data
has Changed
Model: Few companies are generating data, all others are consumin
New Model: all of us are generating data, and all of us are
consuming data
46
Big Data characteristics: 4 V’s
A big data characteristic related to consistency, accuracy,
quality, and trustworthiness. Data veracity refers to the
biasedness, noise, and abnormality in data.
47
Big Data characteristics: 4 V’s
Big Data characteristics: 5 V’s
Big Data characteristics: 5 V’s
Big Data characteristics: 6 V’s
6
Big Data characteristics: 7 V’s
Characteristics : V’s
1. Volume – Size / Quantity 12.Volatility – Duration of Use
2. Velocity - Speed 13.Virality – Spreading Speed
3. Variety - Type / Nature 14.Viscosity- Lag of Event
15.Verbosity – Redundancy
4. Veracity – Quality
16.Voluntariness – Will full
5. Value - Importance availability of data used
6. Variability – Data Differentiation / Change in
according to the context
Meaning
7. Visualization - Data Act/ Data Process17.Versatility - flexible enough to
8. Validity – Authenticity be used differently for different
9. Vocabulary – Data Terminology context
10. Venue – Different Platform
11. Vagueness: Uncertainties in the meaning of
Big Data Vs Small Data
Big Data tools
Big Data Certification: example
Scope of Big Data
• Numerous Job opportunities
– Big Data Analyst,
– Big Data Engineer,
– Big Data solution architect etc.
• Rising demand for Analytics Professional
• Salary Aspects
• Adoption of Big Data analytics across the world
Big Data Every Where
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
2013- How much data?
2014- How much data?
2015 - How much data?
2016 - How much data?
2017 - How much data?
2018 - How much data?
2019 - How much data?
2020- How much data?
2021- How much data?
2022- How much data?
2013-2022
Analysis of Youtube Data per minute
• 2013 - User Share 48 hours of new videos / minute
• 2014 - User Share 72 hours of new videos / minute
• 2015 - User Share 300 hours of new videos / minute
• 2016 – User Share 900 hours of new videos / minute
• 2017 - User Watch 41,46,600 videos / minute
• 2018 – User Watch 43,33,560 videos / minute
• 2019 – User Watch 45,00,000 videos / minute
• 2020 - 500 hours of video streamed / minute
• 2021 – 694 hours of video streamed / minute
• 2022 – Users upload 500 hours of video / minute
Big Data Trends to Consider in 2022
What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching
• Knowledge discovery
– Data Mining
– Statistical Modeling
What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching
• Knowledge discovery
– Data Mining
– Statistical Modeling
Quiz Time ???
https://forms.office.com/r/C4Tx88fnKr
Q1
The feature of big data that refers to the quality of
the stored data is:
A. Variety
B. Volume
C. Variability
D. Veracity
• Answer: D) Veracity
Q2
Following is not true for Big Data Approach
A. Handles structured data
B. Handles frequent updation of large data
C. Handles semi-structured data
D. Handles unstructured data
• Answer: B) Handles frequent updation of large
data
Q3
What does “Velocity” in Big Data mean?
A. Speed of individual machine processors
B. Speed of input data generation
C. Speed of storing data only
D. Speed of storing and processing data
• Answer: D) Speed of storing and processing data
Q4
In Big Data environment data resides in
A. Distributed File system
B. Central server
C. Data warehouse
D. Database
• Answer: A) Distributed File system
Q5
The world wide web(WWW) and the Internet of
Things (IoT) uses
A. Structured data
B. Unstructured data
C. Multimedia data
D. Relational data
• Answer: B) Unstructured data
What is Data Mining
What is Data Mining?
• Discovery of useful, possibly unexpected, patterns in data
• Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
• Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
Data Mining Tasks
• Predictive methods
– Use some variables to predict unknown or future values
of other variables
– Example: Recommender systems
• Descriptive methods
– Find human-interpretable patterns that describe the
data
– Example: Clustering
82
Data Mining Tasks
• Classification [Predictive]
• Regression [Predictive]
• Deviation Detection [Predictive]
• Collaborative Filter [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
Data Mining V/s Big Data
• How to manage very large amounts of data and extract
value and knowledge from them?
Big Data Mining
• Big data is the asset and data mining is the "handler" of that
is used to provide beneficial results.
Analysis Vs Analytics
Analysis- What happened? Analytics- What will happen?
Why?
Analysis / Analytics
• Analysis is detailed examination of the elements or
structure of something, typically as a basis for
discussion or interpretation
• Analytics is the process of, or the results from,
analysis of data for purposes of making decisions,
reaching conclusions, or disproving models or
theories about how some currently interesting
aspect of the world works.
Business Analysis / Analytics
• Business analysis
– concerned with functions and process.
– It has its own architecture domains : Enterprise , Process
– improves performance by standardizing processes
• Business analytics
– concerned with data and reporting.
– It also has its architecture domains : Information , Data
– improves performance by analyzing metrics
– reports to spot problems and interesting findings that could turn
into improvement opportunities.
What is Big Data Analytics
Why Big Data Analytics??
• Technology advances now make it possible to analyze entire data sets
and not just subsets.
• Every interaction rather than just every transaction can be analyzed.
• Analysis of multi-structured data may produce additional insight for
making smart decisions from organizations point of view.
Big Data Analytics
• Scalability (big data)
• Algorithms
• Computing architectures
• Automation for handling large data
90
Technology for Big Data Analytics
• Horizontal scaling
– distributing the workload across multiple
independent machines to improve processing
capability.
• Vertical Scaling
– installing more processors, more memory and
faster hardware typically within a single server
• Peer to Peer Network
– decentralized and distributed network architecture
involve millions of machines(peers) connected in a
network serve and consume resources.
– Message passing Interface(MPI)communication
scheme used to communicate and exchange data
between peers.
– Broadcasting messages is cheaper but the
aggregation of data/results is much expensive.
Technology for Big Data Analytics
Type of Data
• Structured Data
– Relational Data.
– Ex. Tables/Transaction/Legacy Data
• Unstructured Data
– Text Data.
– Ex. Web
• Semi-structured Data
– Document data
– Ex. Email
Big Data Challenges
Challenges of Big Data
Analysi
Visualiz
s
e
Transfer
Capture
Challeng
es of Big
Sharing
Data
Curatio
n
Searc
h Storage
Issues in Big data
• Scalability
• Heterogeneity and Incompleteness
• Precision
• Human Collaboration
• Privacy
• Data Visualization
• Data Redundancy and Compression
Examples of Big Data in Real Life
Quiz Time ???
https://forms.office.com/r/RD8jmxvmzb
Q1
Which is not a technical challenge for Big Data
A. Quality of data
B. Fault tolerance
C. Scalability
D. Massive data volumes
• Answer: D) Massive data volumes
Q2
Which is not a feature of Big Data Analytics?
A. Highly scalable analytics processes
B. Flexibility
C. Real-time results
D. Exotic Hardware
• Answer: D) Exotic Hardware
Q3
Listed below are the three steps that are followed to
deploy a Big Data Solution except
A. Data Ingestion
B. Data Processing
C. Data Dissemination
D. Data Storage
• Answer: C) Data Dissemination
Q4
Which of the following applies to Big Data Analytics?
A. Stores only structured data in data marts and data warehouses.
B. Handles large volume of transactions, but upto an extent.
C. Handles peta or zeta bytes of data, that too in a variety of
formats.
D. Essentially deals only with sampling the data.
• Answer: C) Handles peta or zeta bytes of data, that too in a
variety of formats.
Q5
Which of the following would use big data analytics?
A. Healthcare
B. Public Agencies
C. Retail Companies
D. All of the above
• Answer: D) All of the above
Summary
• Introduction to Big Data
• Big Data characteristics
• Types of Big Data
• Big Data Challenges
• Examples of Big Data in Real Life,
• Big Data Applications
Thank you
Lecture on
Big Data Analytics
Introduction to Big Data
Mrs. Archana Shirke,
Department of Information Technology,
Fr. C. R. I. T., Vashi.
Modules
1. Introduction to Big Data
• Introduction to Big Data, characteristics, types
• Traditional vs. Big Data business approach, Big Data Challenges,
• Examples of Big Data in Real Life, Big Data Applications
2. Introduction to Big Data Frameworks: Hadoop, NOSQL
3. MapReduce Paradigm
4. Mining Big Data Streams
5. Big Data Mining Algorithms
6. Big Data Analytics Applications
109
Introduction
• Big data is data whose scale , diversity and complexity
requires new architecture , techniques, algorithm and analytics
to manage and extract value and hidden knowledge from it.
• Datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze.
• IBM defines -- “Big data is data that exceeds the processing
capacity of conventional database systems. The data is too
big, moves too fast, or doesn't fit the structures of your
database architectures. To gain value from this data, you must
choose an alternative way to process it.”
Why Big Data Analytics??
• Technology advances now make it possible to analyze entire data sets
and not just subsets.
• Every interaction rather than just every transaction can be analyzed.
• Analysis of multi-structured data may produce additional insight for
making smart decisions from organizations point of view.
Big Data Analytics
• Scalability (big data)
• Algorithms
• Computing architectures
• Automation for handling large data
112
Technology for Big Data Analytics
• Horizontal scaling
– distributing the workload across multiple
independent machines to improve processing
capability.
• Vertical Scaling
– installing more processors, more memory and
faster hardware typically within a single server
• Peer to Peer Network
– decentralized and distributed network architecture
involve millions of machines(peers) connected in a
network serve and consume resources.
– Message passing Interface(MPI)communication
scheme used to communicate and exchange data
between peers.
– Broadcasting messages is cheaper but the
aggregation of data/results is much expensive.
Technology for Big Data Analytics
Type of Data
• Structured Data
– Relational Data.
– Ex. Tables/Transaction/Legacy Data
• Unstructured Data
– Text Data.
– Ex. Web
• Semi-structured Data
– Document data
– Ex. Email
TRADITIONAL vs BIG DATA approach
TRADITIONAL DATA BIG DATA
Data integration is very easy. Data integration is very difficult
Normal system configuration is High system configuration is
capable to process traditional required to process big data.
data.
Traditional data base tools are Special kind of data base tools are
required to perform any data base required to perform any data base
operation. operation.
Its data model is strict schema Its data model is flat schema
based and it is static. based and it is dynamic.
Its data sources includes ERP Its data sources includes social
transaction data, CRM transaction media, device data, sensor data,
data, financial data, video, images, audio etc.
organizational data, web
TRADITIONAL vs BIG DATA approach
TRADITIONAL DATA BIG DATA
Traditional data is generated in Big data is generated in outside
enterprise level. and enterprise level.
Its volume ranges from Gigabytes Its volume ranges from Petabytes
to Terabytes. to Zettabytes or Exabytes.
Traditional database system deals Big data system deals with
with structured data. structured, semi structured and
unstructured data.
Traditional data source is Big data source is distributed and
centralized and it is managed in it is managed in distributed form.
centralized form.
Traditional data is in manageable Big data is in huge volume which
volume. becomes unmanageable.
What is Data Mining
What is Data Mining?
• Discovery of useful, possibly unexpected, patterns in data
• Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
• Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
Data Mining Tasks
• Predictive methods
– Use some variables to predict unknown or future values
of other variables
– Example: Recommender systems
• Descriptive methods
– Find human-interpretable patterns that describe the
data
– Example: Clustering
120
Data Mining Tasks
• Classification [Predictive]
• Regression [Predictive]
• Deviation Detection [Predictive]
• Collaborative Filter [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
Data Mining V/s Big Data
• How to manage very large amounts of data and extract
value and knowledge from them?
Big Data Mining
• Big data is the asset and data mining is the "handler" of that
is used to provide beneficial results.
Analysis Vs Analytics
Analysis- What happened? Analytics- What will happen?
Why?
Analysis / Analytics
• Analysis is detailed examination of the elements or
structure of something, typically as a basis for
discussion or interpretation
• Analytics is the process of, or the results from,
analysis of data for purposes of making decisions,
reaching conclusions, or disproving models or
theories about how some currently interesting
aspect of the world works.
Business Analysis / Analytics
• Business analysis
– concerned with functions and process.
– It has its own architecture domains : Enterprise , Process
– improves performance by standardizing processes
• Business analytics
– concerned with data and reporting.
– It also has its architecture domains : Information , Data
– improves performance by analyzing metrics
– reports to spot problems and interesting findings that could turn
into improvement opportunities.
Big Data Challenges
Challenges of Big Data
Analysi
Visualiz
s
e
Transfer
Capture
Challeng
es of Big
Sharing
Data
Curatio
n
Searc
h Storage
Issues in Big data
• Scalability
• Heterogeneity and Incompleteness
• Precision
• Human Collaboration
• Privacy
• Data Visualization
• Data Redundancy and Compression
Examples of Big Data in Real Life
Quiz Time ???
https://forms.office.com/r/RD8jmxvmzb
Q1
Which is not a technical challenge for Big Data
A. Quality of data
B. Fault tolerance
C. Scalability
D. Massive data volumes
• Answer: D) Massive data volumes
Q2
Which is not a feature of Big Data Analytics?
A. Highly scalable analytics processes
B. Flexibility
C. Real-time results
D. Exotic Hardware
• Answer: D) Exotic Hardware
Q3
Listed below are the three steps that are followed to
deploy a Big Data Solution except
A. Data Ingestion
B. Data Processing
C. Data Dissemination
D. Data Storage
• Answer: C) Data Dissemination
Q4
Which of the following applies to Big Data Analytics?
A. Stores only structured data in data marts and data warehouses.
B. Handles large volume of transactions, but upto an extent.
C. Handles peta or zeta bytes of data, that too in a variety of
formats.
D. Essentially deals only with sampling the data.
• Answer: C) Handles peta or zeta bytes of data, that too in a
variety of formats.
Q5
Which of the following would use big data analytics?
A. Healthcare
B. Public Agencies
C. Retail Companies
D. All of the above
• Answer: D) All of the above
TPS Activity
(Think-Pair and Share)
TPS
Problem statement: Select any Real life big data based
use case and justify the need of big data analytics. Write
one page report and submit.
• Think (5 minutes): Individually, students write the real life
application chosen.
• Pair (5 minutes): each student discuss the problem statement
written with partner.
• Share (10-15 minutes): Few pairs can discuss their solution in
detail. Pairs that had used different use cases from the ones
discussed were allowed to speak. Students need to justify
solution with facts.
Summary
• Introduction to Big Data
• Big Data characteristics
• Types of Big Data
• Big Data Challenges
• Examples of Big Data in Real Life
• Big Data Applications
Thank you