CET4001B Big Data
Technologies
School of Computer Engineering and Technology
3/18/2024 Big Data Analytics Lab 1
Unit- I : Introduction to Big Data
• What is Big Data
• Overview of big data analytics
• Traditional database systems vs. big
data systems
• 9 V's of big data
• Significance of big data and real
world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
Syllabus
UNIT I- Introduction to Big Data 2
Motivation For BIG DATA
1. Huge volume of data:
Rather than thousands or millions of rows, Big Data can be billions of rows
and millions of columns due to different applications like twitter, Facebook,
Instagram.
2. Complexity of data types and structures:
Big data can come in multiple forms, including structured and non-structured
data such as financial data, text files, multimedia files, and genetic mappings
also digital traces being left on the web and other digital repositories for
subsequent analysis
UNIT I- Introduction to Big Data 3
3
Motivation for BIG data
Contd..
3. High Speed of new data creation and growth:
Big Data can describe high velocity data, with rapid data
ingestion and near real time analysis.
4. Distributed computing environments and Massively Parallel
Processing (MPP) architectures that enable parallelized data
ingest and analysis are the preferred approach to process such
complex data
UNIT I- Introduction to Big Data 4
Motivation
for BIG
data
Contd..
UNIT I- Introduction to Big Data 5
UNIT I- Introduction to Big Data 6
Big Data
Sources
UNIT I- Introduction to Big Data 7
Big Data Sources Contd..
•Data is created constantly, and at an ever-increasing
rate:
•Sources of Big Data:
•1. Mobile phones, social media, imaging technologies
-all these and more create new data, and that must be
stored somewhere for some purpose
•2.Devices and sensors automatically generate diagnostic
information that needs to be stored and processed in real
time
8
8
Photos and video footage uploaded to the
World Wide Web.
Video surveillance, such as the thousands
Examples of video cameras spread across a city .
of big data Mobile devices, which provide geospatial
location data of the users, as well as
metadata about text messages, phone
calls, and application usage on smart
phones
Smart devices, which provide sensor-based
collection of information from smart electric
grids, smart buildings, and many other
public and industry infrastructures
UNIT I- Introduction to Big Data 9
Statistics of big data
UNIT I- Introduction to Big Data 10
Not a single definition…..
•• Big data is high volume, high
velocity, high variety
information assets that require
DEFINITION new forms of processing to
OF BIG enable enhanced decision
making, insight discovery and
DATA process optimization. ---Doug
Laney, Gartner, 2012.
•• Big Data is data whose scale,
distribution, diversity, and/or
timeliness require the use of
new technical architectures and
analytics to enable insights that
unlock new sources of
business value.
UNIT I- Introduction to Big Data 11
Unit I- Introduction to Big
Data
• What is Big Data
• Overview of Big Data Analytics
• Traditional database systems vs big data
systems
• 9 v's of big data
• Importance of big data and real world
challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
UNIT I- Introduction to Big Data 12
DEFINITION OF BIG DATA
• Big Data is data whose scale, distribution,
diversity, and/or timeliness require the use
of new technical architectures and
analytics to enable insights that unlock
new sources of business value.
UNIT I- Introduction to Big Data 13
Example:
Healthcare
application
14
UNIT I- Introduction to
Big Data
Overview of Big Data
Analytics
15
• Big data analytics is the often-complex process of examining large and varied data
sets, or big data, to uncover information -- such as hidden patterns, unknown
correlations, market trends and customer preferences -- that can help organizations
make informed business decisions.
UNIT I- Introduction to
Big Data
Big Data Analytics..
• Big data analytics is a form • Descriptive analytics answers
of advanced analytics, which the question of what happened.
involves complex applications .
• Diagnostic Analytics ,historical
• Following is the type of data can be measured against
analytics: other data to answer the question
of why something happened.
• Predictive analytics tells what is
likely to happen.
• Prescriptive analytics is to
literally prescribe what action to
take to eliminate a future
problem or take full advantage
of a promising trend.
UNIT I- Introduction to Big Data 16
• Descriptive: A set of techniques for reviewing and examining the data set(s)
to understand the data and analyze business performance.
• Diagnostic: A set of techniques for determine what has happened and why
• Predictive: A set of techniques that analyze current and historical data to
determine what is most likely to (not) happen
• Prescriptive: A set of techniques for computationally developing and
analyzing alternatives that can become courses of action – either tactical or
strategic – that may discover the unexpected
Various • Decisive: A set of techniques for visualizing information and recommending
courses of action to facilitate human decision-making when presented with
a set of alternatives.
types of
big data
analytics
UNIT I- Introduction to Big Data 17
Descriptive Analytics
• Steps:
• Identify the attributes, then
assess/evaluate the attributes
• Estimate the magnitude to correlate the
relative contribution of each attribute to
the final solution
• Accumulate more instances of data from
the data sources
• If possible, perform the steps of
evaluation, classification and
categorization quickly
• Yield a measure of adaptability within the
OODA loop
• At some threshold, crossover into diagnostic
and predictive analytics
UNIT I- Introduction to Big Data 18
Diagnostic Analytics
• Steps:
• Begin with descriptive analytics
• Extract patterns from large data
quantities via data mining
• Correlate data types for
explanation of near-term behavior
– past and present
• Estimate linear/non-linear
behavior not easily identifiable
through other approaches.
• Example: by classifying past insurance
claims, estimate the number of future
claims to flag for investigation with a
high probability of being fraudulent.
UNIT I- Introduction to Big Data
Predictive Analytics
• Steps:
• Begin with descriptive AND
diagnostic analytics
• Choose the right data based on
domain knowledge and relationships
among variables
• Choose the right techniques to yield
insight into possible outcomes
• Determine the likelihood of possible
outcomes given initial boundary
conditions
• Remember! Data driven analytics is
non-linear; do NOT treat like an
engineering project
UNIT I- Introduction to Big Data
Prescriptive
Analytics
• Steps:
• Begin with predictive analytics
• Determine what should occur and
how to make it so
• Determine the mitigating factors that
lead to desirable/undesirable
outcomes
• “What-if” analysis with local or
global optimization
• Find the best set of prices and
advertising frequency to
maximize revenue
• The right set of business moves
to make to achieve that goal
UNIT I- Introduction to Big Data 21
Decisive Analytics
• Steps:
• Given a set of decision alternatives,
choose the one course of action to
do from possibly many
• But, it may not be the optimal one.
• Visualize alternatives – whole or
partial subset
• Perform exploratory analysis –
what-if and why
• How do I get to there from here?
• How did I get here from there?
UNIT I- Introduction to Big Data
What-if analysis:
Process of calculating
What-if backward to find out an
input by providing a specific
analysis output.
Works in opposite fashion of
formulae
What-if analysis helps to
find out what input will result
in a specific output.
UNIT I- Introduction to Big Data 23
Example of formula:
What-if
analysis
(contd) y = a* (x^2)
Problem = given the
values for input
variables a and x, we
can compute value of
output variable y
UNIT I- Introduction to Big Data 24
Example of what-if analysis:
What-if
analysis Suppose a student plans to score an
average of 80 in semester exam. She
(contd) scored 82, 70, 83 and 76 in the subjects
English, Mathematics, Computer Science
and Mechanics respectively.
Statistics exam is due to happen shortly, we
want to calculate the marks she needs to
score in Statistics to achieve an average of
80 in the semester.
UNIT I- Introduction to Big Data 25
•What-if analysis(contd)
Before and after scenarios
UNIT I- Introduction to Big Data 26
Big data Analytics
• Big data can deliver value in almost any area of business or
society:
Report on Big Data in Big Companies
UNIT I- Introduction to Big Data 27
Key Roles For
A Successful
Big Data
Analytics
Project
UNIT I- Introduction to Big Data 28
1. Business User
Someone who understands the domain area and
usually benefits from the results.
This person can consult and advise the project team on
the context of the project, the value of the results, and
how the outputs will be operationalized.[ put into
operation / use ]
Usually a business analyst or subject matter expert in
the project domain fulfills this role.
UNIT I- Introduction to Big Data 29
2. Project Sponsor
Responsible for the genesis of the project. [origin/
source]
Provides the impetus and requirements for the project
and defines the core business problem.
[impulse/stimulus]
Generally provides the funding and gauges the
degree of value from the final outputs of the working
team.
This person sets the priorities for the project and
clarifies the desired outputs.
UNIT I- Introduction to Big Data 30
3. Project Manager
• Ensures that key milestones and objectives are met on
time and at the expected quality. [ a significant stage/event]
UNIT I- Introduction to Big Data 31
4. Business Intelligence Analyst
• Provides business domain expertise based on :
• A deep understanding of the data,
• Key Performance Indicators (KPIs)
• key metrics
• Business intelligence from a reporting perspective
• Business intelligence analysts generally create dashboards
and reports and have knowledge of the data feeds and
sources.
UNIT I- Introduction to Big Data 32
KPIs vs Key metrics
• KPIs are measurable values that show you how
effective you are at achieving business objectives.
• Metrics are different in that they simply track the
status of a specific business process.
• Thus KPIs track whether you hit business
objectives/targets, and metrics track processes
UNIT I- Introduction to Big Data 33
KPI
• Example of KPI
• Target of teams was to increase
sales revenue by 20% this year
end (2021)
Team B
Team A increase in sales = 18%
increase in sales = 21%
UNIT I- Introduction to Big Data 34
KPI addresses the overall /
targeted goal / objectives.
•• Metrics do not.
KPIs vs
Key Key metrics are specific.
Metrics
(contd…) •• KPIs are not.
Accurate tracking of progress (
staff to enterprise) needs the
use of both – KPIs and key
metrics.
UNIT I- Introduction to Big Data 35
Example for KPI and Metrics
• Best Social Media Marketing Metrics Best call center metrics to monitor
• Likes Many industry leading companies track
• Engagement these on TV data walls:
• Followers growth Call completion rate
• Traffic conversions
Agent utilization
Answer seizure ratio (ASR)
• Social interactions
First call resolution rate
• Social sentiment
Speed of answer (SA)
• Social visitor goals
Call handling time
• Social shares Call drop rate (CDR)
• Web visitors from social channel First contact resolution rate
• Social visitors conversion rates Sales per agent
Lead conversion rate
UNIT I- Introduction to Big Data 36
Business intelligence
Business Intelligence (BI) refers to The purpose of Business Intelligence is Business Intelligence Analysts
technologies, applications and practices to support better business decision generally create dashboards and
for the collection, integration, analysis, making. reports and have knowledge of the data
and presentation of business feeds and sources.
information.
UNIT I- ] Introduction to Big Data 37
Provisions and configures the
database environment to support
the analytics needs of the working
team.
•• These responsibilities may include
5. Database
Administrator providing access to key databases
(DBA) or tables and
ensuring the appropriate security
levels are in place related to the
data repositories.
UNIT I- Introduction to Big Data 38
Leverages deep technical skills to
assist with tuning Query
Language queries for data
management and data extraction,
and provides support for data
ingestion into the analytic
6. Data sandbox.
Engineer While the DBA sets up and
configures the databases to be
used, the data engineer executes
the actual data extractions and
performs substantial data
manipulation to facilitate the
analytics.
UNIT I- Introduction to Big Data 39
An Analytics Sandbox is a separate environment
that is part of the architecture, used by multiple
users and is maintained with the support of IT.
•• Key Characteristics
The environment is controlled by the analyst
Analytics •• Allows them to install and use the data tools of their choice
•• Allows them to manage the scheduling and processing of the data
Sandbox assets
Enables analysts to explore and experiment with
internal and external data
Can hold and process large amounts of data
efficiently from many different data sources –
•• big data (unstructured), transactional data (structured),
•• web data, social media data, documents etc.
UNIT I- Introduction to Big Data 40
• A set of resources that enable analytic
professionals to experiment and
reshape data in whatever fashion they
need to
• Data exploration
The • Development of analytical
processes
Analytical • Proof of concepts
• prototyping
Sandbox
Definition
41
The Analytical Sandbox
An Internal Sandbox
• A portion of an enterprise data warehouse or data mart is carved out to serve as
the analytic sandbox
• Strength
• Leverage existing hardware resources and infrastructure already in
place
• Ability to directly join production data with sandbox data
• Cost-effective since no new hardware is needed
• Weaknesses
• An additional load on the existing enterprise data warehouse or data
mart
• Can be constrained by production policies and procedures
Sandbox
Analytic Views & Core Database Tables
Enterprise Analytic Data Sets
Additional Data
42
Enterprise Data Warehouse or Data Mart
The Analytical Sandbox
An External Sandbox
• A physically separate analytic sandbox is created for testing and development of
analytic processes
• Strength
• A stand-alone environment, no impact on other processes
• Reduce workload management
• Weaknesses
• The additional cost of the stand-alone system
• Some data movement
Sandbox
Extract
Enterprise Data Warehouse or Data Mart
43
The data engineer works closely with the data
scientist to help shape data in the right ways
for analyses.
Provides subject matter expertise for:
7. Data •• analytical techniques,
•• data modeling,
•• applying valid analytical techniques to given business
Scientist problems.
Ensures overall analytics objectives are met.
Designs and executes analytical methods
and approaches with the data available to the
project.
UNIT I- Introduction to Big Data 44
• Each role plays a critical part in a
successful analytics project.
• Although seven roles are listed,
fewer or more people can
accomplish the work depending on
Roles • the scope of the
project,
contd… • organizational
structure and
• the skills of the
participants.
UNIT I- Introduction to Big Data 45
SUMMARY
UNIT I- Introduction to Big Data 46
Syllabus
Introduction to Big Data:
• What is Big Data
• Overview of big data analytics
• Traditional database systems vs. Big Data
Systems
• 9 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
UNIT I- Introduction to Big Data 47
Traditional database systems
vs
Big Data systems
UNIT I- Introduction to Big Data 48
Compare Traditional Database systems
vs
Big Data systems
UNIT I- Introduction to Big Data 49
Traditional Database systems
vs
Big Data systems
UNIT I- Introduction to Big Data 50
Analytics Difference
UNIT I- Introduction to Big Data 51
Syllabus
Introduction to Big Data:
• What is Big Data
• Overview of big data analytics
• Traditional database systems vs. Big Data
Systems
• 9 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
UNIT I- Introduction to Big Data 52
Characteristics of Big
Data
UNIT I- Introduction to Big Data 53
Big data first and foremost has to be “big,” and size
in this case is measured as volume.
1. Volume:
Example:
From clinical data associated with lab tests and
physician visits, to the administrative data
surrounding payments, this well of information is
already expanding.
When that data is coupled with greater use of
precision medicine, there will be a big data
explosion in health care, especially as genomic and
environmental data become more ubiquitous.
5
4
2. Velocity in the context of big data refers to two related
Velocity: concepts familiar to anyone in healthcare: the rapidly
increasing speed at which new data is being created
by technological advances, and the corresponding
need for that data to be digested and analyzed in near
real-time.
Example:
55
As more and more medical devices are designed to
monitor patients and collect data, there is great
demand to be able to analyze that data and then to
transmit it back to clinicians and others.
UNIT I- Introduction to
This “internet of things” of healthcare will only lead to
increasing velocity of big data in healthcare.
Big Data
With increasing volume and velocity
comes increasing variety. This third “V”
describes just what you’d think: the huge
diversity of data types that healthcare
organizations see every day.
•• Example: Electronic health records and medical
devices.
Each one might collect a different kind of
data, which in turn might be interpreted
3. Variety: differently by different physicians—or
made available to a specialist but not a
primary care provider.
•• Challenges:
Standardizing and distributing all of that
information so that everyone involved is
on the same page.
UNIT I- Introduction to Big Data 56
• Veracity refers to the level
of trustiness or messiness
of data, and if higher the
trustiness of the data,
then lower the messiness
and vice versa.
• Since the data is collected
from multiple sources, we
need to check the data for
accuracy before using it
for business insights.
4. Veracity • It also refers to the
assurance of quality/
integrity/ credibility/
accuracy of the data.
• Veracity and Value both
together define the data
quality, which can provide
great insights to data
scientists..
UNIT I- Introduction to Big Data 57
big data must have value.
That is, if you’re going to invest in the infrastructure
required to collect and interpret data on a
system-wide scale, it’s important to ensure that the
insights that are generated are based on accurate
data and lead to measurable improvements at the
end of the day.
5. Value
Organizations might use the same tools and
technologies for gathering and analyzing the data
they have available, but how they then put that data
to work is ultimately up to them.
The technical experts will need to be combined with
domain experts with strong industrial knowledge and
the ability to apply this know-how within organisations
for value creation
UNIT I- Introduction to Big Data 58
6 Vs of Big
Data
(summary)
UNIT I- Introduction to Big Data 59
Current ‘V’ s
of Big Data
UNIT I- Introduction to Big Data 60
9 Vs of Big Data
3/18/2024 Big Data Analytics Lab 61
Vs of Big Data
3/18/2024 Big Data Analytics Lab 62
3/18/2024 Big Data Analytics Lab 63
Syllabus
Introduction to Big Data:
• What is Big Data
• Overview of big data analytics
• Traditional database systems vs big data systems
• 9 v's of big data
• Significance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
UNIT I- Introduction to Big Data 64
I Significance of Big Data
• Driven by specialized analytics systems
and software, as well as high-powered
computing systems, big data analytics
offers various business benefits, including:
• New revenue opportunities
• More effective marketing
• Better customer service
• Improved operational efficiency
• Competitive advantages over rivals
UNIT I- Introduction to Big Data 65
Significance of Big Data Cont.…
1. It helps companies to better understand and serve
customers:
•• Examples include the recommendations made by Amazon or
Netflix., Coca-Cola( Customer Acquisition and Retention)
2. It allows companies to optimize their processes:
•• Faster and Better Decision Making
•• Example
•• UOB Bank from Singapore use Big Data for Risk
Management
•• Uber is able to predict demand, dynamically price journeys
and send the closest driver to the customers
UNIT I- Introduction to Big Data 66
Significance of Big Data Cont.…
3. It improves our health care:
•• Government agencies can now predict flu outbreaks and track them in real time and pharmaceutical
companies are able to use big data analytics to fast-track drug development.
4. It helps us to improve security:
•• Government and law enforcement agencies use big data to foil terrorist attacks and detect cyber crime.
5. It allows sport stars to boost their performance:
•• Sensors in balls, GPS trackers on their clothes allow athletes to analyze and improve upon what they do.
6. Cost Reduction:
Big Data Technologies like Hadoop and Cloud based analytics bring sufficient cost
advantages when it come to storing large data
UNIT I- Introduction to Big Data 67
Real world Challenges
1. Dealing with data Growth
• The most obvious challenge associated with big
data is simply storing and analyzing all that
information.
2. Recruiting and retaining big data talent
• In order to develop, manage and run applications
that generate insights, organizations need
professionals with big data skills.
• Potential pitfalls of big data analytics initiatives
include a lack of internal analytics skills and the
high cost of hiring experienced data scientists and
data engineers to fill the gaps.
UNIT I- Introduction to Big Data 68
Real world Challenges contd..
3.Generating insights in a timely manner
• Business goals can be achieved if data scientists can
extract insights from Big Data and can act upon on
those quickly.
• Although some organizations are fortunate to have
data scientists (most may not be), there is a growing
talent gap that makes finding and hiring data
scientists in a timely manner difficult
UNIT I- Introduction to Big Data 69
Real world Challenges contd..
4. Integrating disparate data sources
• The variety associated with big data leads to challenges in data
integration.
• Big data comes from a lot of different places — enterprise
applications, social media streams, email systems,
employee-created documents, etc. Combining all that data and
reconciling it so that it can be used to create reports can be
incredibly difficult.
5. Validating data
• Often organizations are getting similar pieces of data from different
systems, and the data in those different systems doesn't always
agree.
• For example, the ecommerce system may show daily sales at a
certain level while the enterprise resource planning (ERP) system
has a slightly different number.
UNIT I- Introduction to Big Data 70
Syllabus
Introduction to Big Data:
• What is Big Data
• Overview of big data analytics
• Traditional database systems vs big data systems
• 5 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
UNIT I- Introduction to Big Data 71
Architecture
of Big Data
Systems
UNIT I- Introduction to Big Data 72
Architecture of Big data Systems
4 Core Layers of
Big Data Systems
Architecture: Traditional Data
•• Data Storage layer
Systems:
•• Data Processing •• Physical layer
layer •• Logical layer
•• Data Query layer •• View layer
•• Data Visualization
layer
UNIT I- Introduction to Big Data 73
Architecture of Big data Systems
(Cont.…)
1. Data Storage layer:
• Necessity to handle heterogeneity using different data stores
• Polyglot persistence: Approach to identify an effective data store for a particular
data
• To store large amount of unstructured data , Hadoop Distributed File System
(HDFS) can be used.
• For Object based storage Simple Storage System(S3) can be used
• Functionality of this layer is handled by 2 sublayers
• Physical layer- Handles large volume of heterogeneous real-time data
• Data layer- Maintains data blocks and the global namespace to access data
• It also maintains tools to organize, access and retrieve
heterogeneous data
UNIT I- Introduction to Big Data 74
Architecture of Big data
Systems (Contd…)
2. Data Processing layer:
Data collected in the storage layer is processed in this
layer in batch or real-time mode
•• Batch Processing is used for offline Analytics
•• E.g. Hadoop is a batch processing system with Map-Reduce
programming technique
•• Real-time processing is used for online analytics
•• E.g. Apache storm processes streaming data in real time to make the
decision
•• Spark is time-efficient, in-memory data processing engine that can
execute streaming, machine learning or SQL-workloads
•• Along with MapReduce, Spark it also supports tools for statistical
modelling, machine Learning
UNIT I- Introduction to Big Data 75
Architecture of Big data Systems
(Cont.…)
3. Data query layer:
• This layer aims at obtaining data values or
valuable insights from the processing layer
• Hive: used by data analysts to query,
summarize, explore and analyze unstructured
data to obtain actionable business insights
• Analytics Engine- It extends the functionality
of the data processing layer with domain
specific tools for decision making
• Tools in this layer performs descriptive,
predictive, diagnostic analytics
UNIT I- Introduction to Big Data 76
Architecture of Big data Systems
(Cont.…)
4.Data Visualization layer:
• This layer presents the value of the data in a presentable ,
understandable formats
• It makes use of Dashboards, Graphs and tables tools for
visualization
• E.g. Google Chart-
• It is a JavaScript based charting library meant to enhance web applications by adding
interactive charting capability.
• Google Charts provides wide variety of charts. For example, line charts, spline charts,
area charts, bar charts, pie charts and so on.
• E.g. D3-
• It is programming tool for visualization
• User must be knowledgeable on Java Script to visualize the collected data
effectively
UNIT I- Introduction to Big Data 77
Architecture of Big data Systems
(Cont.…)
Following layers offer common 1. Data Ingestion layer:
services to the core layers also called
as service layers. This layer determines the value of information
extracted
Data coming from different sources is prioritized,
validated, categorized and routed to the destination
for effective storage and access
Data may be ingested in batches periodically or in
real time
E.g. Sqoop-
•• supports bulk data transfer between Hadoop and
structured stores such as ORACLE, MYSQL
E.g. Elastic Logstash-
78
•• aggregates data from multiple sources and routes it
Architecture of Big data Systems
(Cont.…)
2. Data Collector layer:
•• This layer transport data from ingestion layer to the rest of the data pipeline
•• E.g. Kafka-
•• It is a message oriented middleware used for data collection
•• It collaborates with Storm, Hbase, Spark for real time analysis of data
3. Data Security layer:
This layer provides authentication, Authorization, audit, data
encryption and central administration for big data systems
E.g. Knox in Hadoop stack, Kerberos, HDFS encryption
UNIT I- Introduction to Big Data 79
Architecture of Big data Systems
(Cont.…)
4. Data Monitoring layer:
• It includes tools for monitoring the
performance at infrastructure,
framework analytics engine, data
store and application levels
5. Infrastructure layer:
• This layer provides the hardware to
host various big data frameworks in
cloud infrastructure that is highly
scalable and preferable
UNIT I- Introduction to Big Data 80
Syllabus
Introduction to Big Data:
• What is Big Data
• Overview of big data analytics
• Traditional database systems vs big data systems
• 5 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
UNIT I- Introduction to Big Data 81
Syllabus
Introduction to Big Data:
• What is Big Data
• Overview of big data analytics
• Traditional database systems vs big data systems
• 5 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big Data Analytics Life
• Big data applications
UNIT I- Introduction to Big Data 82
•Big Data Analytics Life Cycle Analytical
Modelling
Communicating
the results
Deployment
UNIT I- Introduction to Big Data 83
1. Business Case Evaluation
An evaluation of a Big Data
It must begin with a well-defined
analytics business case helps
business case that presents a clear
decision-makers to understand the
understanding of the justification,
business resources which helps
motivation and goals of carrying
business challenges to tackle
out the analysis.
including KPIs .
Initial iterations of the Big Data
The outcome of this stage is the
analytics lifecycle will require more
understand budget (h/w, s/w)
up-front investment of Big Data
required to carry out the analysis
technologies, products and training
project.
compared to later iterations
UNIT I- Introduction to Big Data 84
• Identifying a wider variety of data sources
may increase the probability of finding
hidden patterns and correlations.
• Depending on the business scope of the
analysis project and nature of the business
problems being addressed, the required
datasets and their sources can be
categorized into 2 types
2. Data Internal datasets: such as data marts
and operational systems, are typically
Identificatio compiled and matched against a
pre-defined dataset specification.
n External datasets: publicly available
datasets, content-based web sites,
blogs.
• Review the raw data
• Evaluating the data structures.
• Decide on the infrastructure requirements.
UNIT I- Introduction to Big Data 85
3. Data Acquisition
and Filtering • The data is gathered
from all of the data
sources that were
identified during the
previous stage.
UNIT I- Introduction to Big Data 86
4. Data Extraction
• The extent of extraction and
transformation required depends on
the types of analytics and capabilities of
the Big Data solution.
• E.g., extracting the required fields from
delimited textual data, such as with
webserver log files.
• Similarly, extracting text for text
analytics, which requires scans of whole
documents, is simplified if the
underlying Big Data solution can
directly read the document in its native
format.
UNIT I- Introduction to Big Data 87
5.Data
Validation and
Cleansing
• Examining the cleanliness of the
data
• Checking for consistency of data
by identifying missing and
inconsistent values.
• Assessing the consistency of the
data types by checking if values
suit the data type.
• Reviewing the contents of the
data columns for relevant and
consistent values
• Looking for validity of incoming
data by checking for extreme or
incorrect values.
88
6. Data
Aggregation and
Representation
• The objective of this phase is to
integrate multiple datasets to
arrive at unified view.
• The tools for data indigestion,
filtering ,extraction, validation
,cleansing and aggregation are
Hadoop, open refiner, Alpine
miner, Data Wrangler.
UNIT I- Introduction to Big Data 89
• The data analysis helps to decide the
hypothesis which can be used know the data.
• Analytical modelling includes two
sub-phases
7. 1. Model Planning
Analytical 1. Data Exploration
Modelling • Helps to clean the data to gain data
quality.
2. Model Selection
• Commonly used tools are R, SQL
Analysis services, SAS/ Access for
RDBMS
2. Model Building
• Develop analytical model that fits on the
90 training data , evaluated against test data
which is fitted after several iterations.
UNIT I- Introduction to Big Data
Record all the findings and
8.
then select the most
Communicating significant ones and share
the results with the other
stakeholders.
The team made
recommendations for
future work or
improvements to existing
91 processes.
UNIT I- Introduction to Big Data
This phase deals with deploying the
analytical models in a production
environment.
9.
Deploymen
t
The output of these models can also
be used to prescribe some actions
such as:
Extending the
Optimizing
functionality of
business Creating alerts
enterprise
process
systems
UNIT I- Introduction to Big Data 92
Data Analytics Life Cycle
The Data Analytics Lifecycle The lifecycle draws from
defines analytics process best established methods in the realm
practices spanning discovery to of data analytics and decision
project completion. science.
This synthesis was developed
after gathering input from data Traditional projects follows the
scientists and consulting process centric
established approaches that approach(WATERFALL/ SPIRAL)
provided input on pieces of the to develop the project.
process.
SDLC can not be applied directly We have to follow CRISP-DM
for the data analytics projects as approach for data oriented
it is data centric projects. projects.
UNIT I- Introduction to Big Data 93
Applications of Big
Data across
various industries
UNIT I- Introduction to Big Data 94
Applications of Big Data across
various industries
Sports Domain
•• To understand and study player movement
•• E.g. Nike uses big data for eco-friendly product design
Sentiment Analysis
•• To understand changing customer interest, identify potential customer
•• E.g. Delta Airlines
Behavioral Analysis
•• To understand customer behavior
•• E.g. Amazon’s product recommendations
Healthcare
UNIT I- Introduction to Big Data 95
Big Data Applications
Customer Segmentation
•• It is the grouping of similar users on their purchases and recommending suitable
items for them based on personal or group interest.
•• e.g. Pandora provides music recommendation based on static profile, related
songs, user interest, location.
•• Netflix uses collaborative filtering algo. to recommend the movies.
•• Amazon
Prediction
•• It is the outcome done on historical information.
Fraud Detection
•• To detect prevent and eliminate internal and external frauds.
•• Unusual usage pattern of a debit and credit cards can alert a bank of stolen card.
Personalized Healthcare
UNIT I- Introduction to Big Data 96
Big Data
Architecture
for
personalized
Healthcare.
UNIT I- Introduction to Big Data 97
Personal Health Care Cont.…
The data processing layer extracts the Big Data Driven
phenotype.
The analytic layer uses the following:
•• Descriptive analytics to evaluate various statistics and visualize them
using charts.
•• Diagnostics analytics using survival analysis and regression to correlate
survival rate of patients with heart failure.
•• Predictive analytics using classification , clustering and inferential
analysis to predict survival rate for a new patient.
•• Prescriptive analytics for treatment plan and decision support.
UNIT I- Introduction to Big Data 98
Multiple dimensions of Big Data
UNIT I- Introduction to Big Data 99
DATA contd…
Value is generated by:
• acquiring data,
• combining data from different sources
• providing access to it while ensuring data integrity and preserving
privacy.
• Value is added by
• Pre-processing,
• Validating,
• Analyzing
• Augmenting
• Ensuring data integrity and accuracy
UNIT I- Introduction to Big Data 100
1. Skills
Ensuring the availability of highly and rightly skilled people
who have an excellent grasp of the best practices and
technologies for delivering Big Data Value within
applications and solutions.
There will be the need for data scientists and engineers
who have expertise in :
data
machine
analytics statistics data mining management
learning
.
UNIT I- Introduction to Big Data 101
2. Legal:
• The increased importance of data will intensify the debate on
data ownership and usage,
data protection and privacy,
security,
liability,
cybercrime,
Intellectual Property Rights (IPR) and
impact of insolvencies on data rights.
UNIT I- Introduction to Big Data 102
3. Technical
Key aspects including
low latency
new and rich data linking data,
real-time and scalable
user interaction information
analytics, data
interfaces, and and content
processing,
All have to be advanced to open up new
opportunities and to sustain or develop
competitive advantages.
UNIT I- Introduction to Big Data 103
4. Application
Novel applications and
Business and market ready
solutions must be developed
applications need to be a
and validated based on
core target to allow activities
technologies and concepts in
to have market impact.
ecosystems.
UNIT I- Introduction to Big Data 104
5. Business
A more efficient use of Big Data and understanding data as an The setup of Big Data Value ecosystems and the development
economic asset carries great potential for the economy and of appropriate business models on top of a strong Big Data
society. Value ecosystem must be supported in order to generate the
desired positive impact on economy and employment
UNIT I- Introduction to Big Data 105
6. Social
Big Data will provide solutions for major societal challenges,
such as
The improved efficiency in
Reduced CO2 emissions through
healthcare information processing
climate impact analysis.
or
In parallel it is critical for an accelerated adoption of Big
Data to increase awareness on the benefits and the Value
that Big Data can create for business, the public sector, and
the citizen
UNIT I- Introduction to Big Data 106
References
• G. Sudha Sadhasivam, Thirumahal Rajkumar. Big Data Analytics. Oxford
University Press ( Chapter 1, Chapter 2)
• Kevin Roebuck. Storing and Managing Big Data - NoSQL, HADOOP and
More, Emereopty Limited, ISBN: 1743045743, 9781743045749
• David Dietrich, Barry Hiller. Data Science and Big Data Analytics, 6th
edition, EMC education services, Wiley publications, 2015,
ISBN0-07-120413-X
• https://www.blue-granite.com/blog/advantages-of-the-analytics-sandbox-for
-data-lakes
• https://https://www.dezyre.com/article/types-of-analytics-descriptive-predict
ive-prescriptive-analytics/209 [image]
• https://informationcatalyst.com [image]
• https://www.slideshare.net/hktripathy/lecture2-big-data-life-cycle[image]
UNIT I- Introduction to Big Data 107