0% found this document useful (0 votes)
20 views107 pages

BDT-Unit I

The document provides an overview of Big Data, including its definition, significance, and the various types of analytics involved in processing it. It discusses the motivations for Big Data, its sources, and the roles necessary for successful Big Data analytics projects. Additionally, it outlines the data analytics life cycle and the importance of understanding different analytical techniques such as descriptive, diagnostic, predictive, prescriptive, and decisive analytics.

Uploaded by

Aarya Kevadia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views107 pages

BDT-Unit I

The document provides an overview of Big Data, including its definition, significance, and the various types of analytics involved in processing it. It discusses the motivations for Big Data, its sources, and the roles necessary for successful Big Data analytics projects. Additionally, it outlines the data analytics life cycle and the importance of understanding different analytical techniques such as descriptive, diagnostic, predictive, prescriptive, and decisive analytics.

Uploaded by

Aarya Kevadia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

CET4001B Big Data

Technologies
School of Computer Engineering and Technology

3/18/2024 Big Data Analytics Lab 1


Unit- I : Introduction to Big Data
• What is Big Data
• Overview of big data analytics
• Traditional database systems vs. big
data systems
• 9 V's of big data
• Significance of big data and real
world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle

Syllabus

UNIT I- Introduction to Big Data 2


Motivation For BIG DATA
1. Huge volume of data:

Rather than thousands or millions of rows, Big Data can be billions of rows
and millions of columns due to different applications like twitter, Facebook,
Instagram.

2. Complexity of data types and structures:

Big data can come in multiple forms, including structured and non-structured
data such as financial data, text files, multimedia files, and genetic mappings
also digital traces being left on the web and other digital repositories for
subsequent analysis
UNIT I- Introduction to Big Data 3
3
Motivation for BIG data
Contd..

3. High Speed of new data creation and growth:

Big Data can describe high velocity data, with rapid data
ingestion and near real time analysis.

4. Distributed computing environments and Massively Parallel


Processing (MPP) architectures that enable parallelized data
ingest and analysis are the preferred approach to process such
complex data

UNIT I- Introduction to Big Data 4


Motivation
for BIG
data
Contd..

UNIT I- Introduction to Big Data 5


UNIT I- Introduction to Big Data 6
Big Data
Sources

UNIT I- Introduction to Big Data 7


Big Data Sources Contd..
•Data is created constantly, and at an ever-increasing
rate:

•Sources of Big Data:


•1. Mobile phones, social media, imaging technologies
-all these and more create new data, and that must be
stored somewhere for some purpose

•2.Devices and sensors automatically generate diagnostic


information that needs to be stored and processed in real
time
8
8
Photos and video footage uploaded to the
World Wide Web.

Video surveillance, such as the thousands


Examples of video cameras spread across a city .

of big data Mobile devices, which provide geospatial


location data of the users, as well as
metadata about text messages, phone
calls, and application usage on smart
phones

Smart devices, which provide sensor-based


collection of information from smart electric
grids, smart buildings, and many other
public and industry infrastructures

UNIT I- Introduction to Big Data 9


Statistics of big data

UNIT I- Introduction to Big Data 10


Not a single definition…..
•• Big data is high volume, high
velocity, high variety
information assets that require
DEFINITION new forms of processing to
OF BIG enable enhanced decision
making, insight discovery and
DATA process optimization. ---Doug
Laney, Gartner, 2012.
•• Big Data is data whose scale,
distribution, diversity, and/or
timeliness require the use of
new technical architectures and
analytics to enable insights that
unlock new sources of
business value.
UNIT I- Introduction to Big Data 11
Unit I- Introduction to Big
Data

• What is Big Data


• Overview of Big Data Analytics
• Traditional database systems vs big data
systems
• 9 v's of big data
• Importance of big data and real world
challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle

UNIT I- Introduction to Big Data 12


DEFINITION OF BIG DATA

• Big Data is data whose scale, distribution,


diversity, and/or timeliness require the use
of new technical architectures and
analytics to enable insights that unlock
new sources of business value.

UNIT I- Introduction to Big Data 13


Example:
Healthcare
application

14
UNIT I- Introduction to
Big Data
Overview of Big Data
Analytics
15
• Big data analytics is the often-complex process of examining large and varied data
sets, or big data, to uncover information -- such as hidden patterns, unknown
correlations, market trends and customer preferences -- that can help organizations
make informed business decisions.
UNIT I- Introduction to
Big Data
Big Data Analytics..

• Big data analytics is a form • Descriptive analytics answers


of advanced analytics, which the question of what happened.
involves complex applications .
• Diagnostic Analytics ,historical
• Following is the type of data can be measured against
analytics: other data to answer the question
of why something happened.
• Predictive analytics tells what is
likely to happen.
• Prescriptive analytics is to
literally prescribe what action to
take to eliminate a future
problem or take full advantage
of a promising trend.

UNIT I- Introduction to Big Data 16


• Descriptive: A set of techniques for reviewing and examining the data set(s)
to understand the data and analyze business performance.
• Diagnostic: A set of techniques for determine what has happened and why
• Predictive: A set of techniques that analyze current and historical data to
determine what is most likely to (not) happen
• Prescriptive: A set of techniques for computationally developing and
analyzing alternatives that can become courses of action – either tactical or
strategic – that may discover the unexpected

Various • Decisive: A set of techniques for visualizing information and recommending


courses of action to facilitate human decision-making when presented with
a set of alternatives.

types of
big data
analytics

UNIT I- Introduction to Big Data 17


Descriptive Analytics
• Steps:
• Identify the attributes, then
assess/evaluate the attributes
• Estimate the magnitude to correlate the
relative contribution of each attribute to
the final solution
• Accumulate more instances of data from
the data sources
• If possible, perform the steps of
evaluation, classification and
categorization quickly
• Yield a measure of adaptability within the
OODA loop
• At some threshold, crossover into diagnostic
and predictive analytics

UNIT I- Introduction to Big Data 18


Diagnostic Analytics

• Steps:
• Begin with descriptive analytics
• Extract patterns from large data
quantities via data mining
• Correlate data types for
explanation of near-term behavior
– past and present
• Estimate linear/non-linear
behavior not easily identifiable
through other approaches.
• Example: by classifying past insurance
claims, estimate the number of future
claims to flag for investigation with a
high probability of being fraudulent.

UNIT I- Introduction to Big Data


Predictive Analytics
• Steps:
• Begin with descriptive AND
diagnostic analytics
• Choose the right data based on
domain knowledge and relationships
among variables
• Choose the right techniques to yield
insight into possible outcomes
• Determine the likelihood of possible
outcomes given initial boundary
conditions
• Remember! Data driven analytics is
non-linear; do NOT treat like an
engineering project

UNIT I- Introduction to Big Data


Prescriptive
Analytics
• Steps:
• Begin with predictive analytics
• Determine what should occur and
how to make it so
• Determine the mitigating factors that
lead to desirable/undesirable
outcomes
• “What-if” analysis with local or
global optimization
• Find the best set of prices and
advertising frequency to
maximize revenue
• The right set of business moves
to make to achieve that goal

UNIT I- Introduction to Big Data 21


Decisive Analytics

• Steps:
• Given a set of decision alternatives,
choose the one course of action to
do from possibly many
• But, it may not be the optimal one.
• Visualize alternatives – whole or
partial subset
• Perform exploratory analysis –
what-if and why
• How do I get to there from here?
• How did I get here from there?

UNIT I- Introduction to Big Data


What-if analysis:

Process of calculating
What-if backward to find out an
input by providing a specific
analysis output.

Works in opposite fashion of


formulae

What-if analysis helps to


find out what input will result
in a specific output.

UNIT I- Introduction to Big Data 23


Example of formula:

What-if
analysis
(contd) y = a* (x^2)

Problem = given the


values for input
variables a and x, we
can compute value of
output variable y

UNIT I- Introduction to Big Data 24


Example of what-if analysis:

What-if
analysis Suppose a student plans to score an
average of 80 in semester exam. She
(contd) scored 82, 70, 83 and 76 in the subjects
English, Mathematics, Computer Science
and Mechanics respectively.

Statistics exam is due to happen shortly, we


want to calculate the marks she needs to
score in Statistics to achieve an average of
80 in the semester.

UNIT I- Introduction to Big Data 25


•What-if analysis(contd)

Before and after scenarios

UNIT I- Introduction to Big Data 26


Big data Analytics
• Big data can deliver value in almost any area of business or
society:

Report on Big Data in Big Companies


UNIT I- Introduction to Big Data 27
Key Roles For
A Successful
Big Data
Analytics
Project

UNIT I- Introduction to Big Data 28


1. Business User

Someone who understands the domain area and


usually benefits from the results.

This person can consult and advise the project team on


the context of the project, the value of the results, and
how the outputs will be operationalized.[ put into
operation / use ]

Usually a business analyst or subject matter expert in


the project domain fulfills this role.
UNIT I- Introduction to Big Data 29
2. Project Sponsor

Responsible for the genesis of the project. [origin/


source]
Provides the impetus and requirements for the project
and defines the core business problem.
[impulse/stimulus]
Generally provides the funding and gauges the
degree of value from the final outputs of the working
team.
This person sets the priorities for the project and
clarifies the desired outputs.
UNIT I- Introduction to Big Data 30
3. Project Manager

• Ensures that key milestones and objectives are met on


time and at the expected quality. [ a significant stage/event]

UNIT I- Introduction to Big Data 31


4. Business Intelligence Analyst
• Provides business domain expertise based on :
• A deep understanding of the data,
• Key Performance Indicators (KPIs)
• key metrics
• Business intelligence from a reporting perspective
• Business intelligence analysts generally create dashboards
and reports and have knowledge of the data feeds and
sources.

UNIT I- Introduction to Big Data 32


KPIs vs Key metrics

• KPIs are measurable values that show you how


effective you are at achieving business objectives.

• Metrics are different in that they simply track the


status of a specific business process.

• Thus KPIs track whether you hit business


objectives/targets, and metrics track processes

UNIT I- Introduction to Big Data 33


KPI
• Example of KPI
• Target of teams was to increase
sales revenue by 20% this year
end (2021)

Team B
Team A increase in sales = 18%
increase in sales = 21%

UNIT I- Introduction to Big Data 34


KPI addresses the overall /
targeted goal / objectives.

•• Metrics do not.
KPIs vs
Key Key metrics are specific.
Metrics
(contd…) •• KPIs are not.
Accurate tracking of progress (
staff to enterprise) needs the
use of both – KPIs and key
metrics.

UNIT I- Introduction to Big Data 35


Example for KPI and Metrics
• Best Social Media Marketing Metrics Best call center metrics to monitor
• Likes Many industry leading companies track
• Engagement these on TV data walls:
• Followers growth Call completion rate
• Traffic conversions
Agent utilization
Answer seizure ratio (ASR)
• Social interactions
First call resolution rate
• Social sentiment
Speed of answer (SA)
• Social visitor goals
Call handling time
• Social shares Call drop rate (CDR)
• Web visitors from social channel First contact resolution rate
• Social visitors conversion rates Sales per agent
Lead conversion rate

UNIT I- Introduction to Big Data 36


Business intelligence

Business Intelligence (BI) refers to The purpose of Business Intelligence is Business Intelligence Analysts
technologies, applications and practices to support better business decision generally create dashboards and
for the collection, integration, analysis, making. reports and have knowledge of the data
and presentation of business feeds and sources.
information.

UNIT I- ] Introduction to Big Data 37


Provisions and configures the
database environment to support
the analytics needs of the working
team.
•• These responsibilities may include
5. Database
Administrator providing access to key databases
(DBA) or tables and

ensuring the appropriate security


levels are in place related to the
data repositories.

UNIT I- Introduction to Big Data 38


Leverages deep technical skills to
assist with tuning Query
Language queries for data
management and data extraction,
and provides support for data
ingestion into the analytic
6. Data sandbox.
Engineer While the DBA sets up and
configures the databases to be
used, the data engineer executes
the actual data extractions and
performs substantial data
manipulation to facilitate the
analytics.

UNIT I- Introduction to Big Data 39


An Analytics Sandbox is a separate environment
that is part of the architecture, used by multiple
users and is maintained with the support of IT.
•• Key Characteristics

The environment is controlled by the analyst

Analytics •• Allows them to install and use the data tools of their choice
•• Allows them to manage the scheduling and processing of the data

Sandbox assets
Enables analysts to explore and experiment with
internal and external data

Can hold and process large amounts of data


efficiently from many different data sources –

•• big data (unstructured), transactional data (structured),


•• web data, social media data, documents etc.

UNIT I- Introduction to Big Data 40


• A set of resources that enable analytic
professionals to experiment and
reshape data in whatever fashion they
need to
• Data exploration
The • Development of analytical
processes

Analytical • Proof of concepts


• prototyping

Sandbox
Definition

41
The Analytical Sandbox
An Internal Sandbox
• A portion of an enterprise data warehouse or data mart is carved out to serve as
the analytic sandbox
• Strength
• Leverage existing hardware resources and infrastructure already in
place
• Ability to directly join production data with sandbox data
• Cost-effective since no new hardware is needed
• Weaknesses
• An additional load on the existing enterprise data warehouse or data
mart
• Can be constrained by production policies and procedures

Sandbox

Analytic Views & Core Database Tables


Enterprise Analytic Data Sets
Additional Data
42
Enterprise Data Warehouse or Data Mart
The Analytical Sandbox
An External Sandbox
• A physically separate analytic sandbox is created for testing and development of
analytic processes
• Strength
• A stand-alone environment, no impact on other processes
• Reduce workload management
• Weaknesses
• The additional cost of the stand-alone system
• Some data movement

Sandbox

Extract

Enterprise Data Warehouse or Data Mart


43
The data engineer works closely with the data
scientist to help shape data in the right ways
for analyses.

Provides subject matter expertise for:

7. Data •• analytical techniques,


•• data modeling,
•• applying valid analytical techniques to given business
Scientist problems.

Ensures overall analytics objectives are met.

Designs and executes analytical methods


and approaches with the data available to the
project.

UNIT I- Introduction to Big Data 44


• Each role plays a critical part in a
successful analytics project.
• Although seven roles are listed,
fewer or more people can
accomplish the work depending on
Roles • the scope of the
project,
contd… • organizational
structure and
• the skills of the
participants.

UNIT I- Introduction to Big Data 45


SUMMARY

UNIT I- Introduction to Big Data 46


Syllabus

Introduction to Big Data:


• What is Big Data
• Overview of big data analytics
• Traditional database systems vs. Big Data
Systems
• 9 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle

UNIT I- Introduction to Big Data 47


Traditional database systems
vs
Big Data systems

UNIT I- Introduction to Big Data 48


Compare Traditional Database systems
vs
Big Data systems

UNIT I- Introduction to Big Data 49


Traditional Database systems
vs
Big Data systems

UNIT I- Introduction to Big Data 50


Analytics Difference

UNIT I- Introduction to Big Data 51


Syllabus

Introduction to Big Data:


• What is Big Data
• Overview of big data analytics
• Traditional database systems vs. Big Data
Systems
• 9 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle

UNIT I- Introduction to Big Data 52


Characteristics of Big
Data

UNIT I- Introduction to Big Data 53


Big data first and foremost has to be “big,” and size
in this case is measured as volume.

1. Volume:
Example:

From clinical data associated with lab tests and


physician visits, to the administrative data
surrounding payments, this well of information is
already expanding.

When that data is coupled with greater use of


precision medicine, there will be a big data
explosion in health care, especially as genomic and
environmental data become more ubiquitous.

5
4
2. Velocity in the context of big data refers to two related

Velocity: concepts familiar to anyone in healthcare: the rapidly


increasing speed at which new data is being created
by technological advances, and the corresponding
need for that data to be digested and analyzed in near
real-time.

Example:

55
As more and more medical devices are designed to
monitor patients and collect data, there is great
demand to be able to analyze that data and then to
transmit it back to clinicians and others.
UNIT I- Introduction to

This “internet of things” of healthcare will only lead to


increasing velocity of big data in healthcare.
Big Data
With increasing volume and velocity
comes increasing variety. This third “V”
describes just what you’d think: the huge
diversity of data types that healthcare
organizations see every day.
•• Example: Electronic health records and medical
devices.
Each one might collect a different kind of
data, which in turn might be interpreted
3. Variety: differently by different physicians—or
made available to a specialist but not a
primary care provider.
•• Challenges:

Standardizing and distributing all of that


information so that everyone involved is
on the same page.

UNIT I- Introduction to Big Data 56


• Veracity refers to the level
of trustiness or messiness
of data, and if higher the
trustiness of the data,
then lower the messiness
and vice versa.
• Since the data is collected
from multiple sources, we
need to check the data for
accuracy before using it
for business insights.

4. Veracity • It also refers to the


assurance of quality/
integrity/ credibility/
accuracy of the data.
• Veracity and Value both
together define the data
quality, which can provide
great insights to data
scientists..

UNIT I- Introduction to Big Data 57


big data must have value.

That is, if you’re going to invest in the infrastructure


required to collect and interpret data on a
system-wide scale, it’s important to ensure that the
insights that are generated are based on accurate
data and lead to measurable improvements at the
end of the day.
5. Value
Organizations might use the same tools and
technologies for gathering and analyzing the data
they have available, but how they then put that data
to work is ultimately up to them.

The technical experts will need to be combined with


domain experts with strong industrial knowledge and
the ability to apply this know-how within organisations
for value creation

UNIT I- Introduction to Big Data 58


6 Vs of Big
Data
(summary)

UNIT I- Introduction to Big Data 59


Current ‘V’ s
of Big Data

UNIT I- Introduction to Big Data 60


9 Vs of Big Data

3/18/2024 Big Data Analytics Lab 61


Vs of Big Data

3/18/2024 Big Data Analytics Lab 62


3/18/2024 Big Data Analytics Lab 63
Syllabus

Introduction to Big Data:


• What is Big Data
• Overview of big data analytics
• Traditional database systems vs big data systems
• 9 v's of big data
• Significance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
UNIT I- Introduction to Big Data 64
I Significance of Big Data

• Driven by specialized analytics systems


and software, as well as high-powered
computing systems, big data analytics
offers various business benefits, including:

• New revenue opportunities


• More effective marketing
• Better customer service
• Improved operational efficiency
• Competitive advantages over rivals

UNIT I- Introduction to Big Data 65


Significance of Big Data Cont.…

1. It helps companies to better understand and serve


customers:
•• Examples include the recommendations made by Amazon or
Netflix., Coca-Cola( Customer Acquisition and Retention)

2. It allows companies to optimize their processes:

•• Faster and Better Decision Making


•• Example
•• UOB Bank from Singapore use Big Data for Risk
Management
•• Uber is able to predict demand, dynamically price journeys
and send the closest driver to the customers
UNIT I- Introduction to Big Data 66
Significance of Big Data Cont.…

3. It improves our health care:


•• Government agencies can now predict flu outbreaks and track them in real time and pharmaceutical
companies are able to use big data analytics to fast-track drug development.

4. It helps us to improve security:


•• Government and law enforcement agencies use big data to foil terrorist attacks and detect cyber crime.

5. It allows sport stars to boost their performance:


•• Sensors in balls, GPS trackers on their clothes allow athletes to analyze and improve upon what they do.

6. Cost Reduction:

Big Data Technologies like Hadoop and Cloud based analytics bring sufficient cost
advantages when it come to storing large data

UNIT I- Introduction to Big Data 67


Real world Challenges

1. Dealing with data Growth


• The most obvious challenge associated with big
data is simply storing and analyzing all that
information.
2. Recruiting and retaining big data talent
• In order to develop, manage and run applications
that generate insights, organizations need
professionals with big data skills.
• Potential pitfalls of big data analytics initiatives
include a lack of internal analytics skills and the
high cost of hiring experienced data scientists and
data engineers to fill the gaps.

UNIT I- Introduction to Big Data 68


Real world Challenges contd..

3.Generating insights in a timely manner


• Business goals can be achieved if data scientists can
extract insights from Big Data and can act upon on
those quickly.
• Although some organizations are fortunate to have
data scientists (most may not be), there is a growing
talent gap that makes finding and hiring data
scientists in a timely manner difficult

UNIT I- Introduction to Big Data 69


Real world Challenges contd..

4. Integrating disparate data sources


• The variety associated with big data leads to challenges in data
integration.
• Big data comes from a lot of different places — enterprise
applications, social media streams, email systems,
employee-created documents, etc. Combining all that data and
reconciling it so that it can be used to create reports can be
incredibly difficult.
5. Validating data
• Often organizations are getting similar pieces of data from different
systems, and the data in those different systems doesn't always
agree.
• For example, the ecommerce system may show daily sales at a
certain level while the enterprise resource planning (ERP) system
has a slightly different number.

UNIT I- Introduction to Big Data 70


Syllabus

Introduction to Big Data:


• What is Big Data
• Overview of big data analytics
• Traditional database systems vs big data systems
• 5 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
UNIT I- Introduction to Big Data 71
Architecture
of Big Data
Systems

UNIT I- Introduction to Big Data 72


Architecture of Big data Systems

4 Core Layers of
Big Data Systems
Architecture: Traditional Data
•• Data Storage layer
Systems:
•• Data Processing •• Physical layer
layer •• Logical layer
•• Data Query layer •• View layer
•• Data Visualization
layer

UNIT I- Introduction to Big Data 73


Architecture of Big data Systems
(Cont.…)

1. Data Storage layer:


• Necessity to handle heterogeneity using different data stores
• Polyglot persistence: Approach to identify an effective data store for a particular
data
• To store large amount of unstructured data , Hadoop Distributed File System
(HDFS) can be used.
• For Object based storage Simple Storage System(S3) can be used
• Functionality of this layer is handled by 2 sublayers
• Physical layer- Handles large volume of heterogeneous real-time data
• Data layer- Maintains data blocks and the global namespace to access data
• It also maintains tools to organize, access and retrieve
heterogeneous data

UNIT I- Introduction to Big Data 74


Architecture of Big data
Systems (Contd…)

2. Data Processing layer:

Data collected in the storage layer is processed in this


layer in batch or real-time mode
•• Batch Processing is used for offline Analytics
•• E.g. Hadoop is a batch processing system with Map-Reduce
programming technique
•• Real-time processing is used for online analytics
•• E.g. Apache storm processes streaming data in real time to make the
decision
•• Spark is time-efficient, in-memory data processing engine that can
execute streaming, machine learning or SQL-workloads
•• Along with MapReduce, Spark it also supports tools for statistical
modelling, machine Learning
UNIT I- Introduction to Big Data 75
Architecture of Big data Systems
(Cont.…)

3. Data query layer:


• This layer aims at obtaining data values or
valuable insights from the processing layer
• Hive: used by data analysts to query,
summarize, explore and analyze unstructured
data to obtain actionable business insights
• Analytics Engine- It extends the functionality
of the data processing layer with domain
specific tools for decision making
• Tools in this layer performs descriptive,
predictive, diagnostic analytics

UNIT I- Introduction to Big Data 76


Architecture of Big data Systems
(Cont.…)

4.Data Visualization layer:


• This layer presents the value of the data in a presentable ,
understandable formats
• It makes use of Dashboards, Graphs and tables tools for
visualization
• E.g. Google Chart-
• It is a JavaScript based charting library meant to enhance web applications by adding
interactive charting capability.
• Google Charts provides wide variety of charts. For example, line charts, spline charts,
area charts, bar charts, pie charts and so on.
• E.g. D3-
• It is programming tool for visualization
• User must be knowledgeable on Java Script to visualize the collected data
effectively

UNIT I- Introduction to Big Data 77


Architecture of Big data Systems
(Cont.…)

Following layers offer common 1. Data Ingestion layer:


services to the core layers also called
as service layers. This layer determines the value of information
extracted
Data coming from different sources is prioritized,
validated, categorized and routed to the destination
for effective storage and access
Data may be ingested in batches periodically or in
real time
E.g. Sqoop-
•• supports bulk data transfer between Hadoop and
structured stores such as ORACLE, MYSQL
E.g. Elastic Logstash-
78
•• aggregates data from multiple sources and routes it
Architecture of Big data Systems
(Cont.…)

2. Data Collector layer:

•• This layer transport data from ingestion layer to the rest of the data pipeline
•• E.g. Kafka-
•• It is a message oriented middleware used for data collection
•• It collaborates with Storm, Hbase, Spark for real time analysis of data

3. Data Security layer:

This layer provides authentication, Authorization, audit, data


encryption and central administration for big data systems

E.g. Knox in Hadoop stack, Kerberos, HDFS encryption


UNIT I- Introduction to Big Data 79
Architecture of Big data Systems
(Cont.…)

4. Data Monitoring layer:


• It includes tools for monitoring the
performance at infrastructure,
framework analytics engine, data
store and application levels

5. Infrastructure layer:
• This layer provides the hardware to
host various big data frameworks in
cloud infrastructure that is highly
scalable and preferable

UNIT I- Introduction to Big Data 80


Syllabus

Introduction to Big Data:


• What is Big Data
• Overview of big data analytics
• Traditional database systems vs big data systems
• 5 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big data applications
• Data analytics life cycle
UNIT I- Introduction to Big Data 81
Syllabus

Introduction to Big Data:


• What is Big Data
• Overview of big data analytics
• Traditional database systems vs big data systems
• 5 v's of big data
• Importance of big data and real world challenges
• Architecture of big data systems
• Big Data Analytics Life
• Big data applications
UNIT I- Introduction to Big Data 82
•Big Data Analytics Life Cycle Analytical
Modelling

Communicating
the results

Deployment

UNIT I- Introduction to Big Data 83


1. Business Case Evaluation

An evaluation of a Big Data


It must begin with a well-defined
analytics business case helps
business case that presents a clear
decision-makers to understand the
understanding of the justification,
business resources which helps
motivation and goals of carrying
business challenges to tackle
out the analysis.
including KPIs .

Initial iterations of the Big Data


The outcome of this stage is the
analytics lifecycle will require more
understand budget (h/w, s/w)
up-front investment of Big Data
required to carry out the analysis
technologies, products and training
project.
compared to later iterations

UNIT I- Introduction to Big Data 84


• Identifying a wider variety of data sources
may increase the probability of finding
hidden patterns and correlations.
• Depending on the business scope of the
analysis project and nature of the business
problems being addressed, the required
datasets and their sources can be
categorized into 2 types
2. Data Internal datasets: such as data marts
and operational systems, are typically
Identificatio compiled and matched against a
pre-defined dataset specification.
n External datasets: publicly available
datasets, content-based web sites,
blogs.
• Review the raw data
• Evaluating the data structures.
• Decide on the infrastructure requirements.

UNIT I- Introduction to Big Data 85


3. Data Acquisition
and Filtering • The data is gathered
from all of the data
sources that were
identified during the
previous stage.

UNIT I- Introduction to Big Data 86


4. Data Extraction

• The extent of extraction and


transformation required depends on
the types of analytics and capabilities of
the Big Data solution.
• E.g., extracting the required fields from
delimited textual data, such as with
webserver log files.
• Similarly, extracting text for text
analytics, which requires scans of whole
documents, is simplified if the
underlying Big Data solution can
directly read the document in its native
format.

UNIT I- Introduction to Big Data 87


5.Data
Validation and
Cleansing
• Examining the cleanliness of the
data
• Checking for consistency of data
by identifying missing and
inconsistent values.
• Assessing the consistency of the
data types by checking if values
suit the data type.
• Reviewing the contents of the
data columns for relevant and
consistent values
• Looking for validity of incoming
data by checking for extreme or
incorrect values.

88
6. Data
Aggregation and
Representation
• The objective of this phase is to
integrate multiple datasets to
arrive at unified view.
• The tools for data indigestion,
filtering ,extraction, validation
,cleansing and aggregation are
Hadoop, open refiner, Alpine
miner, Data Wrangler.

UNIT I- Introduction to Big Data 89


• The data analysis helps to decide the
hypothesis which can be used know the data.
• Analytical modelling includes two
sub-phases
7. 1. Model Planning
Analytical 1. Data Exploration
Modelling • Helps to clean the data to gain data
quality.
2. Model Selection
• Commonly used tools are R, SQL
Analysis services, SAS/ Access for
RDBMS
2. Model Building
• Develop analytical model that fits on the
90 training data , evaluated against test data
which is fitted after several iterations.

UNIT I- Introduction to Big Data


Record all the findings and
8.
then select the most
Communicating significant ones and share
the results with the other
stakeholders.
The team made
recommendations for
future work or
improvements to existing
91 processes.

UNIT I- Introduction to Big Data


This phase deals with deploying the
analytical models in a production
environment.

9.
Deploymen
t
The output of these models can also
be used to prescribe some actions
such as:
Extending the
Optimizing
functionality of
business Creating alerts
enterprise
process
systems

UNIT I- Introduction to Big Data 92


Data Analytics Life Cycle
The Data Analytics Lifecycle The lifecycle draws from
defines analytics process best established methods in the realm
practices spanning discovery to of data analytics and decision
project completion. science.

This synthesis was developed


after gathering input from data Traditional projects follows the
scientists and consulting process centric
established approaches that approach(WATERFALL/ SPIRAL)
provided input on pieces of the to develop the project.
process.

SDLC can not be applied directly We have to follow CRISP-DM


for the data analytics projects as approach for data oriented
it is data centric projects. projects.

UNIT I- Introduction to Big Data 93


Applications of Big
Data across
various industries

UNIT I- Introduction to Big Data 94


Applications of Big Data across
various industries
Sports Domain

•• To understand and study player movement


•• E.g. Nike uses big data for eco-friendly product design

Sentiment Analysis

•• To understand changing customer interest, identify potential customer


•• E.g. Delta Airlines

Behavioral Analysis

•• To understand customer behavior


•• E.g. Amazon’s product recommendations

Healthcare

UNIT I- Introduction to Big Data 95


Big Data Applications
Customer Segmentation

•• It is the grouping of similar users on their purchases and recommending suitable


items for them based on personal or group interest.
•• e.g. Pandora provides music recommendation based on static profile, related
songs, user interest, location.
•• Netflix uses collaborative filtering algo. to recommend the movies.
•• Amazon

Prediction

•• It is the outcome done on historical information.

Fraud Detection

•• To detect prevent and eliminate internal and external frauds.


•• Unusual usage pattern of a debit and credit cards can alert a bank of stolen card.

Personalized Healthcare
UNIT I- Introduction to Big Data 96
Big Data
Architecture
for
personalized
Healthcare.

UNIT I- Introduction to Big Data 97


Personal Health Care Cont.…

The data processing layer extracts the Big Data Driven


phenotype.

The analytic layer uses the following:


•• Descriptive analytics to evaluate various statistics and visualize them
using charts.
•• Diagnostics analytics using survival analysis and regression to correlate
survival rate of patients with heart failure.
•• Predictive analytics using classification , clustering and inferential
analysis to predict survival rate for a new patient.
•• Prescriptive analytics for treatment plan and decision support.

UNIT I- Introduction to Big Data 98


Multiple dimensions of Big Data

UNIT I- Introduction to Big Data 99


DATA contd…

Value is generated by:


• acquiring data,
• combining data from different sources
• providing access to it while ensuring data integrity and preserving
privacy.

• Value is added by
• Pre-processing,
• Validating,
• Analyzing
• Augmenting
• Ensuring data integrity and accuracy

UNIT I- Introduction to Big Data 100


1. Skills
Ensuring the availability of highly and rightly skilled people
who have an excellent grasp of the best practices and
technologies for delivering Big Data Value within
applications and solutions.

There will be the need for data scientists and engineers


who have expertise in :
data
machine
analytics statistics data mining management
learning
.
UNIT I- Introduction to Big Data 101
2. Legal:

• The increased importance of data will intensify the debate on

data ownership and usage,


data protection and privacy,
security,
liability,
cybercrime,
Intellectual Property Rights (IPR) and
impact of insolvencies on data rights.

UNIT I- Introduction to Big Data 102


3. Technical

Key aspects including


low latency
new and rich data linking data,
real-time and scalable
user interaction information
analytics, data
interfaces, and and content
processing,

All have to be advanced to open up new


opportunities and to sustain or develop
competitive advantages.
UNIT I- Introduction to Big Data 103
4. Application

Novel applications and


Business and market ready
solutions must be developed
applications need to be a
and validated based on
core target to allow activities
technologies and concepts in
to have market impact.
ecosystems.

UNIT I- Introduction to Big Data 104


5. Business

A more efficient use of Big Data and understanding data as an The setup of Big Data Value ecosystems and the development
economic asset carries great potential for the economy and of appropriate business models on top of a strong Big Data
society. Value ecosystem must be supported in order to generate the
desired positive impact on economy and employment

UNIT I- Introduction to Big Data 105


6. Social
Big Data will provide solutions for major societal challenges,
such as
The improved efficiency in
Reduced CO2 emissions through
healthcare information processing
climate impact analysis.
or

In parallel it is critical for an accelerated adoption of Big


Data to increase awareness on the benefits and the Value
that Big Data can create for business, the public sector, and
the citizen
UNIT I- Introduction to Big Data 106
References
• G. Sudha Sadhasivam, Thirumahal Rajkumar. Big Data Analytics. Oxford
University Press ( Chapter 1, Chapter 2)
• Kevin Roebuck. Storing and Managing Big Data - NoSQL, HADOOP and
More, Emereopty Limited, ISBN: 1743045743, 9781743045749
• David Dietrich, Barry Hiller. Data Science and Big Data Analytics, 6th
edition, EMC education services, Wiley publications, 2015,
ISBN0-07-120413-X
• https://www.blue-granite.com/blog/advantages-of-the-analytics-sandbox-for
-data-lakes
• https://https://www.dezyre.com/article/types-of-analytics-descriptive-predict
ive-prescriptive-analytics/209 [image]
• https://informationcatalyst.com [image]
• https://www.slideshare.net/hktripathy/lecture2-big-data-life-cycle[image]

UNIT I- Introduction to Big Data 107

You might also like