0% found this document useful (0 votes)
39 views69 pages

Big Data

Big Data Analytics involves the processing and analysis of large, complex data sets that traditional computing cannot handle, utilizing various methodologies and tools. Key steps include data collection, cleaning, analysis, visualization, and decision-making, with types of analytics such as descriptive, diagnostic, predictive, and prescriptive. The characteristics of Big Data are summarized by the 'Five V's' - Volume, Velocity, Variety, Veracity, and Value, highlighting the importance of modern tools for effective data management.

Uploaded by

jezmyrelldonato
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views69 pages

Big Data

Big Data Analytics involves the processing and analysis of large, complex data sets that traditional computing cannot handle, utilizing various methodologies and tools. Key steps include data collection, cleaning, analysis, visualization, and decision-making, with types of analytics such as descriptive, diagnostic, predictive, and prescriptive. The characteristics of Big Data are summarized by the 'Five V's' - Volume, Velocity, Variety, Veracity, and Value, highlighting the importance of modern tools for effective data management.

Uploaded by

jezmyrelldonato
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Big Data Analytics - Overview

What is Big Data Analytics?

Gartner defines Big Data as Big data is high-volume, high-velocity and/or high-
variety information that demands cost-effective, innovative forms of information
processing that enable enhanced insight, decision making, and process
automation.

Big Data is a collection of large amounts of data sets that traditional computing
approaches cannot compute and manage. It is a broad term that refers to the
massive volume of complex data sets that businesses and governments
generate in today's digital world. It is often measured in petabytes or terabytes
and originates from three key sources: transactional data, machine data, and
social data.

Big Data encompasses data, frameworks, tools, and methodologies used to


store, access, analyse and visualise it. Technological advanced communication
channels like social networking and powerful gadgets have created different
ways to create data, data transformation and challenges to industry participants
in the sense that they must find new ways to handle data. The process of
converting large amounts of unstructured raw data, retrieved from different
sources to a data product useful for organizations forms the core of Big Data
Analytics.

Steps of Big Data Analytics

Big Data Analytics is a powerful tool which helps to find the potential of large
and complex datasets. To get a better understanding, let's break it down into
key steps −

Data Collection
This is the initial step, in which data is collected from different sources like
social media, sensors, online channels, commercial transactions, website logs
etc. Collected data might be structured (predefined organisation, such as
databases), semi-structured (like log files) or unstructured (text documents,
photos, and videos).

Data Cleaning (Data Pre-processing)


The next step is to process collected data by removing errors and making it
suitable and proper for analysis. Collected raw data generally contains errors,
missing values, inconsistencies, and noisy data. Data cleaning entails identifying
and correcting errors to ensure that the data is accurate and consistent. Pre-
processing operations may also involve data transformation, normalisation, and
feature extraction to prepare the data for further analysis.

Overall, data cleaning and pre-processing entail the replacement of missing


data, the correction of inaccuracies, and the removal of duplicates. It is like
sifting through a treasure trove, separating the rocks and debris and leaving
only the valuable gems behind.

Data Analysis
This is a key phase of big data analytics. Different techniques and algorithms
are used to analyse data and derive useful insights. This can include descriptive
analytics (summarising data to better understand its characteristics), diagnostic
analytics (identifying patterns and relationships), predictive analytics (predicting
future trends or outcomes), and prescriptive analytics (making
recommendations or decisions based on the analysis).
Data Visualization
Its a step to present data in a visual form using charts, graphs and interactive
dashboards. Hence, data visualisation techniques are used to visually portray
the data using charts, graphs, dashboards, and other graphical formats to make
data analysis insights more clear and actionable.

Interpretation and Decision Making


Once data analytics and visualisation are done and insights gained, stakeholders
analyse the findings to make informed decisions. This decision-making includes
optimising corporate operations, increasing consumer experiences, creating new
products or services, and directing strategic planning.

Data Storage and Management


Once collected, the data must be stored in a way that enables easy retrieval
and analysis. Traditional databases may not be sufficient for handling large
amounts of data, hence many organisations use distributed storage systems
such as Hadoop Distributed File System (HDFS) or cloud-based storage
solutions like Amazon S3.

Continuous Learning and Improvement


Big data analytics is a continuous process of collecting, cleaning, and analyzing
data to uncover hidden insights. It helps businesses make better decisions and
gain a competitive edge.

Types of Big-Data

Big Data is generally categorized into three different varieties. They are as
shown below −

 Structured Data
 Semi-Structured Data
 Unstructured Data

Let us discuss the earn type in details.


Structured Data
Structured data has a dedicated data model, a well-defined structure, and a
consistent order, and is designed in such a way that it can be easily accessed
and used by humans or computers. Structured data is usually stored in well-
defined tabular form means in the form of rows and columns. Example: MS
Excel, Database Management Systems (DBMS)

Semi-Structured Data
Semi-structured data can be described as another type of structured data. It
inherits some qualities from Structured Data; however, the majority of this type
of data lacks a specific structure and does not follow the formal structure of
data models such as an RDBMS. Example: Comma Separated Values (CSV) File.

Unstructured Data
Unstructured data is a type of data that doesnt follow any structure. It lacks a
uniform format and is constantly changing. However, it may occasionally include
data and time-related information. Example: Audio Files, Images etc.

Types of Big Data Analytics

Some common types of Big Data analytics are as −

Descriptive Analytics
Descriptive analytics gives a result like What is happening in my business?" if the
dataset is business-related. Overall, this summarises prior facts and aids in the
creation of reports such as a company's income, profit, and sales figures. It also
aids the tabulation of social media metrics. It can do comprehensive, accurate,
live data and effective visualisation.

Diagnostic Analytics
Diagnostic analytics determines root causes from data. It answers like Why is it
happening? Some common examples are drill-down, data mining, and data
recovery. Organisations use diagnostic analytics because they provide an in-
depth insight into a particular problem. Overall, it can drill down the root causes
and ability to isolate all confounding information.
For example − A report from an online store says that sales have decreased,
even though people are still adding items to their shopping carts. Several things
could have caused this, such as the form not loading properly, the shipping cost
being too high, or not enough payment choices being offered. You can use
diagnostic data to figure out why this is happening.

Predictive Analytics
This kind of analytics looks at data from the past and the present to guess what
will happen in the future. Hence, it answers like What will be happening in
future? Data mining, AI, and machine learning are all used in predictive analytics
to look at current data and guess what will happen in the future. It can figure
out things like market trends, customer trends, and so on.

For example − The rules that Bajaj Finance has to follow to keep their customers
safe from fake transactions are set by PayPal. The business uses predictive
analytics to look at all of its past payment and user behaviour data and come up
with a program that can spot fraud.

Prescriptive Analytics
Perspective analytics gives the ability to frame a strategic decision, the
analytical results answer What do I need to do? Perspective analytics works with
both descriptive and predictive analytics. Most of the time, it relies on AI and
machine learning.

For example − Prescriptive analytics can help a company to maximise its business
and profit. For example in the airline industry, Perspective analytics applies
some set of algorithms that will change flight prices automatically based on
demand from customers, and reduce ticket prices due to bad weather
conditions, location, holiday seasons etc.

Tools and Technologies of Big Data Analytics

Some commonly used big data analytics tools are as −

Hadoop
A tool to store and analyze large amounts of data. Hadoop makes it possible to
deal with big data, It's a tool which made big data analytics possible.
MongoDB
A tool for managing unstructured data. It's a database which specially designed
to store, access and process large quantities of unstructured data.

Talend
A tool to use for data integration and management. Talend's solution package
includes complete capabilities for data integration, data quality, master data
management, and data governance. Talend integrates with big data
management tools like Hadoop, Spark, and NoSQL databases allowing
organisations to process and analyse enormous amounts of data efficiently. It
includes connectors and components for interacting with big data technologies,
allowing users to create data pipelines for ingesting, processing, and analysing
large amounts of data.

Cassandra
A distributed database used to handle chunks of data. Cassandra is an open-
source distributed NoSQL database management system that handles massive
amounts of data over several commodity servers, ensuring high availability and
scalability without sacrificing performance.

Spark
Used for real-time processing and analyzing large amounts of data. Apache
Spark is a robust and versatile distributed computing framework that provides a
single platform for big data processing, analytics, and machine learning, making
it popular in industries such as e-commerce, finance, healthcare, and
telecommunications.

Storm
It is an open-source real-time computational system. Apache Storm is a robust
and versatile stream processing framework that allows organisations to process
and analyse real-time data streams on a large scale, making it suited for a wide
range of use cases in industries such as banking, telecommunications, e-
commerce, and IoT.
Kafka
It is a distributed streaming platform that is used for fault-tolerant storage.
Apache Kafka is a versatile and powerful event streaming platform that allows
organisations to create scalable, fault-tolerant, and real-time data pipelines and
streaming applications to efficiently meet their data processing requirements.

Big Data Analytics - Characteristics


Big Data refers to extremely large data sets that may be analyzed to reveal
patterns, trends, and associations, especially relating to human behaviour and
interactions.

Big Data Characteristics

The characteristics of Big Data, often summarized by the "Five V's," include −

Volume
As its name implies; volume refers to a large size of data generated and stored
every second using IoT devices, social media, videos, financial transactions, and
customer logs. The data generated from the devices or different sources can
range from terabytes to petabytes and beyond. To manage such large
quantities of data requires robust storage solutions and advanced data
processing techniques. The Hadoop framework is used to store, access and
process big data.

Facebook generates 4 petabytes of data per day that's a million gigabytes. All
that data is stored in what is known as the Hive, which contains about 300
petabytes of data [1].
Fig: Minutes spent per day on social apps (Image source: Recode)

Fig: Engagement per user on leading social media apps in India (Image source:
www.statista.com) [2]

From the above graph, we can predict how users are devoting their time to
accessing different channels and transforming data, hence, data volume is
becoming higher day by day.

Velocity
The speed with which data is generated, processed, and analysed. With the
development and usage of IoT devices and real-time data streams, the velocity
of data has expanded tremendously, demanding systems that can process data
instantly to derive meaningful insights. Some high-velocity data applications are
as follows −
Variety
Big Data includes different types of data like structured data (found in
databases), unstructured data (like text, images, videos), and semi-structured
data (like JSON and XML). This diversity requires advanced tools for data
integration, storage, and analysis.

Challenges of Managing Variety in Big Data −


Variety in Big Data Applications −
Veracity
Veracity refers accuracy and trustworthiness of the data. Ensuring data quality,
addressing data discrepancies, and dealing with data ambiguity are all major
issues in Big Data analytics.

Value
The ability to convert large volumes of data into useful insights. Big Data's
ultimate goal is to extract meaningful and actionable insights that can lead to
better decision-making, new products, enhanced consumer experiences, and
competitive advantages.

These qualities characterise the nature of Big Data and highlight the importance
of modern tools and technologies for effective data management, processing,
and analysis.
Big Data Analytics - Data Life Cycle
A life cycle is a process which denotes a sequential flow of one or more activities
involved in Big Data Analytics. Before going to learn about a big data analytics
life cycle; lets understand the traditional data mining life cycle.

Traditional Data Mining Life Cycle

To provide a framework to organize the work systematically for an organization;


the framework supports the entire business process and provides valuable
business insights to make strategic decisions to survive the organisation in a
competitive world as well as maximize its profit.

The Traditional Data Mining Life Cycle includes the following phases −

 Problem Definition − Its an initial phase of the data mining process; it includes
problem definitions that need to be uncovered or solved. A problem definition
always includes business goals that need to be achieved and the data that need
to be explored to identify patterns, business trends, and process flow to achieve
the defined goals.
 Data Collection − The next step is data collection. This phase involves data
extraction from different sources like databases, weblogs, or social media
platforms that are required for analysis and to do business intelligence. Collected
data is considered raw data because it includes impurities and may not be in the
required formats and structures.
 Data Pre-processing − After data collection, we clean it and pre-process it to remove
noises, missing value imputation, data transformation, feature selection, and
convert data into a required format before you can begin your analysis.
 Data Exploration and Visualization − Once pre-processing is done on data, we explore
it to understand its characteristics, and identify patterns and trends. This phase
also includes data visualizations using scatter plots, histograms, or heat maps to
show the data in graphical form.
 Modelling − This phase includes creating data models to solve realistic problems
defined in Phase 1. This could include an effective machine learning algorithm;
training the model, and assessing its performance.
 Evaluation − The final stage in data mining is to assess the model's performance
and determine if it matches your business goals in step 1. If the model is
underperforming, you may need to do data exploration or feature selection once
again.

CRISP-DM Methodology

The CRISP-DM stands for Cross Industry Standard Process for Data Mining; it is
a methodology which describes commonly used approaches that a data mining
expert uses to tackle problems in traditional BI data mining. It is still being used
in traditional BI data mining teams. The following figure illustrates it. It
describes the major phases of the CRISP-DM cycle and how they are
interrelated with one another
CRISP-DM was introduced in 1996 and the next year, it got underway as a
European Union project under the ESPRIT funding initiative. The project was led
by five companies: SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA
(an insurance company). The project was finally incorporated into SPSS.

Phases of CRISP-DM Life Cycle | Steps of CRISP-DM


Life Cycle
 Business Understanding − This phase includes problem

definition, project objectives and requirements


from a business perspective, and then converts it
into data mining. A preliminary plan is designed to
achieve the objectives.
 Data Understanding − The data understanding phase
initially starts with data collection, to identify data
quality, discover data insights, or detect interesting
subsets to form hypotheses for hidden information.
 Data Preparation − The data preparation phase covers
all activities to construct the final dataset (data that
will be fed into the modelling tool(s)) from the
initial raw data. Data preparation tasks are likely to
be performed multiple times, and not in any
prescribed order. Tasks include table, record, and
attribute selection as well as transformation and
cleaning of data for modelling tools.
 Modelling − In this phase, different modelling
techniques are selected and applied; different
techniques may be available to process the same
type of data; an expert always opts for effective and
efficient ones.
 Evaluation − Once the proposed model is completed;
before the final deployment of the model, it is
important to evaluate it thoroughly and review the
steps executed to construct the model, to ensure
that the model achieves the desired business
objectives.
 Deployment − The creation of the model is generally
not the end of the project. Even if the purpose of
the model is to increase knowledge of the data, the
knowledge gained will need to be organized and
presented in a way that is useful to the customer.In
many cases, it will be the customer, not the data
analyst, who will carry out the deployment phase.
Even if the analyst deploys the model, the customer
needs to understand upfront the actions which will
need to be carried out to make use of the created
models.
SEMMA Methodology

SEMMA is another methodology developed by SAS for data mining modelling. It


stands for Sample, Explore, Modify, Model, and Asses.

The description of its phases is as follows −

 Sample − The process starts with data sampling, e.g., selecting the dataset for
modelling. The dataset should be large enough to contain sufficient information
to retrieve, yet small enough to be used efficiently. This phase also deals with
data partitioning.
 Explore − This phase covers the understanding of the data by discovering
anticipated and unanticipated relationships between the variables, and also
abnormalities, with the help of data visualization.
 Modify − The Modify phase contains methods to select, create and transform
variables in preparation for data modelling.
 Model − In the Model phase, the focus is on applying various modelling (data
mining) techniques on the prepared variables to create models that possibly
provide the desired outcome.
 Assess − The evaluation of the modelling results shows the reliability and
usefulness of the created models.

The main difference between CRISMDM and SEMMA is that SEMMA focuses on
the modelling aspect, whereas CRISP-DM gives more importance to stages of
the cycle before modelling such as understanding the business problem to be
solved, understanding and pre-processing the data to be used as input, for
example, machine learning algorithms.

Big Data Life Cycle

Big Data Analytics is a field that involves managing the entire data lifecycle,
including data collection, cleansing, organisation, storage, analysis, and
governance. In the context of big data, the traditional approaches were not
optimal for analysing large-volume data, data with different values, data
velocity etc.

For example, the SEMMA methodology disdains data collection and pre-
processing of different data sources. These stages normally constitute most of
the work in a successful big data project. Big Data analytics involves the
identification, acquisition, processing, and analysis of large amounts of raw
data, unstructured and semi-structured data which aims to extract valuable
information for trend identification, enhancing existing company data, and
conducting extensive searches.
The Big Data analytics lifecycle can be divided into the following phases −

 Business Case Evaluation


 Data Identification
 Data Acquisition & Filtering
 Data Extraction
 Data Validation & Cleansing
 Data Aggregation & Representation
 Data Analysis
 Data Visualization
 Utilization of Analysis Results

The primary differences between Big Data Analytics and traditional data analysis
are in the value, velocity, and variety of data processed. To address the specific
requirements for big data analysis, an organised method is required. The
description of Big Data analytics lifecycle phases are as follows −

Business Case Evaluation


A Big Data analytics lifecycle begins with a well-defined business case that
outlines the problem identification, objective, and goals for conducting the
analysis. Before beginning the real hands-on analytical duties, the Business
Case Evaluation needs the creation, assessment, and approval of a business
case.

An examination of a Big Data analytics business case gives a direction to


decision-makers to understand the business resources that will be required and
business problems that need to be addressed. The case evaluation examines
whether the business problem definition being addressed is truly a Big Data
problem.

Data Identification
The data identification phase focuses on identifying the necessary datasets and
their sources for the analysis project. Identifying a larger range of data sources
may improve the chances of discovering hidden patterns and relationships. The
firm may require internal or external datasets and sources, depending on the
nature of the business problems it is addressing.

Data Acquisition and Filtering


The data acquisition process entails gathering data from all of the sources
mentioned in the previous phase. We subjected the data to automated filtering
to remove corrupted data or records irrelevant to the study objectives.
Depending on the type of data source, data might come as a collection of files,
such as data acquired from a third-party data provider, or as API integration,
such as with Twitter.

Once generated or entering the enterprise boundary, we must save both


internal and external data. We save this data to disk and then analyse it using
batch analytics. In real-time analytics, we first analyse the data before saving it
to disc.

Data Extraction
This phase focuses on extracting disparate data and converting it into a format
that the underlying Big Data solution can use for data analysis.

Data Validation and Cleansing


Incorrect data can bias and misrepresent analytical results. Unlike typical
enterprise data, which has a predefined structure and is verified can feed for
analysis; Big Data analysis can be unstructured if data is not validated before
analysis. Its intricacy can make it difficult to develop a set of appropriate
validation requirements. Data Validation and Cleansing is responsible for
defining complicated validation criteria and deleting any known faulty data.

Data Aggregation and Representation


The Data Aggregation and Representation phase focuses on combining multiple
datasets to create a cohesive view. Performing this stage can get tricky due to
variances in −
Data Structure − The data format may be the same, but the data model may
differ.

Semantics − A variable labelled differently in two datasets may signify the same
thing, for example, "surname" and "last name."

Data Analysis
The data analysis phase is responsible for carrying out the actual analysis work,
which usually comprises one or more types of analytics. Especially if the data
analysis is exploratory, we can continue this stage iteratively until we discover
the proper pattern or association.

Data Visualization
The Data Visualization phase visualizes data graphically to communicate
outcomes for effective interpretation by business users.The resultant outcome
helps to perform visual analysis, allowing them to uncover answers to queries
they have not yet formulated.

Utilization of Analysis Results


The outcomes made available to business personnel to support business
decision-making, such as via dashboards.All of the mentioned nine phases are
the primary phases of the Big Data Analytics life cycle.

Below mentioned phases can also be kept in consideration −

Research

Analyse what other companies have done in the same situation. This involves
looking for solutions that are reasonable for your company, even though it
involves adapting other solutions to the resources and requirements that your
company has. In this stage, a methodology for the future stages should be
defined.

Human Resources Assessment

Once the problem is defined, its reasonable to continue analyzing if the current
staff can complete the project successfully. Traditional BI teams might not be
capable of delivering an optimal solution to all the stages, so it should be
considered before starting the project if there is a need to outsource a part of
the project or hire more people.

Data Acquisition

This section is key in a big data life cycle; it defines which type of profiles would
be needed to deliver the resultant data product. Data gathering is a non-trivial
step of the process; it normally involves gathering unstructured data from
different sources. To give an example, it could involve writing a crawler to
retrieve reviews from a website. This involves dealing with text, perhaps in
different languages normally requiring a significant amount of time to be
completed.

Data Munging

Once the data is retrieved, for example, from the web, it needs to be stored in
an easy-to-use format. To continue with the review examples, let's assume the
data is retrieved from different sites where each has a different display of the
data.

Suppose one data source gives reviews in terms of rating in stars, therefore it is
possible to read this as a mapping for the response
variable \mathrm{y\:\epsilon \:\lbrace 1,2,3,4,5\rbrace} \mathrm{y
\:\epsilon \:\lbrace 1,2,3,4,5\rbrace}. Another data source gives
reviews using an arrow system, one for upvoting and the other for downvoting.
This would imply a response variable of the
form \mathrm{y\:\epsilon \:\lbrace positive,negative \rbrace} \mathr
m{y\:\epsilon \:\lbrace positive,negative \rbrace}.

To combine both data sources, a decision has to be made to make these two
response representations equivalent. This can involve converting the first data
source response representation to the second form, considering one star as
negative and five stars as positive. This process often requires a large time
allocation to be delivered with good quality.
Data Storage
Once the data is processed, it sometimes needs to be stored in a database. Big
data technologies offer plenty of alternatives regarding this point. The most
common alternative is using the Hadoop File System for storage which provides
users a limited version of SQL, known as HIVE Query Language. This allows
most analytics tasks to be done in similar ways as would be done in traditional
BI data warehouses, from the user perspective. Other storage options to be
considered are MongoDB, Redis, and SPARK.

This stage of the cycle is related to the human resources knowledge in terms of
their abilities to implement different architectures. Modified versions of
traditional data warehouses are still being used in large-scale applications. For
example, Teradata and IBM offer SQL databases that can handle terabytes of
data; open-source solutions such as PostgreSQL and MySQL are still being used
for large-scale applications.

Even though there are differences in how the different storages work in the
background, from the client side, most solutions provide an SQL API. Hence
having a good understanding of SQL is still a key skill to have for big data
analytics. This stage a priori seems to be the most important topic, in practice,
this is not true. It is not even an essential stage. It is possible to implement a
big data solution that would work with real-time data, so in this case, we only
need to gather data to develop the model and then implement it in real-time.
So there would not be a need to formally store the data at all.

Exploratory Data Analysis

Once the data has been cleaned and stored in a way that insights can be
retrieved from it, the data exploration phase is mandatory. The objective of this
stage is to understand the data, this is normally done with statistical techniques
and also plotting the data. This is a good stage to evaluate whether the problem
definition makes sense or is feasible.

Data Preparation for Modeling and Assessment

This stage involves reshaping the cleaned data retrieved previously and using
statistical preprocessing for missing values imputation, outlier detection,
normalization, feature extraction and feature selection.

Modelling

The prior stage should have produced several datasets for training and testing,
for example, a predictive model. This stage involves trying different models and
looking forward to solving the business problem at hand. In practice, it is
normally desired that the model would give some insight into the business.
Finally, the best model or combination of models is selected evaluating its
performance on a left-out dataset.

Implementation

In this stage, the data product developed is implemented in the data pipeline of
the company. This involves setting up a validation scheme while the data
product is working, to track its performance. For example, in the case of
implementing a predictive model, this stage would involve applying the model
to new data and once the response is available, evaluate the model.

Big Data Analytics - Architecture


What is Big Data Architecture?

Big data architecture is specifically designed to manage data ingestion, data


processing, and analysis of data that is too large or complex. A big size data
cannot be store, process and manage by conventional relational databases. The
solution is to organize technology into a structure of big data architecture. Big
data architecture is able to manage and process data.

Key Aspects of Big Data Architecture

The following are some key aspects of big data architecture −

 To store and process large size data like 100 GB in size.


 To aggregates and transform of a wide variety of unstructured data for analysis
and reporting.
 Access, processing and analysis of streamed data in real time.

Diagram of Big Data Architecture


The following figure shows Big Data Architecture with its sequential
arrangements of different components. The outcomes of one component work
as an input to another component and this process flow continues till to
outcome of processed data.

Here is the diagram of big data architecture −


Components of Big Data Architecture

The following are the different components of big data architecture −

Data Sources
All big data solutions start with one or more data sources. The Big Data
Architecture accommodates various data sources and efficiently manages a wide
range of data types. Some common data sources in big data architecture
include transactional databases, logs, machine-generated data, social media
and web data, streaming data, external data sources, cloud-based data, NOSQL
databases, data warehouses, file systems, APIs, and web services.

These are only a few instances; in reality, the data environment is broad and
constantly changing, with new sources and technologies developing over time.
The primary challenge in big data architecture is successfully integrating,
processing, and analyzing data from various sources in order to gain relevant
insights and drive decision-making.

Data Storage
Data storage is the system for storing and managing large amounts of data in
big data architecture. Big data includes handling large amounts of structured,
semi-structured, and unstructured data; traditional relational databases often
prove inadequate due to scalability and performance limitations.

Distributed file stores, capable of storing large volumes of files in various


formats, typically store data for batch processing operations. People often refer
to this type of store as a data lake. You can use Azure Data Lake Storage or
blob containers in Azure Storage for this purpose. In a big data architecture, the
following image shows the key approaches to data storage −
The selection of a data storage system is contingent on different aspects,
including type of the data, performance requirements, scalability, and financial
limitations. Different big data architectures use a blend of these storage
systems to efficiently meet different use cases and objectives.

Batch Processing
Process data with long running batch jobs to filter, aggregate and prepare data
for analysis, these jobs often involve reading and processing source files, and
then writing the output to new files. Batch processing is an essential component
of big data architecture, allowing for the efficient processing of large amounts of
data using scheduled batches. It entails gathering, processing, and analysing
data in batches at predetermined intervals rather than in real time.

Batch processing is especially useful for operations that do not require


immediate responses, such as data analytics, reporting, and batch-based data
conversions. You can run U-SQL jobs in Azure Data Lake Analytics, use Hive,
Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or use Java,
Scala, or Python programs in an HDInsight Spark cluster.
Real-time Message Ingestion
Big data architecture plays a significant role in real-time message ingestion, as
it necessitates the real-time capture and processing of data streams during their
generation or reception. This functionality helps enterprises deal with high-
speed data sources such as sensor feeds, log files, social media updates,
clickstreams, and IoT devices, among others.

Real-time message ingestion systems are critical for extracting important


insights, identifying anomalies, and responding immediately to occurrences. The
following image shows the different methods work for real time message
ingestion within big data architecture −

The architecture incorporates a method for capturing and storing real-time


messages for stream processing; if the solution includes real-time sources. This
could be a data storage system where incoming messages are dropped into a
folder for processing. Nevertheless, a message ingestion store is necessary for
different approaches to function as a buffer for messages and to facilitate scale-
out processing, reliable delivery, and other message queuing semantics. Some
efficient solutions are Azure Event Hubs, Azure IoT Hubs, and Kafka.

Stream Processing
Stream processing is a type of data processing that continuously processes data
records as they generate or receive in real time. It enables enterprises to
quickly analyze, transform, and respond to data streams, resulting in timely
insights, alerts, and actions. Stream processing is a critical component of big
data architecture, especially for dealing with high-volume data sources such as
sensor data, logs, social media updates, financial transactions, and IoT device
telemetry.

Following figure illustrate how stream processing works within big data
architecture −

After gathering real-time messages, a proposes solution processes data by


filter, aggregate, and preparing it for analysis. The processed stream data is
subsequently stored in an output sink. Azure Stream Analytics offers a managed
stream processing service based on continuously executing SQL queries on
unbounded streams. on addition, we may employ open-source Apache
streaming technologies such as Storm and Spark Streaming on an HDInsight
cluster.

Analytical Data Store


In big data analytics, an Analytical Data Store (ADS) is a customized database
or data storage system designed to deal with complicated analytical queries and
massive amounts of data. An ADS is intended to facilitate ad hoc querying, data
exploration, reporting, and advanced analytics tasks, making it an essential
component of big data systems for business intelligence and analytics. The key
features of Analytical Data Stores in big data analytics are summarized in
following figure −

Analytical tools can query structured data. A low-latency NoSQL technology,


such as HBase or an interactive Hive database, could present the data by
abstracting information from data files in the distributed data storage system.
Azure Synapse Analytics is a managed solution for large-scale, cloud-based data
warehousing. You can serve and analyze data using Hive, HBase, and Spark
SQL with HDInsight.

Analysis and Reporting


Big data analysis and reporting are the processes of extracting insights,
patterns, and trends from huge and complex information to aid in decision-
making, strategic planning, and operational improvements. It includes different
strategies, tools, and methodologies for analyzing data and presenting results in
a useful and practical fashion.
Following image gives a brief idea about different analysis and reporting
methods in big data analytics −

Most big data solutions aim to extract insights from the data through analysis
and reporting. In order to enable users to analyze data, the architecture may
incorporate a data modeling layer, such as a multidimensional OLAP cube or
tabular data model in Azure Analysis Services. It may also offer self-service
business intelligence by leveraging the modeling and visualization features
found in Microsoft Power BI or Excel. Data scientists or analysts might conduct
interactive data exploration as part of their analysis and reporting processes.

Orchestration
In big data analytics, orchestration refers to the coordination and administration
of different tasks, processes, and resources used to execute data. To ensure
that big data analytics workflows run efficiently and reliably, it is necessary to
automate the flow of data and processing processes, schedule jobs, manage
dependencies, and monitor task performance.

Following figure includes different steps used in orchestration −


Workflows that convert source data, transport data across different sources and
sinks, load the processed data into an analytical data store, or output the
results directly to a report or dashboard comprise most big data solutions. To
automate these activities, utilize an orchestration tool like Azure Data Factory,
Apache Oozie, or Sqoop.

Big Data Analytics - Methodology


In terms of methodology, big data analytics differs significantly from the
traditional statistical approach of experimental design. Analytics starts with
data. Normally, we model the data in a way that able to answer the questions
that a business professionals have. The objectives of this approach are to
predict the response behavior or understand how the input variables relate to a
response.

Typically, statistical experimental designs develop an experiment and then


retrieve the resulting data. This enables the generation of data suitable for a
statistical model, under the assumption of independence, normality, and
randomization. Big data analytics methodology begins with problem
identification, and once the business problem is defined, a research stage is
required to design the methodology. However, general guidelines are relevant
to mention and apply to almost all problems.

The following figure demonstrates the methodology often followed in Big Data
Analytics −
Big Data Analytics Methodology

The following are the methodologies of big data analytics −

Define Objectives
Clearly outline the analysis's goals and objectives. What insights do you seek?
What business difficulties are you attempting to solve? This stage is critical to
steering the entire process.

Data Collection
Gather relevant data from a variety of sources. This includes structured data
from databases, semi-structured data from logs or JSON files, and unstructured
data from social media, emails, and papers.

Data Pre-processing
This step involves cleaning and pre-processing the data to ensure its quality and
consistency. This includes addressing missing values, deleting duplicates,
resolving inconsistencies, and transforming data into a useful format.

Data Storage and Management


Store the data in an appropriate storage system. This could include a typical
relational database, a NoSQL database, a data lake, or a distributed file system
such as Hadoop Distributed File System (HDFS).

Exploratory Data Analysis (EDA)


This phase includes the identification of data features, finding patterns, and
detecting outliers. We often use visualization tools like histograms, scatter plots,
and box plots.

Feature Engineering
Create new features or modify existing ones to improve the performance of
machine learning models. This could include feature scaling, dimensionality
reduction, or constructing composite features.
Model Selection and Training
Choose relevant machine learning algorithms based on the nature of the
problem and the properties of the data. If labeled data is available, train the
models.

Model Evaluation
Measure the trained models' performance using accuracy, precision, recall, F1-
score, and ROC curves. This helps to determine the best-performing model for
deployment.

Deployment
In a production environment, deploy the model for real-world use. This could
include integrating the model with existing systems, creating APIs for model
inference, and establishing monitoring tools.

Monitoring and Maintenance


Also, change the analytics pipeline as needed to reflect changing business
requirements or data characteristics.

Iterate
Big Data analytics is an iterative process. Analyze the data, collect comments,
and update the models or procedures as needed to increase accuracy and
effectiveness over time.

One of the most important tasks in big data analytics is statistical modeling,
meaning supervised and unsupervised classification or regression problems.
After cleaning and pre-processing the data for modeling, carefully assess
various models with appropriate loss metrics. After implementing the model,
conduct additional evaluations and report the outcomes. A common pitfall in
predictive modeling is to just implement the model and never measure its
performance.

Big Data Analytics - Core Deliverables


Big data analytics entails processing and analysing large and diverse datasets to
discover hidden patterns, correlations, insights, and other valuable information.
As mentioned in the big data life cycle, some core deliverables of big data
analytics are mentioned in below image −

Machine Learning Implementation

This could be a classification algorithm, a regression model or a segmentation


model.
Recommending System

The objective is to develop a system that can recommend options based on user
behaviour. For example on Netflix, based on users' ratings for a particular
movie/web series/show, related movies, web series, and shows are
recommended.

Dashboard

Business normally needs tools to visualize aggregated data. A dashboard is a


graphical representation of data which can be filtered as per users' needs and
results are reflected on screen.
For example, a sales dashboard of a company may contain filter options to
visualise sales nation-wise, state-wise district-wise, zone-wise or sales product
etc.

Insights and Patterns Identification

Big data analytics identifies trends, patterns, and correlations in data that can
be used to make more informed decisions. These insights could be about
customer behaviour, market trends, or operational inefficiencies.

Ad-Hoc Analysis

Ad-hoc analysis in big data analytics is a process of analysing data on the fly or
spontaneously to answer specific, immediate queries or resolve ad-hoc
inquiries. Unlike traditional analysis, which relies on predefined queries or
structured reporting, ad hoc analysis allows users to explore data interactively,
without the requirement for predefined queries or reports.

Predictive Analytics

Big data analytics can forecast future trends, behaviours, and occurrences by
analysing previous data. Predictive analytics helps organisations to anticipate
customer needs, estimate demand, optimise resources, and manage risks.

Data Visualization

Big data analytics entails presenting complex data in visual forms like charts,
graphs, and dashboards. Data visualisation allows stakeholders to better grasp
and analyse the data insights graphically.

Optimization and Efficiency Improvement

Big data analytics enables organisations to optimise processes, operations, and


resources by identifying areas for improvement and inefficiencies. This could
include optimising supply chain logistics, streamlining manufacturing processes,
or improving marketing strategies.

Personalization and Targeting

Big data analytics allows organisations to personalise their products, services,


and marketing activities based on individual preferences and behaviour by
analysing massive amounts of customer data. This personalised strategy
increases customer satisfaction and marketing ROI.

Risk Management and Fraud Detection

Big data analytics can detect abnormalities and patterns that indicate fraudulent
activity or possible threats. This is especially crucial in businesses like finance,
insurance, and cybersecurity, where early discovery can save large losses.

Real-time Decision Making

Big data analytics can deliver insights in real or near real-time, enabling
businesses to make decisions based on data. This competence is critical in
dynamic contexts where quick decisions are required to capitalise on
opportunities or manage risks.

Scalability and Flexibility

Big data analytics solutions are built to manage large amounts of data from
different sources and formats. They provide scalability to support increasing
data quantities, as well as flexibility to react to changing business requirements
and data sources.

Competitive Advantage

Leveraging big data analytics efficiently can give firms a competitive advantage
by allowing them to innovate, optimise processes, and better understand their
consumers and market trends.

Compliance and Regulatory Requirements

Big data analytics could help firms in ensuring compliance with relevant
regulations and standards by analysing and monitoring data for legal and ethical
requirements, particularly in the healthcare and finance industries.

Overall, the core deliverables of big data analytics are focused on using data to
drive strategic decision-making, increase operational efficiency, improve
consumer experiences, and gain a competitive advantage in the marketplace.
Big Data Adoption and Planning
Considerations
Adopting big data comes with its own set of challenges and considerations, but
with careful planning, organizations can maximize its benefits. Big Data
initiatives should be strategic and business-driven. The adoption of big data can
facilitate this change. The use of Big Data can be transformative, but it is
usually innovative. Transformation activities are often low-risk and aim to
improve efficiency and effectiveness.

The nature of Big Data and its analytic power consists of issues and challenges
that need to be planned in the beginning. For example, the adoption of new
technology makes concerns to secure that conform to existing corporate
standards needs to be addressed. Issues related to tracking the provenance of a
dataset from its procurement to its utilization are often new requirements for
organizations. It is necessary to plan for the management of the privacy of
constituents whose data is being processed or whose identity is revealed by
analytical processes.

All of the aforementioned factors require that an organisation recognise and


implement a set of distinct governance processes and decision frameworks to
ensure that all parties involved understand the nature, consequences, and
management requirements of Big Data. The approach to performing business
analysis is changing with the adoption of Big Data. The Big Data analytics
lifecycle is an effective solution. There are different factors to consider when we
implement Big Data.

Following image depicts about big data adoption and planning considerations −
Big Data Adoption and Planning Considerations

The primary potential big data adoption and planning considerations are as −

Organization Prerequisites
Big Data frameworks are not turnkey solutions. Enterprises require data
management and Big Data governance frameworks for data analysis and
analytics to be useful. Effective processes are required for implementing,
customising, filling, and utilising Big Data solutions.

Define Objectives
Outline your aims and objectives for implementing big data. Whether it's
increasing the customer experience, optimising processes, or improving
decision-making, defined objectives always give a positive direction to the
decision-makers to frame strategy.

Data Procurement
The acquisition of Big Data solutions can be cost-effective, due to the
availability of open-source platforms and tools, as well as the potential to
leverage commodity hardware. A substantial budget may still be required to
obtain external data. Most commercially relevant data will have to be
purchased, which may necessitate continuing subscription expenses to ensure
the delivery of updates to obtained datasets.

Infrastructure
Evaluate your current infrastructure to see if it can handle big data processing
and analytics. Consider whether you need to invest in new hardware, software,
or cloud-based solutions to manage the volume, velocity, and variety of data.

Data Strategy
Create a comprehensive data strategy that is aligned with your business
objectives. This includes determining what sorts of data are required, where to
obtain them, how to store and manage them, and how to ensure their quality
and security.

Data Privacy and Security


Analytics on datasets may reveal confidential data about organisations or
individuals. Analyzing different datasets includes benign data that can reveal
private information when the datasets are reviewed collectively. Addressing
these privacy concerns necessitates an awareness of the nature of the data
being collected, as well as relevant data privacy rules and particular procedures
for data tagging and anonymization. Telemetry data, such as a car's GPS record
or smart metre data readings, accumulated over a long period, might expose an
individual's location and behavior.
Security ensures the security of data networks and repositories using
authentication and authorization mechanisms is an essential element in securing
big data.

Provenance
Provenance refers to information about the data's origins and processing.
Provenance information is used to determine the validity and quality of data and
can also be used for auditing. It can be difficult to maintain provenance as a
large size of data is collected, integrated, and processed using different phases.

Limited Realtime Support


Dashboards and other applications that require streaming data and alerts
frequently require real-time or near-realtime data transmissions. Different
open-source Big Data solutions and tools are batch-oriented; however, a new
phase of real-time open-source technologies supports streaming data
processing.
Distinct Performance Challenges
With the large amounts of data that Big Data solutions must handle,
performance is frequently an issue. For example, massive datasets combined
with advanced search algorithms can lead to long query times.

Distinct Governance Requirements


Big Data solutions access and generate data, which become corporate assets. A
governance structure is essential to ensure that both the data and the solution
environment are regulated, standardized, and evolved in a controlled way.
Establish strong data governance policies to assure data quality, integrity,
privacy, and compliance with legislation like GDPR and CCPA. Define data
management roles and responsibilities, as well as data access, usage, and
security processes.

Distinct Methodology
A mechanism will be necessary to govern the flow of data into and out of Big
Data systems.

It will need to explore how to construct feedback loops so that processed data
can be revised again.

Continuous Improvement
Big data initiatives are iterative, and require on-going development over time.
Monitor performance indicators, get feedback, and fine-tune your strategy to
ensure that you're getting the most out of your data investments.
By carefully examining and planning for these factors, organisations can
successfully adopt and exploit big data to drive innovation, enhance efficiency,
and gain a competitive advantage in today's data-driven world.

Big Data Analytics - Key Stakeholders


Stakeholders are organisations or business professionals who will benefit from
the project. In large organizations, to successfully develop a big data project, it
is needed for the management to set the project back up. This normally
involves finding a way to show the business advantages of the project.

We dont have a unique solution to the problem of finding sponsors for a project,
following key points are as below −

 Check who and where are the sponsors of other projects similar to the one that
interests you.
 Having personal contacts in key management positions helps, so any contact can
be triggered if the project is promising.
 Who would benefit from your project? Who would be your client once the project
is on track?
 Develop a simple, clear, and exciting proposal and share it with the key players
in your organization.

Stakeholders include the project sponsor, the project manager, the business
intelligence analyst, the data engineer, the data scientist, the database
administrator and the business user. It is considered that the first phase of this
Discovery programme will be a good time for project managers and key
stakeholders to sit together and negotiate on appropriate funding at an early
stage, project functioning rather than being put on hold for later discussions.

A documentation process is a critical part in which the problem statement,


project goal statement, and objectives are marked. The document contains the
requirements to achieve the goal and objectives, the success criteria, and the
minimum acceptable outcome for the project with the key stakeholders.

The analytics challenge should be clarified and defined in collaboration with


stakeholders. However, in some cases, project sponsors may have a
predetermined answer that can be biased. Thus, the deployment of a more
objective technique is preferable to a pre-defined solution that may be bypassed
by project sponsors. During the "Discovery" phase, hypotheses should be
produced and evaluated in conjunction with stakeholders.
Stakeholders, as domain experts, can provide suggestions and concepts to test
while hypotheses are developed. The stakeholder is also involved in the
project's results and findings, which should be presented and conveyed to
stakeholders. An analytic team collaborates at the initial phase of the project to
grasp the project requirements, objectives, and hypotheses, and at the end, a
project to share the results and the findings. The analytic team has more
objectives than the stakeholders.

Several key stakeholders play a critical role in ensuring the success of any Big
Data Analytics project. The following image includes some of the key primary
stakeholders typically involved in Big Data Analytics projects −
Key Stakeholders of Big Data Analytics
Business Executives/Leadership
They are setting an overall vision and strategy for the organisation, which
includes how Big Data Analytics will be aligned with business objectives. They're
providing the necessary resources and support for AI initiatives.

Data Scientists/Analysts
These are the experts in creating algorithms, models, and analytical tools to
extract insights from large data. They assess data and make actionable
recommendations to guide company decisions.

IT Professionals
Technical infrastructure necessary for data storage, processing and analysis are
managed by the IT team. They're designed to ensure data security, scalability
and integration with the current system.

Data Engineers
These experts design, implement, and maintain the data architecture and
pipelines required to collect, store, and process huge amounts of data. They
ensure that data is accurate, consistent, and easily accessible.

Data Governance and Compliance Officers


They develop data management policies and procedures to ensure that data is
handled ethically, safely, and by legislation such as GDPR, CCPA, and HIPAA,
among others.

Business Analysts
They serve as a bridge between the various stakeholders in the business world
and the data scientists who work together by converting business requirements
into analytical solutions and vice versa.

End Users/Domain Experts


These are the experts who use the insights gained from big data analytics to
make educated decisions in their domain or department.
Finance Department
Finance stakeholders care about the cost-effectiveness of big data analytics
projects and may provide budgetary supervision and financial analysis.

Marketing and Sales Teams


These teams employ big data analytics insights to optimise marketing efforts,
target customers more effectively, and improve sales methods.

Customer Experience (CX) Teams


They use big data analytics to study customer behaviour, preferences, and
sentiment to improve the entire customer experience.

Legal Department
Legal experts ensure that data is used by applicable laws and regulations, and
they handle any legal risks related to data collection, processing, and analysis.

External Partners and Vendors


Organisations may work with external partners or vendors to supply specialised
expertise, tools, or data for big data analytics projects.

The best way to find stakeholders for a project is to understand the problem
and what would be the resulting data product once it has been implemented.
This understanding will give an edge in convincing the management of the
importance of the big data project. Effective collaboration and communication
among these stakeholders are critical for developing successful big data
analytics programmes and realising the full value of data-driven decision-
making.

Big Data Analytics - Data Analyst


A Data Analyst is a person who collects, analyses and interprets data to solve a
particular problem. A data analyst devotes a lot of time to examining the data
and finds insights in terms of graphical reports and dashboards. Hence, a data
analyst has a reporting-oriented profile and has experience in extracting and
analyzing data from traditional data warehouses using SQL.
Working as a data analyst in big data analytics sounds like a dynamic role. Big
data analytics includes analysing large-size and varied datasets to discover
hidden patterns, unknown relationships, market trends, customer needs, and
related valuable business insights.

In todays scenario, different organizations struggle hard to find competent data


scientists in the market. It is however a good idea to select prospective data
analysts and train them to the relevant skills to become a data scientist. A
competent data analyst has skills like business understanding, SQL
programming, report design and Dashboard creation.

Role and Responsibilities of Data Analyst

Below mentioned image incorporate all the major roles and responsibilities of a
data analyst −

Data Collection
It refers to a process of collecting data from different sources like databases,
data warehouses, APIs, and IoT devices. This could include conducting surveys,
tracking visitor behaviour on a company's website, or buying relevant data sets
from data collection specialists.
Data Cleaning and Pre-processing
There may be duplicates, errors or outliers in the raw data. Cleaning raw data
eliminates errors, inconsistencies, and duplicates. Pre-processing is the process
of converting data into an analytically useful format. Cleaning data entails
maintaining data quality in a spreadsheet or using a programming language to
ensure that your interpretations are correct and unbiased.

Exploratory Data Analysis (EDA)


Using statistical methods and visualization tools, analysis of data is carried out
to identify trends, patterns or relationships.

Model Data
It includes creating and designing database structures. Selection of type of data
is going to be stored and collected. It ensures that how data categories are
related and data appears.

Statistical Analysis
Applying statistical techniques to interpret data, validate hypotheses, and make
predictions.

Machine Learning
To predict future trends, classify data or detect anomalies by building predictive
models using machine learning algorithms.

Data Visualization
To communicate data insights effectively to stakeholders, it is necessary to
create visual representations such as charts, graphs and dashboards.

Data Interpretation and Reporting


To communicate findings and recommendations to decision-makers through the
interpretation of analysis results, and preparation of reports or presentations.
Continuous Learning
It includes keeping up to date with the latest developments in data analysis, big
data technologies and business trends.

A Data analyst makes their proficiency foundation in statistics, programming


languages like Python or R, database fundamentals, SQL, and big data
technologies such as Hadoop, Spark, and NoSQL databases.

What Tools Does a Data Analyst Use?

A data analyst often uses the following tools to process assigned work more
accurately and efficiently during data analysis. Some common tools used by
data analysts are mentioned in below image −

Types of Data Analysts

As technology has rapidly increasing; so, the types and amounts of data that
can be collected, classified, and analyse data has become an essential skill in
almost every business. In the current scenario; every domain has data analysts
experts like data analysts in the criminal justice, fashion, food, technology,
business, environment, and public sectors amongst many others. People who
perform data analysis might be known as −

 Medical and health care analyst


 Market research analyst
 Business analyst
 Business intelligence analyst
 Operations research analyst

Data Analyst Skills

Generally, the skills of data analysts are divided into two major groups'
i.e. Technical Skills and Behavioural Skills.

Data Analyst Technical Skills


 Data Cleaning − A data analyst has proficiency in identifying and handling missing
data, outliers, and errors in datasets.
 Database Tools − Microsoft Excel and SQL are essential tools for any data analyst.
Excel is most widely used in industries; while SQL is capable of handling larger
datasets using SQL queries to manipulate and manage data as per users needs.
 Programming Languages − Data Analysts are proficient in languages such as Python,
R, SQL, or others used for data manipulation, analysis, and visualization. Learning
Python or R makes me proficient in working on large-size data sets and complex
equations. Python and R are popular to work on data analysis.
 Data Visualisation − A competent data analyst must clearly and compellingly
present their findings. Knowing how to show data in charts and graphs will help
coworkers, employers, and stakeholders comprehend your job. Some popular
data visualization tools are Tableau, Jupyter Notebook, and Excel.
 Data Storytelling − Data Analysts can find and communicate insights effectively
through storytelling using data visualization and narrative techniques.
 Statistics and Maths − Statistical methods and tools are used to analyse data
distributions, correlations, and trends. Knowledge of statistics and maths can
guide us to determine which tools are best to use to solve a particular problem,
identify errors in data, and better understand the results.
 Big Data Tools − Data Analysts are familiar with big data processing tools and
frameworks like Hadoop, Spark, or Apache Kafka.
 Data Warehousing − Data Analysts also have an understanding of data warehousing
concepts and work with tools such as Amazon Redshift, Google BigQuery, or
Snowflake.
 Data Governance and Compliance − Data Analysts are aware of data governance
principles, data privacy laws, and regulations (Like GDPR, and HIPAA).
 APIs and Web Scraping − Data Analysts have expertise in pulling data from web
APIs and scraping from websites using libraries like requests (Python) or
BeautifulSoup.

Behavioural Skills
 Problem-solving − A data analyst can understand the problem that needs to be
solved. They identify patterns or trends that might reveal data. Critical thinking
abilities enable analysts to focus on the types of data, identify the most
illuminating methods of analysis, and detect gaps in their work.
 Analytical Thinking − The ability to evaluate complex problems, divide them into
smaller components, and devise logical solutions.
 Communication − As a data analyst, communicating ideas is essential. Data
analysts need solid writing and speaking abilities to communicate with colleagues
and stakeholders.
 Industry Knowledge − Knowing your industry like health care, business, finance,
etc. can help you to communicate with one another.
 Collaboration − Working well with team members, exchanging expertise, and
contributing to a collaborative environment in which ideas are openly exchanged.
 Time Management − Prioritizing work, meeting deadlines, and devoting time to
various areas of data analysis projects.
 Resilience − Dealing effectively with setbacks or failures in data analysis initiatives
while remaining determined to find solutions.

Role of Data Analysts in Todays Data-Driven World

Data analysts are essential to today's data-driven world, they play a vital role
on many levels; some of the reasons are as follows −

 Strategic Decision-Making − Knowing your industry like health Data analysts lays the
framework for strategic decision-making by identifying trends and insights that
can inform corporate plans and improve outcomes.
 Improving Efficiency − Data analysts assist firms in streamlining processes,
lowering costs, and increasing productivity by discovering operational
inefficiencies.
 Enhancing Customer Experiences − Analyzing customer data enables organizations to
better understand customer habits and preferences, resulting in better products
and services.
 Risk Management − Data analysis assists firms in identifying potential risks and
obstacles, allowing them to develop mitigation solutions.
 Business Intelligence − Analysing raw data into relevant information and
visualizations helps stakeholders to understand complex data. They produce
dashboards, reports, and presentations for data-driven decision-making across a
business.
 Predictive Analytics − Based on historical data, data analysts predict future patterns
and outcomes using statistical modelling and machine learning. This helps firms
anticipate customer wants, optimize resource allocation, and establish proactive
initiatives.
 Continuous Improvement − Data analysts assess and monitor data analysis
processes and methods to improve accuracy, efficiency, and relevance. They keep
up with new technology and best practices to better data analysis.
Big Data Analytics - Data Scientist
The role of a data scientist is normally associated with tasks such as predictive
modeling, developing segmentation algorithms, recommender systems, A/B
testing frameworks and often working with raw unstructured data.

The nature of their work demands a deep understanding of mathematics,


applied statistics and programming. There are a few skills common between a
data analyst and a data scientist, for example, the ability to query databases.
Both analyze data, but the decision of a data scientist can have a greater
impact in an organization.

Here is a set of skills a data scientist normally need to have −

 Programming in a statistical package such as: R, Python, SAS, SPSS, or Julia


 Able to clean, extract, and explore data from different sources
 Research, design, and implementation of statistical models
 Deep statistical, mathematical, and computer science knowledge

In big data analytics, people normally confuse the role of a data scientist with
that of a data architect. In reality, the difference is quite simple. A data
architect defines the tools and the architecture the data would be stored at,
whereas a data scientist uses this architecture. Of course, a data scientist
should be able to set up new tools if needed for ad-hoc projects, but the
infrastructure definition and design should not be a part of his task.

Big Data Analytics - Problem Definition


Through this tutorial, we will develop a project. Each subsequent chapter in this
tutorial deals with a part of the larger project in the mini-project section. This is
thought to be an applied tutorial section that will provide exposure to a real-
world problem. In this case, we would start with the problem definition of the
project.

Project Description

The objective of this project would be to develop a machine learning model to


predict the hourly salary of people using their curriculum vitae (CV) text as
input.

Using the framework defined above, it is simple to define the problem. We can
define X = {x1, x2, , xn} as the CVs of users, where each feature can be, in the
simplest way possible, the amount of times this word appears. Then the
response is real valued, we are trying to predict the hourly salary of individuals
in dollars.

These two considerations are enough to conclude that the problem presented
can be solved with a supervised regression algorithm.

Problem Definition

Problem Definition is probably one of the most complex and heavily neglected
stages in the big data analytics pipeline. In order to define the problem a data
product would solve, experience is mandatory. Most data scientist aspirants
have little or no experience in this stage.

Most big data problems can be categorized in the following ways −

 Supervised classification
 Supervised regression
 Unsupervised learning
 Learning to rank

Let us now learn more about these four concepts.

Supervised Classification
Given a matrix of features X = {x1, x2, ..., xn} we develop a model M to predict
different classes defined as y = {c1, c2, ..., cn}. For example: Given transactional
data of customers in an insurance company, it is possible to develop a model
that will predict if a client would churn or not. The latter is a binary classification
problem, where there are two classes or target variables: churn and not churn.

Other problems involve predicting more than one class, we could be interested
in doing digit recognition, therefore the response vector would be defined as: y
= {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, a-state-of-the-art model would be convolutional
neural network and the matrix of features would be defined as the pixels of the
image.

Supervised Regression
In this case, the problem definition is rather similar to the previous example;
the difference relies on the response. In a regression problem, the response y ∈
ℜ, this means the response is real valued. For example, we can develop a model
to predict the hourly salary of individuals given the corpus of their CV.
Unsupervised Learning
Management is often thirsty for new insights. Segmentation models can provide
this insight in order for the marketing department to develop products for
different segments. A good approach for developing a segmentation model,
rather than thinking of algorithms, is to select features that are relevant to the
segmentation that is desired.

For example, in a telecommunications company, it is interesting to segment


clients by their cellphone usage. This would involve disregarding features that
have nothing to do with the segmentation objective and including only those
that do. In this case, this would be selecting features as the number of SMS
used in a month, the number of inbound and outbound minutes, etc.

Learning to Rank
This problem can be considered as a regression problem, but it has particular
characteristics and deserves a separate treatment. The problem involves given
a collection of documents we seek to find the most relevant ordering given a
query. In order to develop a supervised learning algorithm, it is needed to label
how relevant an ordering is, given a query.

It is relevant to note that in order to develop a supervised learning algorithm, it


is needed to label the training data. This means that in order to train a model
that will, for example, recognize digits from an image, we need to label a
significant amount of examples by hand. There are web services that can speed
up this process and are commonly used for this task such as amazon
mechanical turk. It is proven that learning algorithms improve their
performance when provided with more data, so labeling a decent amount of
examples is practically mandatory in supervised learning.

Big Data Analytics - Data Collection


Data collection plays the most important role in the Big Data cycle. The Internet
provides almost unlimited sources of data for a variety of topics. The
importance of this area depends on the type of business, but traditional
industries can acquire a diverse source of external data and combine those with
their transactional data.

For example, lets assume we would like to build a system that recommends
restaurants. The first step would be to gather data, in this case, reviews of
restaurants from different websites and store them in a database. As we are
interested in raw text, and would use that for analytics, it is not that relevant
where the data for developing the model would be stored. This may sound
contradictory with the big data main technologies, but in order to implement a
big data application, we simply need to make it work in real time.

Twitter Mini Project

Once the problem is defined, the following stage is to collect the data. The
following miniproject idea is to work on collecting data from the web and
structuring it to be used in a machine learning model. We will collect some
tweets from the twitter rest API using the R programming language.

First of all create a twitter account, and then follow the instructions in
the twitteR package vignette to create a twitter developer account. This is a
summary of those instructions −

 Go to https://twitter.com/apps/new and log in.


 After filling in the basic info, go to the "Settings" tab and select "Read, Write and
Access direct messages".
 Make sure to click on the save button after doing this
 In the "Details" tab, take note of your consumer key and consumer secret
 In your R session, youll be using the API key and API secret values
 Finally run the following script. This will install the twitteR package from its
repository on github.
install.packages(c("devtools", "rjson", "bit64", "httr"))

# Make sure to restart your R session at this point


library(devtools)
install_github("geoffjentry/twitteR")

We are interested in getting data where the string "big mac" is included and
finding out which topics stand out about this. In order to do this, the first step is
collecting the data from twitter. Below is our R script to collect required data
from twitter. This code is also available in
bda/part1/collect_data/collect_data_twitter.R file.

rm(list = ls(all = TRUE)); gc() # Clears the global environment


library(twitteR) Sys.setlocale(category = "LC_ALL", locale = "C")
### Replace the xxxs with the values you got from the previous
instructions # consumer_key = "xxxxxxxxxxxxxxxxxxxx" #
consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" #
access_token = "xxxxxxxxxx-
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" #
access_token_secret=
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Connect to
twitter rest API setup_twitter_oauth(consumer_key, consumer_secret,
access_token, access_token_secret) # Get tweets related to big mac
tweets <- searchTwitter(big mac, n = 200, lang = en) df <-
twListToDF(tweets) # Take a look at the data head(df) # Check which
device is most used sources <- sapply(tweets, function(x)
x$getStatusSource()) sources <- gsub("</a>", "", sources) sources
<- strsplit(sources, ">") sources <- sapply(sources, function(x)
ifelse(length(x) > 1, x[2], x[1])) source_table = table(sources)
source_table = source_table[source_table > 1] freq =
source_table[order(source_table, decreasing = T)]
as.data.frame(freq) # Frequency # Twitter for iPhone 71 # Twitter
for Android 29 # Twitter Web Client 25 # recognia 20

Big Data Analytics - Cleansing Data


Once the data is collected, we normally have diverse data sources with different
characteristics. The most immediate step would be to make these data sources
homogeneous and continue to develop our data product. However, it depends
on the type of data. We should ask ourselves if it is practical to homogenize the
data.

Maybe the data sources are completely different, and the information loss will
be large if the sources would be homogenized. In this case, we can think of
alternatives. Can one data source help me build a regression model and the
other one a classification model? Is it possible to work with the heterogeneity on
our advantage rather than just lose information? Taking these decisions are
what make analytics interesting and challenging.

In the case of reviews, it is possible to have a language for each data source.
Again, we have two choices −

 Homogenization − It involves translating different languages to the language where


we have more data. The quality of translations services is acceptable, but if we
would like to translate massive amounts of data with an API, the cost would be
significant. There are software tools available for this task, but that would be
costly too.
 Heterogenization − Would it be possible to develop a solution for each language?
As it is simple to detect the language of a corpus, we could develop a
recommender for each language. This would involve more work in terms of tuning
each recommender according to the amount of languages available but is
definitely a viable option if we have a few languages available.

Twitter Mini Project

In the present case we need to first clean the unstructured data and then
convert it to a data matrix in order to apply topics modelling on it. In general,
when getting data from twitter, there are several characters we are not
interested in using, at least in the first stage of the data cleansing process.

For example, after getting the tweets we get these strange characters:
"<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>". These are probably
emoticons, so in order to clean the data, we will just remove them using the
following script. This code is also available in
bda/part1/collect_data/cleaning_data.R file.

rm(list = ls(all = TRUE)); gc() # Clears the global environment


source('collect_data_twitter.R')
# Some tweets
head(df$text)

[1] "Im not a big fan of turkey but baked Mac &
cheese <ed><U+00A0><U+00BD><ed><U+00B8><U+008B>"
[2] "@Jayoh30 Like no special sauce on a big mac. HOW"
### We are interested in the text - Lets clean it!

# We first convert the encoding of the text from latin1 to ASCII


df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub = ""))

# Create a function to clean tweets


clean.text <- function(tx) {
tx <- gsub("htt.{1,20}", " ", tx, ignore.case = TRUE)
tx = gsub("[^#[:^punct:]]|@|RT", " ", tx, perl = TRUE, ignore.case = TRUE)
tx = gsub("[[:digit:]]", " ", tx, ignore.case = TRUE)
tx = gsub(" {1,}", " ", tx, ignore.case = TRUE)
tx = gsub("^\\s+|\\s+$", " ", tx, ignore.case = TRUE)
return(tx)
}
clean_tweets <- lapply(df$text, clean.text)

# Cleaned tweets
head(clean_tweets)
[1] " WeNeedFeminlsm MAC s new make up line features men woc and big girls "
[1] " TravelsPhoto What Happens To Your Body One Hour After A Big Mac "

The final step of the data cleansing mini project is to have cleaned text we can
convert to a matrix and apply an algorithm to. From the text stored in
the clean_tweets vector we can easily convert it to a bag of words matrix and
apply an unsupervised learning algorithm.

Big Data Analytics - Summarizing Data


Reporting is very important in big data analytics. Every organization must have
a regular provision of information to support its decision making process. This
task is normally handled by data analysts with SQL and ETL (extract, transfer,
and load) experience.

The team in charge of this task has the responsibility of spreading the
information produced in the big data analytics department to different areas of
the organization.

The following example demonstrates what summarization of data means.


Navigate to the folder bda/part1/summarize_data and inside the folder, open
the summarize_data.Rproj file by double clicking it. Then, open
the summarize_data.R script and take a look at the code, and follow the
explanations presented.

# Install the following packages by running the following code in R.


pkgs = c('data.table', 'ggplot2', 'nycflights13', 'reshape2')
install.packages(pkgs)

The ggplot2 package is great for data visualization. The data.table package is a
great option to do fast and memory efficient summarization in R. A recent
benchmark shows it is even faster than pandas, the python library used for
similar tasks.
Take a look at the data using the following code. This code is also available
in bda/part1/summarize_data/summarize_data.Rproj file.

library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)

# Convert the flights data.frame to a data.table object and call it DT


DT <- as.data.table(flights)

# The data has 336776 rows and 16 columns


dim(DT)

# Take a look at the first rows


head(DT)

# year month day dep_time dep_delay arr_time arr_delay carrier


# 1: 2013 1 1 517 2 830 11 UA
# 2: 2013 1 1 533 4 850 20 UA
# 3: 2013 1 1 542 2 923 33 AA
# 4: 2013 1 1 544 -1 1004 -18 B6
# 5: 2013 1 1 554 -6 812 -25 DL
# 6: 2013 1 1 554 -4 740 12 UA

# tailnum flight origin dest air_time distance hour


minute
# 1: N14228 1545 EWR IAH 227 1400 5 17
# 2: N24211 1714 LGA IAH 227 1416 5 33
# 3: N619AA 1141 JFK MIA 160 1089 5 42
# 4: N804JB 725 JFK BQN 183 1576 5 44
# 5: N668DN 461 LGA ATL 116 762 5 54
# 6: N39463 1696 EWR ORD 150 719 5 54

The following code has an example of data summarization.

### Data Summarization


# Compute the mean arrival delay
DT[, list(mean_arrival_delay = mean(arr_delay, na.rm = TRUE))]
# mean_arrival_delay
# 1: 6.895377
# Now, we compute the same value but for each carrier
mean1 = DT[, list(mean_arrival_delay = mean(arr_delay, na.rm = TRUE)),
by = carrier]
print(mean1)
# carrier mean_arrival_delay
# 1: UA 3.5580111
# 2: AA 0.3642909
# 3: B6 9.4579733
# 4: DL 1.6443409
# 5: EV 15.7964311
# 6: MQ 10.7747334
# 7: US 2.1295951
# 8: WN 9.6491199
# 9: VX 1.7644644
# 10: FL 20.1159055
# 11: AS -9.9308886
# 12: 9E 7.3796692
# 13: F9 21.9207048
# 14: HA -6.9152047
# 15: YV 15.5569853
# 16: OO 11.9310345

# Now lets compute to means in the same line of code


mean2 = DT[, list(mean_departure_delay = mean(dep_delay, na.rm = TRUE),
mean_arrival_delay = mean(arr_delay, na.rm = TRUE)),
by = carrier]
print(mean2)

# carrier mean_departure_delay mean_arrival_delay


# 1: UA 12.106073 3.5580111
# 2: AA 8.586016 0.3642909
# 3: B6 13.022522 9.4579733
# 4: DL 9.264505 1.6443409
# 5: EV 19.955390 15.7964311
# 6: MQ 10.552041 10.7747334
# 7: US 3.782418 2.1295951
# 8: WN 17.711744 9.6491199
# 9: VX 12.869421 1.7644644
# 10: FL 18.726075 20.1159055
# 11: AS 5.804775 -9.9308886
# 12: 9E 16.725769 7.3796692
# 13: F9 20.215543 21.9207048
# 14: HA 4.900585 -6.9152047
# 15: YV 18.996330 15.5569853
# 16: OO 12.586207 11.9310345

### Create a new variable called gain


# this is the difference between arrival delay and departure delay
DT[, gain:= arr_delay - dep_delay]

# Compute the median gain per carrier


median_gain = DT[, median(gain, na.rm = TRUE), by = carrier]
print(median_gain)

Big Data Analytics - Data Exploration


Exploratory data analysis is a concept developed by John Tuckey (1977) that
consists on a new perspective of statistics. Tuckeys idea was that in traditional
statistics, the data was not being explored graphically, is was just being used to
test hypotheses. The first attempt to develop a tool was done in Stanford, the
project was called prim9. The tool was able to visualize data in nine dimensions,
therefore it was able to provide a multivariate perspective of the data.

In recent days, exploratory data analysis is a must and has been included in the
big data analytics life cycle. The ability to find insight and be able to
communicate it effectively in an organization is fueled with strong EDA
capabilities.

Based on Tuckeys ideas, Bell Labs developed the S programming language in order
to provide an interactive interface for doing statistics. The idea of S was to
provide extensive graphical capabilities with an easy-to-use language. In todays
world, in the context of Big Data, R that is based on the S programming
language is the most popular software for analytics.

The following program demonstrates the use of exploratory data analysis.

The following is an example of exploratory data analysis. This code is also


available in part1/eda/exploratory_data_analysis.R file.

library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)

# Using the code from the previous section


# This computes the mean arrival and departure delays by carrier.
DT <- as.data.table(flights)
mean2 = DT[, list(mean_departure_delay = mean(dep_delay, na.rm = TRUE),
mean_arrival_delay = mean(arr_delay, na.rm = TRUE)),
by = carrier]

# In order to plot data in R usign ggplot, it is normally needed to


reshape the data
# We want to have the data in long format for plotting with ggplot
dt = melt(mean2, id.vars = carrier)

# Take a look at the first rows


print(head(dt))

# Take a look at the help for ?geom_point and geom_line to find similar
examples
# Here we take the carrier code as the x axis
# the value from the dt data.table goes in the y axis

# The variable column represents the color


p = ggplot(dt, aes(x = carrier, y = value, color = variable, group =
variable)) +
geom_point() + # Plots points
geom_line() + # Plots lines
theme_bw() + # Uses a white background
labs(list(title = 'Mean arrival and departure delay by carrier',
x = 'Carrier', y = 'Mean delay'))
print(p)

# Save the plot to disk


ggsave('mean_delay_by_carrier.png', p,
width = 10.4, height = 5.07)

The code should produce an image such as the following −


Big Data Analytics - Data Visualization
In order to understand data, it is often useful to visualize it. Normally in Big
Data applications, the interest relies in finding insight rather than just making
beautiful plots. The following are examples of different approaches to
understanding data using plots.

To start analyzing the flights data, we can start by checking if there are
correlations between numeric variables. This code is also available
in bda/part1/data_visualization/data_visualization.R file.

# Install the package corrplot by running


install.packages('corrplot')

# then load the library


library(corrplot)

# Load the following libraries


library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)

# We will continue working with the flights data


DT <- as.data.table(flights)
head(DT) # take a look

# We select the numeric variables after inspecting the first rows.


numeric_variables = c('dep_time', 'dep_delay',
'arr_time', 'arr_delay', 'air_time', 'distance')

# Select numeric variables from the DT data.table


dt_num = DT[, numeric_variables, with = FALSE]

# Compute the correlation matrix of dt_num


cor_mat = cor(dt_num, use = "complete.obs")
print(cor_mat)
### Here is the correlation matrix
# dep_time dep_delay arr_time arr_delay air_time
distance
# dep_time 1.00000000 0.25961272 0.66250900 0.23230573 -0.01461948 -
0.01413373
# dep_delay 0.25961272 1.00000000 0.02942101 0.91480276 -0.02240508 -
0.02168090
# arr_time 0.66250900 0.02942101 1.00000000 0.02448214 0.05429603
0.04718917
# arr_delay 0.23230573 0.91480276 0.02448214 1.00000000 -0.03529709 -
0.06186776
# air_time -0.01461948 -0.02240508 0.05429603 -0.03529709 1.00000000
0.99064965
# distance -0.01413373 -0.02168090 0.04718917 -0.06186776 0.99064965
1.00000000

# We can display it visually to get a better understanding of the data


corrplot.mixed(cor_mat, lower = "circle", upper = "ellipse")

# save it to disk
png('corrplot.png')
print(corrplot.mixed(cor_mat, lower = "circle", upper = "ellipse"))
dev.off()

This code generates the following correlation matrix visualization −


We can see in the plot that there is a strong correlation between some of the
variables in the dataset. For example, arrival delay and departure delay seem to
be highly correlated. We can see this because the ellipse shows an almost lineal
relationship between both variables, however, it is not simple to find causation
from this result.

We cant say that as two variables are correlated, that one has an effect on the
other. Also we find in the plot a strong correlation between air time and
distance, which is fairly reasonable to expect as with more distance, the flight
time should grow.

We can also do univariate analysis of the data. A simple and effective way to
visualize distributions are box-plots. The following code demonstrates how to
produce box-plots and trellis charts using the ggplot2 library. This code is also
available in bda/part1/data_visualization/boxplots.R file.

source('data_visualization.R')
### Analyzing Distributions using box-plots
# The following shows the distance as a function of the carrier
p = ggplot(DT, aes(x = carrier, y = distance, fill = carrier)) + # Define
the carrier
in the x axis and distance in the y axis
geom_box-plot() + # Use the box-plot geom
theme_bw() + # Leave a white background - More in line with tufte's
principles than the default
guides(fill = FALSE) + # Remove legend
labs(list(title = 'Distance as a function of carrier', # Add labels
x = 'Carrier', y = 'Distance'))
p
# Save to disk
png(boxplot_carrier.png)
print(p)
dev.off()

# Let's add now another variable, the month of each flight


# We will be using facet_wrap for this
p = ggplot(DT, aes(carrier, distance, fill = carrier)) +
geom_box-plot() +
theme_bw() +
guides(fill = FALSE) +
facet_wrap(~month) + # This creates the trellis plot with the by month
variable
labs(list(title = 'Distance as a function of carrier by month',
x = 'Carrier', y = 'Distance'))
p
# The plot shows there aren't clear differences between distance in
different months

# Save to disk
png('boxplot_carrier_by_month.png')
print(p)
dev.off()

You might also like