Big Data
Big Data
Gartner defines Big Data as Big data is high-volume, high-velocity and/or high-
variety information that demands cost-effective, innovative forms of information
processing that enable enhanced insight, decision making, and process
automation.
Big Data is a collection of large amounts of data sets that traditional computing
approaches cannot compute and manage. It is a broad term that refers to the
massive volume of complex data sets that businesses and governments
generate in today's digital world. It is often measured in petabytes or terabytes
and originates from three key sources: transactional data, machine data, and
social data.
Big Data Analytics is a powerful tool which helps to find the potential of large
and complex datasets. To get a better understanding, let's break it down into
key steps −
Data Collection
This is the initial step, in which data is collected from different sources like
social media, sensors, online channels, commercial transactions, website logs
etc. Collected data might be structured (predefined organisation, such as
databases), semi-structured (like log files) or unstructured (text documents,
photos, and videos).
Data Analysis
This is a key phase of big data analytics. Different techniques and algorithms
are used to analyse data and derive useful insights. This can include descriptive
analytics (summarising data to better understand its characteristics), diagnostic
analytics (identifying patterns and relationships), predictive analytics (predicting
future trends or outcomes), and prescriptive analytics (making
recommendations or decisions based on the analysis).
Data Visualization
Its a step to present data in a visual form using charts, graphs and interactive
dashboards. Hence, data visualisation techniques are used to visually portray
the data using charts, graphs, dashboards, and other graphical formats to make
data analysis insights more clear and actionable.
Types of Big-Data
Big Data is generally categorized into three different varieties. They are as
shown below −
Structured Data
Semi-Structured Data
Unstructured Data
Semi-Structured Data
Semi-structured data can be described as another type of structured data. It
inherits some qualities from Structured Data; however, the majority of this type
of data lacks a specific structure and does not follow the formal structure of
data models such as an RDBMS. Example: Comma Separated Values (CSV) File.
Unstructured Data
Unstructured data is a type of data that doesnt follow any structure. It lacks a
uniform format and is constantly changing. However, it may occasionally include
data and time-related information. Example: Audio Files, Images etc.
Descriptive Analytics
Descriptive analytics gives a result like What is happening in my business?" if the
dataset is business-related. Overall, this summarises prior facts and aids in the
creation of reports such as a company's income, profit, and sales figures. It also
aids the tabulation of social media metrics. It can do comprehensive, accurate,
live data and effective visualisation.
Diagnostic Analytics
Diagnostic analytics determines root causes from data. It answers like Why is it
happening? Some common examples are drill-down, data mining, and data
recovery. Organisations use diagnostic analytics because they provide an in-
depth insight into a particular problem. Overall, it can drill down the root causes
and ability to isolate all confounding information.
For example − A report from an online store says that sales have decreased,
even though people are still adding items to their shopping carts. Several things
could have caused this, such as the form not loading properly, the shipping cost
being too high, or not enough payment choices being offered. You can use
diagnostic data to figure out why this is happening.
Predictive Analytics
This kind of analytics looks at data from the past and the present to guess what
will happen in the future. Hence, it answers like What will be happening in
future? Data mining, AI, and machine learning are all used in predictive analytics
to look at current data and guess what will happen in the future. It can figure
out things like market trends, customer trends, and so on.
For example − The rules that Bajaj Finance has to follow to keep their customers
safe from fake transactions are set by PayPal. The business uses predictive
analytics to look at all of its past payment and user behaviour data and come up
with a program that can spot fraud.
Prescriptive Analytics
Perspective analytics gives the ability to frame a strategic decision, the
analytical results answer What do I need to do? Perspective analytics works with
both descriptive and predictive analytics. Most of the time, it relies on AI and
machine learning.
For example − Prescriptive analytics can help a company to maximise its business
and profit. For example in the airline industry, Perspective analytics applies
some set of algorithms that will change flight prices automatically based on
demand from customers, and reduce ticket prices due to bad weather
conditions, location, holiday seasons etc.
Hadoop
A tool to store and analyze large amounts of data. Hadoop makes it possible to
deal with big data, It's a tool which made big data analytics possible.
MongoDB
A tool for managing unstructured data. It's a database which specially designed
to store, access and process large quantities of unstructured data.
Talend
A tool to use for data integration and management. Talend's solution package
includes complete capabilities for data integration, data quality, master data
management, and data governance. Talend integrates with big data
management tools like Hadoop, Spark, and NoSQL databases allowing
organisations to process and analyse enormous amounts of data efficiently. It
includes connectors and components for interacting with big data technologies,
allowing users to create data pipelines for ingesting, processing, and analysing
large amounts of data.
Cassandra
A distributed database used to handle chunks of data. Cassandra is an open-
source distributed NoSQL database management system that handles massive
amounts of data over several commodity servers, ensuring high availability and
scalability without sacrificing performance.
Spark
Used for real-time processing and analyzing large amounts of data. Apache
Spark is a robust and versatile distributed computing framework that provides a
single platform for big data processing, analytics, and machine learning, making
it popular in industries such as e-commerce, finance, healthcare, and
telecommunications.
Storm
It is an open-source real-time computational system. Apache Storm is a robust
and versatile stream processing framework that allows organisations to process
and analyse real-time data streams on a large scale, making it suited for a wide
range of use cases in industries such as banking, telecommunications, e-
commerce, and IoT.
Kafka
It is a distributed streaming platform that is used for fault-tolerant storage.
Apache Kafka is a versatile and powerful event streaming platform that allows
organisations to create scalable, fault-tolerant, and real-time data pipelines and
streaming applications to efficiently meet their data processing requirements.
The characteristics of Big Data, often summarized by the "Five V's," include −
Volume
As its name implies; volume refers to a large size of data generated and stored
every second using IoT devices, social media, videos, financial transactions, and
customer logs. The data generated from the devices or different sources can
range from terabytes to petabytes and beyond. To manage such large
quantities of data requires robust storage solutions and advanced data
processing techniques. The Hadoop framework is used to store, access and
process big data.
Facebook generates 4 petabytes of data per day that's a million gigabytes. All
that data is stored in what is known as the Hive, which contains about 300
petabytes of data [1].
Fig: Minutes spent per day on social apps (Image source: Recode)
Fig: Engagement per user on leading social media apps in India (Image source:
www.statista.com) [2]
From the above graph, we can predict how users are devoting their time to
accessing different channels and transforming data, hence, data volume is
becoming higher day by day.
Velocity
The speed with which data is generated, processed, and analysed. With the
development and usage of IoT devices and real-time data streams, the velocity
of data has expanded tremendously, demanding systems that can process data
instantly to derive meaningful insights. Some high-velocity data applications are
as follows −
Variety
Big Data includes different types of data like structured data (found in
databases), unstructured data (like text, images, videos), and semi-structured
data (like JSON and XML). This diversity requires advanced tools for data
integration, storage, and analysis.
Value
The ability to convert large volumes of data into useful insights. Big Data's
ultimate goal is to extract meaningful and actionable insights that can lead to
better decision-making, new products, enhanced consumer experiences, and
competitive advantages.
These qualities characterise the nature of Big Data and highlight the importance
of modern tools and technologies for effective data management, processing,
and analysis.
Big Data Analytics - Data Life Cycle
A life cycle is a process which denotes a sequential flow of one or more activities
involved in Big Data Analytics. Before going to learn about a big data analytics
life cycle; lets understand the traditional data mining life cycle.
The Traditional Data Mining Life Cycle includes the following phases −
Problem Definition − Its an initial phase of the data mining process; it includes
problem definitions that need to be uncovered or solved. A problem definition
always includes business goals that need to be achieved and the data that need
to be explored to identify patterns, business trends, and process flow to achieve
the defined goals.
Data Collection − The next step is data collection. This phase involves data
extraction from different sources like databases, weblogs, or social media
platforms that are required for analysis and to do business intelligence. Collected
data is considered raw data because it includes impurities and may not be in the
required formats and structures.
Data Pre-processing − After data collection, we clean it and pre-process it to remove
noises, missing value imputation, data transformation, feature selection, and
convert data into a required format before you can begin your analysis.
Data Exploration and Visualization − Once pre-processing is done on data, we explore
it to understand its characteristics, and identify patterns and trends. This phase
also includes data visualizations using scatter plots, histograms, or heat maps to
show the data in graphical form.
Modelling − This phase includes creating data models to solve realistic problems
defined in Phase 1. This could include an effective machine learning algorithm;
training the model, and assessing its performance.
Evaluation − The final stage in data mining is to assess the model's performance
and determine if it matches your business goals in step 1. If the model is
underperforming, you may need to do data exploration or feature selection once
again.
CRISP-DM Methodology
The CRISP-DM stands for Cross Industry Standard Process for Data Mining; it is
a methodology which describes commonly used approaches that a data mining
expert uses to tackle problems in traditional BI data mining. It is still being used
in traditional BI data mining teams. The following figure illustrates it. It
describes the major phases of the CRISP-DM cycle and how they are
interrelated with one another
CRISP-DM was introduced in 1996 and the next year, it got underway as a
European Union project under the ESPRIT funding initiative. The project was led
by five companies: SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA
(an insurance company). The project was finally incorporated into SPSS.
Sample − The process starts with data sampling, e.g., selecting the dataset for
modelling. The dataset should be large enough to contain sufficient information
to retrieve, yet small enough to be used efficiently. This phase also deals with
data partitioning.
Explore − This phase covers the understanding of the data by discovering
anticipated and unanticipated relationships between the variables, and also
abnormalities, with the help of data visualization.
Modify − The Modify phase contains methods to select, create and transform
variables in preparation for data modelling.
Model − In the Model phase, the focus is on applying various modelling (data
mining) techniques on the prepared variables to create models that possibly
provide the desired outcome.
Assess − The evaluation of the modelling results shows the reliability and
usefulness of the created models.
The main difference between CRISMDM and SEMMA is that SEMMA focuses on
the modelling aspect, whereas CRISP-DM gives more importance to stages of
the cycle before modelling such as understanding the business problem to be
solved, understanding and pre-processing the data to be used as input, for
example, machine learning algorithms.
Big Data Analytics is a field that involves managing the entire data lifecycle,
including data collection, cleansing, organisation, storage, analysis, and
governance. In the context of big data, the traditional approaches were not
optimal for analysing large-volume data, data with different values, data
velocity etc.
For example, the SEMMA methodology disdains data collection and pre-
processing of different data sources. These stages normally constitute most of
the work in a successful big data project. Big Data analytics involves the
identification, acquisition, processing, and analysis of large amounts of raw
data, unstructured and semi-structured data which aims to extract valuable
information for trend identification, enhancing existing company data, and
conducting extensive searches.
The Big Data analytics lifecycle can be divided into the following phases −
The primary differences between Big Data Analytics and traditional data analysis
are in the value, velocity, and variety of data processed. To address the specific
requirements for big data analysis, an organised method is required. The
description of Big Data analytics lifecycle phases are as follows −
Data Identification
The data identification phase focuses on identifying the necessary datasets and
their sources for the analysis project. Identifying a larger range of data sources
may improve the chances of discovering hidden patterns and relationships. The
firm may require internal or external datasets and sources, depending on the
nature of the business problems it is addressing.
Data Extraction
This phase focuses on extracting disparate data and converting it into a format
that the underlying Big Data solution can use for data analysis.
Semantics − A variable labelled differently in two datasets may signify the same
thing, for example, "surname" and "last name."
Data Analysis
The data analysis phase is responsible for carrying out the actual analysis work,
which usually comprises one or more types of analytics. Especially if the data
analysis is exploratory, we can continue this stage iteratively until we discover
the proper pattern or association.
Data Visualization
The Data Visualization phase visualizes data graphically to communicate
outcomes for effective interpretation by business users.The resultant outcome
helps to perform visual analysis, allowing them to uncover answers to queries
they have not yet formulated.
Research
Analyse what other companies have done in the same situation. This involves
looking for solutions that are reasonable for your company, even though it
involves adapting other solutions to the resources and requirements that your
company has. In this stage, a methodology for the future stages should be
defined.
Once the problem is defined, its reasonable to continue analyzing if the current
staff can complete the project successfully. Traditional BI teams might not be
capable of delivering an optimal solution to all the stages, so it should be
considered before starting the project if there is a need to outsource a part of
the project or hire more people.
Data Acquisition
This section is key in a big data life cycle; it defines which type of profiles would
be needed to deliver the resultant data product. Data gathering is a non-trivial
step of the process; it normally involves gathering unstructured data from
different sources. To give an example, it could involve writing a crawler to
retrieve reviews from a website. This involves dealing with text, perhaps in
different languages normally requiring a significant amount of time to be
completed.
Data Munging
Once the data is retrieved, for example, from the web, it needs to be stored in
an easy-to-use format. To continue with the review examples, let's assume the
data is retrieved from different sites where each has a different display of the
data.
Suppose one data source gives reviews in terms of rating in stars, therefore it is
possible to read this as a mapping for the response
variable \mathrm{y\:\epsilon \:\lbrace 1,2,3,4,5\rbrace} \mathrm{y
\:\epsilon \:\lbrace 1,2,3,4,5\rbrace}. Another data source gives
reviews using an arrow system, one for upvoting and the other for downvoting.
This would imply a response variable of the
form \mathrm{y\:\epsilon \:\lbrace positive,negative \rbrace} \mathr
m{y\:\epsilon \:\lbrace positive,negative \rbrace}.
To combine both data sources, a decision has to be made to make these two
response representations equivalent. This can involve converting the first data
source response representation to the second form, considering one star as
negative and five stars as positive. This process often requires a large time
allocation to be delivered with good quality.
Data Storage
Once the data is processed, it sometimes needs to be stored in a database. Big
data technologies offer plenty of alternatives regarding this point. The most
common alternative is using the Hadoop File System for storage which provides
users a limited version of SQL, known as HIVE Query Language. This allows
most analytics tasks to be done in similar ways as would be done in traditional
BI data warehouses, from the user perspective. Other storage options to be
considered are MongoDB, Redis, and SPARK.
This stage of the cycle is related to the human resources knowledge in terms of
their abilities to implement different architectures. Modified versions of
traditional data warehouses are still being used in large-scale applications. For
example, Teradata and IBM offer SQL databases that can handle terabytes of
data; open-source solutions such as PostgreSQL and MySQL are still being used
for large-scale applications.
Even though there are differences in how the different storages work in the
background, from the client side, most solutions provide an SQL API. Hence
having a good understanding of SQL is still a key skill to have for big data
analytics. This stage a priori seems to be the most important topic, in practice,
this is not true. It is not even an essential stage. It is possible to implement a
big data solution that would work with real-time data, so in this case, we only
need to gather data to develop the model and then implement it in real-time.
So there would not be a need to formally store the data at all.
Once the data has been cleaned and stored in a way that insights can be
retrieved from it, the data exploration phase is mandatory. The objective of this
stage is to understand the data, this is normally done with statistical techniques
and also plotting the data. This is a good stage to evaluate whether the problem
definition makes sense or is feasible.
This stage involves reshaping the cleaned data retrieved previously and using
statistical preprocessing for missing values imputation, outlier detection,
normalization, feature extraction and feature selection.
Modelling
The prior stage should have produced several datasets for training and testing,
for example, a predictive model. This stage involves trying different models and
looking forward to solving the business problem at hand. In practice, it is
normally desired that the model would give some insight into the business.
Finally, the best model or combination of models is selected evaluating its
performance on a left-out dataset.
Implementation
In this stage, the data product developed is implemented in the data pipeline of
the company. This involves setting up a validation scheme while the data
product is working, to track its performance. For example, in the case of
implementing a predictive model, this stage would involve applying the model
to new data and once the response is available, evaluate the model.
Data Sources
All big data solutions start with one or more data sources. The Big Data
Architecture accommodates various data sources and efficiently manages a wide
range of data types. Some common data sources in big data architecture
include transactional databases, logs, machine-generated data, social media
and web data, streaming data, external data sources, cloud-based data, NOSQL
databases, data warehouses, file systems, APIs, and web services.
These are only a few instances; in reality, the data environment is broad and
constantly changing, with new sources and technologies developing over time.
The primary challenge in big data architecture is successfully integrating,
processing, and analyzing data from various sources in order to gain relevant
insights and drive decision-making.
Data Storage
Data storage is the system for storing and managing large amounts of data in
big data architecture. Big data includes handling large amounts of structured,
semi-structured, and unstructured data; traditional relational databases often
prove inadequate due to scalability and performance limitations.
Batch Processing
Process data with long running batch jobs to filter, aggregate and prepare data
for analysis, these jobs often involve reading and processing source files, and
then writing the output to new files. Batch processing is an essential component
of big data architecture, allowing for the efficient processing of large amounts of
data using scheduled batches. It entails gathering, processing, and analysing
data in batches at predetermined intervals rather than in real time.
Stream Processing
Stream processing is a type of data processing that continuously processes data
records as they generate or receive in real time. It enables enterprises to
quickly analyze, transform, and respond to data streams, resulting in timely
insights, alerts, and actions. Stream processing is a critical component of big
data architecture, especially for dealing with high-volume data sources such as
sensor data, logs, social media updates, financial transactions, and IoT device
telemetry.
Following figure illustrate how stream processing works within big data
architecture −
Most big data solutions aim to extract insights from the data through analysis
and reporting. In order to enable users to analyze data, the architecture may
incorporate a data modeling layer, such as a multidimensional OLAP cube or
tabular data model in Azure Analysis Services. It may also offer self-service
business intelligence by leveraging the modeling and visualization features
found in Microsoft Power BI or Excel. Data scientists or analysts might conduct
interactive data exploration as part of their analysis and reporting processes.
Orchestration
In big data analytics, orchestration refers to the coordination and administration
of different tasks, processes, and resources used to execute data. To ensure
that big data analytics workflows run efficiently and reliably, it is necessary to
automate the flow of data and processing processes, schedule jobs, manage
dependencies, and monitor task performance.
The following figure demonstrates the methodology often followed in Big Data
Analytics −
Big Data Analytics Methodology
Define Objectives
Clearly outline the analysis's goals and objectives. What insights do you seek?
What business difficulties are you attempting to solve? This stage is critical to
steering the entire process.
Data Collection
Gather relevant data from a variety of sources. This includes structured data
from databases, semi-structured data from logs or JSON files, and unstructured
data from social media, emails, and papers.
Data Pre-processing
This step involves cleaning and pre-processing the data to ensure its quality and
consistency. This includes addressing missing values, deleting duplicates,
resolving inconsistencies, and transforming data into a useful format.
Feature Engineering
Create new features or modify existing ones to improve the performance of
machine learning models. This could include feature scaling, dimensionality
reduction, or constructing composite features.
Model Selection and Training
Choose relevant machine learning algorithms based on the nature of the
problem and the properties of the data. If labeled data is available, train the
models.
Model Evaluation
Measure the trained models' performance using accuracy, precision, recall, F1-
score, and ROC curves. This helps to determine the best-performing model for
deployment.
Deployment
In a production environment, deploy the model for real-world use. This could
include integrating the model with existing systems, creating APIs for model
inference, and establishing monitoring tools.
Iterate
Big Data analytics is an iterative process. Analyze the data, collect comments,
and update the models or procedures as needed to increase accuracy and
effectiveness over time.
One of the most important tasks in big data analytics is statistical modeling,
meaning supervised and unsupervised classification or regression problems.
After cleaning and pre-processing the data for modeling, carefully assess
various models with appropriate loss metrics. After implementing the model,
conduct additional evaluations and report the outcomes. A common pitfall in
predictive modeling is to just implement the model and never measure its
performance.
The objective is to develop a system that can recommend options based on user
behaviour. For example on Netflix, based on users' ratings for a particular
movie/web series/show, related movies, web series, and shows are
recommended.
Dashboard
Big data analytics identifies trends, patterns, and correlations in data that can
be used to make more informed decisions. These insights could be about
customer behaviour, market trends, or operational inefficiencies.
Ad-Hoc Analysis
Ad-hoc analysis in big data analytics is a process of analysing data on the fly or
spontaneously to answer specific, immediate queries or resolve ad-hoc
inquiries. Unlike traditional analysis, which relies on predefined queries or
structured reporting, ad hoc analysis allows users to explore data interactively,
without the requirement for predefined queries or reports.
Predictive Analytics
Big data analytics can forecast future trends, behaviours, and occurrences by
analysing previous data. Predictive analytics helps organisations to anticipate
customer needs, estimate demand, optimise resources, and manage risks.
Data Visualization
Big data analytics entails presenting complex data in visual forms like charts,
graphs, and dashboards. Data visualisation allows stakeholders to better grasp
and analyse the data insights graphically.
Big data analytics can detect abnormalities and patterns that indicate fraudulent
activity or possible threats. This is especially crucial in businesses like finance,
insurance, and cybersecurity, where early discovery can save large losses.
Big data analytics can deliver insights in real or near real-time, enabling
businesses to make decisions based on data. This competence is critical in
dynamic contexts where quick decisions are required to capitalise on
opportunities or manage risks.
Big data analytics solutions are built to manage large amounts of data from
different sources and formats. They provide scalability to support increasing
data quantities, as well as flexibility to react to changing business requirements
and data sources.
Competitive Advantage
Leveraging big data analytics efficiently can give firms a competitive advantage
by allowing them to innovate, optimise processes, and better understand their
consumers and market trends.
Big data analytics could help firms in ensuring compliance with relevant
regulations and standards by analysing and monitoring data for legal and ethical
requirements, particularly in the healthcare and finance industries.
Overall, the core deliverables of big data analytics are focused on using data to
drive strategic decision-making, increase operational efficiency, improve
consumer experiences, and gain a competitive advantage in the marketplace.
Big Data Adoption and Planning
Considerations
Adopting big data comes with its own set of challenges and considerations, but
with careful planning, organizations can maximize its benefits. Big Data
initiatives should be strategic and business-driven. The adoption of big data can
facilitate this change. The use of Big Data can be transformative, but it is
usually innovative. Transformation activities are often low-risk and aim to
improve efficiency and effectiveness.
The nature of Big Data and its analytic power consists of issues and challenges
that need to be planned in the beginning. For example, the adoption of new
technology makes concerns to secure that conform to existing corporate
standards needs to be addressed. Issues related to tracking the provenance of a
dataset from its procurement to its utilization are often new requirements for
organizations. It is necessary to plan for the management of the privacy of
constituents whose data is being processed or whose identity is revealed by
analytical processes.
Following image depicts about big data adoption and planning considerations −
Big Data Adoption and Planning Considerations
The primary potential big data adoption and planning considerations are as −
Organization Prerequisites
Big Data frameworks are not turnkey solutions. Enterprises require data
management and Big Data governance frameworks for data analysis and
analytics to be useful. Effective processes are required for implementing,
customising, filling, and utilising Big Data solutions.
Define Objectives
Outline your aims and objectives for implementing big data. Whether it's
increasing the customer experience, optimising processes, or improving
decision-making, defined objectives always give a positive direction to the
decision-makers to frame strategy.
Data Procurement
The acquisition of Big Data solutions can be cost-effective, due to the
availability of open-source platforms and tools, as well as the potential to
leverage commodity hardware. A substantial budget may still be required to
obtain external data. Most commercially relevant data will have to be
purchased, which may necessitate continuing subscription expenses to ensure
the delivery of updates to obtained datasets.
Infrastructure
Evaluate your current infrastructure to see if it can handle big data processing
and analytics. Consider whether you need to invest in new hardware, software,
or cloud-based solutions to manage the volume, velocity, and variety of data.
Data Strategy
Create a comprehensive data strategy that is aligned with your business
objectives. This includes determining what sorts of data are required, where to
obtain them, how to store and manage them, and how to ensure their quality
and security.
Provenance
Provenance refers to information about the data's origins and processing.
Provenance information is used to determine the validity and quality of data and
can also be used for auditing. It can be difficult to maintain provenance as a
large size of data is collected, integrated, and processed using different phases.
Distinct Methodology
A mechanism will be necessary to govern the flow of data into and out of Big
Data systems.
It will need to explore how to construct feedback loops so that processed data
can be revised again.
Continuous Improvement
Big data initiatives are iterative, and require on-going development over time.
Monitor performance indicators, get feedback, and fine-tune your strategy to
ensure that you're getting the most out of your data investments.
By carefully examining and planning for these factors, organisations can
successfully adopt and exploit big data to drive innovation, enhance efficiency,
and gain a competitive advantage in today's data-driven world.
We dont have a unique solution to the problem of finding sponsors for a project,
following key points are as below −
Check who and where are the sponsors of other projects similar to the one that
interests you.
Having personal contacts in key management positions helps, so any contact can
be triggered if the project is promising.
Who would benefit from your project? Who would be your client once the project
is on track?
Develop a simple, clear, and exciting proposal and share it with the key players
in your organization.
Stakeholders include the project sponsor, the project manager, the business
intelligence analyst, the data engineer, the data scientist, the database
administrator and the business user. It is considered that the first phase of this
Discovery programme will be a good time for project managers and key
stakeholders to sit together and negotiate on appropriate funding at an early
stage, project functioning rather than being put on hold for later discussions.
Several key stakeholders play a critical role in ensuring the success of any Big
Data Analytics project. The following image includes some of the key primary
stakeholders typically involved in Big Data Analytics projects −
Key Stakeholders of Big Data Analytics
Business Executives/Leadership
They are setting an overall vision and strategy for the organisation, which
includes how Big Data Analytics will be aligned with business objectives. They're
providing the necessary resources and support for AI initiatives.
Data Scientists/Analysts
These are the experts in creating algorithms, models, and analytical tools to
extract insights from large data. They assess data and make actionable
recommendations to guide company decisions.
IT Professionals
Technical infrastructure necessary for data storage, processing and analysis are
managed by the IT team. They're designed to ensure data security, scalability
and integration with the current system.
Data Engineers
These experts design, implement, and maintain the data architecture and
pipelines required to collect, store, and process huge amounts of data. They
ensure that data is accurate, consistent, and easily accessible.
Business Analysts
They serve as a bridge between the various stakeholders in the business world
and the data scientists who work together by converting business requirements
into analytical solutions and vice versa.
Legal Department
Legal experts ensure that data is used by applicable laws and regulations, and
they handle any legal risks related to data collection, processing, and analysis.
The best way to find stakeholders for a project is to understand the problem
and what would be the resulting data product once it has been implemented.
This understanding will give an edge in convincing the management of the
importance of the big data project. Effective collaboration and communication
among these stakeholders are critical for developing successful big data
analytics programmes and realising the full value of data-driven decision-
making.
Below mentioned image incorporate all the major roles and responsibilities of a
data analyst −
Data Collection
It refers to a process of collecting data from different sources like databases,
data warehouses, APIs, and IoT devices. This could include conducting surveys,
tracking visitor behaviour on a company's website, or buying relevant data sets
from data collection specialists.
Data Cleaning and Pre-processing
There may be duplicates, errors or outliers in the raw data. Cleaning raw data
eliminates errors, inconsistencies, and duplicates. Pre-processing is the process
of converting data into an analytically useful format. Cleaning data entails
maintaining data quality in a spreadsheet or using a programming language to
ensure that your interpretations are correct and unbiased.
Model Data
It includes creating and designing database structures. Selection of type of data
is going to be stored and collected. It ensures that how data categories are
related and data appears.
Statistical Analysis
Applying statistical techniques to interpret data, validate hypotheses, and make
predictions.
Machine Learning
To predict future trends, classify data or detect anomalies by building predictive
models using machine learning algorithms.
Data Visualization
To communicate data insights effectively to stakeholders, it is necessary to
create visual representations such as charts, graphs and dashboards.
A data analyst often uses the following tools to process assigned work more
accurately and efficiently during data analysis. Some common tools used by
data analysts are mentioned in below image −
As technology has rapidly increasing; so, the types and amounts of data that
can be collected, classified, and analyse data has become an essential skill in
almost every business. In the current scenario; every domain has data analysts
experts like data analysts in the criminal justice, fashion, food, technology,
business, environment, and public sectors amongst many others. People who
perform data analysis might be known as −
Generally, the skills of data analysts are divided into two major groups'
i.e. Technical Skills and Behavioural Skills.
Behavioural Skills
Problem-solving − A data analyst can understand the problem that needs to be
solved. They identify patterns or trends that might reveal data. Critical thinking
abilities enable analysts to focus on the types of data, identify the most
illuminating methods of analysis, and detect gaps in their work.
Analytical Thinking − The ability to evaluate complex problems, divide them into
smaller components, and devise logical solutions.
Communication − As a data analyst, communicating ideas is essential. Data
analysts need solid writing and speaking abilities to communicate with colleagues
and stakeholders.
Industry Knowledge − Knowing your industry like health care, business, finance,
etc. can help you to communicate with one another.
Collaboration − Working well with team members, exchanging expertise, and
contributing to a collaborative environment in which ideas are openly exchanged.
Time Management − Prioritizing work, meeting deadlines, and devoting time to
various areas of data analysis projects.
Resilience − Dealing effectively with setbacks or failures in data analysis initiatives
while remaining determined to find solutions.
Data analysts are essential to today's data-driven world, they play a vital role
on many levels; some of the reasons are as follows −
Strategic Decision-Making − Knowing your industry like health Data analysts lays the
framework for strategic decision-making by identifying trends and insights that
can inform corporate plans and improve outcomes.
Improving Efficiency − Data analysts assist firms in streamlining processes,
lowering costs, and increasing productivity by discovering operational
inefficiencies.
Enhancing Customer Experiences − Analyzing customer data enables organizations to
better understand customer habits and preferences, resulting in better products
and services.
Risk Management − Data analysis assists firms in identifying potential risks and
obstacles, allowing them to develop mitigation solutions.
Business Intelligence − Analysing raw data into relevant information and
visualizations helps stakeholders to understand complex data. They produce
dashboards, reports, and presentations for data-driven decision-making across a
business.
Predictive Analytics − Based on historical data, data analysts predict future patterns
and outcomes using statistical modelling and machine learning. This helps firms
anticipate customer wants, optimize resource allocation, and establish proactive
initiatives.
Continuous Improvement − Data analysts assess and monitor data analysis
processes and methods to improve accuracy, efficiency, and relevance. They keep
up with new technology and best practices to better data analysis.
Big Data Analytics - Data Scientist
The role of a data scientist is normally associated with tasks such as predictive
modeling, developing segmentation algorithms, recommender systems, A/B
testing frameworks and often working with raw unstructured data.
In big data analytics, people normally confuse the role of a data scientist with
that of a data architect. In reality, the difference is quite simple. A data
architect defines the tools and the architecture the data would be stored at,
whereas a data scientist uses this architecture. Of course, a data scientist
should be able to set up new tools if needed for ad-hoc projects, but the
infrastructure definition and design should not be a part of his task.
Project Description
Using the framework defined above, it is simple to define the problem. We can
define X = {x1, x2, , xn} as the CVs of users, where each feature can be, in the
simplest way possible, the amount of times this word appears. Then the
response is real valued, we are trying to predict the hourly salary of individuals
in dollars.
These two considerations are enough to conclude that the problem presented
can be solved with a supervised regression algorithm.
Problem Definition
Problem Definition is probably one of the most complex and heavily neglected
stages in the big data analytics pipeline. In order to define the problem a data
product would solve, experience is mandatory. Most data scientist aspirants
have little or no experience in this stage.
Supervised classification
Supervised regression
Unsupervised learning
Learning to rank
Supervised Classification
Given a matrix of features X = {x1, x2, ..., xn} we develop a model M to predict
different classes defined as y = {c1, c2, ..., cn}. For example: Given transactional
data of customers in an insurance company, it is possible to develop a model
that will predict if a client would churn or not. The latter is a binary classification
problem, where there are two classes or target variables: churn and not churn.
Other problems involve predicting more than one class, we could be interested
in doing digit recognition, therefore the response vector would be defined as: y
= {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, a-state-of-the-art model would be convolutional
neural network and the matrix of features would be defined as the pixels of the
image.
Supervised Regression
In this case, the problem definition is rather similar to the previous example;
the difference relies on the response. In a regression problem, the response y ∈
ℜ, this means the response is real valued. For example, we can develop a model
to predict the hourly salary of individuals given the corpus of their CV.
Unsupervised Learning
Management is often thirsty for new insights. Segmentation models can provide
this insight in order for the marketing department to develop products for
different segments. A good approach for developing a segmentation model,
rather than thinking of algorithms, is to select features that are relevant to the
segmentation that is desired.
Learning to Rank
This problem can be considered as a regression problem, but it has particular
characteristics and deserves a separate treatment. The problem involves given
a collection of documents we seek to find the most relevant ordering given a
query. In order to develop a supervised learning algorithm, it is needed to label
how relevant an ordering is, given a query.
For example, lets assume we would like to build a system that recommends
restaurants. The first step would be to gather data, in this case, reviews of
restaurants from different websites and store them in a database. As we are
interested in raw text, and would use that for analytics, it is not that relevant
where the data for developing the model would be stored. This may sound
contradictory with the big data main technologies, but in order to implement a
big data application, we simply need to make it work in real time.
Once the problem is defined, the following stage is to collect the data. The
following miniproject idea is to work on collecting data from the web and
structuring it to be used in a machine learning model. We will collect some
tweets from the twitter rest API using the R programming language.
First of all create a twitter account, and then follow the instructions in
the twitteR package vignette to create a twitter developer account. This is a
summary of those instructions −
We are interested in getting data where the string "big mac" is included and
finding out which topics stand out about this. In order to do this, the first step is
collecting the data from twitter. Below is our R script to collect required data
from twitter. This code is also available in
bda/part1/collect_data/collect_data_twitter.R file.
Maybe the data sources are completely different, and the information loss will
be large if the sources would be homogenized. In this case, we can think of
alternatives. Can one data source help me build a regression model and the
other one a classification model? Is it possible to work with the heterogeneity on
our advantage rather than just lose information? Taking these decisions are
what make analytics interesting and challenging.
In the case of reviews, it is possible to have a language for each data source.
Again, we have two choices −
In the present case we need to first clean the unstructured data and then
convert it to a data matrix in order to apply topics modelling on it. In general,
when getting data from twitter, there are several characters we are not
interested in using, at least in the first stage of the data cleansing process.
For example, after getting the tweets we get these strange characters:
"<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>". These are probably
emoticons, so in order to clean the data, we will just remove them using the
following script. This code is also available in
bda/part1/collect_data/cleaning_data.R file.
[1] "Im not a big fan of turkey but baked Mac &
cheese <ed><U+00A0><U+00BD><ed><U+00B8><U+008B>"
[2] "@Jayoh30 Like no special sauce on a big mac. HOW"
### We are interested in the text - Lets clean it!
# Cleaned tweets
head(clean_tweets)
[1] " WeNeedFeminlsm MAC s new make up line features men woc and big girls "
[1] " TravelsPhoto What Happens To Your Body One Hour After A Big Mac "
The final step of the data cleansing mini project is to have cleaned text we can
convert to a matrix and apply an algorithm to. From the text stored in
the clean_tweets vector we can easily convert it to a bag of words matrix and
apply an unsupervised learning algorithm.
The team in charge of this task has the responsibility of spreading the
information produced in the big data analytics department to different areas of
the organization.
The ggplot2 package is great for data visualization. The data.table package is a
great option to do fast and memory efficient summarization in R. A recent
benchmark shows it is even faster than pandas, the python library used for
similar tasks.
Take a look at the data using the following code. This code is also available
in bda/part1/summarize_data/summarize_data.Rproj file.
library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)
In recent days, exploratory data analysis is a must and has been included in the
big data analytics life cycle. The ability to find insight and be able to
communicate it effectively in an organization is fueled with strong EDA
capabilities.
Based on Tuckeys ideas, Bell Labs developed the S programming language in order
to provide an interactive interface for doing statistics. The idea of S was to
provide extensive graphical capabilities with an easy-to-use language. In todays
world, in the context of Big Data, R that is based on the S programming
language is the most popular software for analytics.
library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)
# Take a look at the help for ?geom_point and geom_line to find similar
examples
# Here we take the carrier code as the x axis
# the value from the dt data.table goes in the y axis
To start analyzing the flights data, we can start by checking if there are
correlations between numeric variables. This code is also available
in bda/part1/data_visualization/data_visualization.R file.
# save it to disk
png('corrplot.png')
print(corrplot.mixed(cor_mat, lower = "circle", upper = "ellipse"))
dev.off()
We cant say that as two variables are correlated, that one has an effect on the
other. Also we find in the plot a strong correlation between air time and
distance, which is fairly reasonable to expect as with more distance, the flight
time should grow.
We can also do univariate analysis of the data. A simple and effective way to
visualize distributions are box-plots. The following code demonstrates how to
produce box-plots and trellis charts using the ggplot2 library. This code is also
available in bda/part1/data_visualization/boxplots.R file.
source('data_visualization.R')
### Analyzing Distributions using box-plots
# The following shows the distance as a function of the carrier
p = ggplot(DT, aes(x = carrier, y = distance, fill = carrier)) + # Define
the carrier
in the x axis and distance in the y axis
geom_box-plot() + # Use the box-plot geom
theme_bw() + # Leave a white background - More in line with tufte's
principles than the default
guides(fill = FALSE) + # Remove legend
labs(list(title = 'Distance as a function of carrier', # Add labels
x = 'Carrier', y = 'Distance'))
p
# Save to disk
png(boxplot_carrier.png)
print(p)
dev.off()
# Save to disk
png('boxplot_carrier_by_month.png')
print(p)
dev.off()