0% found this document useful (0 votes)

65 views98 pages

BDA UNIT-3 (1) - Merged

Big data analytics

Uploaded by

Syed Affan Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views98 pages

BDA UNIT-3 (1) - Merged

Big data analytics

Uploaded by

Syed Affan Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Big Data Analytics.

UNIT-III
Syllabus: Understanding MapReduce Fundamentals and Hbase: Map Reduce Frame
Work, Techniques To Optimize MapReduce Jobs, Use Of Mapreduce, Role of Hbase in Big
Data Processing, Exploring Hive: Introducing Hive, Getting Started with Hive, Hive Services,
Data Types in Hive, Building functions in hive, Hive DDL, Hive DML.

MapReduce Architecture

MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined
to produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing
less overhead over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised
of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the
Hadoop MapReduce Master. Now, the MapReduce master will divide this job into further
equivalent job-parts. These job-parts are then made available for the Map and Reduce Task.
This Map and Reduce task will contain the program as per the requirement of the use-case that
the particular company is solving. The developer writes their logic to fulfill the requirement
that the industry requires. The input data which we are using is then fed to the Map Task and
the Map will generate intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is stored on the HDFS. There
can be n number of Map and Reduce tasks made available for processing the data as per the
requirement. The algorithm for Map and Reduce is made with a very optimized way such that
the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of
address and value is the actual value that it keeps. The Map() function will be executed in
its memory repository on each of these input key-value pairs and generates the intermediate
key-value pair which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and
sort and send to the Reduce() function. Reducer aggregate or group the data based on its
key-value pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across
the cluster and also to schedule each map on the Task Tracker running on the same data
node since there can be hundreds of data nodes available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that are working
on the instruction given by the Job Tracker. This Task Tracker is deployed on each of the
nodes available in the cluster that executes the Map and Reduce task as instructed by Job
Tracker.
There is also one important component of MapReduce Architecture known as Job History
Server. The Job History Server is a daemon process that saves and stores historical information
about the task or application, like the logs which are generated during or after the job execution
are stored on Job History Server.

And why go anywhere else when our DSA to Development: Coding Guide will help you master
all this in a few months! Apply now to our DSA to Development Program and our counsellors
will connect with you for further guidance & support.

How Job runs on MapReduce

MapReduce can be used to work with a solitary method call: submit() on a Job object (you can
likewise call waitForCompletion(), which presents the activity on the off chance that it hasn’t
been submitted effectively, at that point sits tight for it to finish).
Let’s understand the components –
1. Client: Submitting the MapReduce job.
2. Yarn node manager: In a cluster, it monitors and launches the compute containers on
machines.
3. Yarn resource manager: Handles the allocation of computing resources coordination on
the cluster.
4. MapReduce application master Facilitates the tasks running the MapReduce work.
5. Distributed Filesystem: Shares job files with other entities.

How to submit Job?

To create an internal JobSubmitter instance, use the submit() which further
calls submitJobInternal() on it. Having submitted the job,
waitForCompletion() polls the job’s progress after submitting the job once per second. If the
reports have changed since the last report, it further reports the progress to the console. The job
counters are displayed when the job completes successfully. Else the error (that caused the job
to fail) is logged to the console.
Processes implemented by JobSubmitter for submitting the Job :
• The resource manager asks for a new application ID that is used for MapReduce Job ID.
• Output specification of the job is checked. For e.g. an error is thrown to the MapReduce
program or the job is not submitted or the output directory already exists or it has not been
specified.
• If the splits cannot be computed, it computes the input splits for the job. This can be due to
the job is not submitted and an error is thrown to the MapReduce program.
• Resources needed to run the job are copied – it includes the job JAR file, and the computed
input splits, to the shared filesystem in a directory named after the job ID and the
configuration file.
• It copies job JAR with a high replication factor, which is controlled
by mapreduce.client.submit.file.replication property. AS there are a number of copies
across the cluster for the node managers to access.
• By calling submitApplication(), submits the job to the resource manager
USE OF MAP REDUCE
• Here are the top 5 uses of MapReduce:

• a) Social Media Analytics: MapReduce is used to analyse social media data to find
trends and patterns. This analysis, facilitated by MapReduce, empowers organisations
to make data-driven decisions and tailor their strategies to better engage with their target
audience.
• b) Fraud Detection Systems: MapReduce is used to detect fraudulent activities in
financial transactions. By leveraging this technology, organisations can enhance their
fraud detection capabilities, mitigate risks, and safeguard the integrity of economic
systems.
• c) Entertainment Industry: MapReduce is used to analyse user preferences and
viewing history to recommend movies and TV shows. By analysing this information,
the industry can deliver personalised recommendations for movies and TV shows,
enhancing user experience and satisfaction.
• d) E-commerce Optimisation: MapReduce evaluates consumer buying patterns based
on customers’ interests or historical purchasing patterns. This personalised approach
enhances the overall shopping experience for consumers while improving the efficiency
of e-commerce operations.
• e) Data Warehousing: MapReduce is used to process large volumes of data in data
warehousing applications. In this way, organisations can derive actionable insights
from their data, supporting informed decision-making processes across various
business functions.

Apache HBase
Prerequisite– Introduction to Hadoop HBase is a data model that is similar to Google’s big
table. It is an open source, distributed database developed by Apache software foundation
written in Java. HBase is an essential part of our Hadoop ecosystem. HBase runs on top of
HDFS (Hadoop Distributed File System). It can store massive amounts of data from terabytes
to petabytes. It is column oriented and horizontally scalable.
Figure – History of HBase

Applications of Apache HBase:

Real-time analytics: HBase is an excellent choice for real-time analytics applications that
require low-latency data access. It provides fast read and write performance and can handle
large amounts of data, making it suitable for real-time data analysis.
Social media applications: HBase is an ideal database for social media applications that
require high scalability and performance. It can handle the large volume of data generated by
social media platforms and provide real-time analytics capabilities.
IoT applications: HBase can be used for Internet of Things (IoT) applications that require
storing and processing large volumes of sensor data. HBase’s scalable architecture and fast
write performance make it a suitable choice for IoT applications that require low-latency data
processing.
Online transaction processing: HBase can be used as an online transaction processing
(OLTP) database, providing high availability, consistency, and low-latency data access.
HBase’s distributed architecture and automatic failover capabilities make it a good fit for OLTP
applications that require high availability.
Ad serving and clickstream analysis: HBase can be used to store and process large volumes
of clickstream data for ad serving and clickstream analysis. HBase’s column-oriented data
storage and indexing capabilities make it a good fit for these types of applications.
Features of HBase –
1. It is linearly scalable across various nodes as well as modularly scalable, as it divided across
various nodes.

2. HBase provides consistent read and writes.

3. It provides atomic read and write means during one read or write process, all other
processes are prevented from performing any read or write operations.

4. It provides easy to use Java API for client access.

5. It supports Thrift and REST API for non-Java front ends which supports XML, Protobuf
and binary data encoding options.

6. It supports a Block Cache and Bloom Filters for real-time queries and for high volume
query optimization.

7. HBase provides automatic failure support between Region Servers.

8. It support for exporting metrics with the Hadoop metrics subsystem to files.

9. It doesn’t enforce relationship within your data.

10. It is a platform for storing and retrieving data with random access.

Facebook Messenger Platform was using Apache Cassandra but it shifted from Apache
Cassandra to HBase in November 2010. Facebook was trying to build a scalable and robust
infrastructure to handle set of services like messages, email, chat and SMS into a real time
conversation so that’s why HBase is best suited for that.

RDBMS Vs HBase –

1. RDBMS is mostly Row Oriented whereas HBase is Column Oriented.

2. RDBMS has fixed schema but in HBase we can scale or add columns in run time also.

3. RDBMS is good for structured data whereas HBase is good for semi-structured data.

4. RDBMS is optimized for joins but HBase is not optimized for joins.
Apache HBase is a NoSQL, column-oriented database that is built on top of the Hadoop
ecosystem. It is designed to provide low-latency, high-throughput access to large-scale,
distributed datasets. Here are some of the advantages and disadvantages of using HBase:
Advantages Of Apache HBase:
1. Scalability: HBase can handle extremely large datasets that can be distributed across a
cluster of machines. It is designed to scale horizontally by adding more nodes to the cluster,
which allows it to handle increasingly larger amounts of data.
2. High-performance: HBase is optimized for low-latency, high-throughput access to data.
It uses a distributed architecture that allows it to process large amounts of data in parallel,
which can result in faster query response times.
3. Flexible data model: HBase’s column-oriented data model allows for flexible schema
design and supports sparse datasets. This can make it easier to work with data that has a
variable or evolving schema.
4. Fault tolerance: HBase is designed to be fault-tolerant by replicating data across multiple
nodes in the cluster. This helps ensure that data is not lost in the event of a hardware or
network failure.
Disadvantages Of Apache HBase:
1. Complexity: HBase can be complex to set up and manage. It requires knowledge of the
Hadoop ecosystem and distributed systems concepts, which can be a steep learning curve
for some users.
2. Limited query language: HBase’s query language, HBase Shell, is not as feature-rich as
SQL. This can make it difficult to perform complex queries and analyses.
3. No support for transactions: HBase does not support transactions, which can make it
difficult to maintain data consistency in some use cases.
4. Not suitable for all use cases: HBase is best suited for use cases where high throughput
and low-latency access to large datasets is required. It may not be the best choice for
applications that require real-time processing or strong consistency guarantees

Apache Hive – Getting Started With HQL Database Creation And Drop Database

Pre-requisite: Hive 3.1.2 Installation, Hadoop 3.1.2 Installation

HiveQL or HQL is a Hive query language that we used to process or query structured data
on Hive. HQL syntaxes are very much similar to MySQL but have some significant
differences. We will use the hive command, which is a bash shell script to complete our hive
demo using CLI(Command Line Interface). We can easily start hive shell by simply typing
hive in the terminal. Make sure that the /bin directory of your hive installation is mentioned
in the .basrc file. The .bashrc file executes automatically when the user logs into the
system and all necessary commands mentioned in this script file will run. We can simply
check whether the /bin directory is available or not by simply opening it with the command
as shown below.
sudo gedit ~/.bashrc
In case if the path is not added then add it so that we can directly run the hive shell from the
terminal without moving to the hive directory. Otherwise, we can start hive manually by
moving to apache-hive-3.1.2/bin/ directory and by pressing the hive command.
Before performing hive make sure that all of your Hadoop daemons are started and working.
We can simply start all the Hadoop daemon with the below command.
start-dfs.sh # this will start namenode, datanode and secondary namenode

start-yarn.sh # this will start node manager and resource manager

jps # To check running daemons

Databases In Apache Hive

The Database is a storage schema that contains multiple tables. The Hive Databases refer to
the namespace of tables. If you don’t specify the database name by default Hive uses
its default database for table creation and other purposes. Creating a Database allows
multiple users to create tables with a similar name in different schemas so that their names
don’t match.
So, let’s start our hive shell for performing our tasks with the below command.
hive
See the already existing databases using the below command.
show databases; # this will show the existing databases

Create Database Syntax:

We can create a database with the help of the below command but if the database already
exists then, in that case, Hive will throw an error.
CREATE DATABASE|SCHEMA <database name> # we can use DATABASE or
SCHEMA for creation of DB
Example:
CREATE DATABASE Test; # create database with name Test
show databases; # this will show the existing databases

If we again try to create a Test database hive will throw an error/warning that the database
with the name Test already exists. In general, we don’t want to get an error if the database
exists. So we use the create database command with [IF NOT EXIST] clause. This will do
not throw any error.
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Example:
CREATE SCHEMA IF NOT EXISTS Test1;

SHOW DATABASES;

Syntax To Drop Existing Databases:

DROP DATABASE <db_name>; or DROP DATABASE IF EXIST <db_name> # The IF
EXIST clause again is used to suppress error
Example:
DROP DATABASE IF EXISTS Test;
DROP DATABASE Test1;

Now quit hive shell with quit command.

quit;

Hive Services
The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and
its type information, the serializers and deserializers which is used to read and write
data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements
into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
HIVE Data Types

Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.

Integer Types
Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

2,147,483,648 to
INT 4-byte signed integer
2,147,483,647

-
9,223,372,036,854,775,808
BIGINT 8-byte signed integer
to
9,223,372,036,854,775,807

HiveQL - Functions

The Hive provides various in-built functions to perform mathematical and aggregate type
operations. Here, we are going to execute such type of functions on the records of the below
table:
Example of Functions in Hive

Let's create a table and load the data into it by using the following steps: -

o Select the database in which we want to create a table.

1. hive> use hql;

o Create a hive table using the following command: -

1. hive> create table employee_data (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;

o Now, load the data into the table.

1. hive> load data local inpath '/home/codegyani/hive/emp_details' into table employee_data;

o Let's fetch the loaded data by using the following command: -

1. hive> select * from employee_data;
Return type Functions Description

It returns the BIGINT for the rounded

BIGINT round(num)
value of DOUBLE num.

It returns the largest BIGINT that is less

BIGINT floor(num)
than or equal to num.

ceil(num), It returns the smallest BIGINT that is

BIGINT
ceiling(DOUBLE num) greater than or equal to num.

DOUBLE exp(num) It returns exponential of num.

DOUBLE ln(num) It returns the natural logarithm of num.

DOUBLE log10(num) It returns the base-10 logarithm of num.

DOUBLE sqrt(num) It returns the square root of num.

DOUBLE abs(num) It returns the absolute value of num.

DOUBLE sin(d) It returns the sin of num, in radians.

DOUBLE asin(d) It returns the arcsin of num, in radians.

DOUBLE cos(d) It returns the cosine of num, in radians.

DOUBLE acos(d) It returns the arccosine of num, in radians.

DOUBLE tan(d) It returns the tangent of num, in radians.

DOUBLE atan(d) It returns the arctangent of num, in radians.

Now, we discuss mathematical, aggregate and other in-built functions with the corresponding
examples.

Mathematical Functions in Hive

The commonly used mathematical functions in the hive are: -

Advertisement
ADVERTISING

Example of Mathematical Functions in Hive

o Let's see an example to fetch the square root of each employee's salary.
1. hive> select Id, Name, sqrt(Salary) from employee_data ;

Return Type Operator Description

It returns the count of the number of rows present in

BIGINT count(*)
the file.

DOUBLE sum(col) It returns the sum of values.

sum(DISTINCT
DOUBLE It returns the sum of distinct values.
col)
DOUBLE avg(col) It returns the average of values.

avg(DISTINCT
DOUBLE It returns the average of distinct values.
col)

It compares the values and returns the minimum one

DOUBLE min(col)
form it.

It compares the values and returns the maximum one

DOUBLE max(col)
form it.

Aggregate Functions in Hive

In Hive, the aggregate function returns a single value resulting from computation over many

rows. Let''s see some commonly used aggregate functions: -

Examples of Aggregate Functions in Hive

o Let's see an example to fetch the maximum salary of an employee.

1. hive> select max(Salary) from employee_data;
Return Type Operator Description

INT length(str) It returns the length of the string.

STRING reverse(str) It returns the string in reverse order.

concat(str1, str2, It returns the concatenation of two or

STRING
...) more strings.

substr(str, It returns the substring from the string

STRING
start_index) based on the provided starting index.

It returns the substring from the string

substr(str, int
STRING based on the provided starting index and
start, int length)
length.

STRING upper(str) It returns the string in uppercase.

STRING lower(str) It returns the string in lowercase.

It returns the string by removing

STRING trim(str)
whitespaces from both the ends.

It returns the string by removing

STRING ltrim(str)
whitespaces from left-hand side.

It returns the string by removing

TRING rtrim(str)
whitespaces from right-hand side.
o Let's see an example to fetch the minimum
o salary of an employee.
1. hive> select min(Salary) from employee_data;
Other built-in Functions in Hive
The following are some other commonly used in-built functions in the hive: -

Examples of other in-built Functions in Hive

o Let's see an example to fetch the name of each employee in uppercase.

1. select Id, upper(Name) from employee_data;
o Let's see an example to fetch the name of each employee in lowercase.
1. select Id, lower(Name) from employee_data;
Big Data Analytics.
UNIT-II
Syllabus: Intorducing Technologies For Handling Big Data: Distributed and Parallel
Computing for big data, Introducing Hadoop, And Cloud Computing in Big Data.
Understanding Hadoop eco system: Hadoop EcoSystem, Hadoop Distributed file system,
Map Reduce, Hadoop Yarn,Hive, Pig, Sqoop, Zookeeper, Flum, OOzie.

Difference between Parallel Computing and

Distributed Computing
There are mainly two computation types, including parallel computing and distributed
computing. A computer system may perform tasks according to human instructions. A single
processor executes only one task in the computer system, which is not an effective way. Parallel
computing solves this problem by allowing numerous processors to accomplish tasks
simultaneously. Modern computers support parallel processing to improve system performance. In
contrast, distributed computing enables several computers to communicate with one another and
achieve a goal. All of these computers communicate and collaborate over the network. Distributed
computing is commonly used by organizations such as Facebook and Google that allow people to
share resources.

In this article, you will learn about the difference between Parallel Computing and Distributed
Computing. But before discussing the differences, you must know about parallel computing and
distributed computing.

What is Parallel Computing?

It is also known as parallel processing. It utilizes several processors. Each of the processors
completes the tasks that have been allocated to them. In other words, parallel computing involves
performing numerous tasks simultaneously. A shared memory or distributed memory system can
be used to assist in parallel computing. All CPUs in shared memory systems share the memory.
Memory is shared between the processors in distributed memory systems.

Parallel computing provides numerous advantages. Parallel computing helps to increase the CPU
utilization and improve the performance because several processors work simultaneously.
Moreover, the failure of one CPU has no impact on the other CPUs' functionality. Furthermore, if
one processor needs instructions from another, the CPU might cause latency.

Advantages and Disadvantages of Parallel Computing

There are various advantages and disadvantages of parallel computing. Some of the advantages
and disadvantages are as follows:

Advantages

1. It saves time and money because many resources working together cut down on time and costs.
2. It may be difficult to resolve larger problems on Serial Computing.
3. You can do many things at once using many computing resources.
4. Parallel computing is much better than serial computing for modeling, simulating, and
comprehending complicated real-world events.

Disadvantages

1. The multi-core architectures consume a lot of power.

2. Parallel solutions are more difficult to implement, debug, and prove right due to the complexity of
communication and coordination, and they frequently perform worse than their serial equivalents.

What is Distributing Computing?

It comprises several software components that reside on different systems but operate as a single
system. A distributed system's computers can be physically close together and linked by a local
network or geographically distant and linked by a wide area network (WAN). A distributed
system can be made up of any number of different configurations, such as mainframes, PCs,
workstations, and minicomputers. The main aim of distributed computing is to make a network
work as a single computer.

There are various benefits of using distributed computing. It enables scalability and makes it
simpler to share resources. It also aids in the efficiency of computation processes.

Advantages and Disadvantages of Distributed Computing

There are various advantages and disadvantages of distributed computing. Some of the advantages
and disadvantages are as follows:

Advantages

1. It is flexible, making it simple to install, use, and debug new services.

2. In distributed computing, you may add multiple machines as required.
3. If the system crashes on one server, that doesn't affect other servers.
4. A distributed computer system may combine the computational capacity of several computers,
making it faster than traditional systems.

Disadvantages

1. Data security and sharing are the main issues in distributed systems due to the features of open
systems
2. Because of the distribution across multiple servers, troubleshooting and diagnostics are more
challenging.
3. The main disadvantage of distributed computer systems is the lack of software support.

Key differences between the Parallel Computing and Distributed

Computing

Here, you will learn the various key differences between parallel computing and distributed
computation. Some of the key differences between parallel computing and distributed computing
are as follows:
1. Parallel computing is a sort of computation in which various tasks or processes are run at the same
time. In contrast, distributed computing is that type of computing in which the components are
located on various networked systems that interact and coordinate their actions by passing messages
to one another.
2. In parallel computing, processors communicate with another processor via a bus. On the other hand,
computer systems in distributed computing connect with one another via a network.
3. Parallel computing takes place on a single computer. In contrast, distributed computing takes place
on several computers.
4. Parallel computing aids in improving system performance. On the other hand, distributed
computing allows for scalability, resource sharing, and the efficient completion of computation
tasks.
5. The computer in parallel computing can have shared or distributed memory. In contrast, every
system in distributed computing has its memory.
6. Multiple processors execute multiple tasks simultaneously in parallel computing. In contrast, many
computer systems execute tasks simultaneously in distributed computing.

Head-to-head Comparison between the Parallel Computing and

Distributed Computing

Features Parallel Computing Distributed Computing

It is a type of It is that type of computing in which the

computation in which components are located on various networked
Definition
various processes runs systems that interact and coordinate their actions
simultaneously. by passing messages to one another.

The processors
The computer systems connect with one another
Communication communicate with one
via a network.
another via a bus.

Several processors
execute various tasks
Functionality Several computers execute tasks simultaneously.
simultaneously in
parallel computing.

Number of It occurs in a single

It involves various computers.
Computers computer system.

The system may have

Each computer system in distributed computing
Memory distributed or shared
has its own memory.
memory.
It helps to improve the It allows for scalability, resource sharing, and the
Usage
system performance efficient completion of computation tasks.

Conclusion
There are two types of computations: parallel computing and distributed computing. Parallel
computing allows several processors to accomplish their tasks at the same time. In contrast,
distributed computing splits a single task among numerous systems to achieve a common goal.

Role of Cloud Computing in Big Data

Analytics
In this day and age where information is everything, organizations are overwhelmed.
This information, often called “big data,” refers to huge, complicated datasets that
ordinary procedures cannot process. Businesses are increasingly turning to cloud
computing in order to unlock the true value of big data and make use of it.
This article examines how cloud platforms can be used for storing vast amounts of
data effectively as well as managing and analyzing such information. It will reveal
what exactly are some benefits brought by cloud computing into big-data
analytics, and discuss different services offered by providers among other things
like considerations for adopting a cloud-based strategy towards big-data.
Table of Content
• The Challenges of Big Data
• Cloud Computing: The Big Data Solution
• Cloud Services for Big Data Analytics
• Benefits Beyond Core Analytics Services
• Choosing the Right Cloud Platform for Big Data Analytics
• Security Considerations for Cloud-Based Big Data Analytics
• Real-World Examples: Unveiling Insights Across Industries
• The Future Of Cloud Computing And Big Data Analytics
• Conclusion
The Challenges of Big Data
Big data poses several problems that impede traditional methods of analyzing data.
These include:
1. Volume: The amount of data being created today is mind-bogglingly large.
Regular storage systems do not have enough space to accommodate all these
massive sets.
2. Variety: Big data comes in different forms such as structured (relational
databases), unstructured (text files, pictures or videos from social media
posts), and semi-structured logs or emails. Traditional tools struggle with this
complexity.
3. Velocity: The speed at which new records are generated keeps rising every time;
hence real-time analysis becomes difficult due to slow processing speed.
4. Veracity: If you want accurate findings from your research then you must ensure
that your facts are correct since the garbage in garbage out rule applies here too.
There is nothing worse than cleaning up after the traditional method has been
used on a large dataset because it can take forever.

Cloud computing offers an effective solution towards dealing with big size
information sets. Organizations can store their big-data efficiently manage them as
well analyze them by leveraging scalability provided through clouds on demand
resources such as storage capacity . Here’s how:
• Scalability: One thing about these platforms is scalability; they provide large
amounts storage when needed most without having to buy any hardware
infrastructure in advance. For instance, if you know that there will be a lot of
processing power required during certain periods then scaling up becomes very
easy and quick.
• Cost Effectiveness: It also saves on costs since organizations only pay for what
has been utilized unlike maintaining on-site infrastructure which may not be used
all year round thus resulting into huge savings.
• Performance: Cloud computing offers high performance computing resources
like servers with advanced networking features plus memory based in-memory
capabilities which enable faster data processing real-time analytics
• Accessibility: geographical location should never hinder any business from
getting value out of its information stores hence cloud-based solutions being
accessible everywhere provided there’s internet connection. This encourages
team work among members who are far apart geographically as well enables
analysis to happen around the clock.
• Security: It is important that sensitive data is well guarded against unauthorized
access, modification or loss hence cloud providers investing heavily in security
measures such as encryption, access control and residency options for
compliance purposes.

Cloud Services for Big Data Analytics

1. Data Ingestion
• Managed data pipelines: These services automate the collection,
transformation and loading of data from different sources into your cloud storage
i.e., Apache Airflow or AWS Glue offered by various service providers.
• Streaming ingestion: Real time ingestion can be achieved using services
like Apache Kafka which allows integration with social media feeds among
others
2. Data Storage
• Object storage: The best option for storing vast quantities of unstructured and
semi-structured data are highly scalable and cost-effective object storage options
such as Amazon S3, Azure Blob Storage, Google Cloud Storage among
others.
• Lakes of Data: A cloud data lake serves as a centralized storage system that
saves all of the data in its original format, giving users the opportunity to examine
and analyze it at a later time. Time is saved because of the flexible procedures
that may be performed on the data.
• Data Warehouses: When dealing with large datasets, structured schemas are
required for storage and analysis purposes; this is exactly what a cloud data
warehouse does. The method has made querying and reporting processes easier
hence faster.
3. Data Processing and Transformation:
• Managed Hadoop and Spark environments: Complex infrastructure setup can
be avoided by using pre-configured managed Hadoop clusters or Spark
clusters provided by various cloud services.
• Serverless information processing: With serverless compute services
like AWS Lambda or Azure Functions, you can run data processing tasks
without managing servers. This simplifies development and scaling.
• Data anonymization and masking: Cloud platforms provide tools and services
to comply with privacy regulations by anonymizing or masking confidential
datasets.
4. Data Analytics and Visualization:
• Business intelligence (BI) tools: Some cloud-based BI applications
like Tableau, Power BI, Looker etc. provide interactive dashboards and reports
for visual big data analysis.
• Managed machine learning (ML) platforms such as Google Cloud AI
Platform, Amazon SageMaker, Azure Machine Learning etc., allow ML
models development, testing, and deployment on massive datasets.
• Predictive analytics and data mining: Cloud platforms are equipped with built-
in facilities both for predictive analytics and data mining that can help you find
patterns or trends in your data to assist you in future forecasting or better decision
making.
Benefits Beyond Core Analytics Services
• Collaboration: You can collaborate between a data
scientist/analyst/business user since all your team members will have access
through one centralized location where they can share insights with each other
easily using; shared storage space or communication channels provided by these
platforms themselves.
• Disaster Recovery: In case something unexpected happens such as power failure
then rest assured because most cloud providers always ensure that there is
minimum downtime experienced during any disaster recovery process thanks to
their robustness in this area.
• Innovation: Organizations can take advantage of various cutting-edge
technologies that are available through cloud platforms like Artificial
Intelligence (AI) which will help them come up with new data-driven solutions.
By using comprehensive suite of services from different Cloud Providers,
organizations can create an elastic & scalable ecosystem for big-data analytics
that enables maximum value extraction from information assets.
Choosing the Right Cloud Platform for Big Data Analytics
When choosing a cloud platform for big data analytics, there are several factors
that need to be considered:
• Scalability requirements: Evaluate whether the platform can scale resources up
or down as per your fluctuating needs in terms of processing power or storage
space etc.
• Security features: Make sure the chosen provider has good security measures
put in place especially when dealing with sensitive datasets so as not compromise
privacy rights of individuals involved directly/indirectly during
analysis process itself .
• Cost considerations: Compare pricing models offered by various providers
against usage patterns based on current budgetary allocation then go ahead
selecting most appropriate one among them all at hand.
• Integration capabilities: Check how well does it integrate with existing data
infrastructure i.e., databases, warehouses etc., including ETL tools like
Informatica Power Center which might be already installed within organization
environment thus avoiding compatibility issues arising later during
implementation phase itself.
• Vendor lock-in: This is very crucial because you should always choose a
platform that supports open standards thus providing flexibility needed incase
one decides or wishes migrate from his/her current vendor/product line due
change management related reasons where such may require significant
investment both time wise as well financially too.
Security Considerations for Cloud-Based Big Data Analytics
Security is always paramount when dealing with large volumes of information. Here
are some key security considerations regarding cloud-based big-data analytics:
• Data encryption: Ensure all your stored files/data are encrypted; this helps
safeguard against unauthorized access especially during transmission over
unsecured networks where they might get intercepted easily before reaching
intended recipient(s).
• Access control: Always make sure that only authorized personnel have access
rights granted either individually or collectively towards particular dataset(s)
held within a given storage location (s3 bucket etc.) so as not compromise
security aspects involved during analysis phase itself.
• Compliance regulations: Confirm whether these cloud providers comply fully
with relevant industry standards/regulations pertaining data protection act
especially if dealing with health sector related information which should remain
confidential throughout its lifecycle while being processed through various
stages involved till final decision making moment reached upon by responsible
parties concerned here.
• Regular security audits: Regularly conduct comprehensive security audits on
your cloud environment to identify any potential vulnerability areas & address
them accordingly before they can be exploited by malicious actors who might
wish take advantage such weaknesses thereby causing harm intentionally against
organization reputation or even financial loss too.
• Data Copying and Restoration: Keep an all-inclusive plan for data copying and
restoration so that you could retrieve your files if a security breach occurs.
Real-World Examples: Unveiling Insights Across Industries
Cloud-supported massive information analysis is changing the ways of working and
decision-making in many companies. Here are a few interesting instances that
demonstrate such technology’s capabilities:

Examples
1. Retail Industry: The Power of Personalization
Think about a retail environment where product recommendations seem uncannily
accurate and marketing campaigns speak to your soul. This is made possible by
cloud-based big data analytics. Retailers use these tools to process immense
volumes of customer information, such as purchase history, browsing habits
and social media sentiment. They then apply this knowledge to:
• Customize marketing campaigns: Higher conversion rates and increased
customer satisfaction are achieved through targeted email blasts and social media
ads that cater for individual preferences.
• Optimize product recommendations: Recommender systems driven by big
data analytics propose products customers are likely to find interesting thereby
increasing sales and reducing cart abandonment rates.
• Enhance inventory management: Retailers can optimize their inventory levels
by scrutinizing sales trends alongside customer demand patterns which
eliminates stockouts while minimizing clearance sales.
2. Healthcare: From Diagnosis to Personalized Care
The healthcare industry has rapidly adopted cloud-based big data analytics for better
patient care and operational efficiency. Here’s how:
• Improved diagnosis: Healthcare providers can now diagnose patients faster and
more accurately by analyzing medical records together with imaging scans
besides wearable device sensor data.
• Individual treatment plans: Big data analytics makes it possible to create
individualized treatment plans through identification of factors affecting
response to certain drugs or therapies.
• Predictive prevention care: Through cloud based analytics it is possible to
identify people at high risk of particular illnesses before they actually occur thus
leading to better outcomes for patients and lower healthcare expenses.
3. Financial Services: Risk Management & Fraud Detection
Effectively managing risks and making informed decisions are crucial in the ever
changing banking industry. Here’s how financial companies can use big data
analytics in the cloud:
• Identify fraudulent activity: By using advanced algorithms to make sense of
real-time transaction patterns, banks are able to detect and prevent fraudulent
transactions from taking place, thereby protecting both themselves and
customers.
• Evaluate credit riskiness: By checking borrowers’ financial histories against
other types of relevant data points, lenders can make better choices concerning
approvals on loans and interest rates hence reducing credit risk.
• Develop cutting-edge financial products: Banks can use big data analytics to
craft unique financial products for different market segments as they continue
studying their clients’ desires and preferences.
These are only a few instances of the current industry transformations brought about
by cloud-based big data analytics. It is inevitable that as technology advances and
data quantities expand, more inventive applications will surface, enabling businesses
to obtain more profound insights, make fact-based decisions, and accomplish
remarkable outcomes.
The Future Of Cloud Computing And Big Data Analytics
The future of big data analysis is directly related to that of cloud computing. The
significance of cloud platforms will only increase as enterprises grapple with
information overload and seek deeper insights. The following are some tendencies
to watch out for:
• Hybrid and Multi-Cloud Environments: As per their unique needs, companies
will use more and more Hybrid and Multi Cloud approaches to take advantage
of the specific capabilities typical for different providers.
• Serverless Computing: Businesses will increasingly adopt serverless
computing due to its liberation of administrators from the management of
underlying infrastructure to concentrate on analytics functions.
• Integration Of AI & ML: Cloud platforms will seamlessly integrate artificial
intelligence (AI) alongside machine learning (ML) functionalities thus
enabling advanced analytics as well as automated decision making.
• Emphasis on Data Governance and Privacy: To keep pace with shifting rules
on data security and privacy, businesses will need more advanced means of
governing their information, which cloud providers can supply.
Conclusion
Cloud computing has become the bedrock of big data analytics; it is inexpensive,
flexible, secure, and capable of accommodating large quantities of information that
companies can use to make sense of what’s going on around them. As cloud
technology and big data analytics continue to evolve, we can expect even more
powerful tools and services to emerge, enabling organizations to unlock the true
potential of their data and make data-driven decisions that fuel innovation and
success.

Summer-time is here and so is the time to skill-up! More than 5,000 learners have
now completed their journey from basics of DSA to advanced level development
programs such as Full-Stack, Backend Development, Data Science.
And why go anywhere else when our DSA to Development: Coding Guide will help you
master all this in a few months! Apply now to our DSA to Development Program and
our counsellors will connect with you for further guidance & support.

Introduction to Hadoop Distributed File

System(HDFS)
•
With growing data velocity the data size easily outgrows the storage limit of a
machine. A solution would be to store the data across a network of machines. Such
filesystems are called distributed filesystems. Since data is stored across a network
all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems.
HDFS (Hadoop Distributed File System) is a unique design that provides storage
for extremely large files with streaming data access pattern and it runs
on commodity hardware. Let’s elaborate the terms:
• Extremely large files: Here we are talking about the data in range of
petabytes(1000 TB).
• Streaming Data Access Pattern: HDFS is designed on principle of write-once
and read-many-times. Once data is written large portions of dataset can be
processed any number times.
• Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other file
system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
• Manages all the slave nodes and assign work to them.
• It executes filesystem namespace operations like opening, closing, renaming
files and directories.
• It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. DataNode(SlaveNode):
• Actual worker nodes, who do the actual work like reading, writing, processing
etc.
• They also perform creation, deletion, and replication upon instruction from
the master.
• They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
• Namenodes:
o Run on the master node.
o Store metadata (data about data) like file path, the number of blocks,
block Ids. etc.
o Require high amount of RAM.
o Store meta-data in RAM for fast retrieval i.e to reduce seek time.
Though a persistent copy of it is kept on disk.
• DataNodes:
o Run on slave nodes.
o Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed
manner.

Lets assume that 100TB file is inserted, then masternode(namenode) will

first divide the file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and
above). Then these blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate the blocks among themselves and the information of
what blocks they contain is sent to the master. Default replication factor is 3 means
for each block 3 replicas are created (including itself). In hdfs.site.xml we can
increase or decrease the replication factor i.e we can edit its configuration here.
Note: MasterNode has the record of everything, it knows the location and info of
each and every single data nodes and the blocks they contain, i.e. nothing is done
without the permission of master node.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB
file on a single machine. Even if we store, then each read and write operation on
that whole file is going to take very high seek time. But if we have multiple blocks
of size 128MB then its become easy to perform various read and write operations on
it compared to doing it on a whole file at once. So we divide the file to have faster
data access i.e. reduce seek time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow block is present on
datanode D1. Now if the data node D1 crashes we will lose the block and which will
make the overall data inconsistent and faulty. So we replicate the blocks to
achieve fault-tolerance.
Terms related to HDFS:
• HeartBeat : It is the signal that datanode continuously sends to namenode. If
namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
• Balancing : If a datanode is crashed the blocks present on it will be gone too and
the blocks will be under-replicated compared to the remaining blocks. Here
master node(namenode) will give a signal to datanodes containing replicas of
those lost blocks to replicate so that overall distribution of blocks is balanced.
• Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at multiple datanodes.
• Even if multiple datanodes are down we can still do our work, thus making it
highly reliable.
• High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it
doesn’t work well.
• Low latency data access: Applications that require low-latency access to data
i.e in the range of milliseconds will not work well with HDFS, because HDFS is
designed keeping in mind that we need high-throughput of data even at the cost
of latency.
• Small file problem: Having lots of small files will result in lots of seeks and lots
of movement from one datanode to another datanode to retrieve each small file,
this whole process is a very inefficient data access pattern.

HADOOP ECOSYSTEM
Overview: Apache Hadoop is an open source framework intended to make interaction
with big data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which can’t be
processed in an efficient manner with the help of traditional methodology such as RDBMS.
Hadoop has made its place in the industries and companies that need to work on large data
sets which are sensitive and needs efficient handling. Hadoop is a framework that enables
processing of large data sets which reside in the form of clusters. Being a framework, Hadoop
is made up of several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities. Most of the tools or solutions are used to supplement or support
these major elements. All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
• HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus working at
the heart of the system.
YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
• By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.
• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form
of group. Map generates a key-value pair based result which is later on processed by
the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing huge data sets.
• Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization and hence is a major segment
of the Hadoop Ecosystem.
HIVE:
• With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
• Mahout, allows Machine Learnability to a system or application. Machine Learning, as
the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
• It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
Apache Spark:
• It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited for structured data
or batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
• At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At such
times, HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:
• Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries, especially Lucene is based on Java which allows
spell check mechanism, as well. However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.

Hadoop YARN Architecture

YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described
as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be
known as large-scale distributed operating system used for Big Data processing.
YARN architecture basically separates resource management layer from the processing layer.
In Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager
and application manager.

YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient. Through
its various components, it can dynamically allocate various resources and schedule the
application processing. For large volume data processing, it is quite necessary to manage the
available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-
• Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to
extend and manage thousands of nodes and clusters.
• Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
• Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop,
which enables optimized Cluster Utilization.
• Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of
multi-tenancy.

Hadoop YARN Architecture

The main components of YARN architecture include:

• Client: It submits map-reduce jobs.

• Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
o Scheduler: It performs scheduling based on the allocated application and
available resources. It is a pure scheduler, means it does not perform other tasks
such as monitoring or tracking and does not guarantee a restart if a task fails.
The YARN scheduler supports plugins such as Capacity Scheduler and Fair
Scheduler to partition the cluster resources.
o Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.
• Node Manager: It take care of individual node on Hadoop cluster and manages application
and workflow and that particular node. Its primary job is to keep-up with the Resource
Manager. It registers with the Resource Manager and sends heartbeats with the health status
of the node. It monitors resource usage, performs log management and also kills a container
based on directions from the resource manager. It is also responsible for creating the
container process and start it on the request of Application master.
• Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch
Context(CLC) which includes everything an application needs to run. Once the application
is started, it sends the health report to the resource manager from time-to-time.
• Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.
Application workflow in Hadoop YARN:

1. Client submits an application

2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the Resource
Manager

Advantages :
• Flexibility: YARN offers flexibility to run various types of distributed processing systems
such as Apache Spark, Apache Flink, Apache Storm, and others. It allows multiple
processing engines to run simultaneously on a single Hadoop cluster.
• Resource Management: YARN provides an efficient way of managing resources in the
Hadoop cluster. It allows administrators to allocate and monitor the resources required by
each application in a cluster, such as CPU, memory, and disk space.
• Scalability: YARN is designed to be highly scalable and can handle thousands of nodes in
a cluster. It can scale up or down based on the requirements of the applications running on
the cluster.
• Improved Performance: YARN offers better performance by providing a centralized
resource management system. It ensures that the resources are optimally utilized, and
applications are efficiently scheduled on the available resources.
• Security: YARN provides robust security features such as Kerberos authentication, Secure
Shell (SSH) access, and secure data transmission. It ensures that the data stored and
processed on the Hadoop cluster is secure.

Disadvantages :

• Complexity: YARN adds complexity to the Hadoop ecosystem. It requires additional

configurations and settings, which can be difficult for users who are not familiar with
YARN.
• Overhead: YARN introduces additional overhead, which can slow down the performance
of the Hadoop cluster. This overhead is required for managing resources and scheduling
applications.
• Latency: YARN introduces additional latency in the Hadoop ecosystem. This latency can
be caused by resource allocation, application scheduling, and communication between
components.
• Single Point of Failure: YARN can be a single point of failure in the Hadoop cluster. If
YARN fails, it can cause the entire cluster to go down. To avoid this, administrators need
to set up a backup YARN instance for high availability.
• Limited Support: YARN has limited support for non-Java programming languages.
Although it supports multiple processing engines, some engines have limited language
support, which can limit the usability of YARN in certain environments.

Apache Hive
Prerequisites – Introduction to Hadoop, Computing Platforms and Technologies
Apache Hive is a data warehouse and an ETL(ETL stands for "extract, transform, and
load". It's a process that combines data from multiple sources into a single repository, such as
a data warehouse, data store, or data lake) tool which provides an SQL-like interface between
the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built
on top of Hadoop. It is a software project that provides data query and analysis. It facilitates
reading, writing and handling wide datasets that stored in distributed storage and queried by
Structure Query Language (SQL) syntax. It is not built for Online Transactional Processing
(OLTP) workloads. It is frequently used for data warehousing tasks like data encapsulation,
Ad-hoc Queries, and analysis of huge datasets. It is designed to enhance scalability,
extensibility, performance, fault-tolerance and loose-coupling with its input formats.
Initially Hive is developed by Facebook and Amazon, Netflix and It delivers standard SQL
functionality for analytics. Traditional SQL queries are written in the MapReduce Java API
to execute SQL Application and SQL queries over distributed data. Hive provides portability
as most data warehousing applications functions with SQL-based query languages like
NoSQL.
Apache Hive is a data warehouse software project that is built on top of the Hadoop
ecosystem. It provides an SQL-like interface to query and analyze large datasets stored in
Hadoop’s distributed file system (HDFS) or other compatible storage systems.
Hive uses a language called HiveQL, which is similar to SQL, to allow users to express data
queries, transformations, and analyses in a familiar syntax. HiveQL statements are compiled
into MapReduce jobs, which are then executed on the Hadoop cluster to process the data.
Hive includes many features that make it a useful tool for big data analysis, including support
for partitioning, indexing, and user-defined functions (UDFs). It also provides a number of
optimization techniques to improve query performance, such as predicate pushdown, column
pruning, and query parallelization.
Hive can be used for a variety of data processing tasks, such as data warehousing, ETL
(extract, transform, load) pipelines, and ad-hoc data analysis. It is widely used in the big data
industry, especially in companies that have adopted the Hadoop ecosystem as their primary
data processing platform.
Components of Hive:
1. HCatalog –
It is a Hive component and is a table as well as a store management layer for Hadoop. It
enables user along with various data processing tools like Pig and MapReduce which
enables to read and write on the grid easily.
2. WebHCat –
It provides a service which can be utilized by the user to run Hadoop MapReduce (or
YARN), Pig, Hive tasks or function Hive metadata operations with an HTTP interface.
Modes of Hive:
1. Local Mode –
It is used, when the Hadoop is built under pseudo mode which has only one data node,
when the data size is smaller in term of restricted to single local machine, and when
processing will be faster on smaller datasets existing in the local machine.
2. Map Reduce Mode –
It is used, when Hadoop is built with multiple data nodes and data is divided across
various nodes, it will function on huge datasets and query is executed parallelly, and to
achieve enhanced performance in processing large datasets.
Characteristics of Hive:
1. Databases and tables are built before loading the data.
2. Hive as data warehouse is built to manage and query only structured data which is
residing under tables.
3. At the time of handling structured data, MapReduce lacks optimization and usability
function such as UDFs whereas Hive framework have optimization and usability.
4. Programming in Hadoop deals directly with the files. So, Hive can partition the data with
directory structures to improve performance on certain queries.
5. Hive is compatible for the various file formats which are TEXTFILE, SEQUENCEFILE,
ORC, RCFILE, etc.
6. Hive uses derby database in single user metadata storage and it uses MYSQL for multiple
user Metadata or shared Metadata.
Features of Hive:
1. It provides indexes, including bitmap indexes to accelerate the queries. Index type
containing compaction and bitmap index as of 0.10.
2. Metadata storage in a RDBMS, reduces the time to function semantic checks during query
execution.
3. Built in user-defined functions (UDFs) to manipulation of strings, dates, and other data-
mining tools. Hive is reinforced to extend the UDF set to deal with the use-cases not
reinforced by predefined functions.
4. DEFLATE, BWT, snappy, etc are the algorithms to operation on compressed data which
is stored in Hadoop Ecosystem.
5. It stores schemas in a database and processes the data into the Hadoop File Distributed
File System (HDFS).
6. It is built for Online Analytical Processing (OLAP).
7. It delivers various types of querying language which are frequently known as Hive Query
Language (HVL or HiveQL).
Advantages:
Scalability: Apache Hive is designed to handle large volumes of data, making it a scalable
solution for big data processing.
Familiar SQL-like interface: Hive uses a SQL-like language called HiveQL, which makes
it easy for SQL users to learn and use.
Integration with Hadoop ecosystem: Hive integrates well with the Hadoop ecosystem,
enabling users to process data using other Hadoop tools like Pig, MapReduce, and Spark.
Supports partitioning and bucketing: Hive supports partitioning and bucketing, which can
improve query performance by limiting the amount of data scanned.
User-defined functions: Hive allows users to define their own functions, which can be used
in HiveQL queries.
Disadvantages:
Limited real-time processing: Hive is designed for batch processing, which means it may
not be the best tool for real-time data processing.
Slow performance: Hive can be slower than traditional relational databases because it is
built on top of Hadoop, which is optimized for batch processing rather than interactive
querying.
Steep learning curve: While Hive uses a SQL-like language, it still requires users to have
knowledge of Hadoop and distributed computing, which can make it difficult for beginners
to use.
Limited flexibility: Hive is not as flexible as other data warehousing tools because it is
designed to work specifically with Hadoop, which can limit its usability in other
environments.

Introduction to Apache Pig

Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to
process the large datasets. It provides a high-level of abstraction for processing over the
MapReduce. It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin Language. Internally Pig Engine(a
component of Apache Pig) converted all these scripts into a specific map and reduce task. But
these are not visible to the programmers in order to provide a high-level of abstraction. Pig
Latin and Pig Engine are the two main components of the Apache Pig tool. The result of Pig
always stored in the HDFS.
Note: Pig Engine has two type of execution environment i.e. a local execution environment in
a single JVM (used when dataset is small in size)and distributed execution environment in a
Hadoop Cluster.
Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing
the reducer and mapper, compiling packaging the code, submitting the job and retrieving the
output is a time-consuming task. Apache Pig reduces the time of development using the multi-
query approach. Also, Pig is beneficial for programmers who are not from Java background.
200 lines of Java code can be written in only 10 lines using the Pig Latin language.
Programmers who have SQL knowledge needed less effort to learn Pig Latin.
• It uses query approach which results in reducing the length of the code.
• Pig Latin is SQL like language.
• It provides many builtIn operators.
• It provides nested data types (tuples, bags, map).
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At that
time, the main idea to develop Pig was to execute the MapReduce jobs on extremely large
datasets. In the year 2007, it moved to Apache Software Foundation(ASF) which makes it an
open source project. The first version(0.1) of Pig came in the year 2008. The latest version of
Apache Pig is 0.18 which came in the year 2017.
Features of Apache Pig:
• For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.
• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
• Apache Pig is extensible so that you can make your own process and user-defined
functions(UDFs) written in python, java or other programming languages .
• Join operation is easy in Apache Pig.
• Fewer lines of code.
• Apache Pig allows splits in the pipeline.
• By integrating with other components of the Apache Hadoop ecosystem, such as Apache
Hive, Apache Spark, and Apache ZooKeeper, Apache Pig enables users to take advantage
of these components’ capabilities while transforming data.
• The data structure is multivalued, nested, and richer.
• Pig can handle the analysis of both structured and unstructured data.

Difference between Pig and MapReduce

Apache Pig MapReduce

It is a scripting language. It is a compiled programming language.

Abstraction is at higher level. Abstraction is at lower level.

It have less line of code as compared to

Lines of code is more.
MapReduce.

More development efforts are required for

Less effort is needed for Apache Pig.
MapReduce.
Apache Pig MapReduce

Code efficiency is less as compared to As compared to Pig efficiency of code is

MapReduce. higher.

Pig provides built in functions for ordering,

Hard to perform data operations.
sorting and union.

It allows nested data types like map, tuple and

It does not allow nested data types
bag

Applications of Apache Pig:

• For exploring large datasets Pig Scripting is used.
• Provides the supports across large data-sets for Ad-hoc queries.
• In the prototyping of large data-sets processing algorithms.
• Required to process the time sensitive data loads.
• For collecting large amounts of datasets in form of search logs and web crawls.
• Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:
• Atom: It is a atomic data value which is used to store as a string. The main use of this
model is that it can be used as a number and as well as a string.
• Tuple: It is an ordered set of the fields.
• Bag: It is a collection of the tuples.
• Map: It is a set of key/value pairs.

Overview of SQOOP in Hadoop

SQOOP :
Previously when there was no Hadoop or there was no concept of big data at that point in time
all the data is used to be stored in the relational database management system. But nowadays
after the introduction of concepts of Big data, the data need to be stored in a more concise and
effective way. Thus Sqoop comes into existence.
So all the data which are stored in a relational database management system needed to be
transferred into the Hadoop structure. So the transfer of this large amount of data manually is
not possible but with the help of Sqoop, we can able to do it. Thus Sqoop is defined as the tool
which is used to perform data transfer operations from relational database management system
to Hadoop server. Thus it helps in transfer of bulk of data from one point of source to another
point of source.
Some of the important Features of the Sqoop :
• Sqoop also helps us to connect the result from the SQL Queries into Hadoop distributed
file system.
• Sqoop helps us to load the processed data directly into the hive or Hbase.
• It performs the security operation of data with the help of Kerberos.
• With the help of Sqoop, we can perform compression of processed data.
• Sqoop is highly powerful and efficient in nature.
There are two major operations performed in Sqoop :
1. Import
2. Export
Sqoop Working :

SQOOP ARCHITECTURE
Basically the operations that take place in Sqoop are usually user-friendly. Sqoop used the
command-line interface to process command of user. The Sqoop can also use alternative ways
by using Java APIs to interact with the user. Basically, when it receives command by the user,
it is handled by the Sqoop and then the further processing of the command takes place. Sqoop
will only be able to perform the import and export of data based on user command it is not able
to form an aggregation of data.
Sqoop is a tool in which works in the following manner, it first parses argument which is
provided by user in the command-line interface and then sends those arguments to a further
stage where arguments are induced for Map only job. Once the Map receives arguments it then
gives command of release of multiple mappers depending upon the number defined by the user
as an argument in command line Interface. Once these jobs are then for Import command, each
mapper task is assigned with respective part of data that is to be imported on basis of key which
is defined by user in the command line interface. To increase efficiency of process Sqoop uses
parallel processing technique in which data is been distributed equally among all mappers.
After this, each mapper then creates an individual connection with the database by using java
database connection model and then fetches individual part of the data assigned by Sqoop.
Once the data is been fetched then the data is been written in HDFS or Hbase or Hive on basis
of argument provided in command line. thus the process Sqoop import is completed.

The export process of the data in Sqoop is performed in same way, Sqoop export tool which
available performs the operation by allowing set of files from the Hadoop distributed system
back to the Relational Database management system. The files which are given as an input
during import process are called records, after that when user submits its job then it is mapped
into Map Task that brings the files of data from Hadoop data storage, and these data files are
exported to any structured data destination which is in the form of relational database
management system such as MySQL, SQL Server, and Oracle, etc.
Let us now understand the two main operations in detail:
Sqoop Import :
Sqoop import command helps in implementation of the operation. With the help of the import
command, we can import a table from the Relational database management system to the
Hadoop database server. Records in Hadoop structure are stored in text files and each record
is imported as a separate record in Hadoop database server. We can also create load and
partition in Hive while importing data..Sqoop also supports incremental import of data which
means in case we have imported a database and we want to add some more rows, so with the
help of these functions we can only add the new rows to existing database, not the complete
database.
Sqoop Export :
Sqoop export command helps in the implementation of operation. With the help of the export
command which works as a reverse process of operation. Herewith the help of the export
command we can transfer the data from the Hadoop database file system to the Relational
database management system. The data which will be exported is processed into records before
operation is completed. The export of data is done with two steps, first is to examine the
database for metadata and second step involves migration of data.
Here you can get the idea of how the import and export operation is performed in Hadoop with
the help of Sqoop.
Advantages of Sqoop :
• With the help of Sqoop, we can perform transfer operations of data with a variety of
structured data stores like Oracle, Teradata, etc.
• Sqoop helps us to perform ETL operations in a very fast and cost-effective manner.
• With the help of Sqoop, we can perform parallel processing of data which leads to fasten
the overall process.
• Sqoop uses the MapReduce mechanism for its operations which also supports fault
tolerance.
Disadvantages of Sqoop :
• The failure occurs during the implementation of operation needed a special solution to
handle the problem.
• The Sqoop uses JDBC connection to establish a connection with the relational database
management system which is an inefficient way.
• The performance of Sqoop export operation depends upon hardware configuration
relational database management system.

What is Apache ZooKeeper?

Zookeeper is a distributed, open-source coordination service for distributed applications. It
exposes a simple set of primitives to implement higher-level services for synchronization,
configuration maintenance, and group and naming.
In a distributed system, there are multiple nodes or machines that need to communicate with
each other and coordinate their actions. ZooKeeper provides a way to ensure that these nodes
are aware of each other and can coordinate their actions. It does this by maintaining a
hierarchical tree of data nodes called “Znodes“, which can be used to store and retrieve data
and maintain state information. ZooKeeper provides a set of primitives, such as locks, barriers,
and queues, that can be used to coordinate the actions of nodes in a distributed system. It also
provides features such as leader election, failover, and recovery, which can help ensure that the
system is resilient to failures. ZooKeeper is widely used in distributed systems such as Hadoop,
Kafka, and HBase, and it has become an essential component of many distributed applications.

Why do we need it?

• Coordination services: The integration/communication of services in a distributed

environment.
• Coordination services are complex to get right. They are especially prone to errors such as
race conditions and deadlock.
• Race condition-Two or more systems trying to perform some task.
• Deadlocks– Two or more operations are waiting for each other.
• To make the coordination between distributed environments easy, developers came up with
an idea called zookeeper so that they don’t have to relieve distributed applications of the
responsibility of implementing coordination services from scratch.

What is distributed system?

• Multiple computer systems working on a single problem.

• It is a network that consists of autonomous computers that are connected using distributed
middleware.
• Key Features: Concurrent, resource sharing, independent, global, greater fault tolerance,
and price/performance ratio is much better.
• Key Goals: Transparency, Reliability, Performance, Scalability.
• Challenges: Security, Fault, Coordination, and resource sharing.

Coordination Challenge

• Why is coordination in a distributed system the hard problem?

• Coordination or configuration management for a distributed application that has many
systems.
• Master Node where the cluster data is stored.
• Worker nodes or slave nodes get the data from this master node.
• single point of failure.
• synchronization is not easy.
• Careful design and implementation are needed.

Apache Zookeeper

Apache Zookeeper is a distributed, open-source coordination service for distributed systems.

It provides a central place for distributed applications to store data, communicate with one
another, and coordinate activities. Zookeeper is used in distributed systems to coordinate
distributed processes and services. It provides a simple, tree-structured data model, a simple
API, and a distributed protocol to ensure data consistency and availability. Zookeeper is
designed to be highly reliable and fault-tolerant, and it can handle high levels of read and write
throughput.
Zookeeper is implemented in Java and is widely used in distributed systems, particularly in the
Hadoop ecosystem. It is an Apache Software Foundation project and is released under the
Apache License 2.0.

Architecture of Zookeeper

Zookeeper Services
The ZooKeeper architecture consists of a hierarchy of nodes called znodes, organized in a tree-
like structure. Each znode can store data and has a set of permissions that control access to the
znode. The znodes are organized in a hierarchical namespace, similar to a file system. At the
root of the hierarchy is the root znode, and all other znodes are children of the root znode. The
hierarchy is similar to a file system hierarchy, where each znode can have children and
grandchildren, and so on.

Important Components in Zookeeper

ZooKeeper Services
• Leader & Follower
• Request Processor – Active in Leader Node and is responsible for processing write
requests. After processing, it sends changes to the follower nodes
• Atomic Broadcast – Present in both Leader Node and Follower Nodes. It is responsible
for sending the changes to other Nodes.
• In-memory Databases (Replicated Databases)-It is responsible for storing the data in the
zookeeper. Every node contains its own databases. Data is also written to the file system
providing recoverability in case of any problems with the cluster.
Other Components
• Client – One of the nodes in our distributed application cluster. Access information from
the server. Every client sends a message to the server to let the server know that client is
alive.
• Server– Provides all the services to the client. Gives acknowledgment to the client.
• Ensemble– Group of Zookeeper servers. The minimum number of nodes that are required
to form an ensemble is 3.

Zookeeper Data Model

ZooKeeper data model

In Zookeeper, data is stored in a hierarchical namespace, similar to a file system. Each node in
the namespace is called a Znode, and it can store data and have children. Znodes are similar to
files and directories in a file system. Zookeeper provides a simple API for creating, reading,
writing, and deleting Znodes. It also provides mechanisms for detecting changes to the data
stored in Znodes, such as watches and triggers. Znodes maintain a stat structure that includes:
Version number, ACL, Timestamp, Data Length
Types of Znodes:
• Persistence: Alive until they’re explicitly deleted.
• Ephemeral: Active until the client connection is alive.
• Sequential: Either persistent or ephemeral.

Why do we need ZooKeeper in the Hadoop?

Zookeeper is used to manage and coordinate the nodes in a Hadoop cluster, including the
NameNode, DataNode, and ResourceManager. In a Hadoop cluster, Zookeeper helps to:
• Maintain configuration information: Zookeeper stores the configuration information for the
Hadoop cluster, including the location of the NameNode, DataNode, and
ResourceManager.
• Manage the state of the cluster: Zookeeper tracks the state of the nodes in the Hadoop
cluster and can be used to detect when a node has failed or become unavailable.
• Coordinate distributed processes: Zookeeper can be used to coordinate distributed
processes, such as job scheduling and resource allocation, across the nodes in a Hadoop
cluster.
Zookeeper helps to ensure the availability and reliability of a Hadoop cluster by providing a
central coordination service for the nodes in the cluster.

How ZooKeeper in Hadoop Works?

ZooKeeper operates as a distributed file system and exposes a simple set of APIs that enable
clients to read and write data to the file system. It stores its data in a tree-like structure called a
znode, which can be thought of as a file or a directory in a traditional file system. ZooKeeper
uses a consensus algorithm to ensure that all of its servers have a consistent view of the data
stored in the Znodes. This means that if a client writes data to a znode, that data will be
replicated to all of the other servers in the ZooKeeper ensemble.
One important feature of ZooKeeper is its ability to support the notion of a “watch.” A watch
allows a client to register for notifications when the data stored in a znode changes. This can
be useful for monitoring changes to the data stored in ZooKeeper and reacting to those changes
in a distributed system.
In Hadoop, ZooKeeper is used for a variety of purposes, including:
• Storing configuration information: ZooKeeper is used to store configuration information
that is shared by multiple Hadoop components. For example, it might be used to store the
locations of NameNodes in a Hadoop cluster or the addresses of JobTracker nodes.
• Providing distributed synchronization: ZooKeeper is used to coordinate the activities of
various Hadoop components and ensure that they are working together in a consistent
manner. For example, it might be used to ensure that only one NameNode is active at a
time in a Hadoop cluster.
• Maintaining naming: ZooKeeper is used to maintain a centralized naming service for
Hadoop components. This can be useful for identifying and locating resources in a
distributed system.
ZooKeeper is an essential component of Hadoop and plays a crucial role in coordinating the
activity of its various subcomponents.

Reading and Writing in Apache Zookeeper

ZooKeeper provides a simple and reliable interface for reading and writing data. The data is
stored in a hierarchical namespace, similar to a file system, with nodes called znodes. Each
znode can store data and have children znodes. ZooKeeper clients can read and write data to
these znodes by using the getData() and setData() methods, respectively. Here is an example
of reading and writing data using the ZooKeeper Java API:
• Java
• Python3

// Connect to the ZooKeeper ensemble

ZooKeeper zk = new ZooKeeper("localhost:2181", 3000, null);

// Write data to the znode "/myZnode"

String path = "/myZnode";

String data = "hello world";

zk.create(path, data.getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);

// Read data from the znode "/myZnode"

byte[] bytes = zk.getData(path, false, null);

String readData = new String(bytes);

// Prints "hello world"

System.out.println(readData);

// Closing the connection

// to the ZooKeeper ensemble

zk.close();

Session and Watches

Session
• Requests in a session are executed in FIFO order.
• Once the session is established then the session id is assigned to the client.
• Client sends heartbeats to keep the session valid
• session timeout is usually represented in milliseconds

Watches
• Watches are mechanisms for clients to get notifications about the changes in the Zookeeper
• Client can watch while reading a particular znode.
• Znodes changes are modifications of data associated with the znodes or changes in the
znode’s children.
• Watches are triggered only once.
• If the session is expired, watches are also removed.

What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and
transporting large amounts of streaming data such as log files, events (etc...) from various
sources to a centralized data store.

Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy
streaming data (log data) from various web servers to HDFS.

Applications of Flume

Assume an e-commerce web application wants to analyze the customer behavior from a
particular region. To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.

Flume is used to move the log data generated by application servers into HDFS at a higher
speed.

Advantages of Flume

Here are the advantages of using Flume −

• Using Apache Flume we can store the data in to any of the centralized stores (HBase,
HDFS).
• When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized stores
and provides a steady flow of data between them.
• Flume provides the feature of contextual routing.
• The transactions in Flume are channel-based where two transactions (one sender and
one receiver) are maintained for each message. It guarantees reliable message delivery.
• Flume is reliable, fault tolerant, scalable, manageable, and customizable.

Features of Flume

Some of the notable features of Flume are as follows −

• Flume ingests log data from multiple web servers into a centralized store (HDFS,
HBase) efficiently.
• Using Flume, we can get the data from multiple servers immediately into Hadoop.
• Along with the log files, Flume is also used to import huge volumes of event data
produced by social networking sites like Facebook and Twitter, and e-commerce
websites like Amazon and Flipkart.
• Flume supports a large set of sources and destinations types.
• Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
• Flume can be scaled horizontally.

Apache Oozie - Introduction

What is Apache Oozie?

Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed
environment. It allows to combine multiple complex jobs to be run in a sequential order to
achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed to
run parallel to each other.

One of the main advantages of Oozie is that it is tightly integrated with Hadoop stack
supporting various Hadoop jobs like Hive, Pig, Sqoop as well as system-specific jobs
like Java and Shell.

Oozie is an Open Source Java Web-Application available under Apache license 2.0. It is
responsible for triggering the workflow actions, which in turn uses the Hadoop execution
engine to actually execute the task. Hence, Oozie is able to leverage the existing Hadoop
machinery for load balancing, fail-over, etc.

Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it
provides a unique callback HTTP URL to the task, and notifies that URL when it is complete.
If the task fails to invoke the callback URL, Oozie can poll the task for completion.

Following three types of jobs are common in Oozie −

• Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs) to
specify a sequence of actions to be executed.
• Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data
availability.
• Oozie Bundle − These can be referred to as a package of multiple coordinator and
workflow jobs.

We will look into each of these in detail in the following chapters.

A sample workflow with Controls (Start, Decision, Fork, Join and End) and Actions (Hive,
Shell, Pig) will look like the following diagram −

Workflow will always start with a W567-Start tag and end with an End tag.

Use-Cases of Apache Oozie

Apache Oozie is used by Hadoop system administrators to run complex log analysis on HDFS.
Hadoop Developers use Oozie for performing ETL operations on data in a sequential order and
saving the output in a specified format (Avro, ORC, etc.) in HDFS.

In an enterprise, Oozie jobs are scheduled as coordinators or bundles.

Oozie Editors

Before we dive into Oozie lets have a quick look at the available editors for Oozie.
Most of the time, you won’t need an editor and will write the workflows using any popular text
editors (like Notepad++, Sublime or Atom) as we will be doing in this tutorial.

But as a beginner it makes some sense to create a workflow by the drag and drop method using
the editor and then see how the workflow gets generated. Also, to map GUI with the
actual workflow.xml created by the editor. This is the only section where we will discuss about
Oozie editors and won’t use it in our tutorial.

The most popular among Oozie editors is Hue.

Hue Editor for Oozie

This editor is very handy to use and is available with almost all Hadoop vendors’ solutions.

The following screenshot shows an example workflow created by this editor.

You can drag and drop controls and actions and add your job inside these actions.

A good resource to learn more on this topic −

http://gethue.com/new-apache-oozie-workflow-coordinator-bundle-editors/

Oozie Eclipse Plugin (OEP)

Oozie Eclipse plugin (OEP) is an Eclipse plugin for editing Apache Oozie workflows
graphically. It is a graphical editor for editing Apache Oozie workflows inside Eclipse.

Composing Apache Oozie workflows is becoming much simpler. It becomes a matter of drag-
and-drop, a matter of connecting lines between the nodes.

The following screenshots are examples of OEP.

Big Data Analytics.
UNIT-I
Syllabus: Get an overview of Big Data: what is Big data, Histry of data management-
evaluation of big data, Structuring of Big Data, Elements of big data, Big Data Analytics.
Exploring the Use of Big Data in Bussiness context; Use of Big Data in Social
Networking,Use of Big Data in preventing Fraudulent activities, Use of Big Data in detecting
fraudulent activities in insurance sector.

Big data analysis uses advanced analytical methods that can extract important
business insights from bulk datasets. Within these datasets lies both structured
(organized) and unstructured (unorganized) data. Its applications cover different
industries such as healthcare, education, insurance, AI, retail, and manufacturing. By
analyzing this data, organizations get better insight on what is good and what is bad,
so they can make the necessary improvements, develop the production system, and
increase profitability.
This guide will discuss in greater detail the concept of big data analytics and how it
impacts the decision making process in many parts of the corporate world. You will
also know the different types of analyses that are used in big data, the list of the
commonly used tools and the courses that can be recommended for you to start your
journey towards the data analytics career
Table of Content
• What is Big-Data Analytics?
• How does big data analytics work?
• Types of Big Data Analytics
• Big Data Analytics Technologies and Tools
• Benefits of Big Data Analytics
• Challenges of Big data analytics
• Usage of Big Data Analytics
• Conclusion
• FAQs on Big Data Analytics
What is Big-Data Analytics?
Big data analytics is all about crunching massive amounts of information to uncover
hidden trends, patterns, and relationships. It’s like sifting through a giant mountain
of data to find the gold nuggets of insight.
Here’s a breakdown of what it involves:
• Collecting Data: Such data is coming from various sources such as social media,
web traffic, sensors and customer reviews.
• Cleaning the Data: Imagine having to assess a pile of rocks that included some
gold pieces in it. You would have to clean the dirt and the debris first. When data
is being cleaned, mistakes must be fixed, duplicates must be removed and the
data must be formatted properly.
• Analyzing the Data: It is here that the wizardry takes place. Data analysts
employ powerful tools and techniques to discover patterns and trends. It is the
same thing as looking for a specific pattern in all those rocks that you sorted
through.
The multi-industrial utilization of big data analytics spans from healthcare to finance
to retail. Through their data, companies can make better decisions, become more
efficient, and get a competitive advantage.
How does big data analytics work?
Big Data Analytics is a powerful tool which helps to find the potential of large and
complex datasets. To get better understanding, let’s break it down into key steps:
• Data Collection: Data is the core of Big Data Analytics. It is the gathering of
data from different sources such as the customers’ comments, surveys, sensors,
social media, and so on. The primary aim of data collection is to compile as much
accurate data as possible. The more data, the more insights.
• Data Cleaning (Data Preprocessing): The next step is to process this
information. It often requires some cleaning. This entails the replacement of
missing data, the correction of inaccuracies, and the removal of duplicates. It is
like sifting through a treasure trove, separating the rocks and debris and leaving
only the valuable gems behind.
• Data Processing: After that we will be working on the data processing. This
process contains such important stages as writing, structuring, and formatting of
data in a way it will be usable for the analysis. It is like a chef who is gathering
the ingredients before cooking. Data processing turns the data into a format suited
for analytics tools to process.
• Data Analysis: Data analysis is being done by means of statistical,
mathematical, and machine learning methods to get out the most important
findings from the processed data. For example, it can uncover customer
preferences, market trends, or patterns in healthcare data.
• Data Visualization: Data analysis usually is presented in visual form, for
illustration – charts, graphs and interactive dashboards. The visualizations
provided a way to simplify the large amounts of data and allowed for decision
makers to quickly detect patterns and trends.
• Data Storage and Management: The stored and managed analyzed data is of
utmost importance. It is like digital scrapbooking. May be you would want to go
back to those lessons in the long run, therefore, how you store them has great
importance. Moreover, data protection and adherence to regulations are the key
issues to be addressed during this crucial stage.
• Continuous Learning and Improvement: Big data analytics is a continuous
process of collecting, cleaning, and analyzing data to uncover hidden insights. It
helps businesses make better decisions and gain a competitive edge.
Types of Big Data Analytics
Big Data Analytics comes in many different types, each serving a different purpose:
1. Descriptive Analytics: This type helps us understand past events. In social
media, it shows performance metrics, like the number of likes on a post.
2. Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the
reasons behind past events. In healthcare, it identifies the causes of high patient
re-admissions.
3. Predictive Analytics: Predictive analytics forecasts future events based on past
data. Weather forecasting, for example, predicts tomorrow’s weather by
analyzing historical patterns.
4. Prescriptive Analytics: However, this category not only predicts results but also
offers recommendations for action to achieve the best results. In e-commerce, it
may suggest the best price for a product to achieve the highest possible profit.
5. Real-time Analytics: The key function of real-time analytics is data processing
in real time. It swiftly allows traders to make decisions based on real-time market
events.
6. Spatial Analytics: Spatial analytics is about the location data. In urban
management, it optimizes traffic flow from the data unde the sensors and cameras
to minimize the traffic jam.
7. Text Analytics: Text analytics delves into the unstructured data of text. In the
hotel business, it can use the guest reviews to enhance services and guest
satisfaction.
These types of analytics serve different purposes, making data understandable and
actionable. Whether it’s for business, healthcare, or everyday life, Big Data
Analytics provides a range of tools to turn data into valuable insights, supporting
better decision-making.
Big Data Analytics Technologies and Tools
Big Data Analytics relies on various technologies and tools that might sound
complex, let’s simplify them:
• Hadoop: Imagine Hadoop as an enormous digital warehouse. It’s used by
companies like Amazon to store tons of data efficiently. For instance, when
Amazon suggests products you might like, it’s because Hadoop helps manage
your shopping history.
• Spark: Think of Spark as the super-fast data chef. Netflix uses it to quickly
analyze what you watch and recommend your next binge-worthy show.
• NoSQL Databases: NoSQL databases, like MongoDB, are like digital filing
cabinets that Airbnb uses to store your booking details and user data. These
databases are famous because of their quick and flexible, so the platform can
provide you with the right information when you need it.
• Tableau: Tableau is like an artist that turns data into beautiful pictures. The
World Bank uses it to create interactive charts and graphs that help people
understand complex economic data.
• Python and R: Python and R are like magic tools for data scientists. They use
these languages to solve tricky problems. For example, Kaggle uses them to
predict things like house prices based on past data.
• Machine Learning Frameworks (e.g., TensorFlow): In Machine
learning frameworks are the tools who make predictions. Airbnb uses TensorFlow to
predict which properties are most likely to be booked in certain areas. It helps
hosts make smart decisions about pricing and availability.
These tools and technologies are the building blocks of Big Data Analytics and helps
organizations gather, process, understand, and visualize data, making it easier for
them to make decisions based on information.
Benefits of Big Data Analytics
Big Data Analytics offers a host of real-world advantages, and let’s understand with
examples:
1. Informed Decisions: Imagine a store like Walmart. Big Data Analytics helps
them make smart choices about what products to stock. This not only reduces
waste but also keeps customers happy and profits high.
2. Enhanced Customer Experiences: Think about Amazon. Big Data Analytics is
what makes those product suggestions so accurate. It’s like having a personal
shopper who knows your taste and helps you find what you want.
3. Fraud Detection: Credit card companies, like MasterCard, use Big Data
Analytics to catch and stop fraudulent transactions. It’s like having a guardian
that watches over your money and keeps it safe.
4. Optimized Logistics: FedEx, for example, uses Big Data Analytics to deliver
your packages faster and with less impact on the environment. It’s like taking the
fastest route to your destination while also being kind to the planet.
Challenges of Big data analytics
While Big Data Analytics offers incredible benefits, it also comes with its set of
challenges:
• Data Overload: Consider Twitter, where approximately 6,000 tweets are posted
every second. The challenge is sifting through this avalanche of data to find
valuable insights.
• Data Quality: If the input data is inaccurate or incomplete, the insights generated
by Big Data Analytics can be flawed. For example, incorrect sensor readings
could lead to wrong conclusions in weather forecasting.
• Privacy Concerns: With the vast amount of personal data used, like in
Facebook’s ad targeting, there’s a fine line between providing personalized
experiences and infringing on privacy.
• Security Risks: With cyber threats increasing, safeguarding sensitive data
becomes crucial. For instance, banks use Big Data Analytics to detect fraudulent
activities, but they must also protect this information from breaches.
• Costs: Implementing and maintaining Big Data Analytics systems can be
expensive. Airlines like Delta use analytics to optimize flight schedules, but they
need to ensure that the benefits outweigh the costs.
Overcoming these challenges is essential to fully harness the power of Big Data
Analytics. Businesses and organizations must tread carefully, ensuring they make
the most of the insights while addressing these obstacles effectively.
Usage of Big Data Analytics
Big Data Analytics has a significant impact in various sectors:
• Healthcare: It aids in precise diagnoses and disease prediction, elevating patient
care.
• Retail: Amazon’s use of Big Data Analytics offers personalized product
recommendations based on your shopping history, creating a more tailored and
enjoyable shopping experience.
• Finance: Credit card companies such as Visa rely on Big Data Analytics to
swiftly identify and prevent fraudulent transactions, ensuring the safety of your
financial assets.
• Transportation: Companies like Uber use Big Data Analytics to optimize
drivers’ routes and predict demand, reducing wait times and improving overall
transportation experiences.
• Agriculture: Farmers make informed decisions, boosting crop yields while
conserving resources.
• Manufacturing: Companies like General Electric (GE) use Big Data Analytics
to predict machinery maintenance needs, reducing downtime and enhancing
operational efficiency.
Conclusion
Big Data Analytics is a game-changer that’s shaping a smarter future. From
improving healthcare and personalizing shopping to securing finances and
predicting demand, it’s transforming various aspects of our lives. However,
Challenges like managing overwhelming data and safeguarding privacy are real
concerns. In our world flooded with data, Big Data Analytics acts as a guiding light.
It helps us make smarter choices, offers personalized experiences, and uncovers
valuable insights. It’s a powerful and stable tool that promises a better and more
efficient future for everyone.
A Brief History of Big Data Analytics

The advent of big data analytics was a response to the rise of big data that started in the

1990s. Very Long before the term “big data” was coined, the concept was applied to the

dawn of the computer age when businesses used large spreadsheets to crunch numbers and

find trends.

The large amount of data created in the late 1990s and early 2000s was fueled by new data

sources. The popularity of mobile devices and search engines created more data than any

company knew what to do with. Speed was another factor. The faster the data was made,

the more it had to be handled. In 2005, Gartner explained that these are the “3 Vs.” of big

data – variety, volume, and velocity. Recent research by IDC projected that data generation

would grow tenfold worldwide by 2020.

Anyone who could tame the vast amount of raw, unstructured information would open up

a treasure chest of never-before-seen consumer behavior, business operations, natural

phenomena, and population change.

Traditional data warehouses and relational databases were not up to the t ask. Innovation

was needed. In 2006, Hadoop was created by engineers from Yahoo and launched as an

open-source Apache project. The distributed processing platform made it possible to run

big data applications on a clustered platform. This is the main diffe rence between

traditional and big data analytics.

At first, big companies like Google and Facebook used big data analytics. In 2010,

retailers, banks, manufacturers, and healthcare companies began to understand the value

of being big data analytics companies as well.

Initially, large organizations with on-premises data systems were best suited to collect and

analyze large data sets. But Amazon Web Services and other cloud platform vendors have

made it easy for businesses to use big data analytics services. The ability to handle Hadoop

clusters in the cloud has given any size company the freedom to spin up and run only what

they need on demand.

The big data analytics ecosystem is a key component of the agility required for today’s

companies to succeed. Insights can be discovered more quickly and efficiently, translating

into instant trading decisions that can decide a winner.

History of Big Data

The term ‘Big Data’ has been in use since the early 1990s. Although it is not exactly known who
first used the term, most people credit John R. Mashey (who at the time worked at Silicon Graphics)
for making the term popular.[i] Big Data is now a well-established knowledge domain, both in
academics as well as in industry.

In order to best understand how Big Data was able to grow to such popularity, it is important to
place Big Data into its historic perspective. From a knowledge domain perspective, Big Data is the
combination of the very mature domain of statistics with the relatively young domain of computer
science. As such, it builds upon the collective knowledge of mathematics, statistics and data
analysis techniques in general.

Ever since the early beginnings of civilization, people have tried to use ‘data’ towards better
decision making, or to gain a competitive (or military) advantage. This quest can even be dated
back to the ancient Egyptians and the Roman Empire. The famous Library of Alexandria, which
was established around 300 B.C., can be considered as a first attempt by the ancient Egyptians to
capture all ‘data’ within the empire. It is estimated that the library consisted of 40,000 to 400,000
scrolls (which would be the equivalent of around 100,000 books).[ii] Even the ancient leaders of
the world realized that combining different data sources could result in an advantage over other
competing empires.

Other well documented use cases of the first forms of data analysis come from the Roman empire.
The ancient Roman military utilized very detailed statistical analysis to ‘predict’ at which border
the chance of an enemy insurgency would be the most prevalent. Based on these analyses, they
were able to deploy their armies in the most efficient way possible. It is not a far stretch to consider
these calculations one of the earliest forms of ‘predictive’ data analysis. And again, these analysis
techniques provided the Roman military with an advantage over other armies.

In order to understand the world of Big Data, it is therefore important to realize that most techniques
that are used today (from predictive algorithms to classification techniques) have been developed
centuries ago, and that Big Data continues to build on the work of some of the greatest minds in
history. The key aspect that has changed, of course, is the availability and accessibility to massive
quantities of data. Whereas up until the 1950s, most data analysis was done manually and on paper,
we now have the technology and capability to analyse terabytes of data within split seconds.

Especially since the beginning of the 21st century, the volume and speed with which data is
generated has changed beyond measures of human comprehension. The total amount of data in the
world was 4.4 zettabytes in 2013. That is set to rise steeply to 44 zettabytes by 2020.[iii] To put
that in perspective, 44 zettabytes are the equivalent to 44 trillion gigabytes. Even with the most
advanced technologies today, it is impossible to analyse all this data. The need to process these
increasingly larger (and unstructured) data sets is how traditional data analysis transformed into
‘Big Data’ in the last decade.
Figure: Data and the volume of data in Perspective (source: MyNASAData)

The evolution of Big Data can roughly be subdivided into three main phases.[iv] Each phase was
driven by technological advancements and has its own characteristics and capabilities. In order to
understand the context of Big Data today, it is important to understand how each of these phases
contributed to the modern meaning of Big Data.

Big Data Phase 1 – Structured Content

Data analysis, data analytics and Big Data originate from the longstanding domain of database
management. It relies heavily on the storage, extraction, and optimization techniques that are
common in data that is stored in Relational Database Management Systems (RDBMS). The
techniques that are used in these systems, such as structured query language (SQL) and the
extraction, transformation and loading (ETL) of data, started to professionalize in the 1970s.

Database management and data warehousing systems are still fundamental components of modern-
day Big Data solutions. The ability to quickly store and retrieve data from databases or find
information in large data sets, is still a core requirement for the analysis of Big Data. Relational
database management technology and other data processing technologies that were developed
during this phase, are still strongly embedded in the Big Data solutions from leading IT vendors,
such as Microsoft, Google and Amazon. A number of core technologies and characteristics of this
first phase in the evolution of Big Data is outlined in figure 3.

Big Data Phase 2 – Web Based Unstructured Content

From the early 2000s, the internet and corresponding web applications started to generate
tremendous amounts of data. In addition to the data that these web applications stored in relational
databases, IP-specific search and interaction logs started to generate web based unstructured data.
These unstructured data sources provided organizations with a new form of knowledge: insights
into the needs and behaviours of internet users. With the expansion of web traffic and online stores,
companies such as Yahoo, Amazon and eBay started to analyse customer behaviour by analysing
click-rates, IP-specific location data and search logs, opening a whole new world of possibilities.

From a technical point of view, HTTP-based web traffic introduced a massive increase in semi-
structured and unstructured data (further discussed in chapter 1.6). Besides the standard structured
data types, organizations now needed to find new approaches and storage solutions to deal with
these new data types in order to analyse them effectively. The arrival and growth of social media
data greatly aggravated the need for tools, technologies and analytics techniques that were able to
extract meaningful information out of this unstructured data. New technologies, such as networks
analysis, web-mining and spatial-temporal analysis, were specifically developed to analyse these
large quantities of web based unstructured data effectively.

Big Data Phase 3 – Mobile and Sensor-based Content

The third and current phase in the evolution of Big Data is driven by the rapid adoption of mobile
technology and devices, and the data they generate. The number of mobile devices and tablets
surpassed the number of laptops and PCs for the first time in 2011.[v] In 2020, there are an
estimated 10 billion devices that are connected to the internet. And all of these devices generate
data every single second of the day.

Mobile devices not only give the possibility to analyse behavioural data (such as clicks and search
queries), but they also provide the opportunity to store and analyse location-based GPS data.
Through these mobile devices and tablets, it is possible to track movement, analyse physical
behaviour and even health-related data (for example the number of steps you take per day). And
because these devices are connected to the internet almost every single moment, the data that these
devices generate provide a real-time and unprecedented picture of people’s behaviour.

Simultaneously, the rise of sensor-based internet-enabled devices is increasing the creation of data
to even greater volumes. Famously coined the ‘Internet of Things’ (IoT), millions of new TVs,
thermostats, wearables and even refrigerators are connected to the internet every single day,
providing massive additional data sets. Since this development is not expected to stop anytime
soon, it could be stated that the race to extract meaningful and valuable information out of these
new data sources has only just begun. A summary of the evolution of Big Data and its key
characteristics per phase is outlined in figure 3.

Figure: The Three major Phases in the evolution of Big Data

STRUCTURED DATA(Types of Big Data)
2.5 quintillion bytes of data are generated every day by users. Predictions by Statista
suggest that by the end of 2021, 74 Zettabytes( 74 trillion GBs) of data would be
generated by the internet. Managing such a vacuous and perennial outsourcing of
data is increasingly difficult. So, to manage such huge complex data, Big data was
introduced, it is related to the extraction of large and complex data into meaningful
data which can’t be extracted or analyzed by traditional methods.
All data cannot be stored in the same way. The methods for data storage can be
accurately evaluated after the type of data has been identified. A Cloud Service, like
Microsoft Azure, is a one-stop destination for storing all kinds of data; blobs, queues,
files, tables, disks, and applications data. However, even within the Cloud, there are
special services to deal with specific sub-categories of data.
For example, Azure Cloud Services like Azure SQL and Azure Cosmos DB help in
handling and managing sparsely varied kinds of data.
Applications Data is the data that is created, read, updated, deleted, or processed by
applications. This data could be generated via web apps, android apps, iOS apps, or
any applications whatsoever. Due to a varied diversity in the kinds of data being
used, determining the storage approach is a little nuanced.

Types of Big Data

Structured Data

• Structured data can be crudely defined as the data that resides in a fixed field
within a record.
• It is type of data most familiar to our everyday lives. for ex: birthday,address
• A certain schema binds it, so all the data has the same set of properties. Structured
data is also called relational data. It is split into multiple tables to enhance the
integrity of the data by creating a single record to depict an entity. Relationships
are enforced by the application of table constraints.
• The business value of structured data lies within how well an organization can
utilize its existing systems and processes for analysis purposes.
Sources of structured data

A Structured Query Language (SQL) is needed to bring the data together. Structured
data is easy to enter, query, and analyze. All of the data follows the same format.
However, forcing a consistent structure also means that any alteration of data is too
tough as each record has to be updated to adhere to the new structure. Examples of
structured data include numbers, dates, strings, etc. The business data of an e-
commerce website can be considered to be structured data.

Name Class Section Roll No Grade

Geek1 11 A 1 A

Geek2 11 A 2 B

Geek3 11 A 3 A

Cons of Structured Data

1. Structured data can only be leveraged in cases of predefined functionalities. This
means that structured data has limited flexibility and is suitable for certain
specific use cases only.
2. Structured data is stored in a data warehouse with rigid constraints and a definite
schema. Any change in requirements would mean updating all of that structured
data to meet the new needs. This is a massive drawback in terms of resource and
time management.

Semi-Structured Data
• Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly organized into
rows and columns like that in a spreadsheet. However, there are some features
like key-value pairs that help in discerning the different entities from each other.
• Since semi-structured data doesn’t need a structured query language, it is
commonly called NoSQL data.
• A data serialization language is used to exchange semi-structured data across
systems that may even have varied underlying infrastructure.
• Semi-structured content is often used to store metadata about a business process
but it can also include files containing machine instructions for computer
programs.
• This type of information typically comes from external sources such as social
media platforms or other web-based data feeds.

Semi-Structured Data

Data is created in plain text so that different text-editing tools can be used to draw
valuable insights. Due to a simple format, data serialization readers can be
implemented on hardware with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in
files, transit, store, and parse. The sender and the receiver don’t need to know about
the other system. As long as the same serialization language is used, the data can be
understood by both systems comfortably. There are three predominantly used
Serialization languages.
1. XML– XML stands for eXtensible Markup Language. It is a text-based markup
language designed to store and transport data. XML parsers can be found in almost
all popular development platforms. It is human and machine-readable. XML has
definite standards for schema, transformation, and display. It is self-descriptive.
Below is an example of a programmer’s details in XML.
XML

<CodingPlatform Type="Fav">GeeksforGeeks</CodingPlatform>

<CodingPlatform Type="3rdFav">CodeisLife</CodingPlatform>

</CodingPlatforms>

</ProgrammerDetails>

XML expresses the data using tags (text within angular brackets) to shape the data
(for ex: FirstName) and attributes (For ex: Type) to feature the data. However, being
a verbose and voluminous language, other formats have gained more popularity.
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file
format for data interchange. JSON is easy to use and uses human/machine-readable
text to store and transmit data objects.
• Javascript

"firstName": "Jane",

"lastName": "Doe",

"codingPlatforms": [

{ "type": "Fav", "value": "Geeksforgeeks" },

{ "type": "2ndFav", "value": "Code4Eva!" },

{ "type": "3rdFav", "value": "CodeisLife" }

This format isn’t as formal as XML. It’s more like a key/value pair model than a
formal data depiction. Javascript has inbuilt support for JSON. Although JSON is
very popular amongst web developers, non-technical personnel find it tedious to
work with JSON due to its heavy dependence on JavaScript and structural characters
(braces, commas, etc.)

3. YAML– YAML is a user-friendly data serialization language. Figuratively, it

stands for YAML Ain’t Markup Language. It is adopted by technical and non-
technical handlers all across the globe owing to its simplicity. The data structure is
defined by line separation and indentation and reduces the dependency on structural
characters. YAML is extremely comprehensive and its popularity is a result of its
human-machine readability.
YAML example

A product catalog organized by tags is an example of semi-structured data.

Unstructured Data

• Unstructured data is the kind of data that doesn’t adhere to any definite schema
or set of rules. Its arrangement is unplanned and haphazard.
• Photos, videos, text documents, and log files can be generally considered
unstructured data. Even though the metadata accompanying an image or a video
may be semi-structured, the actual data being dealt with is unstructured.
• Additionally, Unstructured data is also known as “dark data” because it cannot
be analyzed without the proper software tools.

Un-structured Data
Summary
Applications data can be classified as structured, semi-structured, and unstructured
data. Structured data is neatly organized and obeys a fixed set of rules. Semi-
structured data doesn’t obey any schema, but it has certain discernible features for
an organization. Data serialization languages are used to convert data objects into a
byte stream. These include XML, JSON, and YAML. Unstructured data doesn’t
have any structure at all. All these three kinds of data are present in an application.
All three of them play equally important roles in developing resourceful and
attractive applications.

ELEMENTS OF BIG DATA

6V’s of Big Data

In recent years, Big Data was defined by the “3Vs” but now there is “6Vs” of Big
Data which are also termed as the characteristics of Big Data as follows:
1. Volume:
• The name ‘Big Data’ itself is related to a size which is enormous.
• Volume is a huge amount of data.
• To determine the value of data, size of data plays a very crucial role. If the
volume of data is very large, then it is actually considered as a ‘Big Data’. This
means whether a particular data can actually be considered as a Big Data or not,
is dependent upon the volume of data.
• Hence while dealing with Big Data it is necessary to consider a characteristic
‘Volume’.
• Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes
(6.2 billion GB) per month. Also, by the year 2020 we will have almost 40000
Exabytes of data.
2. Velocity:
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the potential
of data that how fast the data is generated and processed to meet the demands.
• Sampling data can help in dealing with the issue like ‘velocity’.
• Example: There are more than 3.5 billion searches per day are made on Google.
Also, Facebook users are increasing by 22%(Approx.) year by year.
3. Variety:
• It refers to nature of data that is structured, semi-structured and unstructured
data.
• It also refers to heterogeneous sources.
• Variety is basically the arrival of data from new sources that are both inside and
outside of an enterprise. It can be structured, semi-structured and unstructured.
o Structured data: This data is basically an organized data. It generally
refers to data that has defined the length and format of data.
o Semi- Structured data: This data is basically a semi-organised data.
It is generally a form of data that do not conform to the formal
structure of data. Log files are the examples of this type of data.
o Unstructured data: This data basically refers to unorganized data. It
generally refers to data that doesn’t fit neatly into the traditional row
and column structure of the relational database. Texts, pictures,
videos etc. are the examples of unstructured data which can’t be
stored in the form of rows and columns.
4. Veracity:
• It refers to inconsistencies and uncertainty in data, that is data which is available
can sometimes get messy and quality and accuracy are difficult to control.
• Big Data is also variable because of the multitude of data dimensions resulting
from multiple disparate data types and sources.
• Example: Data in bulk could create confusion whereas less amount of data
could convey half or Incomplete Information.
5. Value:
• After having the 4 V’s into account there comes one more V which stands for
Value! The bulk of Data having no Value is of no good to the company, unless
you turn it into something useful.
• Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information. Hence, you can state that Value! is
the most important V of all the 6V’s.
6. Variability:
• How fast or available data that extent is the structure of your data is changing?
• How often does the meaning or shape of your data change?
• Example: if you are eating same ice-cream daily and the taste just keep
changing.

What is Big Data Analytics ?

Nowadays, social media marketing gets more significance from business insights than from
being only a communication tool. Social media has greatly changed since it was first created
to establish connections between individuals. Additionally, social media marketers should mix
crucial roles often performed by technicians and businessmen. The exciting numbers are
revealed in Domo's Data Never Sleeps Report 8.0, which also shows how quickly the volume
of their labor is increasing.

How Is Social Media Being Affected by Big

Data?
The extensive use of these big data tactics is clearly demonstrated by the influx of posts,
comments, likes, dislikes, followings, and followers from social media sources, such as the top
3 leaders - Facebook, Youtube, and Instagram. Facebook is not going away, as evidenced by
Statista's estimate that it had 2.38 billion active monthly users in the first quarter of 2019.

Operating these massive amounts of information created every single second is crucial.
Successful firms pay attention to what their consumers say because both positive and bad
comments can affect their ability to attract new customers and maintain their good name.

Big data is essential to marketing analytics' ability to forecast future customer behavior without
exaggeration. Many businesses invest in big data solution technologies to track customers'
experiences in social media in real-time.

Advantages of Using Big Data in Social Media:

Let's take a quick look at the top 7 advantages of big data analytics for social media marketing.

1. Channels of communication:

AI strategies enable the processing of data from various channels, particularly when
synchronization and a widely used log-in technology are used. Many business websites
encourage users to sign up using Google or Facebook accounts, allowing marketers to access
data from social media activity, browser history, desktop and mobile applications, cloud
storage, and other sources to learn more about their customers.
2. Real-time communication:

The key to a successful market study is user behavior on social media, such as advertising
clicked, pages visited and followed, comments left, links saved, and friends added. No other
source can provide a more accurate and current picture of market demand. The most important
thing is to take advantage of the circumstance earlier than competitors because it changes so
quickly.

3. Intended audience:

Similar to other company endeavors, social media marketing aims to boost sales, but it serves
no use in feeding vegan meat. Knowing your intended audience is crucial, therefore. The
breadth of ML solutions allows for extracting useful insights from various social network
activities, including millions of photographs, music preferences, locations, and many other
activities.

4. Future forecasts:

Using big data strategy and predictive analytics in the media allows for better decision-making
based on historical data. Data-driven businesses frequently achieve great success because
computers can predict future customer preferences. Even if they evolve over time, habits and
interests generally stay connected. Following a purchase on social media, there is a strong
likelihood that the consumer will select related goods.

5. Security concerns:

Private information is extremely important to customers due to the rise of social media and the
public presentation of personal information, weird as it may sound. Although there is still much
need for improvement in this area, most businesses give security concerns a top priority. Data
vendors, marketers, and business owners must provide data security against leaks to
unauthorized third parties. Different forms of protection are suggested by big data solutions,
such as voice and facial recognition, authorization, check-in notifications, etc.

6. Campaign analysis:

The seesaw dynamics of ROI indicators may be properly tracked thanks to big data analytics.
Marketers can learn more about a social media campaign's success. Predictive analytics tools
excel when it comes to predicting the goods and services that customers will demand.
Measuring user interactions and responses to online advertisements across various social media
platforms can reveal much about consumer behavior and purchasing habits. Overall, the
success or failure of a campaign can be predicted based on past customer behavior gathered
from social media, historical website data, email subscriptions, and other forms of digital
contact.

7. Affordable costs:

Because so many elements must be considered, pricing selections can occasionally be difficult.
Typically, it begins with product costs, problems with competition, market demand, positive
revenue, levels of currency and inflation, and finishes with a global economic scenario. In order
to fully understand how much your loyal customers are willing to spend on your products, a
solid Big Data strategy on social media should not only involve lavish payments to your
Instagram influencers. It should also involve regular communication with these customers,
perhaps through A/B testing or online surveys. All of this can assist marketers in making more
precise and flexible price adjustments to meet client expectations.

8. Innovation potential:

Through media monitoring, businesses can thoroughly grasp their goods and target market
using data science Tools for social media analytics that can be set up to find market-wide
capability gaps. For instance, user input expressing a need for lighter, more relaxed running
shoes helped to propel the minimalist innovation in the market for running shoes. The most
prosperous businesses in recent years have been those that can mine consumer feedback from
social media platforms and use it to reinvent their businesses.

How big data analytics is used in

social media.

Almost all of us are familiar with social media in 2021. There are many social media platforms
run by different companies in the world. Not to forget, the large number of social media users
that have been added in recent years. With the rise of social media, the amount of data produced
by different platforms is unmatchable. The likes, shares, and comments across social media
platforms contain information regarding user behavior.
It is why business organizations are using big data analytics to make the best use of the data
available on social media platforms. Big data analytics is widely used in social media to shape
marketing strategies and much more. Read on to know seven ways how big data analytics is
used in social media.
• Omnichannel presence
Many business applications and websites have a social media integration. Customers can log
in to a business application using their social media credentials. It helps businesses to collect
customer data from social media platforms and use them to provide better services. You can
get access to social media posts, browser history, and much more. Since customers have an
omnichannel presence, you can collect data from all sources to know more about the
preferences of customers.
• Real-time activity monitoring
Social media is a place where you immediately get to know when someone has liked a post or
shared a product link. Businesses monitor the activity of customers on social media to know
about their current mood. If a social media user is liking your product posts, you can show
them an email quickly to convert them into a customer. No other platform can inform customer
preferences in real-time other than social media. Big data is used extensively to collect real-
time activity reports on social media.
• Forecasting
When big data is mixed with modern-day technologies like ML and AI, it can predict customer
preferences. Based on customer habits on social media, AI/ML algorithms predict their
demands. Businesses then focus on releasing new products/services as per the future demands.
For example, when a customer buys something online, the chances of them buying similar
products increase.
• Security
Data vendors cannot illegally transfer customer data to the wrong hands. When customers share
data on social media, that data can only be used for business purposes. Your social media data
cannot be placed in the wrong hands by business organizations. Big data is used for enhancing
the security of social media platforms based on customer suggestions.

• Campaign monitoring

Marketers run social media campaigns to boost their ROI (Return on Investment). Using big
data, marketers can know how well a social media campaign has performed. Young aspirants
can go for big data training to know more about how to run social media campaigns and study
high-end analytics.

• Product pricing

When a firm launches its product on social media, customers give their valuable opinions.
Social media is widely used to determine whether customers are satisfied with the pricing of a
product or not. Big data training includes data collection from social media channels and how
to analyze them.
• Ad creation

Social media is used to collect info about customer preferences. Based on that info, targeted,
and personalized advertisements are displayed on social media channels. Technologies
like Hadoop programming and Python programming are also used by big data analysts in
social media.
Young aspirants can go for the big data analytics programs launched by Imarticus Learning.
Its PG Program in ML and Data Analytics can help working professionals in getting a
raise. Start your big data course and learn Python and Hadoop programming!

he Iportance of Big Data Analytics in Terms of Fraud Prevention:

As online purchase, payment, and money transfer transactions increase, the risks of
fraud that may occur through these transactions also increase. It was very difficult
for companies to process and analyze the huge amount of data that emerged from
these transactions and use it in fraud detection. At this point, we come across an
indispensable facilitating tool: big data analytics for fraud detection. Using big data
analytics in some points of fraud detection provides many advantages.

One of the most important points when detecting fraud is to take action quickly. It
may take a long time to identify the suspicious ones among this large number of
irregular data resulting from transactions.
Some transactions may be perceived as suspicious by misinterpretations as a result
of these long analyses. During this evaluation process, there will still be a need for
people, namely a manual workload, to analyze the data and check for suspicious
transactions or misinterpretations.

To protect the company and customers from harm, it is necessary to draw up rules
based on this data and look at past fraudulent activities, so that we can establish
systems that can prevent possible damages and frauds that may occur.

All these mean more cost, time, and manual work. Big data analytics plays the
biggest helping role in solving these issues. Using data analyzed with techniques in
big data analytics can provide:

• Low costs
• More accurate and precise detections
• Optimized workflows and efficiency of systems
• Better services to customers
In addition, data mining and machine learning made by big data analytics are used
in fraud analytics. These tools enable the implementation of payment fraud analytics,
financial fraud analytics, and insurance fraud detection analytics.

What are the Common Problems in Big Data Analytics in Fraud Detection?

We mentioned the importance of big data analytics in detecting fraud. Although it

makes it easier to detect fraud, it can also bring some problems with it. Some of these
problems can be listed as:

• Unrelated or Insufficient Data: The data from the transactions may come from
many different sources. In some cases, false results can be obtained in fraud
detection due to insufficient or irrelevant data. Detection can be based on the
inappropriate rules used in the algorithm. Because of this risk of failure, companies
may be hesitant to use big data analytics and machine learning.

• High Costs: Big data analytics and fraud detection systems may cause some costs
such as the cost of software, and hardware systems, the cost of components used for
the sustainability of these systems, and the time spent.

• Dynamic Fraud Methods: As technology develops, fraud methods develop at the

same pace. In order to catch this speed and detect fraud, it is necessary to constantly
monitor the data and give rules to the algorithms with new and accurate data
analytics.

• Data Security: While processing the data and making decisions with this data
analytics system, the security of the data is also a problem to be considered. That
means the security of data should be checked.
Solutions To Big Data Analytics Problems

• It is necessary to separate unnecessary data by processing complex data coming from

many channels with certain analyses and big data analytics. This organized, prepared
data is given to the algorithms. These algorithms ensure that fraudulent transactions
are detected and quick action is taken.

• Monitoring access to this data, reports, and alarms from a single tool with easy and
visualized dashboards prevents wasting money and time. Even if you pay for this
tool in the first place, invest in it, it will provide much more benefit than what is paid
to you in the long run by preventing fraudulent transactions detected with these tools.

• In conclusion, an engineering system should be established to analyze big data and

manage and control its analytics. It is necessary to ensure data security by including
cyber security experts. Most importantly, it will provide many benefits to use
software such as Formica, which will provide features such as data processing,
analysis, inference, and alarming in the field of fraud within the company and will
prevent time and effort spent by helping analysts and engineers.

Fraudulent activities, including e-commerce scams, insurance fraud, cybersecurity threats, and
financing fraud, pose significant risks to both individuals and companies across various
industries such as retail, insurance, banking, and healthcare.

To combat these risks, businesses increasingly adopt advanced fraud prevention technologies
and robust risk management strategies that depend on Big Data. For instance, predictive
analytics models, alternative data sources, and advanced machine learning techniques empower
decision-makers to develop innovative approaches and methodologies to proactively prevent
fraud.

These technologies analyze large volumes of data to identify patterns and anomalies in
transactions that indicate fraudulent behavior, allowing businesses to take proper action.
Understanding big data analytics

Big data analytics involves processing and analyzing large and complex data sets, known as
big data, to extract valuable insights. This field helps in the discovery of trends, patterns, and
correlations within vast amounts of raw data, assisting analysts in making informed decisions.

By using the growing data generated from various sources, such as IoT sensors, social media,
and even financial data from institutions, transactions, and smart devices, organizations can get
actionable insights through advanced analytic techniques.

In response to current challenges, companies are shifting to advanced data analytics techniques
for fraud prevention technologies and risk management strategies that use Big Data.
Techniques like predictive analytics, alternative data, and machine learning are helping create
new ways to prevent fraud.

Applications of big data analytics in fraud

detection

Here are a few applications of big data analytics in fraud detection:

Real-time fraud monitoring

One of the main benefits of using big data in fraud detection is the ability to perform real-time
analytics and monitoring. Traditional methods of detecting fraud often depend on past data
analysis, which may not be fast enough to stop advanced fraudsters.

Big data analytics supports instant analysis of transactions, user behavior, and patterns,
allowing organizations to monitor, detect, and respond to potential fraud as it happens.

Pattern recognition

Integrating machine learning algorithms with big data analytics boosts insurance fraud
detection analytics and prevention. These algorithms learn from historical data, identifying
patterns and trends linked to fraudulent activities.
As they continuously evolve, machine learning models become highly effective at predicting
and against fraudulent activity and preventing fraud before it happens, offering a proactive
defense mechanism.

Anomaly detection

Big data allows advanced behavioral analytics, which involves analyzing user behavior
patterns to identify anomalies. By specifying a baseline of normal user behavior, organizations
can quickly detect deviations that may indicate fraud.

This approach to payment fraud analytics is specifically effective in online banking, e-

commerce, and other digital transactions, where abnormal patterns can be easily identified and
investigated.

Predictive modeling and risk assessment

Predictive models can assist organizations in predicting fraud scenarios and identifying
suspicious activities. These models can include variables such as transaction volume, velocity,
or customer behavior patterns to evaluate the likelihood of fraud.

With these insights, organizations can consider the risk and allocate the resources to detect
fraud more effectively and take proactive steps to prevent fraudulent activities before they
occur.

Examples of big data analytics for fraud detection and

prevention

To demonstrate further investigation into how big data analytics improves credit card
fraud detection and prevention in banking and finance. Here are these real-world applications:

• PayPal: Using machine learning to analyze billions of transactions, PayPal detects potentially
fraudulent transactions and activities in milliseconds, guiding significant savings and improved
customer satisfaction.

• Mastercard: Using data mining to identify fraud patterns across millions of merchants and
cardholders, Mastercard offers fraud prevention solutions like Mastercard Safety Net and
Mastercard Identity Check.
• HSBC: By integrating and analyzing data from customer profiles, data records, transaction
records, and external databases, HSBC combats money laundering and financial crime using
techniques such as network analysis and commodity resolution.

• American Express: Using natural language generation and geospatial analysis, American
Express analyzes customers’ spending habits, preferences, and locations to create personalized
fraud alerts.

Challenges and considerations

Here are some challenges to using big data for fraud detection.

Data quality and integration issues

Ensuring high-quality and reliable data involves techniques such as data cleaning, validation,
and integration. These processes are important to make sure the data used for financial
fraud analytics and detection is accurate and trustworthy. Many organizations face integration
issues while using big data analysis, for those who don’t have advanced high-quality
computers. Consider using the latest computers to get along with these technologies.

Privacy and security concerns

Maintaining the privacy and security of data is necessary. This requires following regulations
and ethical standards and implementing measures like encryption device data mine,
anonymization, and access control to protect sensitive information. Find tools that will help
you maintain your privacy and keep compliance with all rules and regulations.

Regulatory compliance requirements

Organizations must adhere to various regulatory requirements to maintain correct data

handling. They must also keep themselves current on applicable laws and standards and build
compliance measures into data management practices.

Benefits of big data analytics for fraud detection

Here are some of the benefits of big data analysis for fraud detection:
Improved accuracy and efficiency in fraud detection

Big data analytics improves the accuracy and efficiency of fraud detection by processing vast
amounts of data quickly. This capability of big data fraud detection allows organizations to
mine call data records and identify fraudulent activities more accurately and efficiently.

Reduced false positives and false negatives

By using advanced analytic techniques, big data analytics reduces false positives for legitimate
transactions and false negatives for fraudulent transactions. This guarantees that genuine
transactions are not mistakenly flagged as fraud and that actual fraudulent activities are
accurately detected.

Faster response times and better decision-making

The real-time processing capabilities of big data analytics allow faster response times to
potential fraud. This quick data analysis also supports better decision-making, evolving fraud
tactics, and allowing organizations to act accordingly to prevent fraud.

Easier compliance with regulatory standards

Big data analytics helps organizations comply with regulatory standards by providing robust
data management and data security measures. It assures that data handling practices meet
regulatory requirements, reducing the risk of non-compliance.

Best practices for big data analytics in fraud detection

and prevention

To maximize the benefits of fraud data analytics and address the challenges of big data
analytics for fraud detection and prevention in banking and finance, consider the following best
practices:

• Define clear objectives and metrics: Set targets people can work towards Your fraud data
analytics will, therefore, be specific to helping achieve these goals, rather than general and
perhaps less relevant.
• Implement a robust data governance framework: Develop a comprehensive data
governance framework that includes policies, processes, roles, and responsibilities for
maintaining data quality, privacy, security, and compliance. This framework should be flexible
enough to adapt to evolving regulations and business needs.

• Adopt an integrated approach to data management: This approach uses various data
sources, types, formats, and technologies, allowing for more comprehensive and accurate fraud
detection capabilities.

• Apply a combination of analytical methods: Use a combination of testing and confirming

analytical methods to find patterns that balance discovery and validation. This approach allows
you to find new fraud patterns while confirming the validity of known patterns, balancing
complexity with simplicity.

• Implement a continuous improvement process: Set an iterative process of testing, learning,

and improving fraud detection lies. This continuous cycle helps you adapt to the ever-changing
nature and dynamics of fraud, assuring your fraud detection solutions remain effective.

By following these best practices, organizations can improve their ability to detect and prevent
fraud, using Big Data and advanced analytics techniques to stay ahead of evolving threats.

Conclusion

Throughout this article, we’ve explored the role of big data analytics in fraud detection
and prevention. We’ve discussed how various techniques, such as real-time fraud
monitoring, pattern recognition, anomaly detection, and predictive modeling, can
significantly improve the accuracy, efficiency, and effectiveness of fraud detection
efforts.

Real-world examples from industry leaders like PayPal, Mastercard, HSBC, and American
Express have shown the practical applications, potential threats, and benefits of using Big Data
analytics tools in combating fraudulent activities.

The future of big data analytics in fraud detection and prevention looks promising, with
continuous advancements in machine learning, AI, and data processing technologies. These
developments will further refine and improve the capabilities of fraud detection software,
making them more adaptive and robust.
Stay ahead of fraud and protect your business with advanced solutions from HyperVerge.
Discover how HyperVerge’s advanced technology can help you prevent fraud more
effectively. Learn more here.

How a Big Data Strategy Can Fight

Insurance Fraud
Find out how financial institutions are combating fraud, and identify its behavioral
patterns through Data Science

At the same time, insurers have also understood that they need a Big Data strategy
for various purposes. Not all, however, already use tools to detect fraud.

The fight against banking and insurance crimes is a daily challenge for financial
institutions around the world.

Fraud comes in many forms, from credit card scams to fake bank slips, data theft on
fake websites, and irregular purchases. The fact is that during the pandemic, fraud
has increased by 70%. The growth of fraud attempts has led banks and insurers to
invest in anti-fraud technologies, but fraudsters are getting smarter.

For all institutions, the sophistication of this type of attack is a problem that needs
to be solved efficiently.

Winning the war on fraud requires companies to outsmart criminals. The good news
is that technology can help.

Thanks to Data Science, it’s now possible to improve fraud management in real-
time, with more effective results and increased customer satisfaction. With data
processing and analysis, Big Data, Artificial Intelligence, and Machine Learning, we
can identify new attack patterns quickly.

Continue reading and understand how a Data Science strategy can help insurers
avoid headaches and financial damage!

Fraud Fighting Challenges

According to a survey, the most prominent challenges institutions face in the fight
against fraud are directly linked to the digital transformation that the banking and
insurance sector has undergone.
The increase in the use of digital channels during 2020 has expanded the scale of
fraud. Right now, banks and startups are opening accounts through apps. Not
installing mechanisms to combat fraud from account inception could put future
operations at risk.

These problems are not unique to the industry but are of particular concern, as
fraudsters heavily target them.

The two main components in fighting fraud are detection and prevention.

Fraud detection refers to the ability to detect fraudulent events, recognize patterns,
and identify if fraud has occurred.

Prevention, which is much more complicated, seeks to analyze and predict

fraudulent events before they occur.

The most common moments where fraud occurs are:

• Issuing a credit card

• Financing electronics
• Buying a cell phone
• Opening a bank account
• Buying a car
• Starting a business

The main concerns are related to:

1. Data theft: Institutions are more prone to crimes based on stolen identity, and
customers are more prone to scams, as personal data is used to gain company/client
trust.

2. Faster Payment Processing: Shortening the time it takes to process payments poses
the challenge of real-time prevention, which requires well-protected systems and
automation.

3. Open banking: Data accessibility requires robust security mechanisms – such as

identity verification – interconnected between various institutions.

4. Increase in digital channels: Being present on multiple channels makes the fraud
prevention strategy and policy more complex. Cohesion is needed.

5. Social engineering: Scams that customers are voluntarily coerced into, such as
payments or transfers to fraudsters. They are notoriously difficult to detect.

Lack of protection also brings risks to institutions, which may become legally liable
for customer losses.
Furthermore, a lack of security damages credibility and the operations as a whole.
Investing in prevention solutions prevents losses and criminal liability, in addition
to improving the institution’s image.

Fraud Fighting Strategies and Standards

Prevention is the key to strategies against banking scams. But it’s not enough. It
needs to have the tools to predict, detect, and respond to threats.

So we’re talking about a strategy that integrates data science across the institution,
from tools to people, and from governance to culture.

Yes, technology is an excellent ally in ensuring that fraud monitoring is proactive

rather than reactive. Allowing banking institutions to identify and anticipate
fraudulent actions before they generate losses.

Imagine that the bank wants to start a relationship with a company or individual: how
do you prevent fraud?

The first step is to carry out extensive research on the history of that institution or
potential client, understanding their behavior.

The good news is that this process can be fully automated and executed quickly.

With a simple search on a sophisticated Big Data platform, it is possible to gather

relevant information and make decisions based on that data.

The benefits of this data sweep are clear. From the data consolidated in a single
report, prepared with combined criteria, managers can understand the consumer’s
profile before closing the deal, validating the registration and identifying possible
risk factors.

Another preventive measure available to companies is the definition of interest

groups that are frequently monitored, springing to receive alerts in case of suspicious
actions. There are also more advanced anti-fraud mechanisms, which we call
enhanced intelligence.

In this case, an extra layer of technology is added to solutions to increase the power
of analytics for decisions that come from data packs.

Personal documents offered as part of the validation process undergo rigorous

verification procedures, including the use of facial recognition as proof of life.
Fraud Fighting Case Study

Denmark’s largest bank has a great example of how Artificial Intelligence and
Machine Learning can provide excellent results in fraud detection.

The institution adopted a set of technologies to create and launch a fraud detection
platform based on Artificial Intelligence. The solution uses Machine Learning to
analyze tens of thousands of resources, monitoring millions of banking transactions
online in real-time to provide insight that differentiates honest activities from
fraudulent ones.

The Danish bank’s anti-fraud program is the first to put Machine Learning
techniques into production while also developing deep learning models to test out
strategies.

The team began work within the bank’s existing infrastructure and then created
advanced Machine Learning models to detect fraud in millions of transactions per
year and at peak hours.

To ensure transparency and encourage trust, the mechanism includes an

interpretation layer on top of the Machine Learning models, explaining blocked
activity.

The fact is, every bank needs a scalable and robust analytics platform and a roadmap
and digitization strategy to bring data science into the organization.

With so many online transactions, credit cards, and mobile payments, banks demand
real-time solutions to detect fraud efficiently.

AI helps uncover data ‘anomalies’ through transaction analysis and identifies

fraudulent operations through data and user behavior. Machine Learning contributes
its predictive capacity thanks to current technological capabilities. Rapid machine
learning ‘disarms’ criminals, preventing financial theft in real-time.

This entire process takes place in a matter of minutes, sometimes seconds.

Soon after that, new fraud patterns are developed. In other words, they are short
windows of action and learning to be solved by ML/AI.

The Fraud Prevention Cycle: Continuous Improvement in Defense

Data processing is at the heart of the project! Gathering, storing, structuring, and
cross-checking information is the best way to detect fraud efficiently. The analysis
of fraudulent behavior is crucial to the definition of a propensity indicator.
This acts as an irregularity alert to interrupt the payment process and deepen the
claim analysis. For this, it is essential that managers carefully look into the
monitoring and detection of threats.

As this is a continuous cycle, actions must be constant, organized, and closely

monitored.

The efficient work of fraud prevention depends on the team’s analytical capacity.

Below, we have a step-by-step guide for creating a fraud prevention cycle.

DATA

To identify patterns of fraudulent behavior, companies need to process datasets –

often unstructured ones.

From Data Science, it is possible to identify fraud-prone behaviors. After processing,

this sea of data is organized for visualization and understanding, bringing to life sets
of information in dashboards.

ANALYSIS

This is where you find out if there is fraud being committed at that time and
understand the path pursued by the scammers and their strategies. Here, the
organization and dashboard visualization takes place only with the information
necessary for the fraud-fighting processes.

Upon finding the pattern defined by the indicator, it performs a more accurate
analysis in search of irregularities that prove fraud. Digital solutions can provide
detailed, real-time information for diagnoses that lead to more informed decisions.

CORRECTION

The third stage of the fraud prevention cycle comes into play when the previously
taken steps are insufficient to prevent fraudulent attacks.

In addition to reviewing the security techniques applied to prevent the recurrence of

cases, it is essential to check the entire preventive process, accumulate lessons
learned, and reinforce the need for policies to combat fraud in institutions.

Combating fraud deserves your attention

The message for banks and insurance companies is: invest in analytics and data
technologies.
Even within a sector continuously developing ‘State of the Art’ solutions to financial
crimes, a more focused look is needed for data analysis and monitoring. Advanced
analytics models facilitate this process through the use of detailed customer
information.

The challenge is to do this work without compromising the quality of your customer
experience, which is at the heart of the strategy. One that is increasingly demanding,
with intuitive, responsive, and secure solutions.

Invest in combining solutions with multiple layers of defense. Not least because, as
markets become more mature from a digital perspective, threats gain new levels of
complexity. Fraud has been and will increasingly become a digital arms race.

If you have any questions about preventing fraud in your company or want to boost
your digital defenses, why not reach out to one of our consultants?

Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Hadoop MapReduce for Developers
No ratings yet
Hadoop MapReduce for Developers
4 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
MapReduce Architecture Explained
No ratings yet
MapReduce Architecture Explained
13 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Hadoop MapReduce for Big Data
No ratings yet
Hadoop MapReduce for Big Data
5 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Unit 2
No ratings yet
Unit 2
12 pages
Data Science
No ratings yet
Data Science
7 pages
Unit III
No ratings yet
Unit III
90 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
28 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
MapReduce for Data Processing
No ratings yet
MapReduce for Data Processing
7 pages
MapReduce Technique Overview
No ratings yet
MapReduce Technique Overview
16 pages
Unit 5
No ratings yet
Unit 5
35 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Unit 3
No ratings yet
Unit 3
13 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
3 pages
3 Unit
No ratings yet
3 Unit
17 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Unit - III
No ratings yet
Unit - III
37 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Bda U4
No ratings yet
Bda U4
25 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Big Data Analytics
No ratings yet
Big Data Analytics
50 pages
Mod Menu Crash 2023 10 03-01 37 03
No ratings yet
Mod Menu Crash 2023 10 03-01 37 03
6 pages
CCBT Unit 2
No ratings yet
CCBT Unit 2
31 pages
Slides Chapter 6 Pipelining
No ratings yet
Slides Chapter 6 Pipelining
60 pages
Firewall and SELinux Configuration Guide
No ratings yet
Firewall and SELinux Configuration Guide
14 pages
Agile in Software Projects & Healthcare
No ratings yet
Agile in Software Projects & Healthcare
8 pages
Data Email 5
No ratings yet
Data Email 5
11 pages
Red Hat Enterprise Linux 7 DM Multipath
No ratings yet
Red Hat Enterprise Linux 7 DM Multipath
55 pages
AVEVA™ Unified Engineering 3.0 - Release Note
No ratings yet
AVEVA™ Unified Engineering 3.0 - Release Note
136 pages
POSIX Threads in Linux
No ratings yet
POSIX Threads in Linux
11 pages
Resume Digambar
No ratings yet
Resume Digambar
1 page
Invoice236881 PDF
No ratings yet
Invoice236881 PDF
1 page
Scheme of Examination: Academic Session 2010 Onwards
No ratings yet
Scheme of Examination: Academic Session 2010 Onwards
61 pages
1
No ratings yet
1
309 pages
Questions Bank JPR
No ratings yet
Questions Bank JPR
9 pages
Oops Through Java R19 Unit-4
100% (1)
Oops Through Java R19 Unit-4
33 pages
XXX
No ratings yet
XXX
8 pages
FINAL Wi Fi Router Vulnerabilities
No ratings yet
FINAL Wi Fi Router Vulnerabilities
10 pages
BSE2701 System Design - HD2 15-16 - HVAC Assignment
No ratings yet
BSE2701 System Design - HD2 15-16 - HVAC Assignment
2 pages
VDR G4 (E) Voyage Data Recorder VDR G4 (E) Technical Manual For VDR G4 (E)
No ratings yet
VDR G4 (E) Voyage Data Recorder VDR G4 (E) Technical Manual For VDR G4 (E)
300 pages
Chapter 4 Introducing Publisher 2013
No ratings yet
Chapter 4 Introducing Publisher 2013
2 pages
Web Programming Lab Guide
80% (5)
Web Programming Lab Guide
28 pages
COMFAR III Expert Installation Guide
No ratings yet
COMFAR III Expert Installation Guide
20 pages
Intrusion Detection Systems
No ratings yet
Intrusion Detection Systems
29 pages
Fall 2023 - IT430 - 1
No ratings yet
Fall 2023 - IT430 - 1
2 pages
Upc A Code Generator
No ratings yet
Upc A Code Generator
11 pages
CRM On Vodafone
No ratings yet
CRM On Vodafone
15 pages
Function Approximations Using Matlab
No ratings yet
Function Approximations Using Matlab
11 pages
IT Exemplar 2018 MEMO
No ratings yet
IT Exemplar 2018 MEMO
25 pages
Air Sim
No ratings yet
Air Sim
7 pages
Mobile Automation Testing Using Appium: White Paper
No ratings yet
Mobile Automation Testing Using Appium: White Paper
12 pages