0% found this document useful (0 votes)
11 views61 pages

Hadoop Questions

Zookeeper is a distributed coordination service that helps manage and synchronize distributed applications through a hierarchical data structure called znodes. It provides essential features such as leader election, failover, and recovery, making it a critical component in systems like Hadoop and Kafka. Data virtualization is another technology that allows for real-time access to data from multiple sources without physical relocation, enhancing data integration and accessibility.

Uploaded by

228w1f0045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views61 pages

Hadoop Questions

Zookeeper is a distributed coordination service that helps manage and synchronize distributed applications through a hierarchical data structure called znodes. It provides essential features such as leader election, failover, and recovery, making it a critical component in systems like Hadoop and Kafka. Data virtualization is another technology that allows for real-time access to data from multiple sources without physical relocation, enhancing data integration and accessibility.

Uploaded by

228w1f0045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Define Zookeeper

Zookeeper is a distributed, open-source coordination service for


distributed applications. It exposes a simple set of primitives to
implement higher-level services for synchronization, configuration
maintenance, and group and naming.
In a distributed system, there are multiple nodes or machines that
need to communicate with each other and coordinate their actions.
ZooKeeper provides a way to ensure that these nodes are aware of
each other and can coordinate their actions. It does this by
maintaining a hierarchical tree of data nodes called “Znodes“,
which can be used to store and retrieve data and maintain state
information. ZooKeeper provides a set of primitives, such as locks,
barriers, and queues, that can be used to coordinate the actions of
nodes in a distributed system. It also provides features such as
leader election, failover, and recovery, which can help ensure that
the system is resilient to failures. ZooKeeper is widely used in
distributed systems such as Hadoop, Kafka, and HBase, and it has
become an essential component of many distributed applications.

Why do we need it?

 Coordination services: The integration/communication of


services in a distributed environment.
 Coordination services are complex to get right. They are
especially prone to errors such as race conditions and
deadlock.
 Race condition-Two or more systems trying to perform
some task.
 Deadlocks– Two or more operations are waiting for each
other.
 To make the coordination between distributed
environments easy, developers came up with an idea
called zookeeper so that they don’t have to relieve
distributed applications of the responsibility of
implementing coordination services from scratch.

Apache Zookeeper

Apache Zookeeper is a distributed, open-source coordination


service for distributed systems. It provides a central place for
distributed applications to store data, communicate with one
another, and coordinate activities. Zookeeper is used in distributed
systems to coordinate distributed processes and services. It
provides a simple, tree-structured data model, a simple API, and a
distributed protocol to ensure data consistency and availability.
Zookeeper is designed to be highly reliable and fault-tolerant, and
it can handle high levels of read and write throughput.
Zookeeper is implemented in Java and is widely used in distributed
systems, particularly in the Hadoop ecosystem. It is an Apache
Software Foundation project and is released under the Apache
License 2.0.

Architecture of Zookeeper

The ZooKeeper architecture consists of a hierarchy of nodes called


znodes, organized in a tree-like structure. Each znode can store
data and has a set of permissions that control access to the znode.
The znodes are organized in a hierarchical namespace, similar to a
file system. At the root of the hierarchy is the root znode, and all
other znodes are children of the root znode. The hierarchy is similar
to a file system hierarchy, where each znode can have children and
grandchildren, and so on.

Important Components in Zookeeper


ZooKeeper Services

 Leader & Follower


 Request Processor – Active in Leader Node and is
responsible for processing write requests. After processing,
it sends changes to the follower nodes
 Atomic Broadcast – Present in both Leader Node and
Follower Nodes. It is responsible for sending the changes to
other Nodes.
 In-memory Databases (Replicated Databases)-It is
responsible for storing the data in the zookeeper. Every
node contains its own databases. Data is also written to the
file system providing recoverability in case of any problems
with the cluster.
Other Components
 Client – One of the nodes in our distributed application
cluster. Access information from the server. Every client
sends a message to the server to let the server know that
client is alive.
 Server– Provides all the services to the client. Gives
acknowledgment to the client.
 Ensemble– Group of Zookeeper servers. The minimum
number of nodes that are required to form an ensemble is
3.

Zookeeper Data Model


ZooKeeper data model

In Zookeeper, data is stored in a hierarchical namespace, similar to


a file system. Each node in the namespace is called a Znode, and it
can store data and have children. Znodes are similar to files and
directories in a file system. Zookeeper provides a simple API for
creating, reading, writing, and deleting Znodes. It also provides
mechanisms for detecting changes to the data stored in Znodes,
such as watches and triggers. Znodes maintain a stat structure that
includes: Version number, ACL, Timestamp, Data Length
Types of Znodes:
 Persistence: Alive until they’re explicitly deleted.
 Ephemeral: Active until the client connection is alive.
 Sequential: Either persistent or ephemeral.

How ZooKeeper in Hadoop Works?

ZooKeeper operates as a distributed file system and exposes a


simple set of APIs that enable clients to read and write data to the
file system. It stores its data in a tree-like structure called a znode,
which can be thought of as a file or a directory in a traditional file
system. ZooKeeper uses a consensus algorithm to ensure that all of
its servers have a consistent view of the data stored in the Znodes.
This means that if a client writes data to a znode, that data will be
replicated to all of the other servers in the ZooKeeper ensemble.
One important feature of ZooKeeper is its ability to support the
notion of a “watch.” A watch allows a client to register for
notifications when the data stored in a znode changes. This can be
useful for monitoring changes to the data stored in ZooKeeper and
reacting to those changes in a distributed system.
In Hadoop, ZooKeeper is used for a variety of purposes, including:
 Storing configuration information: ZooKeeper is used to
store configuration information that is shared by multiple
Hadoop components. For example, it might be used to
store the locations of NameNodes in a Hadoop cluster or
the addresses of JobTracker nodes.
 Providing distributed synchronization: ZooKeeper is used to
coordinate the activities of various Hadoop components
and ensure that they are working together in a consistent
manner. For example, it might be used to ensure that only
one NameNode is active at a time in a Hadoop cluster.
 Maintaining naming: ZooKeeper is used to maintain a
centralized naming service for Hadoop components. This
can be useful for identifying and locating resources in a
distributed system.
ZooKeeper is an essential component of Hadoop and plays a crucial
role in coordinating the activity of its various subcomponents.

Data Virtualization

The foundation of data virtualization technology is the execution of
distributed data management processes, mostly for queries, against
numerous heterogeneous data sources, and the Federation of query
results into virtual views. Applications, query/reporting tools,
message-oriented middleware, or other parts of the data
management infrastructure then consume these virtual views.
Instead of performing data movement and physically storing
integrated views in a destination data structure, data virtualization
can be utilized to construct virtualized and integrated views of data
in memory. To make querying logic simpler, it provides an
abstraction layer over the actual physical implementation of data.
It is a method for combining data from various sources and different
types into a comprehensive, logical representation without
physically relocating the data. Simply put, users can theoretically
access and examine data while it still exists in its original sources
thanks to specialized middleware.

Features of Data Virtualization

 Time to market acceleration from data to final


product:- Virtual data objects can be created considerably
more quickly than existing ETL tools and databases since
they include integrated data. Customers may now more
easily get the information they require.
 One-Stop Security:- The contemporary data architecture
makes it feasible to access data from a single location. Data
can be secured down to the row and column level thanks to
the virtual layer that grants access to all organizational
data. Authorizing numerous user groups on the same virtual
dataset is feasible by using data masking, anonymization,
and pseudonymization.
 Combine data explicitly from different sources:- The
virtual data layer makes it simple to incorporate distributed
data from Data Warehouses, Big Data Platforms, Data lakes,
Cloud Solutions, and Machine Learning into user-required
data objects.
 Flexibility:- It is feasible to react quickly to new advances
in various sectors thanks to data virtualization. This is up to
ten times faster than conventional ETL and data
warehousing methods. By providing integrated virtual data
objects, data virtualization enables you to reply instantly to
fresh data requests. This does away with the necessity to
copy data to various data levels but just makes it virtually
accessible.

Layers of Data Virtualization

following are the working layers in data virtualization architecture.


 Connection layer: With the use of connectors and
communication protocols, this layer is in charge of accessing
the data dispersed across numerous source systems that
contain both organized and unstructured data. Platforms for
data virtualization can connect to various data sources, such
as SQL and NoSQL databases like MySQL, Oracle and
MongoDB etc.
 Abstraction layer: The abstraction layer, also known as
the virtual or semantic layer, serves as a link between all
data sources and all business users, forming the backbone
of the entire virtualization system. This tier just holds the
logical views and information required to access the
sources; it does not itself store any data. The complexity of
the underlying data structures is hidden from end users,
who only see the schematic data models thanks to the
abstraction layer.
 Consumption layer: A single point of access to the data
stored in the underlying sources is offered by a different tier
of the data virtualization architecture. Depending on the
type of consumer, several protocols and connectors are
used to give abstracted data representations. They can
interface with the virtual layer using SQL and a variety of
APIs, such as REST and SOAP APIs, as well as access
standards like JDBC and ODBC. A variety of corporate users,
tools, and apps can all have access to data virtualization
software, including well-known ones like Tableau, Cognos,
and Power BI.

Applications of Data Virtualization

 Migration: Think of a scenario where you migrate a CRM


system to the cloud from a traditional one. Or a gradual
migration of old systems to the cloud. You can accomplish
this with data virtualization without halting operations or
reporting.
 Uses In Operations:- For call centres or customer support
systems, data silos are a big source of misery that have
lasted for a very long time. A bank would, for instance,
choose one call centre for credit cards and another for home
loans. Data virtualization that spans data silos enables
everyone, from a call centre to a database manager, to see
the full range of data repositories from a single point of
access.
 Agile BI:- With data virtualization, you can use your data
for data science, API or system linkages, governed
(regulated), and self-service BI. Additionally, it’s perfect for
“agile” BI, which involves developing dashboards and
reports in incredibly fast iterations that include testing,
piloting, and production. Would you wish to add new sources
to your current BI stream by connecting SaaS cloud services
like Salesforce or Google Analytics? You may! You may
combine all of your data with data virtualization, even in a
hybrid environment. Additionally, you don’t need to worry
about security because it is highly centralised.
 Data Integration:- This is the most likely situation you will
encounter because practically every company contains data
from multiple different data sources. Connecting an
antiquated client/server-based data source with modern
digital platforms like social media is required for that.
You use the data catalogue to search your data after
connecting using methods like Java DAO, ODBC, SOAP, or
other APIs. Constructing connections is more likely to be
difficult, even with data virtualization.
 Accessing Real-Time Data:- Are your SLA agreements
under pressure and a source system not performing
adequately in terms of (near) real-time accessibility to
massive amounts of data? Data virtualization allows you to
blend real-time data from the source system with historical
data that has been “offloaded” to a different source. You can
prevent overtaxing your source systems by optimising your
caching or conducting more intelligent system queries.
Without initially copying every type of data with ETL
operations, even near real-time analytics on huge data are
feasible. Additionally, it is simple to create a virtual data
mart by combining an outdated data warehouse with a fresh
data source.

Advantages of Data Virtualization

 Data virtualization enables real-time access to and


manipulation of source data through the virtual/logical layer
without physically relocating the data to a new location. ETL
is typically not required.
 Comparing the implementation of data virtualization to the
construction of a separate consolidated store, the former
takes less funding and resources.
 There is no need to relocate the material, and access levels
may be controlled.
 Without worrying about a data type or where the data is
located, users can build and execute whatever reports and
analyses they require.
 Through a single virtual layer, all corporate data is
accessible to all consumers and use cases.
Top 5 Uses of MapReduce
By spreading out processing across numerous nodes and
merging or decreasing the results of those nodes, MapReduce
has the potential to handle large data volumes. This makes it
suitable for the following use cases:

Uses of MapReduce

1. Entertainment
Hadoop MapReduce assists end users in finding the most
popular movies based on their preferences and previous
viewing history. It primarily concentrates on their clicks and
logs.
Various OTT services, including Netflix, regularly release many
web series and movies. It may have happened to you that you
couldn’t pick which movie to watch, so you looked at Netflix’s
recommendations and decided to watch one of the suggested
series or films. Netflix uses Hadoop and MapReduce to indicate
to the user some well-known movies based on what they have
watched and which movies they enjoy. MapReduce can
examine user clicks and logs to learn how they watch movies.

2. E-commerce
Several e-commerce companies, including Flipkart, Amazon,
and eBay, employ MapReduce to evaluate consumer buying
patterns based on customers’ interests or historical purchasing
patterns. For various e-commerce businesses, it provides
product suggestion methods by analyzing data, purchase
history, and user interaction logs.
Many e-commerce vendors use the MapReduce programming
model to identify popular products based on customer
preferences or purchasing behavior. Making item proposals for
e-commerce inventory is part of it, as is looking at website
records, purchase histories, user interaction logs, etc., for
product recommendations.

3. Social media
Nearly 500 million tweets, or about 3000 per second, are sent
daily on the microblogging platform Twitter. MapReduce
processes Twitter data, performing operations such as
tokenization, filtering, counting, and aggregating counters.
 Tokenization: It creates key-value pairs from the
tokenized tweets by mapping the tweets as maps of
tokens.
 Filtering: The terms that are not wanted are
removed from the token maps.
 Counting: It creates a token counter for each word in
the count.
 Aggregate counters: A grouping of comparable
counter values is prepared into small, manageable
pieces using aggregate counters.

4. Data warehouse
Systems that handle enormous volumes of information are
known as data warehouse systems. The star schema, which
consists of a fact table and several dimension tables, is the
most popular data warehouse model. In a shared-nothing
architecture, storing all the necessary data on a single node is
impossible, so retrieving data from other nodes is essential.
This results in network congestion and slow query execution
speeds. If the dimensions are not too big, users can replicate
them over nodes to get around this issue and maximize
parallelism. Using MapReduce, we may build specialized
business logic for data insights while analyzing enormous data
volumes in data warehouses.

5. Fraud detection
Conventional methods of preventing fraud are not always very
effective. For instance, data analysts typically manage
inaccurate payments by auditing a tiny sample of claims and
requesting medical records from specific submitters. Hadoop is
a system well suited for handling large volumes of data needed
to create fraud detection algorithms. Financial businesses,
including banks, insurance companies, and payment locations,
use Hadoop and MapReduce for fraud detection, pattern
recognition evidence, and business analytics through
transaction analysis.
CAP Theorem
Introduction

We have had significant advances in distributed databases to handle the


proliferation of data. This has allowed us to handle increased traffic with lower
latency, allowed an easier expansion of the database system, provided better fault
tolerance in terms of replication, and so much more.

The NoSQL databases have inadvertently been at the forefront of this shift in the
domain of distributed databases. And they have been providing you with lots of
flexibility in terms of handling your data. Plus there is a plethora of them out there!

However, the real question is which one to use? The answer to this question lies
not only in the properties of these databases but also in understanding a
fundamental theorem. A theorem that has gained renewed attention since the
advent of such databases in the realm of databases. Yes, I’m talking about the CAP
theorem!

In simple terms, the CAP theorem lets you determine how you want to handle your
distributed database systems when a few database servers refuse to communicate
with each other due to some fault in the system. However, there exists some
misunderstanding. So, in this article, we will try to understand the CAP theorem
and how it helps to choose the right distributed database system.

But first, let’s get a basic understanding of distributed database systems.

Distributed Database Systems

In a NoSQL type distributed database system, multiple computers, or nodes, work


together to give an impression of a single working database unit to the user. They
store the data in these multiple nodes. And each of these nodes runs an instance of
the database server and they communicate with each other in some way.

When a user wants to write to the database, the data is appropriately written to a
node in the distributed database. The user may not be aware of where the data is
written. Similarly, when a user wants to retrieve the data, it connects to the nearest
node in the system which retrieves the data for it, without the user knowing about
this.

This way, a user simply interacts with the system as if it is interacting with a single
database. Internally the nodes communicate with each other, retrieving data that
the user is looking for, from the relevant node, or storing the data provided by the
user.

Now, the benefits of a distributed system are quite obvious. With the increase in
traffic from the users, we can easily scale our database by adding more nodes to the
system. Since these nodes are commodity hardware, they are relatively cheaper
than adding more resources to each of the nodes individually. That is, horizontal
scaling is cheaper than vertical scaling.
This horizontal scaling makes replication of data cheaper and easier. This means
that now the system can easily handle more user traffic by appropriately
distributing the traffic amongst the replicated nodes that.

So now the real problem comes while choosing the appropriate distribution system
for a task. To answer such a question, we need to understand the CAP theorem.

Understanding CAP theorem with an Example

CAP theorem, also known as Brewer’s theorem, stands for Consistency,


Availability and Partition Tolerance. But let’s try to understand each, with an
example.

Availability
Imagine there is a very popular mobile operator in your city and you are its
customer because of the amazing plans it offers. Besides that, they also provide an
amazing customer care service where you can call anytime and get your queries
and concerns answered quickly and efficiently. Whenever a customer calls them,
the mobile operator is able to connect them to one of their customer care operators.

The customer is able to elicit any information required by her/him about his
accounts like balance, usage, or other information. We call
this Availability because every customer is able to connect to the operator and get
the information about the user/customer.
Consistency
Now, you have recently shifted to a new house in the city and you want to update
your address registered with the mobile operator. You decide to call the customer
care operator and update it with them. When you call, you connect with an
operator. This operator makes the relevant changes in the system. But once you
have put down the phone, you realize you told them the correct street name but the
old house number (old habits die hard!).

So you frantically call the customer care again. This time when you call, you
connect with a different customer care operator but they are able to access your
records as well and know that you have recently updated your address. They make
the relevant changes in the house number and the rest of the address is the same as
the one you told the last operator.
Partition tolerance
Recently you have noticed that your current mobile plan does not suit you. You do
not access that much mobile data any longer because you have good wi-fi facilities
at home and at the office, and you hardly step outside anywhere. Therefore, you
want to update your mobile plan. So you decide to call the customer care once
again.

On connecting with the operator this time, they tell you that they have not been
able to update their records due to some issues. So the information lying with the
operator might not be up to date, therefore they cannot update the information. We
can say here that the service is broken or there is no Partition tolerance.
Understanding the Terms of the CAP theorem

Now let’s take up these terms one by one and try to understand them in a more
formal manner.

Consistency

Consistency means that the user should be able to see the same data no matter
which node they connect to on the system. This data is the most recent data written
to the system. So if a write operation has occurred on a node, it should be
replicated to all its replicas. So that whenever a user connects to the system, they
can see that same information.

However, having a system that maintains consistency instantaneously and globally


is near impossible. Therefore, the goal is to make this transition fast enough so that
it is hardly noticeable.
Consistency is of importance when it is required that all the clients or users view
the same data. This is important in places that deal with financial or personal
information. For example, your bank account should reflect the same balance
whether you view it from your PC, tablet, or smartphone!

Availability

Availability means that every request from the user should elicit a response from
the system. Whether the user wants to read or write, the user should get a response
even if the operation was unsuccessful. This way, every operation is bound to
terminate.

For example, when you visit your bank’s ATM, you are able to access your
account and its related information. Now even if you go to some other ATM, you
should still be able to access your account. If you are only able to access your
account from one ATM and not another, this means that the information is not
available with all the ATMs.

Availability is of importance when it is required that the client or user be able to


access the data at all times, even if it is not consistent. For example, you should be
able to see your friend’s Whatsapp status even if you are viewing an outdated one
due to some network failure.

Partition Tolerance

Partition refers to a communication break between nodes within a distributed


system. Meaning, if a node cannot receive any messages from another node in the
system, there is a partition between the two nodes. Partition could have been
because of network failure, server crash, or any other reason.
So, if Partition means a break in communication then Partition tolerance would
mean that the system should still be able to work even if there is a partition in the
system. Meaning if a node fails to communicate, then one of the replicas of the
node should be able to retrieve the data required by the user.

This is handled by keeping replicas of the records in multiple different nodes. So


that even if a partition occurs, we are able to retrieve the data from its replica. As
you must have guessed already, partition tolerance is a must for any distributed
database system.

What is the CAP Theorem?

In the last section, you understood what each term means in the CAP theorem.
Now let us understand the theorem itself.

The CAP theorem states that a distributed database system has to make a tradeoff
between Consistency and Availability when a Partition occurs.
A distributed database system is bound to have partitions in a real-world system
due to network failure or some other reason. Therefore, partition tolerance is a
property we cannot avoid while building our system. So a distributed system will
either choose to give up on Consistency or Availability but not on Partition
tolerance.

For example in a distributed system, if a partition occurs between two nodes, it is


impossible to provide consistent data on both the nodes and availability of
complete data. Therefore, in such a scenario we either choose to compromise on
Consistency or on Availability. Hence, a NoSQL distributed database is either
characterized as CP or AP. CA type databases are generally the monolithic
databases that work on a single node and provide no distribution. Hence, they
require no partition tolerance.
Understanding CP with MongoDB

Let’s try to understand how a distributed system would work when it decides to
give up on Availability during a partition with the help of MongoDB.

MongoDB is a NoSQL database that stores data in one or more Primary nodes in
the form of JSON files. Each Primary node has multiple replica sets that update
themselves asynchronously using the operation log file of their respective primary
node. The replica set nodes in the system send a heartbeat (ping) to every other
node to keep track if other replicas or primary nodes are alive or dead. If no
heartbeat is received within 10 seconds, then that node is marked as inaccessible.

If a Primary node becomes inaccessible, then one of the secondary nodes needs to
become the primary node. Till a new primary is elected from amongst the
secondary nodes, the system remains unavailable to the user to make any new
write query. Therefore, the MongoDB system behaves as a Consistent system and
compromises on Availability during a partition.

Understanding AP with Cassandra

Now let’s also look at how a system compromises on Consistency. For this, we
will look at the Cassandra database which is called a highly available database.

Cassandra is a peer-to-peer system. It consists of multiple nodes in the system. And


each node can accept a read or write request from the user. Cassandra maintains
multiple replicas of data in separate nodes. This gives it a masterless node
architecture where there are multiple points of failure instead of a single point.
The replication factor determines the number of replicas of data. If the replication
factor is 3, then we will replicate the data in three nodes in a clockwise manner.

A situation can occur where a partition occurs and the replica does not get an
updated copy of the data. In such a situation the replica nodes will still be available
to the user but the data will be inconsistent. However, Cassandra also provides
eventual consistency. Meaning, all updates will reach all the replicas eventually.
But in the meantime, it allows divergent versions of the same data to exist
temporarily. Until we update them to the consistent state.

Therefore, by allowing nodes to be available throughout and allowing temporarily


inconsistent data to existing in the system, Cassandra is an AP database that
compromises on consistency.

Note that I have considered the MongoDB and Cassandra databases to be in their
defaul
Define Data Node ?

NameNode
NameNode can be regarded as the system’s master. It keeps
track of the file system tree and metadata for all of the system’s
files and folders. Metadata information is stored in two files:
‘Namespace image’ and ‘edit log.’ Namenode is aware of all data
nodes carrying data blocks for a particular file, but it does not
keep track of block positions. When the system starts, this
information is rebuilt from data nodes each time.

Name Node is the HDFS controller and manager since it is aware


of the state and metadata of all HDFS files, including file
permissions, names, and block locations. Because the metadata is
tiny, it is kept in the memory of the name node, allowing for
speedier data access. Furthermore, because the HDFS cluster is
accessible by several customers at the same time, all of this data
is processed by a single computer. It performs file system actions
such as opening, shutting, renaming, and so on.

DataNode

DataNodes are the slave nodes in HDFS. The actual data is stored
on DataNodes. A functional filesystem has more than one
DataNode, with data replicated across them. On startup, a
DataNode connects to the NameNode; spinning until that service
comes up.
Data storage nodes (DataNode)
A Data node's primary role in a Hadoop cluster is to store data,
and the jobs are executed as tasks on these nodes. The tasks
are scheduled in a way that the batch job processing is done
near the data by allocating tasks to those nodes which would
be having the data for processing in most certainty. This also
ensures that the batch jobs are optimized from execution
perspectives and are performant with near data processing.

The data node is a commodity computer with the GNU/Linux


operating system and data node software installed. In a cluster,
there will be a data node for each node (common
hardware/system). These nodes are in charge of the system’s
data storage.

Datanodes respond to client requests by performing read-write


operations on file systems. They also carry out actions such as
block creation, deletion, and replication in accordance with the
name node’s instructions.
Define ACID properties

A.C.I.D. properties: Atomicity, Consistency, Isolation,

and Durability
ACID is an acronym that refers to the set of 4 key properties
that define a transaction: Atomicity, Consistency,
Isolation, and Durability. If a database operation has these
ACID properties, it can be called an ACID transaction, and data
storage systems that apply these operations are called
transactional systems. ACID transactions guarantee that each
read, write, or modification of a table has the following
properties:
 Atomicity - each statement in a transaction (to read,
write, update or delete data) is treated as a single unit.
Either the entire statement is executed, or none of it is
executed. This property prevents data loss and corruption
from occurring if, for example, if your streaming data
source fails mid-stream.
 Consistency - ensures that transactions only make
changes to tables in predefined, predictable ways.
Transactional consistency ensures that corruption or
errors in your data do not create unintended
consequences for the integrity of your table.
 Isolation - when multiple users are reading and writing
from the same table all at once, isolation of their
transactions ensures that the concurrent transactions
don't interfere with or affect one another. Each request
can occur as though they were occurring one by one,
even though they're actually occurring simultaneously.
 Durability - ensures that changes to your data made by
successfully executed transactions will be saved, even in
the event of system failure.

Why are ACID transactions a good thing to have?


ACID transactions ensure the highest possible data reliability
and integrity. They ensure that your data never falls into an
inconsistent state because of an operation that only partially
completes. For example, without ACID transactions, if you were
writing some data to a database table, but the power went out
unexpectedly, it's possible that only some of your data would
have been saved, while some of it would not. Now your
database is in an inconsistent state that is very difficult and
time-consuming to recover from.

What is polyglot persistence?


Polyglot persistence is a conceptual term that refers to the use of different data
storage approaches and technologies to support the unique storage requirements of
various data types that live within enterprise applications. Polyglot persistence
essentially revolves around the idea that an application can benefit from using
more than one core database or storage technology.
Polyglot persistence suggests that an organization's database engineer or architect
should place a priority on figuring out how they'll need to manipulate application
data and identifying the database technology that will best fit those needs. This
approach is advocated as a way to address storage performance issues, simplify
data operations and eliminate potential fragmentation upfront.

Core tenets of polyglot persistence


Polyglot persistence draws from many of the same ideas encapsulated in polyglot
programming, which refers to the practice of writing applications using a mix of
languages. The idea is that this allows developers to take full advantage of each
various languages' suitability for solving different types of application
development and management challenges.

Polyglot refers to the ability to communicate fluently in several languages. In the


context of application development, it is used to describe an application's ability to
operate and communicate through multiple programming languages. In data
storage, the term persistence refers to survived data after the application process it
was designed to support is terminated. This data is then stored in a non-volatile
storage location.

However, not all organizations are in a place where this approach will benefit their
operations, and an enterprise should ask several questions before it makes the
decision to implement polyglot persistence. Some of these specific questions
include the following:

 How will developers and database managers be trained to work with


the new system?

 Is there an expert available who can help get the system up and
running?

 Is there existing in-house expertise that can help mentor the rest of the
staff?

 Who will provide support and repairs when problems occur?


Once satisfactory answers have been found for each of these questions, polyglot
persistence can be implemented to help an enterprise begin to make the move away
from singular relational databases and toward a mixture of data sources.

How to use polyglot persistence


Database architects and engineers must design and calibrate their polyglot
persistence approach for the specific data architecture that exists within their
enterprise. Luckily, polyglot persistence techniques readily adapts to the use
of SQL, NoSQL or hybrid database systems.

Various factors should be taken into consideration when deciding to move to a


polyglot persistence storage system. For one, it will likely require the addition of a
new database -- along with other supporting technologies -- in order to sustain the
new implementation. It's also worth determining whether it will be necessary to
support ACID versus BASE compliance .

The most important factor to understand is the data flow within the organization.
An easy way to do this is to establish an owner for each piece of data in the system.
Introducing this detail at the overall system architecture level will allow developers
to see which owner is able to modify their corresponding piece of data as well as
how this data will be distributed in the system, thus making the work on each of
the different pieces of data more feasible.

There are several standalone tooling options available to automate polyglot


persistence, such as CloudBoost.io, though large platform providers like Red Hat
and Azure also provide support for this approach.

Data storage types


Polyglot persistence makes use of a wide array of data storage types. Here are
some examples of the various components and hardware involved:

 External hard drives

 Network-attached storage

 Cloud-based storage
 Solid-state drives

 Document, graph and search databases

 Row-store databases

 Data caches

Polyglot persistence strengths and weaknesses


The polyglot persistence approach carries its fair share of both strengths and
weaknesses. Some strengths of polyglot persistence include the following:

 Simplified data management operations

 Reduced amounts of data fragmentation

 Improved response time to data requests

 Increased scalability for applications and systems

 More flexibility in deciding where data is stored

On the other hand, some of polyglot persistence's major weaknesses include the
following:

 The need to remodel data systems based on the specific data


architecture

 Database engineers and architects become more responsible for data


storage

 Database interactions can become more complicated

 New data stores may add a need for more training

 Complex database integration may incur higher operational expenses

Example of polyglot persistence


E-commerce can provide us with a good example of where an organization might
apply polyglot persistence. The average web-based retail site often uses many
types of data for a shopping cart component: transactional data, session
information, inventory counts, order histories and customer profiles.
In the past, an organization might use a single database engine to handle all these
different types of data. However, doing so requires extensive data conversion to
format the different data types within a single, relational database. Polyglot
persistence, however, suggests that the shopping cart (and related e-commerce)
data be divided into databases that are best suited for each data type, thus taking
the pressure away from one overused data location.

Longs

1.) Architecture of HBase


2.)
Introduction to Hadoop, Apache HBase
HBase architecture has 3 main components: HMaster,
Region Server, Zookeeper.
Introduction to Hadoop,Hbase
Figure – Architecture of HBase
All the 3 components are described below:

1. HMaster –
The implementation of Master Server in HBase is HMaster. It
is a process in which regions are assigned to region server
as well as DDL (create, delete table) operations. It monitor
all Region Server instances present in the cluster. In a
distributed environment, Master runs several background
threads. HMaster has many features like controlling load
balancing, failover etc.

2. Region Server –
HBase Tables are divided horizontally by row key range into
Regions. Regions are the basic building elements of HBase
cluster that consists of the distribution of tables and are
comprised of Column families. Region Server runs on HDFS
DataNode which is present in Hadoop cluster. Regions of
Region Server are responsible for several things, like
handling, managing, executing as well as reads and writes
HBase operations on that set of regions. The default size of
a region is 256 MB.

3. Zookeeper –
It is like a coordinator in HBase. It provides services like
maintaining configuration information, naming, providing
distributed synchronization, server failure notification etc.
Clients communicate with region servers via zookeeper.

Advantages of HBase –

1. Can store large data sets

2. Database can be shared

3. Cost-effective from gigabytes to petabytes

4. High availability through failover and replication

Disadvantages of HBase –

1. No support SQL structure

2. No transaction support

3. Sorted only on key

4. Memory issues on the cluster

Comparison between HBase and HDFS:

 HBase provides low latency access while HDFS provides high


latency operations.

 HBase supports random read and writes while HDFS


supports Write once Read Many times.

 HBase is accessed through shell commands, Java API, REST,


Avro or Thrift API while HDFS is accessed through
MapReduce jobs.

Features of HBase architecture :

Distributed and Scalable: HBase is designed to be distributed and


scalable, which means it can handle large datasets and can scale
out horizontally by adding more nodes to the cluster.
Column-oriented Storage: HBase stores data in a column-
oriented manner, which means data is organized by columns rather
than rows. This allows for efficient data retrieval and aggregation.
Hadoop Integration: HBase is built on top of Hadoop, which
means it can leverage Hadoop’s distributed file system (HDFS) for
storage and MapReduce for data processing.
Consistency and Replication: HBase provides strong consistency
guarantees for read and write operations, and supports replication
of data across multiple nodes for fault tolerance.
Built-in Caching: HBase has a built-in caching mechanism that can
cache frequently accessed data in memory, which can improve
query performance.
Compression: HBase supports compression of data, which can
reduce storage requirements and improve query performance.
Flexible Schema: HBase supports flexible schemas, which means
the schema can be updated on the fly without requiring a database
schema migration.
Note – HBase is extensively used for online analytical operations,
like in banking applications such as real-time data updates in ATM
machines, HBase can be used.

2.)HDFS Architecture – Detailed Explanation

Hadoop is an open-source framework for distributed storage and processing.


It can be used to store large amounts of data in a reliable, scalable, and
inexpensive manner. It was created by Yahoo! in 2005 as a means of storing
and processing large datasets. Hadoop provides MapReduce for distributed
processing, HDFS for storing data, and YARN for managing compute
resources. By using Hadoop, you can process huge amounts of data quickly
and efficiently. Hadoop can be used to run enterprise applications such as
analytics and data mining. HDFS is the core component of Hadoop. It is a
distributed file system that provides capacity and reliability for distributed
applications. It stores files across multiple machines, enabling high availability
and scalability. HDFS is designed to handle large volumes of data across
many servers. It also provides fault tolerance through replication and auto-
scalability. As a result, HDFS can serve as a reliable source of storage for
your application’s data files while providing optimum performance. HDFS is
implemented as a distributed file system with multiple data nodes spread
across the cluster to store files.

What is Hadoop HDFS?

Hadoop is a software framework that enables distributed storage and


processing of large data sets. It consists of several open source projects,
including HDFS, MapReduce, and Yarn. While Hadoop can be used for
different purposes, the two most common are Big Data analytics and NoSQL
database management. HDFS stands for “Hadoop Distributed File System”
and is a decentralized file system that stores data across multiple computers
in a cluster. This makes it ideal for large-scale storage as it distributes the load
across multiple machines so there’s less pressure on each individual machine.
MapReduce is a programming model that allows users to write code once and
execute it across many servers. When combined with HDFS, MapReduce can
be used to process massive data sets in parallel by dividing work up into
smaller chunks and executing them simultaneously.

HDFS Architecture

HDFS is an Open source component of the Apache Software Foundation that


manages data. HDFS has scalability, availability, and replication as key
features. Name nodes, secondary name nodes, data nodes, checkpoint
nodes, backup nodes, and blocks all make up the architecture of HDFS.
HDFS is fault-tolerant and is replicated. Files are distributed across the cluster
systems using the Name node and Data Nodes. The primary difference
between Hadoop and Apache HBase is that Apache HBase is a non-relational
database and Apache Hadoop is a non-relational data store.

HDFS is composed of master-slave architecture, which includes the following


elements:
NameNode

All the blocks on DataNodes are handled by NameNode, which is known as


the master node. It performs the following functions:

1. Monitor and control all the DataNodes instances.


2. Permits the user to access a file.
3. Stores all of the block records on a DataNode instance.
4. EditLogs are committed to disk after every write operation to Name
Node’s data storage. The data is then replicated to all the other data
nodes, including Data Node and Backup Data Node. In the event of a
system failure, EditLogs can be manually recovered by Data Node.
5. All of the DataNodes’ blocks must be alive in order for all of the blocks
to be removed from the data nodes.
6. Therefore, every UpdateNode in a cluster is aware of every DataNode
in the cluster, but only one of them is actively managing communication
with all the DataNodes. Since every DataNode runs their own software,
they are completely independent. Therefore, if a DataNode fails, the
DataNode will be replaced by another DataNode. This means that the
failure of a DataNode will not impact the rest of the cluster, since all the
DataNodes are aware of every DataNode in the cluster.

There are two kinds of files in NameNode: FsImage files and EditLogs files:

1. FsImage: It contains all the details about a filesystem, including all the
directories and files, in a hierarchical format. It is also called a file image
because it resembles a photograph.
2. EditLogs: The EditLogs file keeps track of what modifications have
been made to the files of the filesystem.
Secondary NameNode

When NameNode runs out of disk space, a secondary NameNode is activated


to perform a checkpoint. The secondary NameNode performs the following
duties.

1. It stores all the transaction log data (from all the source databases) into
one location so that when you want to replay it, it is at one single
location. Once the data is stored, it is replicated across all the servers,
either directly or via a distributed file system.
2. The information stored in the filesystem is replicated across all the
cluster nodes and stored in all the data nodes. Data nodes store the
data. The cluster nodes store the information about the cluster nodes.
This information is called metadata. When a data node reads data from
the cluster, it uses the metadata to determine where to send the data
and what type of data it is. This metadata is also written to a hard drive.
The cluster nodes will write this information if the cluster is restarted.
The cluster will read this information and use it to determine where to
send the data and what type of data it is.
3. The FsImage can be used to create a new replica of data, which can be
used to scale up the data. If the new FsImage needs to be used to
create a new replica, this replication will start with a new FsImage.
There are some cases when it is necessary to recover from a failed
FsImage. In this situation, a new FsImage must be created from an old
one. The FsImage can be used to create backups of data. Data stored
in the Hadoop cluster can be backed up and stored in another Hadoop
cluster, or the data can be stored on a local file system.

DataNode

Every slave machine that contains data organsises a DataNode. DataNode


stores data in ext3 or ext4 file format on DataNodes. DataNodes do the
following:

1. DataNodes store every data.


2. It handles all of the requested operations on files, such as reading file
content and creating new data, as described above.
3. All the instructions are followed, including scrubbing data on
DataNodes, establishing partnerships, and so on.
Checkpoint Node

It establishes checkpoints at specified intervals to generate checkpoint nodes


in FsImage and EditLogs from NameNode and joins them to produce a new
image. Whenever you generate FsImage and EditLogs from NameNode and
merge them to create a new image, checkpoint nodes in HDFS create a
checkpoint and deliver it to the NameNode. The directory structure is always
identical to that of the name node, so the checkpointed image is always
available.

Backup Node

Backup nodes are used to provide high availability of the data. In case one of
the active NameNode or DataNodes goes down, the backup node can be
promoted to active and the active node switched over to the backup node.
Backup nodes are not used to recover from a failure of the active NameNode
or DataNodes. Instead, you use a replica set of the data for that purpose. Data
nodes are used to store the data and to create the FsImage and editsLogs
files for replication. Data nodes connect to one or more replica sets of the data
to create the FsImage and editsLogs files for replication. Data nodes are not
used to provide high availability.

Blocks

This default size can be changed to any value between 32 and 128
megabytes, depending on the performance required. Data is written to the
DataNodes every time a user makes a change, and new data is appended to
the end of the DataNode. DataNodes are replicated to ensure data
consistency and fault tolerance. If a Node fails, the system automatically
recovers the data from a backup and replicates it across the remaining healthy
Nodes. DataNodes do not store the data directly on the hard drives, instead
using the HDFS file system. This architecture allows HDFS to scale
horizontally as the number of users and data types increase. When the file
size gets bigger, the block size gets bigger as well. When the file size
becomes bigger than the block size, the larger data is placed in the next block.
For example, if the data is 135 MB and the block size is 128 MB, two blocks
will be created. The first block will be 128 MB, while the second block will be
135 MB. When the file size gets bigger than that, the larger data will be placed
in the next block. This ensures that the most data will always be stored at the
same block.
Features of HDFS

The following are the main advantages of HDFS:

 HDFS can be configured to create multiple replicas for a particular file. If


any one replica fails, the user can still access the data from other
replicas. HDFS provides the option to configure automatic failover in
case of a failure. So, in case of any hardware failure or an error, the
user can get his data from another node where the data has been
replicated. HDFS provides the facility to perform software failover. This
is similar to automatic failover; however, it is performed at the data
provider level. So, in case of any hardware failure or an error, the user
can get his data from another node where the data has been replicated.

 Horizontal scalability means that the data stored on multiple nodes can
be stored in a single file system. Vertical scalability means that data can
be stored on multiple nodes. Data can be replicated to ensure data
integrity. Replication occurs through the use of replication factors rather
than the data itself. HDFS can store up to 5PB of data in a single cluster
and handles the load by automatically choosing the best data node to
store data on. Data can be read/updated quickly as it is stored on
multiple nodes. Data stored on multiple nodes through replication
increases the reliability of data.

 Data is stored on HDFS, not on the local filesystem of your computer. In


the event of a failure, the data is stored on a separate server, and can
be accessed by the application running on your local computer. Data is
replicated on multiple servers to ensure that even in the event of a
server failure, your data is still accessible. Data can be accessed via a
client tool such as the Java client, the Python client, or the CLI. Access
to data is accessible via a wide variety of client tools. This makes it
possible to access data from a wide variety of programming languages.

Replication Management in HDFS Architecture

HDFS is able to survive computer crashes and recover from data corruption.
HDFS operates on the principle of duplicates, so in the event of failure, it can
continue operating as long as there are replicas available. When working on
the principle of replicas, data is duplicated and stored on different machines in
the DHFS cluster. A replica of every block is stored on at least three
DataNodes. HDFS uses a technique referred to as nameNode maintenance to
maintain copies on multiple DataNodes. The nameNode keeps track of how
many blocks have been under- or over-replicated, and subsequently adds or
deletes copies accordingly.

Write Operation

The process continues until all DataNodes have received the data. After
DataNodes receive a copy of the file, they send back the location of the last
block they received. This enables the NameNode to reconstruct the file. After
receiving the last block, the DataNodes notify the NameNode that the job is
complete. The NameNode then replies with a complete file that can be used
by the application. When a file is split into segments, it must be reassembled
to return the file data to the application. Splitting a file into segments is a
method that enables the NameNode to optimize its storage capacity. Splitting
a file into segments also improves fault tolerance and availability. When the
client receives a split file, the process is similar to that of a single file. The
client divides the file into segments, which are then sent to the DataNode.
DataNode 1 receives the segment A and passes it to DataNode 2 and so on.

Read Operation

The client then sends the file to the Replicator. The Replicator does not have
a copy of the file and must read the data from another location. In the
background, data is then sent to the DataNode. The DataNode only has
metadata and must contact the other data nodes to receive the actual data.
The data is then sent to the Replicator. The Replicator again does not have a
copy of the file and must read the data from another location. Data is then
sent to the Reducer. The Reducer does have a copy of the data, but a
compressed version.

Advantages of HDFS Architecture

1. It is a highly scalable data storage system. This makes it ideal for data-
intensive applications like Hadoop and streaming analytics. Another
major benefit of Hadoop is that it is easy to set up. This makes it ideal
for non-technical users.
2. It is very easy to implement, yet very robust. There is a lot of flexibility
you get with Hadoop. It is a fast and reliable file system.
3. This makes Hadoop a great fit for a wide range of data applications.
The most common one is analytics. You can use Hadoop to process
large amounts of data quickly, and then analyze it to find trends or make
recommendations. The most common type of application that uses
Hadoop analytics is data crunching.
4. You can increase the size of the cluster by adding more nodes or
increase the size of the cluster by adding more nodes. If you have many
clients that need to be stored on HDFS you can easily scale your cluster
horizontally by adding more nodes to the cluster. To scale your cluster
vertically, you can increase the size of the cluster. Once the size of the
cluster is increased, it can serve more clients.
5. This can be done by setting up a centralized database, or by distributing
data across a cluster of commodity personal computers, or a
combination of both. The most common setup for this type of
virtualization is to create a virtual machine on each of your servers.
6. Specialization reduces the overhead of data movement across the
cluster and provides high availability of data.
7. Automatic data replication can be accomplished with a variety of
technologies, including RAID, Hadoop, and database replication.
Logging data and monitoring it for anomalies can also help to detect
and respond to hardware and software failures.

Disadvantages of HDFS Architecture

1. It is important to have a backup strategy in place. The cost of downtime


can be extremely high, so it is important to keep things running
smoothly. It is also recommended to have a security plan in place. If
your company does not have a data backup plan, you are putting your
company’s data at risk.
2. The chances are that the data in one location is vulnerable to hacking.
Imagine the fear of losing valuable data when a disaster strikes. To
protect data, backup data to a remote location. In the event of a
disaster, the data can be quickly restored to its original location.
3. This can be done manually or through a data migration process. Once
the data is copied to the local environment, it can be accessed,
analyzed, and used for any purpose.

3) Similarities and dissimilarities among Relational and Non-relational


databases
Relational databases organize data into tables with rows and columns,
where each row represents a record, and each column represents a field.
This structure allows for easy querying and data manipulation
using Structured Query Language (SQL).
Some popular SQL database systems include:

Oracle, Microsoft SQL Server, PostgreSQL, MySQL, MariaDB


A non-relational database, also known as a NoSQL database, is a type of
database that does not use the traditional table-based relational structure.
Instead, non-relational databases use various data models, such as key-
value, document, column-family, and graph.
This allows for greater flexibility in storing and managing data, especially
for large-scale, distributed, and unstructured data. To break it down, let's go
through the types of non-relational data models one by one.

 Key-value: This data model comprises two parts: a key and a value.
The key is like an index, used to look up and access the value
containing the data stored in the database.
 Document: Documents are self-contained, meaning all the
information related to a single record is stored within one document.
This makes it easier to add or modify data as needed.
 Column family: Column family databases store data in columns
 Graph: Graph databases use nodes and edges to represent
relationships between different data sets.
Some popular NoSQL databases include:

 MongoDB, Google Cloud Firestore, Cassandra, Redis, Apache


HBase, Amazon DynamoDB

What are the similarities between SQL and NoSQL databases?


At a high level, NoSQL and SQL databases have many similarities. In
addition to supporting data storage and queries, they both also allow one
to retrieve, update, and delete stored data. However, under the surface
lie some significant differences that affect NoSQL versus SQL
performance, scalability, and flexibility.

Data Structure
A data structure is the way data is organized and stored in a database.
Relational databases use a table-based structure with rows and columns,
while non-relational databases use various data models, such as key-
value, document, column-family, and graph.

Performance
Relational databases can provide strong data consistency and integrity but
may be slower in performance for certain use cases. Non-relational
databases can offer faster performance for specific use cases, such as big
data and real-time processing.

Scalability
Relational databases have limited scalability, making them less suitable for
large datasets and high read/write loads. Non-relational databases are
highly scalable and can handle large-scale, distributed data more
efficiently.

Query Language
Relational databases use SQL for querying and manipulating data, while
non-relational databases typically use their own query languages or APIs,
which can vary between different databases.

Schema
Relational databases have a predefined schema which makes them better
suited for structured data. Non-relational databases, however, are more
flexible and can accommodate various types of data.

Development
Relational databases require more development effort when it comes to
creating complex queries or changing the database structure. On the other
hand, non-relational databases are easier to develop and require fewer
resources.

4)Virtualization Approaches in Big Data?

VIRTUALIZATION AND BIG DATA


• Virtualization is a process that allows you to learn the images of multiple
operating systems on a physical computer. These images of operating
systems are on virtual machines.
• A virtual machine is basically a software representation of a physical
machine that can execute and perform same functions as a physical
machine.
• Each virtual machine contains a separate copy of the operating system
with its own virtual hardware resources. Devices and applications.

• The operating system that is runs as virtual machine is known as the


guest, while operating systems that runs a virtual machine is known as the
host. A guest operating system runs only hardware virtualization layer,
which is at the top of the hardware of a physical machine

VIRTUAL VIRTUAL VIRTUAL


MACHINE MACHINE MACHINE

VIRTUALIZATION SOFTWARE

HOST OPERATING SYSTEM

PHYSICAL HARDWARE
• Data virtualization software acts as a bridge across multiple, diverse data
sources, bringing critical decision-making data together in one virtual
place to fuel analytics.
• It is a method for combining data from various sources and different
types into a comprehensive, logical representation without physically
relocating the data
• Instead of performing data movement and physically storing integrated
views in a destination data structure, data virtualization can be utilized
to construct virtualized and integrated views of data in memory. To
make querying logic simpler, it provides an abstraction layer over the
actual physical implementation of data
A hypervisor allows a single host computer to support multiple virtual machines
(VMs) by sharing resources including memory and processing and allows us to
build and run virtual machines which are abbreviated as VMs.

The hypervisor is a hardware virtualization technique that allows multiple


guest operating systems (OS) to run on a single host system at the same time.
A hypervisor is sometimes also called a virtual machine manager(VMM).
Virtualization has three characteristics that support the scalability and
operating efficiency required for big data environments:

• Partitioning: In virtualization, many applications and operating systems


are supported in a single physical system by partitioning the available
resources.
• Isolation: Each virtual machine is isolated from its host physical system
and other virtualized machines. Because of this isolation, if one virtual
instance crashes, the other virtual machines and the host system aren't
affected. In addition, data isn’t shared between one virtual instance and
another.
• Encapsulation: A virtual machine can be represented as a single file, so
you can identify it easily based on the services it provides.

VIRTUALIZATION APPROACHES

1)SERVER VIRTUALIZATION
In server virtualization, one physical server is partitioned into multiple virtual
servers. The hardware and resources of a machine — including the random
access memory (RAM), CPU, hard drive, and network controller — can be
virtualized into a series of virtual machines that each runs its own applications
and operating system. A single thin layer of software is being inserted with the
hardware, virtual machine monitor also called as a hypervisor. It is a technology
that manages the traffic between the virtual machines and the physical machine

2)Application virtualization
Application virtualization means encapsulating applications in a way that they
would not be dependent on the underlying physical computer system.
Application virtualization improves the manageability and portability of
applications. It can be used along its server virtualization to meet business
application virtualization ensures that big data applications can access resources
on the basis of the relative priority with each other. Big data applications have
significant it resource requirements in application visualisation can help them
increase accessing the resources at low costs

server

user

3)Network Virtualization
Network virtualization is using virtual networking as a tool of connexion
resources while implementing network virtualization. You don’t need to rely on
the physical network for managing traffic between connexions. You can create
as many virtual networks as you need from your single physical implementation
in the big data environment. Network virtualization helps in defining different
networks with different set of performance in capacities to manage the large
distributed data required for the bigger analysis

4)Process and memory virtualization


process virtualization optimises the power of the processor and maximises its
performance. Memory virtualization decouples memory from the service. Big
data analysis needs systems to have the high processing power cpu and memory
ram for performing complex computations. These competitions can take a lot of
time in case CPU and memory resources are not sufficient. Processor and
memory virtualization thus can increase the speed of the process and get your
analysis results sooner

5)Data and storage Virtualization


Data virtualization provides an abstract service that delivers data continuously
in a consistent form, without the knowledge of the underlying physical
database. It is used to create a platform that can provide dynamically linked
Data services on the other hand, storage virtualization combines physical
storage resources so that they can be shared in more effective way. In big data
environment, sometimes you may need to access only a certain type of data, say
only database data virtualization proves to be useful in this case as a virtual
images of the database can be stored and invoked whenever required without
consuming valuable data centre resources or capacity. In addition, store
specification is used to store large volumes of unstructured and structured data
5.) Big Data Stack

Here's a closer look at what's in the image and the relationship between
the components:

 Interfaces and feeds: On either side of the diagram are


indications of interfaces and feeds into and out of both internally
managed data and data feeds from external sources. To understand
how big data works in the real world, start by understanding this
necessity.

What makes big data big is that it relies on picking up lots of data
from lots of sources. Therefore, open application programming
interfaces (APIs) will be core to any big data architecture.
In addition, keep in mind that interfaces exist at every level and
between every layer of the stack. Without integration services, big
data can't happen.

 Redundant physical infrastructure: The supporting physical


infrastructure is fundamental to the operation and scalability of a
big data architecture. Without the availability of robust physical
infrastructures, big data would probably not have emerged as such
an important trend.

To support an unanticipated or unpredictable volume of data, a


physical infrastructure for big data has to be different than that for
traditional data. The physical infrastructure is based on a distributed
computing model. This means that data may be physically stored in
many different locations and can be linked together through
networks, the use of a distributed file system, and various big data
analytic tools and applications.

 Security infrastructure: The more important big data analysis


becomes to companies, the more important it will be to secure that
data. For example, if you are a healthcare company, you will
probably want to use big data applications to determine changes in
demographics or shifts in patient needs.

This data about your constituents needs to be protected both to


meet compliance requirements and to protect the patients' privacy.
You will need to take into account who is allowed to see the data
and under what circumstances they are allowed to do so. You will
need to be able to verify the identity of users as well as protect the
identity of patients.

 Operational data sources: When you think about big data,


understand that you have to incorporate all the data sources that
will give you a complete picture of your business and see how the
data impacts the way you operate your business.
Traditionally, an operational data source consisted of highly
structured data managed by the line of business in a relational
database. But as the world changes, it is important to understand
that operational data now has to encompass a broader set of data
sources.

(or)
 Security infrastructure - The information about your
constituents must be protected in order to comply with
regulatory requirements as well as to protect their privacy.
 Operational data sources - A relational database was used
to store highly structured data that was handled by the line of
business. Operational data sources were used to store highly-
structured data.
 Organizing Databases and tools - structured database and
tools used to organize the data and process this.
 Analytical Data warehouse - The addition of an analytical
data warehouse simplifies the data for the development of
reports.
 Reporting and visualization - Enable the processing of data
while providing a user-friendly depiction of the results.

Big Data Application


In order to fulfill the needs of the business, Big Data Application -
Custom Application of Big Data for Business - allows cloud
computing and virtualization, Hadoop, and other technologies.

With the development of the internet and technologies such as big


data, this field of marketing transitioned to the digital realm, which
is now known as Digital Marketing in modern times. Today, thanks to
big data, you can collect massive volumes of information and learn
about the preferences of millions of customers in a matter of
seconds. In order to assist marketers in running campaigns,
increasing click-through rates, putting relevant adverts, improving
the product, and covering the nuances to achieve the targeted
target, Business Analysts examine the data.

For example, Amazon gathered information about the purchases


made by millions of people all over the world through its website.
They conducted research on the purchasing habits and payment
methods of their clients and used the findings to develop new offers
and advertising campaigns.

6.)Key Features of MapReduce


The following advanced features characterize MapReduce:

1. Highly scalable
A framework with excellent scalability is Apache Hadoop
MapReduce. This is because of its capacity for distributing and
storing large amounts of data across numerous servers. These
servers can all run simultaneously and are all reasonably
priced.
By adding servers to the cluster, we can simply grow the
amount of storage and computing power. We may improve the
capacity of nodes or add any number of nodes (horizontal
scalability) to attain high computing power. Organizations may
execute applications from massive sets of nodes, potentially
using thousands of terabytes of data, thanks to Hadoop
MapReduce programming.

2. Versatile
Businesses can use MapReduce programming to access new
data sources. It makes it possible for companies to work with
many forms of data. Enterprises can access both organized and
unstructured data with this method and acquire valuable
insights from the various data sources.
Since Hadoop is an open-source project, its source code is
freely accessible for review, alterations, and analyses. This
enables businesses to alter the code to meet their specific
needs. The MapReduce framework supports data from sources
including email, social media, and clickstreams in different
languages.

3. Secure
The MapReduce programming model uses the HBase and HDFS
security approaches, and only authenticated users are
permitted to view and manipulate the data. HDFS uses a
replication technique in Hadoop 2 to provide fault tolerance.
Depending on the replication factor, it makes a clone of each
block on the various machines. One can therefore access data
from the other devices that house a replica of the same data if
any machine in a cluster goes down. Erasure coding has taken
the role of this replication technique in Hadoop 3. Erasure
coding delivers the same level of fault tolerance with less area.
The storage overhead with erasure coding is less than 50%.

4. Affordability
With the help of the MapReduce programming framework and
Hadoop’s scalable design, big data volumes may be stored and
processed very affordably. Such a system is particularly cost-
effective and highly scalable, making it ideal for business
models that must store data that is constantly expanding to
meet the demands of the present.
In terms of scalability, processing data with older, conventional
relational database management systems was not as simple as
it is with the Hadoop system. In these situations, the company
had to minimize the data and execute classification based on
presumptions about how specific data could be relevant to the
organization, hence deleting the raw data. The MapReduce
programming model in the Hadoop scale-out architecture helps
in this situation.
5. Fast-paced
The Hadoop Distributed File System, a distributed storage
technique used by MapReduce, is a mapping system for finding
data in a cluster. The data processing technologies, such as
MapReduce programming, are typically placed on the same
servers that enable quicker data processing.
Thanks to Hadoop’s distributed data storage, users may
process data in a distributed manner across a cluster of nodes.
As a result, it gives the Hadoop architecture the capacity to
process data exceptionally quickly. Hadoop MapReduce can
process unstructured or semi-structured data in high numbers
in a shorter time.

6. Based on a simple programming model


Hadoop MapReduce is built on a straightforward programming
model and is one of the technology’s many noteworthy
features. This enables programmers to create MapReduce
applications that can handle tasks quickly and effectively. Java
is a very well-liked and simple-to-learn programming language
used to develop the MapReduce programming model.
Java programming is simple to learn, and anyone can create a
data processing model that works for their company. Hadoop is
straightforward to utilize because customers don’t need to
worry about computing distribution. The framework itself does
the processing.

7. Parallel processing-compatible
The parallel processing involved in MapReduce programming is
one of its key components. The tasks are divided in the
programming paradigm to enable the simultaneous execution
of independent activities. As a result, the program runs faster
because of the parallel processing, which makes it simpler for
the processes to handle each job. Multiple processors can carry
out these broken-down tasks thanks to parallel processing.
Consequently, the entire software runs faster.

8. Reliable
The same set of data is transferred to some other nodes in a
cluster each time a collection of information is sent to a single
node. Therefore, even if one node fails, backup copies are
always available on other nodes that may still be retrieved
whenever necessary. This ensures high data availability.
The framework offers a way to guarantee data trustworthiness
through the use of Block Scanner, Volume Scanner, Disk
Checker, and Directory Scanner modules. Your data is safely
saved in the cluster and is accessible from another machine
that has a copy of the data if your device fails or the data
becomes corrupt.

9. Highly available
Hadoop’s fault tolerance feature ensures that even if one of the
DataNodes fails, the user may still access the data from other
DataNodes that have copies of it. Moreover, the high
accessibility Hadoop cluster comprises two or more active and
passive NameNodes running on hot standby. The active
NameNode is the active node. A passive node is a backup node
that applies changes made in active NameNode’s edit logs to
its namespace.
Key Features of MapReduce
There are some key features of MapReduce below:

Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a
large number of nodes in a cluster. This allows it to handle massive datasets, making
it suitable for Big Data applications.
Fault Tolerance
MapReduce incorporates built-in fault tolerance to ensure the reliable processing of
data. It automatically detects and handles node failures, rerunning tasks on available
nodes as needed.
Data Locality
MapReduce takes advantage of data locality by processing data on the same node
where it is stored, minimizing data movement across the network and improving
overall performance.
Simplicity
The MapReduce programming model abstracts away many complexities associated
with distributed computing, allowing developers to focus on their data processing
logic rather than low-level details.
Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make
storing and processing extensive data sets very economical.
Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution
of independent operations. As a result, programs run faster due to parallel
processing, making it easier for a process to handle each job. Thanks to parallel
processing, these distributed tasks can be performed by multiple processors.
Therefore, all software runs faster.

HBase
HBase is a data model that is similar to Google’s big table. It is an open
source, distributed database developed by Apache software foundation written
in Java. HBase is an essential part of our Hadoop ecosystem. HBase runs on
top of HDFS (Hadoop Distributed File System). It can store massive amounts
of data from terabytes to petabytes. It is column oriented and horizontally
scalable Hbase is well suited for sparse data sets which are very common in big
data use cases Hbase provides APIs enabling development in practically any
programming language.It provides random real-time read/write access to data in
the Hadoop File System.

Why HBase

o RDBMS get exponentially slow as the data becomes large


o Expects data to be highly structured, i.e. ability to fit in a well-defined
schema
o Any change in schema might require a downtime
o For sparse datasets, too much of overhead of maintaining NULL values

HBase Architecture
FEATURES OF HBase
1. t is linearly scalable across various nodes as well as modularly
scalable, as it divided across various nodes.
2. HBase provides consistent read and writes.
3. It provides atomic read and write means during one read or write
process, all other processes are prevented from performing any read
or write operations.
4. It provides easy to use Java API for client access.
5. It supports Thrift and REST API for non-Java front ends which
supports XML, Protobuf and binary data encoding options.
6. It supports a Block Cache and Bloom Filters for real-time queries
and for high volume query optimization.
7. HBase provides automatic failure support between Region Servers.
8. It support for exporting metrics with the Hadoop metrics subsystem
to files.
9. It doesn’t enforce relationship within your data.

10.It is a platform for storing and retrieving data with random access.

You might also like