Hadoop Questions
Hadoop Questions
Apache Zookeeper
Architecture of Zookeeper
Data Virtualization
The foundation of data virtualization technology is the execution of
distributed data management processes, mostly for queries, against
numerous heterogeneous data sources, and the Federation of query
results into virtual views. Applications, query/reporting tools,
message-oriented middleware, or other parts of the data
management infrastructure then consume these virtual views.
Instead of performing data movement and physically storing
integrated views in a destination data structure, data virtualization
can be utilized to construct virtualized and integrated views of data
in memory. To make querying logic simpler, it provides an
abstraction layer over the actual physical implementation of data.
It is a method for combining data from various sources and different
types into a comprehensive, logical representation without
physically relocating the data. Simply put, users can theoretically
access and examine data while it still exists in its original sources
thanks to specialized middleware.
Uses of MapReduce
1. Entertainment
Hadoop MapReduce assists end users in finding the most
popular movies based on their preferences and previous
viewing history. It primarily concentrates on their clicks and
logs.
Various OTT services, including Netflix, regularly release many
web series and movies. It may have happened to you that you
couldn’t pick which movie to watch, so you looked at Netflix’s
recommendations and decided to watch one of the suggested
series or films. Netflix uses Hadoop and MapReduce to indicate
to the user some well-known movies based on what they have
watched and which movies they enjoy. MapReduce can
examine user clicks and logs to learn how they watch movies.
2. E-commerce
Several e-commerce companies, including Flipkart, Amazon,
and eBay, employ MapReduce to evaluate consumer buying
patterns based on customers’ interests or historical purchasing
patterns. For various e-commerce businesses, it provides
product suggestion methods by analyzing data, purchase
history, and user interaction logs.
Many e-commerce vendors use the MapReduce programming
model to identify popular products based on customer
preferences or purchasing behavior. Making item proposals for
e-commerce inventory is part of it, as is looking at website
records, purchase histories, user interaction logs, etc., for
product recommendations.
3. Social media
Nearly 500 million tweets, or about 3000 per second, are sent
daily on the microblogging platform Twitter. MapReduce
processes Twitter data, performing operations such as
tokenization, filtering, counting, and aggregating counters.
Tokenization: It creates key-value pairs from the
tokenized tweets by mapping the tweets as maps of
tokens.
Filtering: The terms that are not wanted are
removed from the token maps.
Counting: It creates a token counter for each word in
the count.
Aggregate counters: A grouping of comparable
counter values is prepared into small, manageable
pieces using aggregate counters.
4. Data warehouse
Systems that handle enormous volumes of information are
known as data warehouse systems. The star schema, which
consists of a fact table and several dimension tables, is the
most popular data warehouse model. In a shared-nothing
architecture, storing all the necessary data on a single node is
impossible, so retrieving data from other nodes is essential.
This results in network congestion and slow query execution
speeds. If the dimensions are not too big, users can replicate
them over nodes to get around this issue and maximize
parallelism. Using MapReduce, we may build specialized
business logic for data insights while analyzing enormous data
volumes in data warehouses.
5. Fraud detection
Conventional methods of preventing fraud are not always very
effective. For instance, data analysts typically manage
inaccurate payments by auditing a tiny sample of claims and
requesting medical records from specific submitters. Hadoop is
a system well suited for handling large volumes of data needed
to create fraud detection algorithms. Financial businesses,
including banks, insurance companies, and payment locations,
use Hadoop and MapReduce for fraud detection, pattern
recognition evidence, and business analytics through
transaction analysis.
CAP Theorem
Introduction
The NoSQL databases have inadvertently been at the forefront of this shift in the
domain of distributed databases. And they have been providing you with lots of
flexibility in terms of handling your data. Plus there is a plethora of them out there!
However, the real question is which one to use? The answer to this question lies
not only in the properties of these databases but also in understanding a
fundamental theorem. A theorem that has gained renewed attention since the
advent of such databases in the realm of databases. Yes, I’m talking about the CAP
theorem!
In simple terms, the CAP theorem lets you determine how you want to handle your
distributed database systems when a few database servers refuse to communicate
with each other due to some fault in the system. However, there exists some
misunderstanding. So, in this article, we will try to understand the CAP theorem
and how it helps to choose the right distributed database system.
When a user wants to write to the database, the data is appropriately written to a
node in the distributed database. The user may not be aware of where the data is
written. Similarly, when a user wants to retrieve the data, it connects to the nearest
node in the system which retrieves the data for it, without the user knowing about
this.
This way, a user simply interacts with the system as if it is interacting with a single
database. Internally the nodes communicate with each other, retrieving data that
the user is looking for, from the relevant node, or storing the data provided by the
user.
Now, the benefits of a distributed system are quite obvious. With the increase in
traffic from the users, we can easily scale our database by adding more nodes to the
system. Since these nodes are commodity hardware, they are relatively cheaper
than adding more resources to each of the nodes individually. That is, horizontal
scaling is cheaper than vertical scaling.
This horizontal scaling makes replication of data cheaper and easier. This means
that now the system can easily handle more user traffic by appropriately
distributing the traffic amongst the replicated nodes that.
So now the real problem comes while choosing the appropriate distribution system
for a task. To answer such a question, we need to understand the CAP theorem.
Availability
Imagine there is a very popular mobile operator in your city and you are its
customer because of the amazing plans it offers. Besides that, they also provide an
amazing customer care service where you can call anytime and get your queries
and concerns answered quickly and efficiently. Whenever a customer calls them,
the mobile operator is able to connect them to one of their customer care operators.
The customer is able to elicit any information required by her/him about his
accounts like balance, usage, or other information. We call
this Availability because every customer is able to connect to the operator and get
the information about the user/customer.
Consistency
Now, you have recently shifted to a new house in the city and you want to update
your address registered with the mobile operator. You decide to call the customer
care operator and update it with them. When you call, you connect with an
operator. This operator makes the relevant changes in the system. But once you
have put down the phone, you realize you told them the correct street name but the
old house number (old habits die hard!).
So you frantically call the customer care again. This time when you call, you
connect with a different customer care operator but they are able to access your
records as well and know that you have recently updated your address. They make
the relevant changes in the house number and the rest of the address is the same as
the one you told the last operator.
Partition tolerance
Recently you have noticed that your current mobile plan does not suit you. You do
not access that much mobile data any longer because you have good wi-fi facilities
at home and at the office, and you hardly step outside anywhere. Therefore, you
want to update your mobile plan. So you decide to call the customer care once
again.
On connecting with the operator this time, they tell you that they have not been
able to update their records due to some issues. So the information lying with the
operator might not be up to date, therefore they cannot update the information. We
can say here that the service is broken or there is no Partition tolerance.
Understanding the Terms of the CAP theorem
Now let’s take up these terms one by one and try to understand them in a more
formal manner.
Consistency
Consistency means that the user should be able to see the same data no matter
which node they connect to on the system. This data is the most recent data written
to the system. So if a write operation has occurred on a node, it should be
replicated to all its replicas. So that whenever a user connects to the system, they
can see that same information.
Availability
Availability means that every request from the user should elicit a response from
the system. Whether the user wants to read or write, the user should get a response
even if the operation was unsuccessful. This way, every operation is bound to
terminate.
For example, when you visit your bank’s ATM, you are able to access your
account and its related information. Now even if you go to some other ATM, you
should still be able to access your account. If you are only able to access your
account from one ATM and not another, this means that the information is not
available with all the ATMs.
Partition Tolerance
In the last section, you understood what each term means in the CAP theorem.
Now let us understand the theorem itself.
The CAP theorem states that a distributed database system has to make a tradeoff
between Consistency and Availability when a Partition occurs.
A distributed database system is bound to have partitions in a real-world system
due to network failure or some other reason. Therefore, partition tolerance is a
property we cannot avoid while building our system. So a distributed system will
either choose to give up on Consistency or Availability but not on Partition
tolerance.
Let’s try to understand how a distributed system would work when it decides to
give up on Availability during a partition with the help of MongoDB.
MongoDB is a NoSQL database that stores data in one or more Primary nodes in
the form of JSON files. Each Primary node has multiple replica sets that update
themselves asynchronously using the operation log file of their respective primary
node. The replica set nodes in the system send a heartbeat (ping) to every other
node to keep track if other replicas or primary nodes are alive or dead. If no
heartbeat is received within 10 seconds, then that node is marked as inaccessible.
If a Primary node becomes inaccessible, then one of the secondary nodes needs to
become the primary node. Till a new primary is elected from amongst the
secondary nodes, the system remains unavailable to the user to make any new
write query. Therefore, the MongoDB system behaves as a Consistent system and
compromises on Availability during a partition.
Now let’s also look at how a system compromises on Consistency. For this, we
will look at the Cassandra database which is called a highly available database.
A situation can occur where a partition occurs and the replica does not get an
updated copy of the data. In such a situation the replica nodes will still be available
to the user but the data will be inconsistent. However, Cassandra also provides
eventual consistency. Meaning, all updates will reach all the replicas eventually.
But in the meantime, it allows divergent versions of the same data to exist
temporarily. Until we update them to the consistent state.
Note that I have considered the MongoDB and Cassandra databases to be in their
defaul
Define Data Node ?
NameNode
NameNode can be regarded as the system’s master. It keeps
track of the file system tree and metadata for all of the system’s
files and folders. Metadata information is stored in two files:
‘Namespace image’ and ‘edit log.’ Namenode is aware of all data
nodes carrying data blocks for a particular file, but it does not
keep track of block positions. When the system starts, this
information is rebuilt from data nodes each time.
DataNode
DataNodes are the slave nodes in HDFS. The actual data is stored
on DataNodes. A functional filesystem has more than one
DataNode, with data replicated across them. On startup, a
DataNode connects to the NameNode; spinning until that service
comes up.
Data storage nodes (DataNode)
A Data node's primary role in a Hadoop cluster is to store data,
and the jobs are executed as tasks on these nodes. The tasks
are scheduled in a way that the batch job processing is done
near the data by allocating tasks to those nodes which would
be having the data for processing in most certainty. This also
ensures that the batch jobs are optimized from execution
perspectives and are performant with near data processing.
and Durability
ACID is an acronym that refers to the set of 4 key properties
that define a transaction: Atomicity, Consistency,
Isolation, and Durability. If a database operation has these
ACID properties, it can be called an ACID transaction, and data
storage systems that apply these operations are called
transactional systems. ACID transactions guarantee that each
read, write, or modification of a table has the following
properties:
Atomicity - each statement in a transaction (to read,
write, update or delete data) is treated as a single unit.
Either the entire statement is executed, or none of it is
executed. This property prevents data loss and corruption
from occurring if, for example, if your streaming data
source fails mid-stream.
Consistency - ensures that transactions only make
changes to tables in predefined, predictable ways.
Transactional consistency ensures that corruption or
errors in your data do not create unintended
consequences for the integrity of your table.
Isolation - when multiple users are reading and writing
from the same table all at once, isolation of their
transactions ensures that the concurrent transactions
don't interfere with or affect one another. Each request
can occur as though they were occurring one by one,
even though they're actually occurring simultaneously.
Durability - ensures that changes to your data made by
successfully executed transactions will be saved, even in
the event of system failure.
However, not all organizations are in a place where this approach will benefit their
operations, and an enterprise should ask several questions before it makes the
decision to implement polyglot persistence. Some of these specific questions
include the following:
Is there an expert available who can help get the system up and
running?
Is there existing in-house expertise that can help mentor the rest of the
staff?
The most important factor to understand is the data flow within the organization.
An easy way to do this is to establish an owner for each piece of data in the system.
Introducing this detail at the overall system architecture level will allow developers
to see which owner is able to modify their corresponding piece of data as well as
how this data will be distributed in the system, thus making the work on each of
the different pieces of data more feasible.
Network-attached storage
Cloud-based storage
Solid-state drives
Row-store databases
Data caches
On the other hand, some of polyglot persistence's major weaknesses include the
following:
Longs
1. HMaster –
The implementation of Master Server in HBase is HMaster. It
is a process in which regions are assigned to region server
as well as DDL (create, delete table) operations. It monitor
all Region Server instances present in the cluster. In a
distributed environment, Master runs several background
threads. HMaster has many features like controlling load
balancing, failover etc.
2. Region Server –
HBase Tables are divided horizontally by row key range into
Regions. Regions are the basic building elements of HBase
cluster that consists of the distribution of tables and are
comprised of Column families. Region Server runs on HDFS
DataNode which is present in Hadoop cluster. Regions of
Region Server are responsible for several things, like
handling, managing, executing as well as reads and writes
HBase operations on that set of regions. The default size of
a region is 256 MB.
3. Zookeeper –
It is like a coordinator in HBase. It provides services like
maintaining configuration information, naming, providing
distributed synchronization, server failure notification etc.
Clients communicate with region servers via zookeeper.
Advantages of HBase –
Disadvantages of HBase –
2. No transaction support
HDFS Architecture
There are two kinds of files in NameNode: FsImage files and EditLogs files:
1. FsImage: It contains all the details about a filesystem, including all the
directories and files, in a hierarchical format. It is also called a file image
because it resembles a photograph.
2. EditLogs: The EditLogs file keeps track of what modifications have
been made to the files of the filesystem.
Secondary NameNode
1. It stores all the transaction log data (from all the source databases) into
one location so that when you want to replay it, it is at one single
location. Once the data is stored, it is replicated across all the servers,
either directly or via a distributed file system.
2. The information stored in the filesystem is replicated across all the
cluster nodes and stored in all the data nodes. Data nodes store the
data. The cluster nodes store the information about the cluster nodes.
This information is called metadata. When a data node reads data from
the cluster, it uses the metadata to determine where to send the data
and what type of data it is. This metadata is also written to a hard drive.
The cluster nodes will write this information if the cluster is restarted.
The cluster will read this information and use it to determine where to
send the data and what type of data it is.
3. The FsImage can be used to create a new replica of data, which can be
used to scale up the data. If the new FsImage needs to be used to
create a new replica, this replication will start with a new FsImage.
There are some cases when it is necessary to recover from a failed
FsImage. In this situation, a new FsImage must be created from an old
one. The FsImage can be used to create backups of data. Data stored
in the Hadoop cluster can be backed up and stored in another Hadoop
cluster, or the data can be stored on a local file system.
DataNode
Backup Node
Backup nodes are used to provide high availability of the data. In case one of
the active NameNode or DataNodes goes down, the backup node can be
promoted to active and the active node switched over to the backup node.
Backup nodes are not used to recover from a failure of the active NameNode
or DataNodes. Instead, you use a replica set of the data for that purpose. Data
nodes are used to store the data and to create the FsImage and editsLogs
files for replication. Data nodes connect to one or more replica sets of the data
to create the FsImage and editsLogs files for replication. Data nodes are not
used to provide high availability.
Blocks
This default size can be changed to any value between 32 and 128
megabytes, depending on the performance required. Data is written to the
DataNodes every time a user makes a change, and new data is appended to
the end of the DataNode. DataNodes are replicated to ensure data
consistency and fault tolerance. If a Node fails, the system automatically
recovers the data from a backup and replicates it across the remaining healthy
Nodes. DataNodes do not store the data directly on the hard drives, instead
using the HDFS file system. This architecture allows HDFS to scale
horizontally as the number of users and data types increase. When the file
size gets bigger, the block size gets bigger as well. When the file size
becomes bigger than the block size, the larger data is placed in the next block.
For example, if the data is 135 MB and the block size is 128 MB, two blocks
will be created. The first block will be 128 MB, while the second block will be
135 MB. When the file size gets bigger than that, the larger data will be placed
in the next block. This ensures that the most data will always be stored at the
same block.
Features of HDFS
Horizontal scalability means that the data stored on multiple nodes can
be stored in a single file system. Vertical scalability means that data can
be stored on multiple nodes. Data can be replicated to ensure data
integrity. Replication occurs through the use of replication factors rather
than the data itself. HDFS can store up to 5PB of data in a single cluster
and handles the load by automatically choosing the best data node to
store data on. Data can be read/updated quickly as it is stored on
multiple nodes. Data stored on multiple nodes through replication
increases the reliability of data.
HDFS is able to survive computer crashes and recover from data corruption.
HDFS operates on the principle of duplicates, so in the event of failure, it can
continue operating as long as there are replicas available. When working on
the principle of replicas, data is duplicated and stored on different machines in
the DHFS cluster. A replica of every block is stored on at least three
DataNodes. HDFS uses a technique referred to as nameNode maintenance to
maintain copies on multiple DataNodes. The nameNode keeps track of how
many blocks have been under- or over-replicated, and subsequently adds or
deletes copies accordingly.
Write Operation
The process continues until all DataNodes have received the data. After
DataNodes receive a copy of the file, they send back the location of the last
block they received. This enables the NameNode to reconstruct the file. After
receiving the last block, the DataNodes notify the NameNode that the job is
complete. The NameNode then replies with a complete file that can be used
by the application. When a file is split into segments, it must be reassembled
to return the file data to the application. Splitting a file into segments is a
method that enables the NameNode to optimize its storage capacity. Splitting
a file into segments also improves fault tolerance and availability. When the
client receives a split file, the process is similar to that of a single file. The
client divides the file into segments, which are then sent to the DataNode.
DataNode 1 receives the segment A and passes it to DataNode 2 and so on.
Read Operation
The client then sends the file to the Replicator. The Replicator does not have
a copy of the file and must read the data from another location. In the
background, data is then sent to the DataNode. The DataNode only has
metadata and must contact the other data nodes to receive the actual data.
The data is then sent to the Replicator. The Replicator again does not have a
copy of the file and must read the data from another location. Data is then
sent to the Reducer. The Reducer does have a copy of the data, but a
compressed version.
1. It is a highly scalable data storage system. This makes it ideal for data-
intensive applications like Hadoop and streaming analytics. Another
major benefit of Hadoop is that it is easy to set up. This makes it ideal
for non-technical users.
2. It is very easy to implement, yet very robust. There is a lot of flexibility
you get with Hadoop. It is a fast and reliable file system.
3. This makes Hadoop a great fit for a wide range of data applications.
The most common one is analytics. You can use Hadoop to process
large amounts of data quickly, and then analyze it to find trends or make
recommendations. The most common type of application that uses
Hadoop analytics is data crunching.
4. You can increase the size of the cluster by adding more nodes or
increase the size of the cluster by adding more nodes. If you have many
clients that need to be stored on HDFS you can easily scale your cluster
horizontally by adding more nodes to the cluster. To scale your cluster
vertically, you can increase the size of the cluster. Once the size of the
cluster is increased, it can serve more clients.
5. This can be done by setting up a centralized database, or by distributing
data across a cluster of commodity personal computers, or a
combination of both. The most common setup for this type of
virtualization is to create a virtual machine on each of your servers.
6. Specialization reduces the overhead of data movement across the
cluster and provides high availability of data.
7. Automatic data replication can be accomplished with a variety of
technologies, including RAID, Hadoop, and database replication.
Logging data and monitoring it for anomalies can also help to detect
and respond to hardware and software failures.
Key-value: This data model comprises two parts: a key and a value.
The key is like an index, used to look up and access the value
containing the data stored in the database.
Document: Documents are self-contained, meaning all the
information related to a single record is stored within one document.
This makes it easier to add or modify data as needed.
Column family: Column family databases store data in columns
Graph: Graph databases use nodes and edges to represent
relationships between different data sets.
Some popular NoSQL databases include:
Data Structure
A data structure is the way data is organized and stored in a database.
Relational databases use a table-based structure with rows and columns,
while non-relational databases use various data models, such as key-
value, document, column-family, and graph.
Performance
Relational databases can provide strong data consistency and integrity but
may be slower in performance for certain use cases. Non-relational
databases can offer faster performance for specific use cases, such as big
data and real-time processing.
Scalability
Relational databases have limited scalability, making them less suitable for
large datasets and high read/write loads. Non-relational databases are
highly scalable and can handle large-scale, distributed data more
efficiently.
Query Language
Relational databases use SQL for querying and manipulating data, while
non-relational databases typically use their own query languages or APIs,
which can vary between different databases.
Schema
Relational databases have a predefined schema which makes them better
suited for structured data. Non-relational databases, however, are more
flexible and can accommodate various types of data.
Development
Relational databases require more development effort when it comes to
creating complex queries or changing the database structure. On the other
hand, non-relational databases are easier to develop and require fewer
resources.
VIRTUALIZATION SOFTWARE
PHYSICAL HARDWARE
• Data virtualization software acts as a bridge across multiple, diverse data
sources, bringing critical decision-making data together in one virtual
place to fuel analytics.
• It is a method for combining data from various sources and different
types into a comprehensive, logical representation without physically
relocating the data
• Instead of performing data movement and physically storing integrated
views in a destination data structure, data virtualization can be utilized
to construct virtualized and integrated views of data in memory. To
make querying logic simpler, it provides an abstraction layer over the
actual physical implementation of data
A hypervisor allows a single host computer to support multiple virtual machines
(VMs) by sharing resources including memory and processing and allows us to
build and run virtual machines which are abbreviated as VMs.
VIRTUALIZATION APPROACHES
1)SERVER VIRTUALIZATION
In server virtualization, one physical server is partitioned into multiple virtual
servers. The hardware and resources of a machine — including the random
access memory (RAM), CPU, hard drive, and network controller — can be
virtualized into a series of virtual machines that each runs its own applications
and operating system. A single thin layer of software is being inserted with the
hardware, virtual machine monitor also called as a hypervisor. It is a technology
that manages the traffic between the virtual machines and the physical machine
2)Application virtualization
Application virtualization means encapsulating applications in a way that they
would not be dependent on the underlying physical computer system.
Application virtualization improves the manageability and portability of
applications. It can be used along its server virtualization to meet business
application virtualization ensures that big data applications can access resources
on the basis of the relative priority with each other. Big data applications have
significant it resource requirements in application visualisation can help them
increase accessing the resources at low costs
server
user
3)Network Virtualization
Network virtualization is using virtual networking as a tool of connexion
resources while implementing network virtualization. You don’t need to rely on
the physical network for managing traffic between connexions. You can create
as many virtual networks as you need from your single physical implementation
in the big data environment. Network virtualization helps in defining different
networks with different set of performance in capacities to manage the large
distributed data required for the bigger analysis
Here's a closer look at what's in the image and the relationship between
the components:
What makes big data big is that it relies on picking up lots of data
from lots of sources. Therefore, open application programming
interfaces (APIs) will be core to any big data architecture.
In addition, keep in mind that interfaces exist at every level and
between every layer of the stack. Without integration services, big
data can't happen.
(or)
Security infrastructure - The information about your
constituents must be protected in order to comply with
regulatory requirements as well as to protect their privacy.
Operational data sources - A relational database was used
to store highly structured data that was handled by the line of
business. Operational data sources were used to store highly-
structured data.
Organizing Databases and tools - structured database and
tools used to organize the data and process this.
Analytical Data warehouse - The addition of an analytical
data warehouse simplifies the data for the development of
reports.
Reporting and visualization - Enable the processing of data
while providing a user-friendly depiction of the results.
1. Highly scalable
A framework with excellent scalability is Apache Hadoop
MapReduce. This is because of its capacity for distributing and
storing large amounts of data across numerous servers. These
servers can all run simultaneously and are all reasonably
priced.
By adding servers to the cluster, we can simply grow the
amount of storage and computing power. We may improve the
capacity of nodes or add any number of nodes (horizontal
scalability) to attain high computing power. Organizations may
execute applications from massive sets of nodes, potentially
using thousands of terabytes of data, thanks to Hadoop
MapReduce programming.
2. Versatile
Businesses can use MapReduce programming to access new
data sources. It makes it possible for companies to work with
many forms of data. Enterprises can access both organized and
unstructured data with this method and acquire valuable
insights from the various data sources.
Since Hadoop is an open-source project, its source code is
freely accessible for review, alterations, and analyses. This
enables businesses to alter the code to meet their specific
needs. The MapReduce framework supports data from sources
including email, social media, and clickstreams in different
languages.
3. Secure
The MapReduce programming model uses the HBase and HDFS
security approaches, and only authenticated users are
permitted to view and manipulate the data. HDFS uses a
replication technique in Hadoop 2 to provide fault tolerance.
Depending on the replication factor, it makes a clone of each
block on the various machines. One can therefore access data
from the other devices that house a replica of the same data if
any machine in a cluster goes down. Erasure coding has taken
the role of this replication technique in Hadoop 3. Erasure
coding delivers the same level of fault tolerance with less area.
The storage overhead with erasure coding is less than 50%.
4. Affordability
With the help of the MapReduce programming framework and
Hadoop’s scalable design, big data volumes may be stored and
processed very affordably. Such a system is particularly cost-
effective and highly scalable, making it ideal for business
models that must store data that is constantly expanding to
meet the demands of the present.
In terms of scalability, processing data with older, conventional
relational database management systems was not as simple as
it is with the Hadoop system. In these situations, the company
had to minimize the data and execute classification based on
presumptions about how specific data could be relevant to the
organization, hence deleting the raw data. The MapReduce
programming model in the Hadoop scale-out architecture helps
in this situation.
5. Fast-paced
The Hadoop Distributed File System, a distributed storage
technique used by MapReduce, is a mapping system for finding
data in a cluster. The data processing technologies, such as
MapReduce programming, are typically placed on the same
servers that enable quicker data processing.
Thanks to Hadoop’s distributed data storage, users may
process data in a distributed manner across a cluster of nodes.
As a result, it gives the Hadoop architecture the capacity to
process data exceptionally quickly. Hadoop MapReduce can
process unstructured or semi-structured data in high numbers
in a shorter time.
7. Parallel processing-compatible
The parallel processing involved in MapReduce programming is
one of its key components. The tasks are divided in the
programming paradigm to enable the simultaneous execution
of independent activities. As a result, the program runs faster
because of the parallel processing, which makes it simpler for
the processes to handle each job. Multiple processors can carry
out these broken-down tasks thanks to parallel processing.
Consequently, the entire software runs faster.
8. Reliable
The same set of data is transferred to some other nodes in a
cluster each time a collection of information is sent to a single
node. Therefore, even if one node fails, backup copies are
always available on other nodes that may still be retrieved
whenever necessary. This ensures high data availability.
The framework offers a way to guarantee data trustworthiness
through the use of Block Scanner, Volume Scanner, Disk
Checker, and Directory Scanner modules. Your data is safely
saved in the cluster and is accessible from another machine
that has a copy of the data if your device fails or the data
becomes corrupt.
9. Highly available
Hadoop’s fault tolerance feature ensures that even if one of the
DataNodes fails, the user may still access the data from other
DataNodes that have copies of it. Moreover, the high
accessibility Hadoop cluster comprises two or more active and
passive NameNodes running on hot standby. The active
NameNode is the active node. A passive node is a backup node
that applies changes made in active NameNode’s edit logs to
its namespace.
Key Features of MapReduce
There are some key features of MapReduce below:
Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a
large number of nodes in a cluster. This allows it to handle massive datasets, making
it suitable for Big Data applications.
Fault Tolerance
MapReduce incorporates built-in fault tolerance to ensure the reliable processing of
data. It automatically detects and handles node failures, rerunning tasks on available
nodes as needed.
Data Locality
MapReduce takes advantage of data locality by processing data on the same node
where it is stored, minimizing data movement across the network and improving
overall performance.
Simplicity
The MapReduce programming model abstracts away many complexities associated
with distributed computing, allowing developers to focus on their data processing
logic rather than low-level details.
Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make
storing and processing extensive data sets very economical.
Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution
of independent operations. As a result, programs run faster due to parallel
processing, making it easier for a process to handle each job. Thanks to parallel
processing, these distributed tasks can be performed by multiple processors.
Therefore, all software runs faster.
HBase
HBase is a data model that is similar to Google’s big table. It is an open
source, distributed database developed by Apache software foundation written
in Java. HBase is an essential part of our Hadoop ecosystem. HBase runs on
top of HDFS (Hadoop Distributed File System). It can store massive amounts
of data from terabytes to petabytes. It is column oriented and horizontally
scalable Hbase is well suited for sparse data sets which are very common in big
data use cases Hbase provides APIs enabling development in practically any
programming language.It provides random real-time read/write access to data in
the Hadoop File System.
Why HBase
HBase Architecture
FEATURES OF HBase
1. t is linearly scalable across various nodes as well as modularly
scalable, as it divided across various nodes.
2. HBase provides consistent read and writes.
3. It provides atomic read and write means during one read or write
process, all other processes are prevented from performing any read
or write operations.
4. It provides easy to use Java API for client access.
5. It supports Thrift and REST API for non-Java front ends which
supports XML, Protobuf and binary data encoding options.
6. It supports a Block Cache and Bloom Filters for real-time queries
and for high volume query optimization.
7. HBase provides automatic failure support between Region Servers.
8. It support for exporting metrics with the Hadoop metrics subsystem
to files.
9. It doesn’t enforce relationship within your data.
10.It is a platform for storing and retrieving data with random access.