0% found this document useful (0 votes)
26 views14 pages

Bda Unit 3

This document discusses data processing with Hadoop, emphasizing its distributed storage and processing capabilities through HDFS and the MapReduce framework. It explains the workflow of data processing, detailing the roles of JobTracker and TaskTracker, as well as the phases of mappers and reducers. Additionally, it introduces NoSQL databases, highlighting their types, advantages, and differences from SQL databases, along with a brief overview of NewSQL systems that combine the scalability of NoSQL with the consistency of traditional databases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views14 pages

Bda Unit 3

This document discusses data processing with Hadoop, emphasizing its distributed storage and processing capabilities through HDFS and the MapReduce framework. It explains the workflow of data processing, detailing the roles of JobTracker and TaskTracker, as well as the phases of mappers and reducers. Additionally, it introduces NoSQL databases, highlighting their types, advantages, and differences from SQL databases, along with a brief overview of NewSQL systems that combine the scalability of NoSQL with the consistency of traditional databases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Unit -3

3.1 Processing data with hadoop:

Data processing in Hadoop is a fundamental aspect of working with large datasets. Hadoop, an
open-source framework, enables distributed storage and processing of massive amounts of data
across clusters of commodity hardware. This distributed approach makes it possible to handle
datasets that would be too large and complex for traditional single-server systems.

Data Ingestion and Storage (HDFS):

 Data Ingestion: Data from various sources (e.g., weblogs, social media feeds, sensor data)
is ingested into the Hadoop cluster.
 Hadoop Distributed File System (HDFS): Hadoop uses HDFS as its storage layer. HDFS
divides the input data into large blocks (typically 128MB or 256MB) and distributes these
blocks across multiple nodes in the cluster.
 Replication: To ensure fault tolerance, HDFS replicates each data block multiple times
(typically 3) and stores these replicas on different nodes. If one node fails, the data is still
available from other replicas.
 NameNode and DataNodes: HDFS has a master-slave architecture. The NameNode
manages the file system namespace and metadata (location of blocks). DataNodes store the
actual data blocks.

Data Processing Framework (MapReduce & YARN):


 MapReduce: This is the original programming model for processing data in Hadoop. It
involves two main functions:
o Map: Takes input data as key-value pairs and processes each pair to generate
intermediate key-value pairs. This stage runs in parallel across the data blocks.
o Reduce: Takes the intermediate key-value pairs (grouped by key) from the map stage
and aggregates or combines the values to produce the final output. This stage also
runs in parallel.
 Yet Another Resource Negotiator (YARN): Introduced in Hadoop 2.0, YARN is the
resource management layer. It decouples resource management from the MapReduce
programming model, allowing Hadoop to support other processing frameworks (like Spark
and Tez).
o ResourceManager: Manages the allocation of cluster resources (CPU, memory).
o NodeManager: Manages the resources on individual nodes.
o ApplicationMaster: Manages the lifecycle of each application (e.g., a MapReduce
job).

Data Processing Workflow:

The typical data processing flow in Hadoop using MapReduce involves the following steps:

 Input Splitting: The input data is divided into smaller, logical chunks called splits. Each split
is processed by a separate map task.
 Mapping: Each map task processes its assigned input split and generates intermediate key-
value pairs based on the logic defined in the mapper function.
 Shuffling and Sorting: The intermediate key-value pairs from all the map tasks are shuffled
and sorted based on the keys. This process groups all the values associated with the same
key together.
 Reducing: Each reduce task processes the sorted intermediate key-value pairs for a specific
set of keys. It applies the logic defined in the reducer function to aggregate, filter, or
transform the data, producing the final output.
 Output: The output from the reduce tasks is written back to HDFS.

3.2 MapReducing Working


 MapReduce programming helps to process massive amounts of data in
parallel.
 Input data set splits into independent chunks. Map tasks process
these independent chunks completely in a parallel manner.
 Reduce task-provides reduced output by combining the output of
various mapers. There are two daemons associated with MapReduce
Programming: JobTracker and TaskTracer.

JobTracker:
JobTracker is a master daemon responsible for executing over
MapReduce job.
It provides connectivity between Hadoop and application.

Whenever code submitted to a cluster, JobTracker creates the


execution plan by deciding which task to assign to which node.

It also monitors all the running tasks. When task fails it automatically re-
schedules the task to a different node after a predefined number of retires.

There will be one job Tracker process running on a single Hadoop cluster.
Job Tracker processes run on their own Java Virtual machine process.

Fig. Job Tracker and Task Tracker interaction

TaskTracker:
This daemon is responsible for executing individual tasks that is assigned
by the Job Tracker.

Task Tracker continuously sends heartbeat message to job tracker. When


a job tracker fails to receive a heartbeat message from a
TaskTracker, the JobTracker assumes that the TaskTracker has failed and
resubmits the task to another available node in the cluster.

Map Reduce Framework


Phases: Daemons:
Map: Converts input into key- JobTracker: Master,
value pairs. Schedules Task
Reduce: Combines output of TaskTracker: Slave, Execute
mappers and produces a task
reduced result set.

MapReduce working:
MapReduce divides a data analysis task into two parts – Map and Reduce.
In the example given below: there two mappers and one reduce.
Each mapper works on the partial data set that is stored on that node and
the reducer combines the output from the mappers to produce the reduced
result set.
Steps:
1. First, the input dataset is split into multiple pieces of data.
2. Next, the framework creates a master and several slave processes and
executes the worker processes remotely.
3. Several map tasks work simultaneously and read pieces of data that
were assigned to each map task.
4. Map worker uses partitioner function to divide the data into
regions.
5. When the map slaves complete their work, the master instructs the
reduce slaves to begin their work.
6. When all the reduce slaves complete their work, the master
transfers the control to the user program.

Fig. MapReduce Programming Architecture


In MapReduce programming, Jobs(applications) are split into a set of map tasks
and reduce tasks.
Map task takes care of loading, parsing, transforming and filtering.
The responsibility of reduce task is grouping and aggregating data that is
produced by map tasks to generate final output.
Each map task is broken down into the following phases:
1. Record Reader 2. Mapper
3. Combiner 4.Partitioner.
The output produced by the map task is known as intermediate <keys,
value> pairs. These intermediate <keys, value> pairs are sent to reducer.
The reduce tasks are broken down into the following phases:
1. Shuffle 2. Sort
3. Reducer 4. Output format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed
resides. This way, Hadoop ensures data locality. Data locality means that data is
not moved over network; only computational code moved to process data which
saves network bandwidth.

Mapper Phases:
Mapper maps the input <keys, value> pairs into a set of intermediate <keys,
value> pairs.
Each map task is broken into following phases:

1. RecordReader: converts byte oriented view of input in to Record oriented


view and presents it to the Mapper tasks. It presents the tasks with keys
and values.
i) InputFormat: It reads the given input file and splits using the
method getsplits().
ii) Then it defines RecordReader using createRecordReader() which is
responsible for generating <keys, value> pairs.

2. Mapper: Map function works on the <keys, value> pairs produced by


RecordReader and generates intermediate (key, value) pairs.
Methods:
- protected void cleanup(Context context): called once at tend of
task.
- protected void map(KEYIN key, VALUEIN value, Context
context): called once for each key-value pair in input split.
- void run(Context context): user can override this method for
complete control over execution of Mapper.
- protected void setup(Context context): called once at beginning of
task to perform required activities to initiate map() method.

3. Combiner: It takes intermediate <keys, value> pairs provided by mapper


and applies user specific aggregate function to only one mapper. It is also
known as local Reducer.
We can optionally specify a combiner using
Job.setCombinerClass(ReducerClass) to perform local aggregation on
intermediate outputs.

Fig. MapReduce without Combiner class

Fig. MapReduce with Combiner class

4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper,


splits them into partitions the data using a user-defined condition.
The default behavior is to hash the key to determine the reducer.User can
control by using the method:
int getPartition(KEY key, VALUE value, int numPartitions )
Reducer Phases:
1. Shuffle & Sort:
 Downloads the grouped key-value pairs onto the
local machine, where the Reducer is running.
 The individual <keys, value> pairs are sorted by
key into a larger data list.
 The data list groups the equivalent keys together so
that their values can be iterated easily in the Reducer
task.
2. Reducer:
 The Reducer takes the grouped key-value paired
data as input and runs a Reducer function on each
one of them.
 Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a
wide range of processing.
 Once the execution is over, it gives zero or more
key-value pairs to the final step.
Methods:
- protected void cleanup(Context context): called once at tend of
task.
- protected void reduce(KEYIN key, VALUEIN
value, Context context): called once for each key-value pair.
- void run(Context context): user can override this
method for complete control over execution of Reducer.
- protected void setup(Context context): called once
at beginning of task to perform required activities to initiate
reduce() method.

3. Output format:
 In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer
function and writes them onto a file using a record
writer.

What is NoSQL?

NoSQL databases (AKA "not only SQL") store data differently than relational tables.
NoSQL databases come in a variety of types based on their data model. The main
types are document, key-value, wide-column, and graph. They provide flexible
schemas and scale easily with large amounts of big data and high user loads.

In this article, you'll learn what a NoSQL database is, why (and when!) you should use
one, and how to get started.

What is a NoSQL database?


When people use the term “NoSQL database,” they typically use it to refer to any
non-relational database. Some say the term “NoSQL” stands for “non-SQL” while
others say it stands for “not only SQL.” Either way, most agree that NoSQL databases
store data in a more natural and flexible way. NoSQL, as opposed to SQL, is a
database management approach, whereas SQL is just a query language, similar to the
query languages of NoSQL databases.

Types of databases — NoSQL

Over time, four major types of NoSQL databases have emerged: document databases,
key-value databases, wide-column stores, and graph databases. Nowadays, multi-
model databases are also becoming quite popular.

Document-oriented databases

A document-oriented database stores data in documents similar to JSON (JavaScript


Object Notation) objects. Each document contains pairs of fields and values. The
values can typically be a variety of types, including things like strings, numbers,
booleans, arrays, or even other objects. A document database offers a flexible data
model, much suited for semi-structured and typically unstructured data sets. They also
support nested structures, making it easy to represent complex relationships or
hierarchical data.

Examples of document databases are MongoDB and Couchbase. A typical document


will look like the following:

Code Snippet

"_id": "12345",
"name": "foo bar",
"email": "foo@bar.com",

"address": {

"street": "123 foo street",


"city": "some city",

"state": "some state",


"zip": "123456"

},
"hobbies": ["music", "guitar", "reading"]

Key-value databases

A key-value store is a simpler type of database where each item contains keys and
values. Each key is unique and associated with a single value. They are used for
caching and session management and provide high performance in reads and writes
because they tend to store things in memory. Examples are Amazon DynamoDB and
Redis. A simple view of data stored in a key-value database is given below:

Code Snippet

Key: user:12345
Value: {"name": "foo bar", "email": "foo@bar.com", "designation": "software
developer"}

Wide-column stores

Wide-column stores store data in tables, rows, and dynamic columns. The data is
stored in tables. However, unlike traditional SQL databases, wide-column stores are
flexible, where different rows can have different sets of columns. These databases can
employ column compression techniques to reduce the storage space and enhance
performance. The wide rows and columns enable efficient retrieval of sparse and wide
data. Some examples of wide-column stores are Apache Cassandra and HBase. A
typical example of how data is stored in a wide-column is as follows:

name id email dob city


Foo bar 12345 foo@bar.com Some city
Carn Yale 34521 bar@foo.com 12-05-1972

Graph databases

A graph database stores data in the form of nodes and edges. Nodes typically store
information about people, places, and things (like nouns), while edges store
information about the relationships between the nodes. They work well for highly
connected data, where the relationships or patterns may not be very obvious initially.
Examples of graph databases are Neo4J and Amazon Neptune. MongoDB also
provides graph traversal capabilities using the $graphLookup stage of the aggregation
pipeline. Below is an example of how data is stored:
NoSQL databases offer several advantages, primarily due to their flexibility,
scalability, and ability to handle large, diverse datasets. They excel at storing and
retrieving data with minimal schema requirements, making them suitable for various
data structures and formats. Their scalability and distributed nature allow them to
handle massive workloads and changing data volumes.
Key advantages of NoSQL databases:
Flexibility:
NoSQL databases can store various data types, including structured, semi-
structured, and unstructured data, without rigid schema requirements. This
makes them ideal for applications with evolving data models.
· · Scalability:
· NoSQL databases are designed for horizontal scaling, meaning you can easily add
more servers to handle increasing data volume and traffic without downtime. This is
particularly beneficial for cloud environments.
· · · High performance:
· NoSQL databases are often optimized for specific data models and access patterns,
leading to faster query execution and lower latency.
· · · Ease of use:
· Many NoSQL databases offer user-friendly APIs and simplified data access,
making them easier to develop with compared to traditional SQL databases.
· · · Big data handling:
· NoSQL databases are well-suited for managing large datasets and providing fast
analytics.
· · · Cost-effective:
· NoSQL databases can be more cost-effective than relational databases, especially
for cloud deployments.

 The five critical differences between SQL and NoSQL are:

 SQL databases are relational, and NoSQL databases are non-relational.


 SQL databases use structured query language (SQL) and have a predefined
schema. NoSQL databases have dynamic schemas for unstructured data.
 SQL databases are vertically scalable, while NoSQL databases are
horizontally scalable.
 SQL databases are table-based, while NoSQL databases are document, key-
value, graph, or wide-column stores.
 SQL databases are better for multi-row transactions, while NoSQL is better for
unstructured data like documents or JSON.

NewSQL (pronounced new ess-cue-ell or new sequel) is a relational database


management system (RDMS) that aims to provide NoSQL system scalability while
also maintaining the consistency of a traditional database system.

NewSQL combines ACID (atomicity, consistency, isolation and durability)


compliance with horizontal scaling for online transaction processing workloads.
Enterprise systems that handle data, such as financial and order processing systems,
are too big for a traditional relational database. At the same time, these enterprise
systems aren’t practical for NoSQL systems because they have transactional and
consistency requirements. NewSQL provides the scale and reliability without
requiring more infrastructure or development expenditures.
NewSQL uses SQL to ingest new information, execute transaction processing at a
large scale, and change the contents of the database. The main categories of NewSQL
include new architectures, transparent sharding middleware, SQL engines and
database as a service (DBaaS).

Key Features of NewSQL


NewSQL databases incorporate several key features that make them stand out in the
data management landscape, especially when compared to traditional relational
database management systems (RDBMS) and NoSQL databases. Here are the core
features that define NewSQL systems:

SQL Compatibility with Scalability and Performance

NewSQL systems are designed to retain the use of SQL, the structured query
language widely used in traditional databases, while overcoming the performance and
scalability limitations of conventional SQL databases. NewSQL systems introduce
distributed, scalable architectures, often referred to as a shared-nothing architecture,
which allows for horizontal scaling. This means that as data volumes grow, NewSQL
databases can efficiently distribute workloads across multiple nodes without
sacrificing SQL functionality.

ACID Compliance

ACID properties—atomicity, consistency, isolation, and durability—are crucial for


ensuring reliable transaction processing in databases. Unlike NoSQL databases, which
often relax ACID compliance in favor of scalability and flexibility, NewSQL systems
maintain full ACID guarantees, even in distributed environments. This allows
NewSQL systems to offer the consistency and transactional integrity of traditional
SQL databases while meeting modern scalability demands.

Distributed, Shared-Nothing Architecture

NewSQL databases typically adopt a distributed system architecture to improve


scalability and availability. The shared-nothing model ensures that there is no single
point of contention between nodes, making these databases ideal for handling big
data workloads and maintaining high availability. By partitioning data across many
servers, NewSQL systems provide fault tolerance and the ability to scale out
seamlessly as the dataset grows.

High Availability and Fault Tolerance

Many NewSQL databases are designed with high availability and fault tolerance in
mind. They use mechanisms such as replication and automatic failover to ensure
that data is accessible at all times, even in the event of hardware or network failures.
Systems like Google Spanner exemplify this by providing geographically distributed
databases that ensure data is always available, even across data centers.
Improvements over Traditional Relational Databases

NewSQL addresses the primary performance and scalability challenges of traditional


relational databases. While relational databases offer strong consistency, they often
struggle with horizontal scalability, making them less suitable for modern, high-
velocity workloads. NewSQL resolves this by combining the transactional integrity of
relational databases with the ability to scale across distributed systems.

Comparison of Key Features

Traditional SQL NewSQL


Feature NoSQL Database
Database Database
ACID Compliance Full Rarely Full
Horizontal
Limited High High
Scalability
SQL Support Yes No (in most cases) Yes
Eventual (in many
Consistency Strong Strong
cases)
Fault Tolerance Moderate (HA required) High High
Comparison of NewSQL vs. SQL and NoSQL
When discussing NewSQL, it's important to understand its position between
traditional SQL databases and the newer NoSQL systems. Each of these database
types offers distinct advantages and disadvantages, depending on the workload and
business needs.

NewSQL vs. SQL

NewSQL retains many core features of traditional SQL databases, such as SQL
support and ACID compliance. However, NewSQL improves upon SQL databases
by providing:

Scalability: Traditional SQL databases, while powerful, typically struggle to


scale horizontally, which limits their ability to handle large volumes of data
across distributed systems. NewSQL overcomes this with distributed
architectures that allow horizontal scaling across nodes.


Performance: NewSQL databases enhance performance, especially in high-


traffic environments where online transaction processing (OLTP) is critical.
Systems like Google Spanner demonstrate how NewSQL can combine the
reliability of traditional SQL with modern performance needs.


Fault tolerance: NewSQL is built with high availability in mind. Through


mechanisms such as replication and automatic failover, it provides better
data durability and fault tolerance than traditional SQL systems.

Table: Comparison of NewSQL, SQL, and NoSQL


NewSQL
Feature SQL Database NoSQL Database
Database
ACID
Full Limited (BASE model) Full
Compliance
SQL Support Yes No Yes
Scalability Vertical Horizontal Horizontal
Consistency Strong Eventual Strong
Unstructured/Semi-
Data Structure Structured (schemas) Structured
structured
Transaction-heavy OLTP with
Use Case Web-scale applications
applications scalability
Moderate (with add-
Fault Tolerance High (built-in) High (built-in)
ons)

eature SQL NoSQL NewSQL

No, it doesn't follow a Yes, since the


Yes, it follows
Relational relational model. It was relational model is
relational modelling
Property designed to be entirely equally essential for
to a large extent.
different from that. real-time analytics.
Yes, ACID
properties are No, rather provides for Yes, Acid properties
ACID
fundamental to their CAP support are taken care of.
application

Yes, proper support


and even enhanced
SQL Support for SQL No support for old SQL
functionalities for Old
SQL

Fully functionally
It supports such
Inefficient for OLTP supports OLTP
OLTP databases, but it is not
databases. databases and is highly
the best suited.
efficient

Vertical + Horizontal
Scaling Vertical scaling Only Vertical scaling
scaling

Can handle simple Highly efficient in


Better than SQL for
Query queries with ease and processing complex
processing complex
Handling fails when they get queries and smaller
queries
complex queries.

Distributed
No Yes Yes
Databases

You might also like