0% found this document useful (0 votes)

26 views14 pages

Bda Unit 3

This document discusses data processing with Hadoop, emphasizing its distributed storage and processing capabilities through HDFS and the MapReduce framework. It explains the workflow of data processing, detailing the roles of JobTracker and TaskTracker, as well as the phases of mappers and reducers. Additionally, it introduces NoSQL databases, highlighting their types, advantages, and differences from SQL databases, along with a brief overview of NewSQL systems that combine the scalability of NoSQL with the consistency of traditional databases.

Uploaded by

vanamalajayasurya126

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views14 pages

Bda Unit 3

Uploaded by

vanamalajayasurya126

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Unit -3

3.1 Processing data with hadoop:

Data processing in Hadoop is a fundamental aspect of working with large datasets. Hadoop, an
open-source framework, enables distributed storage and processing of massive amounts of data
across clusters of commodity hardware. This distributed approach makes it possible to handle
datasets that would be too large and complex for traditional single-server systems.

Data Ingestion and Storage (HDFS):

 Data Ingestion: Data from various sources (e.g., weblogs, social media feeds, sensor data)
is ingested into the Hadoop cluster.
 Hadoop Distributed File System (HDFS): Hadoop uses HDFS as its storage layer. HDFS
divides the input data into large blocks (typically 128MB or 256MB) and distributes these
blocks across multiple nodes in the cluster.
 Replication: To ensure fault tolerance, HDFS replicates each data block multiple times
(typically 3) and stores these replicas on different nodes. If one node fails, the data is still
available from other replicas.
 NameNode and DataNodes: HDFS has a master-slave architecture. The NameNode
manages the file system namespace and metadata (location of blocks). DataNodes store the
actual data blocks.

Data Processing Framework (MapReduce & YARN):

 MapReduce: This is the original programming model for processing data in Hadoop. It
involves two main functions:
o Map: Takes input data as key-value pairs and processes each pair to generate
intermediate key-value pairs. This stage runs in parallel across the data blocks.
o Reduce: Takes the intermediate key-value pairs (grouped by key) from the map stage
and aggregates or combines the values to produce the final output. This stage also
runs in parallel.
 Yet Another Resource Negotiator (YARN): Introduced in Hadoop 2.0, YARN is the
resource management layer. It decouples resource management from the MapReduce
programming model, allowing Hadoop to support other processing frameworks (like Spark
and Tez).
o ResourceManager: Manages the allocation of cluster resources (CPU, memory).
o NodeManager: Manages the resources on individual nodes.
o ApplicationMaster: Manages the lifecycle of each application (e.g., a MapReduce
job).

Data Processing Workflow:

The typical data processing flow in Hadoop using MapReduce involves the following steps:

 Input Splitting: The input data is divided into smaller, logical chunks called splits. Each split
is processed by a separate map task.
 Mapping: Each map task processes its assigned input split and generates intermediate key-
value pairs based on the logic defined in the mapper function.
 Shuffling and Sorting: The intermediate key-value pairs from all the map tasks are shuffled
and sorted based on the keys. This process groups all the values associated with the same
key together.
 Reducing: Each reduce task processes the sorted intermediate key-value pairs for a specific
set of keys. It applies the logic defined in the reducer function to aggregate, filter, or
transform the data, producing the final output.
 Output: The output from the reduce tasks is written back to HDFS.

3.2 MapReducing Working

 MapReduce programming helps to process massive amounts of data in
parallel.
 Input data set splits into independent chunks. Map tasks process
these independent chunks completely in a parallel manner.
 Reduce task-provides reduced output by combining the output of
various mapers. There are two daemons associated with MapReduce
Programming: JobTracker and TaskTracer.

JobTracker:
JobTracker is a master daemon responsible for executing over
MapReduce job.
It provides connectivity between Hadoop and application.

Whenever code submitted to a cluster, JobTracker creates the

execution plan by deciding which task to assign to which node.

It also monitors all the running tasks. When task fails it automatically re-
schedules the task to a different node after a predefined number of retires.

There will be one job Tracker process running on a single Hadoop cluster.
Job Tracker processes run on their own Java Virtual machine process.

Fig. Job Tracker and Task Tracker interaction

TaskTracker:
This daemon is responsible for executing individual tasks that is assigned
by the Job Tracker.

Task Tracker continuously sends heartbeat message to job tracker. When

a job tracker fails to receive a heartbeat message from a
TaskTracker, the JobTracker assumes that the TaskTracker has failed and
resubmits the task to another available node in the cluster.

Map Reduce Framework

Phases: Daemons:
Map: Converts input into key- JobTracker: Master,
value pairs. Schedules Task
Reduce: Combines output of TaskTracker: Slave, Execute
mappers and produces a task
reduced result set.

MapReduce working:
MapReduce divides a data analysis task into two parts – Map and Reduce.
In the example given below: there two mappers and one reduce.
Each mapper works on the partial data set that is stored on that node and
the reducer combines the output from the mappers to produce the reduced
result set.
Steps:
1. First, the input dataset is split into multiple pieces of data.
2. Next, the framework creates a master and several slave processes and
executes the worker processes remotely.
3. Several map tasks work simultaneously and read pieces of data that
were assigned to each map task.
4. Map worker uses partitioner function to divide the data into
regions.
5. When the map slaves complete their work, the master instructs the
reduce slaves to begin their work.
6. When all the reduce slaves complete their work, the master
transfers the control to the user program.

Fig. MapReduce Programming Architecture

In MapReduce programming, Jobs(applications) are split into a set of map tasks
and reduce tasks.
Map task takes care of loading, parsing, transforming and filtering.
The responsibility of reduce task is grouping and aggregating data that is
produced by map tasks to generate final output.
Each map task is broken down into the following phases:
1. Record Reader 2. Mapper
3. Combiner 4.Partitioner.
The output produced by the map task is known as intermediate <keys,
value> pairs. These intermediate <keys, value> pairs are sent to reducer.
The reduce tasks are broken down into the following phases:
1. Shuffle 2. Sort
3. Reducer 4. Output format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed
resides. This way, Hadoop ensures data locality. Data locality means that data is
not moved over network; only computational code moved to process data which
saves network bandwidth.

Mapper Phases:
Mapper maps the input <keys, value> pairs into a set of intermediate <keys,
value> pairs.
Each map task is broken into following phases:

1. RecordReader: converts byte oriented view of input in to Record oriented

view and presents it to the Mapper tasks. It presents the tasks with keys
and values.
i) InputFormat: It reads the given input file and splits using the
method getsplits().
ii) Then it defines RecordReader using createRecordReader() which is
responsible for generating <keys, value> pairs.

2. Mapper: Map function works on the <keys, value> pairs produced by

RecordReader and generates intermediate (key, value) pairs.
Methods:
- protected void cleanup(Context context): called once at tend of
task.
- protected void map(KEYIN key, VALUEIN value, Context
context): called once for each key-value pair in input split.
- void run(Context context): user can override this method for
complete control over execution of Mapper.
- protected void setup(Context context): called once at beginning of
task to perform required activities to initiate map() method.

3. Combiner: It takes intermediate <keys, value> pairs provided by mapper

and applies user specific aggregate function to only one mapper. It is also
known as local Reducer.
We can optionally specify a combiner using
Job.setCombinerClass(ReducerClass) to perform local aggregation on
intermediate outputs.

Fig. MapReduce without Combiner class

Fig. MapReduce with Combiner class

4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper,

splits them into partitions the data using a user-defined condition.
The default behavior is to hash the key to determine the reducer.User can
control by using the method:
int getPartition(KEY key, VALUE value, int numPartitions )
Reducer Phases:
1. Shuffle & Sort:
 Downloads the grouped key-value pairs onto the
local machine, where the Reducer is running.
 The individual <keys, value> pairs are sorted by
key into a larger data list.
 The data list groups the equivalent keys together so
that their values can be iterated easily in the Reducer
task.
2. Reducer:
 The Reducer takes the grouped key-value paired
data as input and runs a Reducer function on each
one of them.
 Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a
wide range of processing.
 Once the execution is over, it gives zero or more
key-value pairs to the final step.
Methods:
- protected void cleanup(Context context): called once at tend of
task.
- protected void reduce(KEYIN key, VALUEIN
value, Context context): called once for each key-value pair.
- void run(Context context): user can override this
method for complete control over execution of Reducer.
- protected void setup(Context context): called once
at beginning of task to perform required activities to initiate
reduce() method.

3. Output format:
 In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer
function and writes them onto a file using a record
writer.

What is NoSQL?

NoSQL databases (AKA "not only SQL") store data differently than relational tables.
NoSQL databases come in a variety of types based on their data model. The main
types are document, key-value, wide-column, and graph. They provide flexible
schemas and scale easily with large amounts of big data and high user loads.

In this article, you'll learn what a NoSQL database is, why (and when!) you should use
one, and how to get started.

What is a NoSQL database?

When people use the term “NoSQL database,” they typically use it to refer to any
non-relational database. Some say the term “NoSQL” stands for “non-SQL” while
others say it stands for “not only SQL.” Either way, most agree that NoSQL databases
store data in a more natural and flexible way. NoSQL, as opposed to SQL, is a
database management approach, whereas SQL is just a query language, similar to the
query languages of NoSQL databases.

Types of databases — NoSQL

Over time, four major types of NoSQL databases have emerged: document databases,
key-value databases, wide-column stores, and graph databases. Nowadays, multi-
model databases are also becoming quite popular.

Document-oriented databases

A document-oriented database stores data in documents similar to JSON (JavaScript

Object Notation) objects. Each document contains pairs of fields and values. The
values can typically be a variety of types, including things like strings, numbers,
booleans, arrays, or even other objects. A document database offers a flexible data
model, much suited for semi-structured and typically unstructured data sets. They also
support nested structures, making it easy to represent complex relationships or
hierarchical data.

Examples of document databases are MongoDB and Couchbase. A typical document

will look like the following:

Code Snippet

"_id": "12345",
"name": "foo bar",
"email": "foo@bar.com",

"address": {

"street": "123 foo street",

"city": "some city",

"state": "some state",

"zip": "123456"

},
"hobbies": ["music", "guitar", "reading"]

Key-value databases

A key-value store is a simpler type of database where each item contains keys and
values. Each key is unique and associated with a single value. They are used for
caching and session management and provide high performance in reads and writes
because they tend to store things in memory. Examples are Amazon DynamoDB and
Redis. A simple view of data stored in a key-value database is given below:

Code Snippet

Key: user:12345
Value: {"name": "foo bar", "email": "foo@bar.com", "designation": "software
developer"}

Wide-column stores

Wide-column stores store data in tables, rows, and dynamic columns. The data is
stored in tables. However, unlike traditional SQL databases, wide-column stores are
flexible, where different rows can have different sets of columns. These databases can
employ column compression techniques to reduce the storage space and enhance
performance. The wide rows and columns enable efficient retrieval of sparse and wide
data. Some examples of wide-column stores are Apache Cassandra and HBase. A
typical example of how data is stored in a wide-column is as follows:

name id email dob city

Foo bar 12345 foo@bar.com Some city
Carn Yale 34521 bar@foo.com 12-05-1972

Graph databases

A graph database stores data in the form of nodes and edges. Nodes typically store
information about people, places, and things (like nouns), while edges store
information about the relationships between the nodes. They work well for highly
connected data, where the relationships or patterns may not be very obvious initially.
Examples of graph databases are Neo4J and Amazon Neptune. MongoDB also
provides graph traversal capabilities using the $graphLookup stage of the aggregation
pipeline. Below is an example of how data is stored:
NoSQL databases offer several advantages, primarily due to their flexibility,
scalability, and ability to handle large, diverse datasets. They excel at storing and
retrieving data with minimal schema requirements, making them suitable for various
data structures and formats. Their scalability and distributed nature allow them to
handle massive workloads and changing data volumes.
Key advantages of NoSQL databases:
Flexibility:
NoSQL databases can store various data types, including structured, semi-
structured, and unstructured data, without rigid schema requirements. This
makes them ideal for applications with evolving data models.
· · Scalability:
· NoSQL databases are designed for horizontal scaling, meaning you can easily add
more servers to handle increasing data volume and traffic without downtime. This is
particularly beneficial for cloud environments.
· · · High performance:
· NoSQL databases are often optimized for specific data models and access patterns,
leading to faster query execution and lower latency.
· · · Ease of use:
· Many NoSQL databases offer user-friendly APIs and simplified data access,
making them easier to develop with compared to traditional SQL databases.
· · · Big data handling:
· NoSQL databases are well-suited for managing large datasets and providing fast
analytics.
· · · Cost-effective:
· NoSQL databases can be more cost-effective than relational databases, especially
for cloud deployments.

 The five critical differences between SQL and NoSQL are:

 SQL databases are relational, and NoSQL databases are non-relational.

 SQL databases use structured query language (SQL) and have a predefined
schema. NoSQL databases have dynamic schemas for unstructured data.
 SQL databases are vertically scalable, while NoSQL databases are
horizontally scalable.
 SQL databases are table-based, while NoSQL databases are document, key-
value, graph, or wide-column stores.
 SQL databases are better for multi-row transactions, while NoSQL is better for
unstructured data like documents or JSON.

NewSQL (pronounced new ess-cue-ell or new sequel) is a relational database

management system (RDMS) that aims to provide NoSQL system scalability while
also maintaining the consistency of a traditional database system.

NewSQL combines ACID (atomicity, consistency, isolation and durability)

compliance with horizontal scaling for online transaction processing workloads.
Enterprise systems that handle data, such as financial and order processing systems,
are too big for a traditional relational database. At the same time, these enterprise
systems aren’t practical for NoSQL systems because they have transactional and
consistency requirements. NewSQL provides the scale and reliability without
requiring more infrastructure or development expenditures.
NewSQL uses SQL to ingest new information, execute transaction processing at a
large scale, and change the contents of the database. The main categories of NewSQL
include new architectures, transparent sharding middleware, SQL engines and
database as a service (DBaaS).

Key Features of NewSQL

NewSQL databases incorporate several key features that make them stand out in the
data management landscape, especially when compared to traditional relational
database management systems (RDBMS) and NoSQL databases. Here are the core
features that define NewSQL systems:

SQL Compatibility with Scalability and Performance

NewSQL systems are designed to retain the use of SQL, the structured query
language widely used in traditional databases, while overcoming the performance and
scalability limitations of conventional SQL databases. NewSQL systems introduce
distributed, scalable architectures, often referred to as a shared-nothing architecture,
which allows for horizontal scaling. This means that as data volumes grow, NewSQL
databases can efficiently distribute workloads across multiple nodes without
sacrificing SQL functionality.

ACID Compliance

ACID properties—atomicity, consistency, isolation, and durability—are crucial for

ensuring reliable transaction processing in databases. Unlike NoSQL databases, which
often relax ACID compliance in favor of scalability and flexibility, NewSQL systems
maintain full ACID guarantees, even in distributed environments. This allows
NewSQL systems to offer the consistency and transactional integrity of traditional
SQL databases while meeting modern scalability demands.

Distributed, Shared-Nothing Architecture

NewSQL databases typically adopt a distributed system architecture to improve

scalability and availability. The shared-nothing model ensures that there is no single
point of contention between nodes, making these databases ideal for handling big
data workloads and maintaining high availability. By partitioning data across many
servers, NewSQL systems provide fault tolerance and the ability to scale out
seamlessly as the dataset grows.

High Availability and Fault Tolerance

Many NewSQL databases are designed with high availability and fault tolerance in
mind. They use mechanisms such as replication and automatic failover to ensure
that data is accessible at all times, even in the event of hardware or network failures.
Systems like Google Spanner exemplify this by providing geographically distributed
databases that ensure data is always available, even across data centers.
Improvements over Traditional Relational Databases

NewSQL addresses the primary performance and scalability challenges of traditional

relational databases. While relational databases offer strong consistency, they often
struggle with horizontal scalability, making them less suitable for modern, high-
velocity workloads. NewSQL resolves this by combining the transactional integrity of
relational databases with the ability to scale across distributed systems.

Comparison of Key Features

Traditional SQL NewSQL

Feature NoSQL Database
Database Database
ACID Compliance Full Rarely Full
Horizontal
Limited High High
Scalability
SQL Support Yes No (in most cases) Yes
Eventual (in many
Consistency Strong Strong
cases)
Fault Tolerance Moderate (HA required) High High
Comparison of NewSQL vs. SQL and NoSQL
When discussing NewSQL, it's important to understand its position between
traditional SQL databases and the newer NoSQL systems. Each of these database
types offers distinct advantages and disadvantages, depending on the workload and
business needs.

NewSQL vs. SQL

NewSQL retains many core features of traditional SQL databases, such as SQL
support and ACID compliance. However, NewSQL improves upon SQL databases
by providing:

Scalability: Traditional SQL databases, while powerful, typically struggle to

scale horizontally, which limits their ability to handle large volumes of data
across distributed systems. NewSQL overcomes this with distributed
architectures that allow horizontal scaling across nodes.




Performance: NewSQL databases enhance performance, especially in high-

traffic environments where online transaction processing (OLTP) is critical.
Systems like Google Spanner demonstrate how NewSQL can combine the
reliability of traditional SQL with modern performance needs.




Fault tolerance: NewSQL is built with high availability in mind. Through

mechanisms such as replication and automatic failover, it provides better
data durability and fault tolerance than traditional SQL systems.

Table: Comparison of NewSQL, SQL, and NoSQL

NewSQL
Feature SQL Database NoSQL Database
Database
ACID
Full Limited (BASE model) Full
Compliance
SQL Support Yes No Yes
Scalability Vertical Horizontal Horizontal
Consistency Strong Eventual Strong
Unstructured/Semi-
Data Structure Structured (schemas) Structured
structured
Transaction-heavy OLTP with
Use Case Web-scale applications
applications scalability
Moderate (with add-
Fault Tolerance High (built-in) High (built-in)
ons)

eature SQL NoSQL NewSQL

No, it doesn't follow a Yes, since the

Yes, it follows
Relational relational model. It was relational model is
relational modelling
Property designed to be entirely equally essential for
to a large extent.
different from that. real-time analytics.
Yes, ACID
properties are No, rather provides for Yes, Acid properties
ACID
fundamental to their CAP support are taken care of.
application

Yes, proper support

and even enhanced
SQL Support for SQL No support for old SQL
functionalities for Old
SQL

Fully functionally
It supports such
Inefficient for OLTP supports OLTP
OLTP databases, but it is not
databases. databases and is highly
the best suited.
efficient

Vertical + Horizontal
Scaling Vertical scaling Only Vertical scaling
scaling

Can handle simple Highly efficient in

Better than SQL for
Query queries with ease and processing complex
processing complex
Handling fails when they get queries and smaller
queries
complex queries.

Distributed
No Yes Yes
Databases

Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Unit - III
No ratings yet
Unit - III
37 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Bda Unit 2
No ratings yet
Bda Unit 2
54 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Bda U2
No ratings yet
Bda U2
79 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
3 Unit
No ratings yet
3 Unit
17 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Map Reduce Intro
No ratings yet
Map Reduce Intro
21 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
Unit 3
No ratings yet
Unit 3
33 pages
Hadoop Architecture: HDFS, YARN, MapReduce
No ratings yet
Hadoop Architecture: HDFS, YARN, MapReduce
4 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Hadoop MapReduce for Developers
No ratings yet
Hadoop MapReduce for Developers
4 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Unit 2
No ratings yet
Unit 2
12 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Unit 3
No ratings yet
Unit 3
13 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
BDA Notes
No ratings yet
BDA Notes
39 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
MapReduce Programming in Hadoop
No ratings yet
MapReduce Programming in Hadoop
42 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Bda Mod2
No ratings yet
Bda Mod2
8 pages
Cloud Notes - Unit - 5
No ratings yet
Cloud Notes - Unit - 5
31 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
10 pages
Big Data Unit-2 PPT Part2
No ratings yet
Big Data Unit-2 PPT Part2
78 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Anatomy of A MapReduce Job Run
No ratings yet
Anatomy of A MapReduce Job Run
2 pages
26-Unit5 DataAnalytics ApacheHadoop IoTPart1
No ratings yet
26-Unit5 DataAnalytics ApacheHadoop IoTPart1
21 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
BDA
No ratings yet
BDA
20 pages
15 Regsitry Tweaks
No ratings yet
15 Regsitry Tweaks
3 pages
Free Online AI Image Watermark Remover - LightPDF
No ratings yet
Free Online AI Image Watermark Remover - LightPDF
1 page
Iot Ass - 3
No ratings yet
Iot Ass - 3
7 pages
Site Activity Bemcli
No ratings yet
Site Activity Bemcli
8 pages
Enabling Manage Templates in Recruiting
No ratings yet
Enabling Manage Templates in Recruiting
3 pages
Atm Machine Computer Project
100% (1)
Atm Machine Computer Project
30 pages
VSG60 SCPI Programming Manual
No ratings yet
VSG60 SCPI Programming Manual
30 pages
Cashless Automatic Rationing System by Using GSM and RFID Technology
No ratings yet
Cashless Automatic Rationing System by Using GSM and RFID Technology
4 pages
Multicast For Video Streaming: EE290T Spring 2002 Puneet Mehra Pmehra@eecs - Berkeley.edu
No ratings yet
Multicast For Video Streaming: EE290T Spring 2002 Puneet Mehra Pmehra@eecs - Berkeley.edu
20 pages
PyMsgBox Guide for Developers
No ratings yet
PyMsgBox Guide for Developers
15 pages
Implementing Cisco NX-OS Switches and Fabrics in The Data Center (DCNX) v1.0
No ratings yet
Implementing Cisco NX-OS Switches and Fabrics in The Data Center (DCNX) v1.0
3 pages
Assignment No - 12
No ratings yet
Assignment No - 12
4 pages
K6R4016C1D: Prelimpreliminarypppppppppinary
No ratings yet
K6R4016C1D: Prelimpreliminarypppppppppinary
13 pages
Novel Madre Dewi Lestari PDF
100% (1)
Novel Madre Dewi Lestari PDF
2 pages
Decision Control Statements - Python
No ratings yet
Decision Control Statements - Python
6 pages
Hands-On Penetration Testing On Windows: Unleash Kali Linux, PowerShell, and Windows Debugging Tools For Security Testing and Analysis 1st Edition Phil Bramwell Download
100% (1)
Hands-On Penetration Testing On Windows: Unleash Kali Linux, PowerShell, and Windows Debugging Tools For Security Testing and Analysis 1st Edition Phil Bramwell Download
59 pages
Design Document
No ratings yet
Design Document
14 pages
Huawei Y330-U15 Update Guide
No ratings yet
Huawei Y330-U15 Update Guide
16 pages
User Manual N7 Pro
No ratings yet
User Manual N7 Pro
79 pages
Chapter 2
No ratings yet
Chapter 2
20 pages
BSNL FTTH ONT Password Change Guide
No ratings yet
BSNL FTTH ONT Password Change Guide
8 pages
QuickRide Logcat
No ratings yet
QuickRide Logcat
83 pages
MT 6592
No ratings yet
MT 6592
6 pages
Lotus Ques & Ans
No ratings yet
Lotus Ques & Ans
40 pages
Automatic Sorting Machine Using Delta PLC
100% (1)
Automatic Sorting Machine Using Delta PLC
8 pages
做好Web及Api安全以維持營運持續
No ratings yet
做好Web及Api安全以維持營運持續
23 pages
PPS Model Question Paper - 2
No ratings yet
PPS Model Question Paper - 2
2 pages
Multiprocessor OS Overview & Benefits
No ratings yet
Multiprocessor OS Overview & Benefits
7 pages
Hidden Gems in RDi
No ratings yet
Hidden Gems in RDi
27 pages
Process Delivery Methodology Document
No ratings yet
Process Delivery Methodology Document
16 pages

Bda Unit 3

Uploaded by

Bda Unit 3

Uploaded by

Unit -3

3.1 Processing data with hadoop:

Data Ingestion and Storage (HDFS):

Data Processing Framework (MapReduce & YARN):

Data Processing Workflow:

3.2 MapReducing Working

Whenever code submitted to a cluster, JobTracker creates the

Fig. Job Tracker and Task Tracker interaction

Task Tracker continuously sends heartbeat message to job tracker. When

Map Reduce Framework

Fig. MapReduce Programming Architecture

1. RecordReader: converts byte oriented view of input in to Record oriented

2. Mapper: Map function works on the <keys, value> pairs produced by

3. Combiner: It takes intermediate <keys, value> pairs provided by mapper

Fig. MapReduce without Combiner class

Fig. MapReduce with Combiner class

4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper,

What is a NoSQL database?

Types of databases — NoSQL

A document-oriented database stores data in documents similar to JSON (JavaScript

Examples of document databases are MongoDB and Couchbase. A typical document

"street": "123 foo street",

"state": "some state",

name id email dob city

 The five critical differences between SQL and NoSQL are:

 SQL databases are relational, and NoSQL databases are non-relational.

NewSQL (pronounced new ess-cue-ell or new sequel) is a relational database

NewSQL combines ACID (atomicity, consistency, isolation and durability)

Key Features of NewSQL

SQL Compatibility with Scalability and Performance

ACID properties—atomicity, consistency, isolation, and durability—are crucial for

Distributed, Shared-Nothing Architecture

NewSQL databases typically adopt a distributed system architecture to improve

High Availability and Fault Tolerance

NewSQL addresses the primary performance and scalability challenges of traditional

Comparison of Key Features

Traditional SQL NewSQL

NewSQL vs. SQL

Scalability: Traditional SQL databases, while powerful, typically struggle to

Performance: NewSQL databases enhance performance, especially in high-

Fault tolerance: NewSQL is built with high availability in mind. Through

Table: Comparison of NewSQL, SQL, and NoSQL

eature SQL NoSQL NewSQL

No, it doesn't follow a Yes, since the

Yes, proper support

Can handle simple Highly efficient in

You might also like