0% found this document useful (0 votes)
3 views73 pages

Bda TT

Big Data refers to large and complex data sets that cannot be managed by traditional databases, characterized by the '5 Vs': Volume, Variety, Velocity, Veracity, and Value. It encompasses structured, semi-structured, and unstructured data, requiring advanced technologies like Hadoop for processing and analysis. Hadoop, an open-source framework, allows for distributed data storage and processing, enhancing scalability and fault tolerance compared to traditional systems.

Uploaded by

pooja.rs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views73 pages

Bda TT

Big Data refers to large and complex data sets that cannot be managed by traditional databases, characterized by the '5 Vs': Volume, Variety, Velocity, Veracity, and Value. It encompasses structured, semi-structured, and unstructured data, requiring advanced technologies like Hadoop for processing and analysis. Hadoop, an open-source framework, allows for distributed data storage and processing, enhancing scalability and fault tolerance compared to traditional systems.

Uploaded by

pooja.rs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

What is Big Data?

Big data is the term for collection of data sets so large and complex to store and manage in
traditional databases that it becomes difficult to process and analyze using on hand database
system tools or traditional data processing applications.

Characteristics of Big Data


Volume – The most prominent feature of Big Data is its size. Volume refers to the huge
amount of data generated and stored, often measured in petabytes or exabytes. Such
large-scale data cannot be processed using normal computers and requires advanced
technologies. For example, platforms like Instagram or Twitter generate massive volumes of
data daily through posts, comments, likes, and media sharing. These enormous datasets hold
great potential for analysis and pattern discovery.
Variety – Big Data comes in different formats and structures. It may be structured (tables,
databases), semi-structured (JSON, XML), or unstructured (videos, images, social media
posts). Sources such as Facebook, Twitter, Pinterest, Google Ads, and CRM systems
contribute to this variety, which makes processing and analysis more complex but also more
insightful.
Velocity – Velocity refers to the speed at which data is generated and needs to be processed.
In today’s world, data flows continuously in real-time from social media, sensors, financial
transactions, and other sources. Systems must be able to handle this fast-paced data to ensure
timely analysis and decision-making.
Value – Beyond just collecting and storing data, it is important to extract meaningful value
from it. Not all data is useful; therefore, identifying, processing, and analyzing valuable data
is crucial to gain insights, improve decision-making, and support innovation.
Veracity – Veracity refers to the accuracy, quality, and trustworthiness of data. Since Big
Data often comes from multiple sources, it may include inconsistencies, duplicates, or
incomplete information. Ensuring clean, reliable, and consistent data is essential for effective
analysis and decision-making.
Normally, transactional or operational data in ERP applications is stored in megabytes,
customer-related data in CRM systems is in gigabytes, web-related data is in terabytes, and
real-time data such as clickstreams, sensor feeds, or mobile app data reaches petabytes. A
massive volume of information is also generated through social media, where the velocity of
uploads is extremely high, and the variety of formats includes pictures, videos, and
unstructured text. These aspects highlight the three core “Vs” of Big Data—volume, velocity,
and variety—which add complexity to analysis and data management. The concept of
Volume in Big Data can be better understood by breaking it into three dimensions: length,
breadth, and depth. Length refers to the time dimension, meaning how long data needs to
be collected—some data may be gathered over minutes (real-time sensor data) while others
may span years (customer history). Breadth relates to discovery and integration, i.e., the
number and variety of sources from which data is collected (databases, sensors, social media,
mobile apps) and how this diverse information is combined into a unified system. Depth
signifies the level of analysis and insights, or how deeply the data is explored to extract
patterns, relationships, and meaningful conclusions. In short, volume is not only about the
size of data but also about the time span of collection, the diversity of sources, and the extent
of analysis applied.

Since data from multiple sources is often random and unstructured, its growing volume
outpaces traditional storage, slowing down data warehousing. Controlling the flow is also
difficult, as the veracity (accuracy and genuineness) of data must be ensured before use.
Finally, velocity also matters for time-limited processes, as streaming data from the real
world into organizations enables maximizing the value of information. Thus, Big Data is
defined by five key components: Volume, Velocity, Variety, Veracity, and Value.

Types of Big Data


Traditionally, in databases, datasets are organized as rows in tables representing transactions,
known as operational data, which are structured with fixed data types such as numbers,
strings, or characters. These transactions and their histories are typically processed through
ETL (Extract, Transform, Load) into data warehouses, where data mining algorithms uncover
hidden patterns for business predictions. However, with mergers and acquisitions, companies
faced challenges migrating huge volumes of transactional data with different structures,
leading to the development of open standards like XML (semi-structured data using tags) and
RDF (graph data) by IBM and W3C. In addition, streaming data emerged, which consists of
continuously generated, ordered sequences such as phone conversations, ATM transactions,
network traffic, searches, and sensor readings, scanned only once due to limited resources.
Beyond structured and semi-structured data, businesses also began relying on unstructured
information such as emails, audio, video, logs, blogs, forums, social media, clickstreams, and
mobile app data for decision-making. Since these arrive at high speed, storing them is often
impractical, so real-time processing and technologies like columnar databases are used.
Hence, modern data can be broadly classified as structured, semi-structured, unstructured,
real-time, event-based, or complex.

1. Structured Data
●​ Definition:​
Highly organized data that conforms to a fixed schema, making it easy to store,
process, and analyze.
●​ Examples:​
Relational databases, spreadsheets, financial transaction records, customer profiles,
and sensor data.
●​ Characteristics:​
Data is typically stored in tables with rows and columns, and traditional analytics
tools can easily read it.
2. Unstructured Data
●​ Definition:​
Data that does not have a predefined format or organization, making it challenging to
process with traditional tools.
●​ Examples:​
Text messages, social media posts, videos, audio files, images, and natural language
documents.
●​ Characteristics:​
Constitutes the largest portion of data and requires advanced tools and techniques for
analysis, often generated by humans or machines like satellite imagery.
3. Semi-Structured Data
●​ Definition: Data that is not stored in a rigid relational database format but contains
some organizational properties, such as tags or keywords, that make it easier to
manage and process than unstructured data.
●​ Examples: XML documents, JSON files, emails, and web pages with tags.
●​ Characteristics: Sits between structured and unstructured data, allowing for some
flexibility and adaptability in its structure.

Traditional vs. Big Data business approach,


Case Study of Big Data Solutions
Data is as valuable as money because it helps organizations understand customer needs,
innovate, and stay competitive. Big Data Analytics allows companies to analyze huge
datasets quickly for better decision-making. For example, insurance companies can detect
fraud, manufacturers can spot supply chain problems early, service firms like hotels and
retailers can improve customer loyalty, public services can optimize delivery, and smart cities
can use data from sensors, crime reports, energy, and finance to improve citizens’ lives.
Case Study 1 (Hadoop): Hadoop helps solve the problem of processing very large data by
splitting it into smaller parts and running them in parallel on multiple machines. For example,
customer feedback emails can be analyzed using a word count program to quickly find how
many times products were returned or refunds requested, giving insights into supplier
performance.
Case Study 2 (Clickstream Analytics): Websites generate clickstream data when visitors
interact with them (pages viewed, time spent, clicks, exit pages). Analyzing this helps
companies understand customer behavior, improve website design, predict customer loyalty,
and create better marketing strategies. Since billions of clicks are generated, Hadoop and
parallel processing make the analysis much faster compared to traditional databases. This
allows organizations to respond quickly, personalize customer experiences, and make smarter
business decisions.
Concept of Hadoop
What is Hadoop?
●​ Hadoop is an open-source software framework designed to store and process huge
amounts of data in a distributed and parallel way across clusters of commodity
(low-cost) computers.
●​ Instead of moving large amounts of data across networks, Hadoop moves the
computation to where the data is stored, making it much faster and efficient.
●​ It follows a master-slave distributed architecture where the workload is divided into
smaller parts and executed simultaneously on different machines.
Advantages of Hadoop over Traditional RDBMS
1.​ Scalability – Can easily add more nodes to scale up with little admin work.
2.​ Flexibility – Can store unstructured data (text, images, videos) without predefined
schemas.
3.​ Unlimited Storage – No restrictions on how much or how long data can be stored.
4.​ No Preprocessing Needed – Raw data can be stored directly, unlike in data
warehouses.
5.​ Fault Tolerance – Multiple copies of data are stored; if one node fails, others take over
automatically.
Hadoop Goals
1.​ Scalable – Works from a single server to thousands of servers.
2.​ Fault-Tolerant – Automatically handles hardware failures.
3.​ Economical – Uses commodity hardware instead of expensive high-end systems.
4.​ Resilient – Fault tolerance built into the software at the application layer.
Hadoop Assumptions
1.​ Hardware failures are expected in large clusters.
2.​ Batch processing is preferred (focus on high throughput rather than low latency).
3.​ Designed for very large datasets (GBs to TBs).
4.​ Portability across platforms is important.
5.​ Should support high aggregate data bandwidth across hundreds of nodes.
6.​ Must handle tens of millions of files in one instance.
7.​ Follows a write-once, read-many access model.

Performance Terms
●​ Latency = Time taken to produce a result (low in Hadoop not a priority).
●​ Throughput = Number of results/actions completed per unit time (high in Hadoop,
main goal).
●​ Hadoop achieves high throughput but low latency using batch processing.
Hadoop Core Components
HDFS - Hadoop Distributed File System

●​ HDFS (Hadoop Distributed File System) is the storage layer of Hadoop, modeled
after Google File System (GFS).
●​ Designed for commodity hardware, high throughput, scalability, and availability.
●​ Optimized for large files (gigabytes and above).
●​ Uses replication for fault tolerance; data blocks are automatically replicated across
multiple nodes. (HDFS ensures fault tolerance by storing multiple copies (default 3)
of each data block on different nodes, so if a node fails, the data remains available and
missing copies are automatically recreated.)
●​ Works in a write-once, read-many pattern (originally immutable files, later supports
appends)
Key Characteristics
1.​ Scalable Storage – Can handle clusters with thousands of nodes.
2.​ Replication – Default block size = 64 MB; default replication factor = 3.
3.​ Streaming Data Access – Supports write-once, read-many access patterns.
4.​ File Appends – Recent HDFS versions allow appending to existing files.
5.​ Minimal Node Communication – Nodes communicate only as needed.
6.​ Sits on Top of Local File System – e.g., ext3, ext4.
HDFS Components
1. NameNode (Master)
●​ Maintains and Manages DataNodes
●​ All the file system Metadata is stored. i. e. Information about data blocks e.g. location
of blocks stored, the size of the files, permissions etc.
●​ The NameNode manages metadata operations such as opening or closing files, but it
does not handle the actual data; instead, it directs clients to the DataNodes where the
file blocks are stored, so the data flows directly between the client and the DataNodes
while the NameNode acts like a traffic controller.
●​ The NameNode coordinates read and write operations by managing metadata, but the
actual data is transferred directly between the client and the DataNodes, ensuring
efficient data flow.
●​ Heartbeat messages: These are simple signals that tell the NameNode the DataNode is
active and functioning. If heartbeats stop, the NameNode assumes the node has failed.
●​ Block reports: These provide detailed information about all the data blocks stored on
that DataNode, including block IDs, status, and locations.
●​ Manages block replication: The NameNode keeps track of how many copies of each
data block exist and on which DataNodes. If a block is lost due to node failure, the
NameNode automatically creates new copies on healthy nodes to maintain the desired
replication level.
●​ Handles authorization and authentication: It ensures that only authorized users or
applications can access or modify files.
●​ Creates checkpoints: The NameNode periodically saves the current state of the file
system metadata, so in case of failure, it can recover quickly without losing important
information.
2. DataNode (Slave)
●​ Slave daemons (Stores actual data blocks on local disks)
●​ Handles read/write requests from clients.
●​ Communicate with other Data Nodes to replicate its data blocks for redundancy
●​ Sends periodic heartbeats and block reports to the NameNode.
●​ Maintains block integrity across multiple volumes.
3. Secondary NameNode
●​ Performs periodic checkpoints of the filesystem metadata.
●​ Helps restart NameNode in case of failure.
●​ Takes snapshots of HDFS metadata from name node at intervals
HDFS Advantages
●​ Reliable and fault-tolerant due to replication.
●​ High throughput for large file processing.
●​ Can scale to thousands of nodes economically using commodity hardware.
●​ Supports distributed computation close to data, reducing network bottlenecks.

MapReduce
The Shared Nothing Principle means that each node in the system is self-sufficient and
operates independently, without relying on the resources of other nodes. MapReduce works in
a batch-processing manner, which makes it unsuitable for real-time streaming data. Its main
strength lies in the ability to parallelize tasks across large volumes of raw data. It is especially
effective for problems that can be broken down into many smaller, independent
sub-problems, where the results from each sub-problem can later be aggregated to produce
the final solution to the larger problem.
The MapReduce algorithm enables parallel processing and consists of two sequential phases:
map and reduce. In the map phase, the input is represented as a set of key–value pairs, and a
user-defined function is applied to each pair to produce a set of intermediate key–value pairs.
In the reduce phase, these intermediate pairs are grouped according to their keys, and the
values associated with each key are combined using a user-defined reduce function to
produce the final result. In some cases, depending on the nature of the operation, the reduce
phase may not be required, and the output from the map phase alone is sufficient.
In MapReduce, the processing tasks are managed by two main components at the cluster
level: the JobTracker and the TaskTrackers. The JobTracker acts as the master and is
responsible for scheduling jobs and managing computational resources across the cluster,
which is why it runs on only one node. Each MapReduce job is divided into multiple tasks,
and these tasks are assigned to different TaskTrackers based on the location of the data stored
on their respective nodes. TaskTrackers, which run on every slave node in the cluster, are
responsible for executing the assigned tasks. The JobTracker continuously monitors the
TaskTrackers to ensure they are completing their tasks successfully.
In MapReduce, computation is moved closer to the data instead of transferring large volumes
of data to the processing engine. This approach is important because moving massive datasets
across the network is costly and significantly reduces network performance. If the data were
always sent to a single processing unit, it would take much longer to complete the task and
create a bottleneck in the system. Additionally, relying heavily on one central unit would risk
overloading the master node, which could lead to failures. By processing data locally on the
nodes where it is stored, MapReduce ensures faster execution, reduces network congestion,
and improves overall system reliability.
MapReduce also has certain limitations and cannot be applied to every type of problem. It is
not suitable for computations that rely on previously computed values, since each task in
MapReduce works independently without awareness of past results. Similarly, algorithms that
depend on a shared global state cannot be efficiently implemented, as the system follows the
shared-nothing principle and each node operates in isolation. In addition, problems that
require handling massive data correlations or complex interrelations across the entire dataset
are difficult to process with MapReduce, because the model is designed for independent
parallel tasks rather than highly interconnected operations.

YARN
Yet Another Resource Negotiator (YARN) was introduced to solve the problems of the
original MapReduce 1.0 architecture, particularly the limitations of the JobTracker service. In
large Hadoop clusters, which can consist of tens of thousands of nodes, MapReduce 1.0
suffered from issues such as poor scalability, high memory usage, synchronization
challenges, and a single point of failure (SPOF) at the JobTracker. To overcome these
problems, YARN became a core component of Apache Hadoop.
YARN separates the two main responsibilities of the JobTracker—resource management and
job scheduling/monitoring—into different services. The Resource Manager (RM) acts as a
global manager that allocates resources among all applications in the cluster, while each
application gets its own ApplicationMaster (AM) to handle job-specific scheduling and
execution. This design removes the burden from a single node and distributes responsibilities
across the cluster, improving scalability and reliability.
The RM includes a scheduler that allocates resources based on constraints like queue
capacities and user limits, as well as the specific needs of each application. The
NodeManager, running on each node, is responsible for launching application containers,
monitoring resource usage (CPU, memory, disk, and network), and reporting back to the RM.
Each AM runs inside a container and negotiates resources with the RM while coordinating
with NodeManagers to execute and monitor its tasks.

In Hadoop YARN, a container is the basic unit of resource allocation where tasks (like Map
or Reduce tasks) actually run. It is a collection of resources (CPU, memory, disk, network)
assigned by the ResourceManager (RM) to an application. The NodeManager launches and
manages these containers on individual worker nodes. Think of a container as a lightweight,
isolated environment that holds the resources required for a single task or application
component, ensuring fair sharing and efficient use of cluster resources.
●​ Resource Manager (RM) Scheduler – The RM has a scheduler that decides how to
share the cluster’s resources. It looks at rules like how much capacity each queue has,
how many resources each user is allowed to use, and what each application needs,
before giving out resources.
●​ Scheduling based on requirements – The scheduler doesn’t just randomly assign
resources. It checks what each application is asking for (like how much memory, how
many CPU cores) and then assigns resources accordingly.
●​ NodeManager – This runs on every machine in the cluster. Its job is to start and
manage “containers” (little isolated environments where tasks run). It also keeps an
eye on how much CPU, memory, disk, and network the containers are using, and
reports this back to the RM so the system knows what’s going on.
●​ ApplicationMaster (AM) – Each application gets its own AM, which also runs inside
a container. The AM talks to the RM’s scheduler to request the resources it needs,
keeps track of which containers are running for that application, and monitors their
progress until the job finishes.

Hadoop Ecosystem

HBase – HBase is an open-source, column-oriented NoSQL database built on top of Hadoop,


inspired by Google’s Bigtable. It is distributed, scalable, versioned, and non-relational,
making it ideal for handling very large datasets with random, real-time read and write access.
Unlike traditional relational databases, it stores data in a column-family format, which is
efficient for sparse data.
Hive – Hive is a data warehouse software built on Hadoop that allows users to read, write,
and manage large datasets stored in HDFS. It provides an SQL-like interface (HiveQL),
making it easy for people familiar with SQL to query and analyze big data without writing
complex MapReduce code. Hive is best suited for batch processing and structured data
analysis.
Sqoop – Sqoop (SQL-to-Hadoop) is a tool that enables efficient bulk data transfer between
Hadoop and structured databases such as MySQL, Oracle, or PostgreSQL. It simplifies the
process of importing data from relational databases into Hadoop for analysis, and exporting
processed Hadoop data back into relational databases.
Pig – Apache Pig provides a high-level scripting language (Pig Latin) to make working with
MapReduce easier. It abstracts the low-level programming complexity of Java MapReduce
and offers a SQL-like flow for transforming, analyzing, and processing large datasets. Pig is
useful for data pipelines and iterative processing.
ZooKeeper – ZooKeeper is a distributed, open-source coordination service that provides
synchronization, configuration management, and naming for distributed applications. It acts
like a central service that helps different nodes in a Hadoop cluster coordinate with each
other, ensuring fault tolerance and consistency.
Mahout – Apache Mahout is a library of scalable machine learning algorithms designed to
run in a distributed environment like Hadoop. It provides implementations of clustering,
classification, and recommendation algorithms, enabling big data analytics and predictive
modeling.
Flume – Apache Flume is a reliable, distributed tool for collecting, aggregating, and
transporting large volumes of streaming data such as log files, events, or social media feeds.
It moves data efficiently from multiple sources to a centralized data store like HDFS, making
it ideal for real-time data ingestion.
Oozie – Apache Oozie is a scheduling tool for Hadoop jobs. Instead of running each job
separately, Oozie allows you to combine multiple jobs (like MapReduce, Pig, Hive, or
Sqoop) into one workflow and execute them in sequence or based on conditions. It runs as a
Java web application and integrates directly with Hadoop’s YARN to manage workflows,
making it easier to automate complex data pipelines.
Ambari – Apache Ambari is an open-source platform that simplifies Hadoop cluster setup,
monitoring, and management through a web interface and APIs, reducing manual effort.
Kafka – Apache Kafka is a distributed messaging system for real-time data streams with high
throughput and low latency.
Storm – Apache Storm processes data streams in real time, meaning it handles data as it
arrives rather than in batches. This makes it ideal for applications that need instant insights,
like fraud detection, live monitoring, or alerting systems, because results are computed
immediately.
Spark – Apache Spark is a fast data processing engine that keeps data in memory to reduce
slow disk access. It can handle large-scale batch jobs, real-time streaming, machine learning,
and SQL queries within one framework, making it highly versatile and much faster than
traditional MapReduce.
To help remember maybe?
“A Huge Hive (HBase + Hive) wanted to scoop up data (Sqoop) using a Pig script (Pig).
Then the ZooKeeper guided the Ambari team (ZooKeeper + Ambari) and Oozie scheduled
their work (Oozie). They sent messages through Kafka, processed them instantly with Storm,
and collected logs via Flume. Finally, the Spark engine ran fast analytics, while Mahout
predicted trends.”


Hadoop limitations
Security Concerns – Hadoop’s security features are disabled by default because enabling
them can be complex. This means that without proper configuration, sensitive data is at
significant risk. Additionally, Hadoop does not provide encryption for data stored on disks or
transmitted across the network, which makes it less appealing for government agencies or
organizations that require stringent data protection measures.
Vulnerable By Nature – Since Hadoop is almost entirely written in Java, it inherits the
security challenges associated with one of the most widely used programming languages.
Java’s popularity also makes it a frequent target for cyberattacks. Some experts have even
suggested considering alternative platforms that are inherently safer and more efficient than
Hadoop.
Not Fit for Small Data – Hadoop is optimized for large-scale datasets, and its design makes it
inefficient for handling many small files. The Hadoop Distributed File System (HDFS)
struggles with random reads on small files, which can slow performance. As a result,
organizations with small data volumes may find Hadoop overkill or unsuitable for their
needs.
Potential Stability Issues – Being an open-source project, Hadoop is developed and
maintained by a large community of contributors. While this brings continuous
improvements, it can also lead to occasional stability problems. Organizations are advised to
use the latest stable release or rely on third-party vendors who can manage and troubleshoot
potential issues effectively.
General Limitations – Hadoop is not necessarily the ultimate solution for every big data
problem. For instance, Google has developed Cloud Dataflow as an alternative. Relying
solely on Hadoop could mean missing out on other tools and technologies that may offer
better efficiency, flexibility, or additional features for big data processing.

Module 2:
Distributed File Systems:
In recent times, the new parallel-computing architecture called cluster computing is in use. A
computing cluster is basically a group of ordinary, low-cost computers (commodity
hardware) connected together—usually with Ethernet cables or affordable network
switches—to act as a single powerful system. Instead of buying one supercomputer, you
combine many machines to share the workload. Frameworks like MapReduce are designed to
run on these clusters, breaking down large-scale data processing tasks into smaller chunks
that can be executed in parallel across multiple machines. This setup allows tasks to be done
efficiently and ensures the system is fault-tolerant—if one computer fails during computation,
the job continues on other machines without stopping.
Compute nodes typically in the range of 8–64 are stored in racks and are connected with each
other by Ethernet or switch to the network. Failure at the node level (disk failure) and at the
rack level (network failure) is taken care of by replicating data in secondary nodes. All the
tasks are completed independently and so if any task fails, it can be re-started without
affecting the other tasks.
Parallel computing, tasks are divided among multiple compute nodes (each node = one
machine with CPU, memory, cache, and disk). Traditionally, large scientific applications
required special-purpose parallel supercomputers with many processors and expensive
hardware. But clusters of commodity hardware (ordinary, cheaper machines) made parallel
processing much more affordable while still achieving high performance. To support this,
special distributed file systems were developed to handle huge amounts of data across many
nodes:

●​ Google File System (GFS)


●​ Hadoop Distributed File System (HDFS)
●​ CloudStore (KFS)

●​

Physical Organization of Compute Nodes,


In cluster computing, compute nodes (servers) are physically arranged in racks, with each
rack typically containing 8–64 nodes that are internally connected using gigabit Ethernet.
When there are multiple racks, they are linked together through an additional network layer
or switch, forming a larger cluster. Hadoop works most efficiently on Linux-based machines,
where a client machine is used to load data and MapReduce programs into the cluster and
then retrieve the results after execution. In small clusters with fewer than 40 nodes, the
JobTracker (which manages job execution) and the NameNode (which manages file system
metadata) can be hosted on the same physical server. For medium and large clusters, these
roles are separated onto different servers to balance the workload and improve performance.
Hadoop can also run in virtual machines such as VMware for small-scale or testing
environments (like a laptop), but in production, virtualization adds overhead that reduces
performance, so real physical machines are preferred.

A cluster is a collection of interconnected nodes (computers) linked through a network,


allowing them to communicate and work together as a single system. In a computer cluster,
these nodes collaborate to share processing power, storage, and tasks, making the cluster
more powerful and efficient than individual machines.
(In ppt)
Map Task: In a MapReduce job, the map task processes the input data. For example, if the
goal is to count word occurrences in documents, the map function reads each document,
breaks it into words, and produces key-value pairs of the form (word, 1). So, for words w1,
w2, …, wn in a document, the mapper outputs (w1,1), (w2,1), …, (wn,1). These outputs are
then grouped together for the next stage.
Combiners​
Combiners are optimization components in MapReduce that act as small local reducers on
each mapper’s output. Instead of sending all raw key-value pairs across the network, the
combiner performs partial aggregation right after the map phase. This works only when the
reduce function is associative and commutative (e.g., addition, counting), because the result
will be the same no matter the order or grouping. By reducing data size early, combiners
minimize network traffic and make the whole MapReduce job more efficient.
Example of Combiner
●​ Map Output (before combiner): Suppose a mapper processes a document containing
the words: cat cat dog. It emits → (cat,1), (cat,1), (dog,1).
●​ Combiner Execution (local aggregation): The combiner runs on this mapper’s output
and aggregates locally: (cat,2), (dog,1).

Grouping by Key: After the map tasks finish, their outputs (key-value pairs) are sent to the
master controller, which organizes them for the reduce phase. Since there are usually multiple
reducers, the master must decide which reducer handles which keys. To do this, it applies a
hash function to each key, producing a bucket number between 0 and r–1 (where r is the
number of reducers). Each bucket corresponds to one reducer, so all key-value pairs with the
same key are sent to the same reducer. This ensures proper grouping by key and allows
reducers to process their assigned data in parallel.
Partitioner: The partitioner decides which reducer a (key, value) pair should go to, ensuring
that all values for the same key end up at the same reducer.
Reduce Tasks: A reduce task takes input as (key, [list of values]) and combines the values.
Continuing the word count example, the reducer simply sums up all the 1’s for a given word
and outputs (word, total count). This produces the final result.
Coping with Node Failures: MapReduce is designed to handle hardware failures. If the
master node fails, it’s the worst case, but often masters are backed up. If a node running map
or reduce tasks fails, the master detects it and reassigns those tasks to other healthy nodes.
This ensures fault tolerance and job completion.
Number of Mappers and Reducers: The number of mappers in Hadoop is decided by how the
input data is split into blocks—each block of the file creates one map task. For example, if a
file is 2 GB and the block size is 64 MB, then 2 GB ÷ 64 MB = 32 mappers. The number of
reducers, on the other hand, is not tied to input size but is set by the user or system
configuration.
More Insight on Mappers: In Hadoop, the number of map tasks depends on the number of
input files or splits, not on whether files are empty or not. So, if a directory has 4 files,
Hadoop still creates 4 map tasks, even if one file contains no data. If the directory instead has
4 subfolders and each subfolder (except one) contains 3 files, then there are 9 files total,
which means 9 map tasks are created.
Algorithms Using MapReduce: MapReduce is not suited for every problem, especially online
transactional systems like retail sales. However, it is ideal for analytical tasks over large
datasets, such as similarity detection, pattern mining, and statistical analysis.
Google’s Implementation of MapReduce: Google originally created MapReduce to handle
large-scale matrix-vector multiplications required for computing PageRank, which ranks
billions of web pages. MapReduce’s iterative processing made it well-suited for such
algorithms, and it was also extended to power recommender systems.
Applications of MapReduce: Typical applications include matrix-vector calculations and
database-style relational operations such as selection, projection, union, intersection,
difference, natural joins, grouping, and aggregation. These tasks scale efficiently on large
clusters using MapReduce.
Large-Scale File-System Organisation. (Full chatgpt coz cant find)
Large-Scale File-System Organisation refers to how big data systems like Hadoop organize
and manage files across distributed clusters. Instead of storing data on a single machine, files
are split into fixed-size blocks (e.g., 64 MB or 128 MB) and distributed across many nodes
for parallelism and fault tolerance. Each block is replicated (usually 3 times) across different
nodes and racks to ensure reliability in case of failures. The NameNode manages metadata
(like filenames, block locations, and permissions), while DataNodes store the actual blocks.
This design supports scalability, high throughput, and fault tolerance, making it well-suited
for handling massive datasets in large-scale applications.
NoSQL:​

Introduction to NoSQL
NoSQL (Not Only SQL) is a modern database management system designed to handle
massive amounts of unstructured or semi-structured data across distributed environments.
Unlike traditional relational databases that rely on fixed schemas, tables, and complex JOIN
operations, NoSQL supports multiple data models such as key-value stores, document stores,
column-family databases, and graph databases. It is schema-free. It is designed to run in
parallel across many processors, so tasks are divided and executed simultaneously, speeding
up performance. The system uses a shared-nothing architecture, where each machine (node)
has its own RAM and disk instead of sharing with others. This avoids bottlenecks, ensures
fault tolerance, and allows the database to scale out easily and cheaply by adding more
low-cost commodity servers.
NoSQL databases are built for scalability, performance, and agility, offering linear scalability,
meaning performance grows consistently with the addition of more processors. They are
optimized for modern applications where data is vast, dynamic, and unpredictable, unlike
relational databases that work best with structured and predictable datasets. NoSQL trades the
strict ACID (Atomicity, Consistency, Isolation, Durability) guarantees of relational systems
for BASE (Basically Available, Soft state, Eventual consistency) principles, ensuring fault
tolerance, high availability, and distributed reliability.
The rise of big data, social media, and large-scale applications has fueled the adoption of
NoSQL, as traditional relational models cannot efficiently scale or adapt to frequent changes.
Companies like Netflix, LinkedIn, and Twitter rely on NoSQL to analyze user interactions
and social network data in real time. With over 150 implementations such as Google
BigTable, Apache Hadoop, MemcacheDB, and SimpleDB, NoSQL has become the backbone
of many data-intensive industries.
In short, NoSQL represents the next generation of databases that are non-relational,
schema-free, free of joins, horizontally scalable, and capable of managing huge volumes of
distributed data. They use simple APIs, allow easy replication, and exploit modern
commodity hardware, making them a perfect fit for today’s dynamic, large-scale applications.
NoSQL Business Drivers,
Volume: Organizations shift from relying on faster single processors to using many
processors working together. Big data is too large to be handled by a single machine, so it is
split into smaller chunks (fragmented) and distributed across multiple inexpensive computers
(commodity machines) that work together as a cluster. Each machine processes its portion of
the data in parallel, which increases speed and efficiency. Technologies like Hadoop, HDFS,
and MapReduce make this possible.
Velocity: Real-time demands from social networks or e-commerce sites require databases to
handle high-frequency reads and writes. Traditional RDBMS often struggle under sudden
traffic bursts, especially when indexing many columns, causing slow responses during peak
activity.
Variety: Traditional databases face challenges when adding new columns or storing
uncommon attributes, often resulting in sparse matrices. Schema changes require downtime
and affect availability, making RDBMS less flexible for unpredictable data.
Agility: In RDBMS, handling complex queries especially those involving nested or repeated
data structures is cumbersome. To manage such data, an object-relational mapping (ORM)
layer is used, which translates objects in the application. This process is complex, requires
experienced developers familiar with frameworks like Java Hibernate, and can slow down
both implementation and testing.
Desirable NoSQL Features:
1.​ 24×7 Availability: NoSQL replicates both function and data across nodes to avoid
single points of failure, ensuring fault tolerance and continuous operation. In the
highly competitive world today, downtime is equated to real dollars lost and is deadly
to a company’s reputation. Hardware failures are bound to occur. Care has to be taken
that there is no single point of failure and system needs to show fault tolerance. For
this, both function and data are to be replicated so that if database servers or “nodes”
fail, the other nodes in the system are able to continue with operations without data
loss. NoSQL database environments are able to provide this facility. System updates
can be made dynamically without having to take the database offline.
2.​ Location Transparency: Users can read/write data without worrying about the
physical storage location, with updates propagated across distributed nodes. Location
Transparency means that in a distributed system or NoSQL database, users or
applications can access data without knowing which physical server or node stores it.
The system automatically handles the underlying distribution of data across multiple
nodes, so read and write requests are directed to the correct locations behind the
scenes.
3.​ Schema-less Data Model: Most of the business data is unstructured and unpredictable
which a RDBMS cannot cater to. NoSQL database system is a schema-free flexible
data model that can easily accept all types of structured, semi-structured and
unstructured data.
4.​ Modern Transaction Analysis: Traditional RDBMS enforces strict ACID properties,
which can limit performance for big data and real-time analytics. NoSQL, on the
other hand, follows the CAP theorem, prioritizing Availability and Partition Tolerance
while providing eventual consistency instead of immediate strict consistency. By
avoiding JOINs and foreign key constraints, NoSQL enables faster, more scalable
queries, making it ideal for applications like customer profiling, product reviews,
recommendation engines, and social media analytics where perfect ACID compliance
is not required.

5.​ Big Data Architecture: Big Data poses challenges in volume, speed, and diversity, and
NoSQL addresses these with a distributed architecture that supports horizontal
scaling—adding more servers or nodes to handle increased data and
workload—real-time streaming, and diverse storage models like key-value (Redis),
document (MongoDB), wide-column (Cassandra), and graph (Neo4j). It also
integrates with frameworks like Hadoop, Spark, and HDFS for large-scale processing,
enabling enterprises to efficiently manage current demands and future growth.

CAP theorem

Partition Tolerance (P) means that a distributed system can keep working correctly even if
some nodes cannot communicate with each other due to network failures or delays. In other
words, the system is resilient to “partitions”—situations where the network splits into
separate segments that can’t talk to each other.
CAP Trade-offs: The CAP theorem states that a distributed system can guarantee at most two
of the three properties—Consistency, Availability, and Partition Tolerance—at the same time.
Systems must trade-off between consistency and availability when partitions occur.
Eventual Consistency: Databases like Amazon DynamoDB and Cassandra prioritize partition
tolerance and eventual consistency (availability), meaning updates are eventually propagated
to all nodes, but clients may temporarily see outdated or inconsistent data.
Strong Consistency: Databases like HBase prioritize consistency and partition tolerance,
ensuring that all updates are immediately visible to clients. In this case, the system may
become unavailable during network partitions to maintain consistency.
ACID vs BASE property
NoSQL Data Architecture
Patterns: Key-value stores, Graph stores, Column family (Bigtable) stores, Document
stores,
Variations of NoSQL architectural patterns,
Key Value Stores:
A key-value store is a type of NoSQL database that stores data as pairs: a key and a
corresponding value. The key acts as a unique identifier, while the value can be a list, string,
JSON, BLOB, or more complex data. This model is schema-less, meaning there’s no fixed
structure, and keys can be generated manually or automatically. Common examples include
Redis, Amazon DynamoDB, Azure Table Storage, Riak, and Memcache.

Data Access and API:​


Clients interact with key-value stores using simple operations:

●​ Get(key): returns the value for any given key, or it may return an error message if
there’s no key in the key-value store.
●​ Put(key, value): adds a new key-value pair to the table and will update a value if this
key is already present.
●​ Delete(key): removes a key and its value from the table, or it may return an error
message if there’s no key in the key-value store.​

Rules:

1.​ All keys must be unique.​

2.​ Values cannot be queried directly; only keys are indexed.

Key-value stores organize data using hash tables, where each key maps directly to its value,
enabling very fast lookups. To manage large datasets, keys are often grouped into logical
buckets; multiple buckets can contain the same key name, but the real key is determined by
combining the bucket and key, ensuring uniqueness. Additionally, caching mechanisms are
used to store frequently accessed key-value pairs in memory, which reduces disk access and
significantly improves read and write performance.
In this analogy:

●​ The key is the word you're looking up, like "amphora". It's a unique identifier that
points to specific information. Just as you can quickly find a word in a dictionary, a
key-value store uses the key to quickly retrieve the associated data.
●​ The value is all the information associated with that word. For "amphora," the value
includes its definition, plural form, similar words, and etymology. In a database, this
value can be anything from a simple string to a complex object, but it is always
retrieved as a single unit when you use its key.

Key-value databases are suitable for applications for which the ability to store and retrieve
data in a fast and efficient manner is more important than imposing structure or constraints on
the data.
Use Cases:

Storing Web Pages in a Key-Value Store: Key-value stores are well-suited for storing web
pages and large digital assets due to their fast access and simple structure. Web search
engines, like Google, use web crawlers to automatically visit websites and store the content
of each page. The words on each page are indexed as keys, allowing quick keyword searches.

Cloud Storage (Amazon S3): Organizations with thousands or millions of digital assets, such
as images, videos, and audio files, use key-value stores for storage. Amazon S3 provides a
REST API to store and retrieve these assets efficiently, treating each asset as a value and
assigning a unique key for access.

Benefits:

Key-value stores provide fast, scalable, and reliable data storage. They are ideal for
applications where speed and efficiency matter more than enforcing a structured schema.
Additional features like metadata tags, access control, and precision monitoring enhance
usability for enterprise applications.
Weakness of Key−Value Stores:

●​ Key-value stores prioritize availability and partition tolerance, but they often sacrifice
consistency, meaning updates to data may not be immediately visible across all nodes.
Because of this, they cannot efficiently update parts of a value or perform complex
queries like traditional relational databases, limiting their functionality.
●​ Another drawback is scalability of keys, as the volume of data grows, ensuring all
keys remain unique becomes increasingly difficult, which can complicate data
management and integrity.

GRAPH DATABASE / GRAPH STORES:

Graph Databases: Graph databases store data as nodes (entities or objects), relationships
(connections between nodes), and properties (attributes of nodes or relationships).Graph
databases are built on graph theory, designed to manage data where elements (nodes) are
highly interconnected through relationships (edges). Unlike relational databases, they excel at
handling complex relationships with an undetermined number of connections. Graph
databases are ideal for applications with intricate object relationships, such as social
networks, video-sharing platforms, and rule-based engines. A graph store organizes data into
three components: nodes, relationships, and properties. Nodes represent entities (e.g.,
“Employee”), relationships describe how nodes are connected (e.g., “works_for”), and
properties are key–value pairs describing attributes (e.g., empno: 2300, empname: Tom)

Data fields

1. Nodes

2. Relationships

3. Properties
For example, in Neo4j or VertexDB, nodes represent nouns (like “Person” or “Product”),
relationships represent verbs (like “buys” or “knows”), and properties describe attributes (like
age, price, or rating). These databases are especially powerful for graph-like queries such as
finding the shortest path between nodes, measuring node similarity, checking neighbor
properties, or analyzing overall connectivity.
Column Family Stores/ Columnar Databases

Columnar databases are a hybrid between traditional RDBMS and key-value stores. Unlike
row-oriented storage, they store values in columns or groups of columns, which allows
efficient querying of specific attributes. These databases can scale to handle large volumes of
data and are widely used in OLAP (Online Analytical Processing) systems. Examples include
HBase, Cassandra, Hypertable, and Vertica.

EmployeeIndia: {

address: {

city: "Mumbai"

pincode: 400058

},
projectDetails: {

durationDays: 100

cost: 50000

Example Structure: Consider an example EmployeeIndia record:

●​ The outermost key EmployeeIndia represents the row.


●​ Column families are address and projectDetails.
●​ The address column family contains columns like city and pincode, while
projectDetails contains durationDays and cost. Columns are accessed through their
column family. This structure allows flexible, fast access to only the required columns
without reading entire rows.

Eg. of Columnar
DOCUMENT STORES:

Key-value stores are the simplest form of NoSQL databases. Each key maps to a value,
which is often a Binary Large Object (BLOB) containing the actual data. These values lack a
formal structure, so the system cannot index or search within them. When a key is provided,
the system simply returns the associated value. This makes key-value stores very fast for
simple lookups but unsuitable for complex queries or searching inside the data.

Document Stores: Document stores extend the idea of key-value stores by making the value a
structured document, often in formats like JSON, BSON, XML, or PDF. Unlike key-value
stores, everything inside a document is automatically indexed, allowing searches not just by
key but also by content or nested fields. For example, in a document where the root is
Employee, a path like Employee[id='2300']/Address/street/BuildingName/text() can directly
access nested values. Embedded documents and arrays reduce the need for JOINs, improving
query performance and speed. MongoDB is a popular document store that uses BSON
(Binary JSON) format.
NoSQL Case Study, (full chatgpt)

1) Amazon DynamoDB (Key–Value Store)

●​ Background: Amazon initially used RDBMS for shopping cart and checkout systems but faced issues
with scalability and 24×7 availability.​

●​ Solution: Introduced DynamoDB, a key–value NoSQL store to handle cart and session data; final
order is still stored in RDBMS.​

●​ Salient Features:​

1.​ Scalable – throughput can be changed dynamically using AWS APIs.​

2.​ Automated storage scaling – storage grows on demand.​

3.​ Shared-nothing architecture – distributes tables over hundreds of servers.​

4.​ Fault tolerance – data replicated across multiple availability zones.​

5.​ Schema-free – supports multiple data types (string, number, binary, sets).​

6.​ Efficient indexing – primary keys + secondary indexes.​

7.​ Strong consistency – ensures latest value reads; supports atomic counters.​

8.​ Security – integrates with AWS IAM for access control.​

9.​ Monitoring – integrates with CloudWatch for throughput/latency metrics.​

10.​ Integration – works with Amazon Redshift (data warehouse) and Elastic MapReduce
(Hadoop).​

●​ Use Case: Shopping cart, session management, large-scale e-commerce workloads.​

2) Google BigTable (Column-Family Store)

●​ Background: Designed to manage petabytes of data distributed over 100,000+ nodes. Built on
Google File System (GFS), MapReduce, Chubby lock service.​

●​ Architecture: Column-family model – data stored as a multi-dimensional sorted map indexed by


row key, column key, timestamp.​

●​ Strengths:​

○​ Highly scalable, distributed over commodity hardware.​


○​ Efficient for sparse, large datasets.​

○​ Optimized for analytical and OLAP systems.​

●​ Use Cases: Web indexing, Google Earth, Google Analytics, personalized search.​

3) MongoDB (Document Store)

●​ Background: Developed by Eliot Horowitz’s team (10gen) to overcome RDBMS limitations for high
availability and agility.​

●​ Architecture:​

○​ Stores data as JSON/BSON documents.​

○​ Schema-free, flexible, supports embedded documents and arrays.​

○​ Everything inside a document is indexed → fast queries without JOINs.​

●​ Strengths:​

○​ Supports dynamic schema, easy to evolve models.​

○​ Focused on speed, power, and flexibility.​

○​ Collections ≈ Tables, Documents ≈ Rows, Fields ≈ Columns (in RDBMS).​

●​ Use Cases: Ad serving (high-speed, real-time ad delivery), content management, sentiment analysis,
product catalogs, gaming user data.​

4) Neo4j (Graph Database)

●​ Background: Open-source, developed by Neo Technology (since 2003, released 2007). Implemented
in Java and Scala.​

●​ Architecture: Property Graph Model – consists of nodes (entities), edges/relationships, and


properties (attributes as key–value pairs).​

●​ Strengths:​

○​ Handles highly connected data efficiently.​

○​ ACID-compliant, supports failover for production use.​


○​ Supports CQL (Cypher Query Language), indexing via Apache Lucene, and unique
constraints.​

○​ Native graph processing engine for fast traversals.​

●​ Use Cases: Social networks (Facebook, LinkedIn), recommendations, fraud detection, network
management, scientific research, organizational analytics.

NoSQL solution for big data,

Big Data Challenges:​


Big Data refers to datasets so large or complex that a single processor cannot efficiently
manage them. While some problems can be solved on a single processor, Big Data problems
require distributed computing to handle their scale. Key considerations include whether the
analysis needs all the data or just a subset, and how quickly the data must be processed.
Main Use Cases:​
NoSQL solutions excel in handling large-scale data from diverse sources. Examples include
bulk image processing, such as NASA or medical imaging projects, and initiatives like the
New York Times converting 3.3 million historical newspaper scans into web formats using
Amazon EC2 and Hadoop. Other examples include public web page data (news, product
reviews, blog posts) and remote sensor data from vehicles, road infrastructure, or
environmental monitoring (e.g., garden moisture sensors).

NoSQL Solutions for Big Data:​


NoSQL databases are designed to efficiently handle high-volume input/output operations and
scale linearly with growing datasets. They are operationally efficient, reducing the need for
large teams to manage servers. Reports and analyses can often be performed by
non-programmers using simple tools. NoSQL also addresses distributed computing
challenges, including network latency and potential node failures. Furthermore, these systems
can support both large-scale batch processing and time-critical event processing, making
them versatile for a range of Big Data applications.

They solve big data problems by:

●​ Moving queries to data (instead of moving huge datasets across the network).
●​ Using hash rings for even load distribution.
●​ Employing replication to scale reads and ensure availability.
●​ Distributing queries evenly to DataNodes to avoid bottlenecks.

Moving queries to the data, not data to the queries: Instead of moving huge amounts of data
across the network to the query, NoSQL systems move the query logic closer to where the
data resides. This minimizes data transfer, reduces latency, and allows distributed nodes to
process only their portion of the dataset before sending results back.

Using hash rings is a technique in NoSQL systems to distribute data evenly and efficiently
across multiple servers in a cluster. The idea is based on consistent hashing, where each data
item is assigned a hash value (like a 40-character random key) and mapped to a position on a
circular structure called a hash ring. Each server (or node) in the cluster also has a position on
this ring, and any data item is stored on the first node it encounters when moving clockwise
around the ring.

When new nodes are added, they simply take over a portion of the hash space, meaning only
a small subset of data needs to be reassigned instead of redistributing everything. This makes
scaling smooth and efficient. The system also supports replication by storing each item on
multiple nodes—for example, one as the primary and others as secondary replicas. If a
primary node fails, the data can still be retrieved from a secondary node, ensuring fault
tolerance and high availability.
Analyzing big data with a shared-nothing architecture,

The Shared-Nothing concept means that every node in the system has its own CPU, memory,
and storage, with no resources shared among nodes. This contrasts with other
resource-sharing architectures: in a Shared RAM system, multiple CPUs share the same
memory, which is useful for computation-heavy tasks like graph processing; in a Shared Disk
system, CPUs share access to storage through a Storage Area Network (SAN), often used for
traditional databases. In Shared-Nothing systems, commodity machines work independently
with no shared resources, making them ideal for NoSQL databases. The advantages of this
approach include high scalability and fault tolerance, since queries are distributed across
multiple nodes and there is no single point of failure. Additionally, key–value and document
stores are highly cache-friendly and perform efficiently in this setup. Popular NoSQL
systems such as BigTable, Cassandra, and HBase use the shared-nothing architecture to
handle massive data workloads effectively.

Understanding the types of big data problems,

Polyglot Persistence is when an application uses multiple database types together, each
chosen for what it does best—for example, a document store for user profiles, a key–value
store for caching, and a relational database for transactions. Big Data Problems can be
classified into two types: read-mostly data, which is accessed a lot but rarely updated (like
data warehouses, images, or sensor logs), and read-write (transactional) data, which changes
often and must be highly available (like banking or online shopping). Aggregate-Oriented
NoSQL Databases—such as key–value, document, and column-family stores—organize
related data together into aggregates, making them easier to distribute across clusters for
scalability and performance.

Choosing distribution models:

Two main techniques are used: sharding and replication, and systems may adopt either or a
combination of both. For instance, Riak uses both sharding and replication together.

Sharding refers to the horizontal partitioning of a large database, where rows are divided into
smaller, more manageable subsets. Each partition forms a shard, representing a fraction of the
overall dataset, and these shards can be distributed across different servers or even different
physical locations. This approach ensures that large datasets are broken down and distributed
for better performance and scalability.

Replication, on the other hand, involves copying entire datasets across multiple servers to
improve data availability and reliability. Replication can be implemented in two ways:
master–slave and peer-to-peer. In master–slave replication, a single node (the master) holds
the authoritative copy of data and handles all write operations, while slave nodes synchronize
from the master and serve read requests. This reduces update conflicts but the downside is
that if the master crashes, the whole system may stop working — this is called a single point
of failure (SPOF). To reduce this risk, systems may use RAID drives (to protect against disk
failure) or set up a standby master (a backup master that is kept updated with the current
master’s data). However, making the standby automatically take over when the main master
fails (called failover) is tricky, and testing this backup without risking the live cluster is also
difficult.

In contrast, peer-to-peer replication eliminates reliance on a single master. Every node in the
cluster can accept writes and acts like a master, ensuring that even if one node fails, the
system continues to function. However, this requires constant synchronization among nodes,
increasing communication overhead.

Ultimately, the choice of distribution model depends on business needs. For example,
master–slave replication is often suitable for batch processing jobs where write conflicts are
minimal, while peer-to-peer replication is favored for high-availability systems where fault
tolerance is critical. Hadoop’s early versions adopted a master–slave design with a
NameNode as the master, though later versions improved fault tolerance by addressing SPOF
issues. Similarly, HBase follows a master–slave model, whereas Cassandra uses a
peer-to-peer model to achieve high availability and scalability.

master-slave versus peer-to-peer,


Master-Slave Architecture

In this model, there's a single, centralized Master node that controls all the other Slave nodes.

●​ Master Node: The master node, also known as the NameNode in Hadoop, acts as a
central authority. It manages the entire cluster, keeps a database of all the other nodes,
and handles all client requests.
●​ Slave Nodes: The slave nodes perform the actual work as instructed by the master.
They don't communicate with each other directly.
●​ Fault Tolerance: To prevent a single point of failure, a Standby Master is often used. It
takes over if the primary master fails. However, if both the master and standby fail,
the entire system can go down.

This architecture is simpler to set up and manage but can be a bottleneck for large systems.

Peer-to-Peer Architecture

In this model, all nodes are equal and can communicate directly with each other. There is no
central authority.

●​ All Nodes are Peers: Each node in the cluster has a copy of the cluster's information
and can process requests. Requests can be distributed among any of the nodes.
●​ Decentralized: There is no single master node. This eliminates the bottleneck and
single point of failure inherent in the master-slave model.
●​ Fault Tolerance: If any single node crashes, the other nodes can easily take over its
work and the system continues to function. This makes peer-to-peer systems highly
fault-tolerant and resilient.
Module 3:

The Map Tasks,

Grouping by Key,
The Reduce Tasks,

Combiners,
Details of MapReduce Execution,
Distributed Execution and Coordination:

When a user submits a MapReduce job, the system takes care of the complex task of
distributing the code across a cluster of machines. It handles all the scheduling and
synchronization to ensure the job runs smoothly and efficiently. The system's scheduler aims
to give all users a fair share of the cluster's processing power. A key feature for optimizing
performance is speculative execution. If a particular machine is running a task very slowly,
the scheduler can assign an additional instance of that same task to another machine.
Whichever instance finishes first "wins," and the slower one is terminated. This helps to
prevent a single slow node from delaying the entire job.

Synchronization and Data Flow:

MapReduce jobs are divided into two main phases: the Map phase and the Reduce phase. The
execution of these phases is strictly synchronized. The Reduce phase cannot begin until all
the tasks in the Map phase are fully completed. The Map tasks generate intermediate data,
which is a set of key-value pairs. This data must be prepared for the Reduce phase through a
crucial step called shuffle/sort. This process involves all the nodes that ran the Map tasks and
all the nodes that will run the Reduce tasks. The key-value pairs are grouped by their key, and
then sorted, before being sent to the appropriate Reduce task, ensuring that all data for a
given key is processed together. This synchronized process is fundamental to the MapReduce
model's ability to process vast amounts of data in a structured and reliable way.

Coping With Node Failures.


(All done above before just additional things)
Algorithms Using MapReduce:

Matrix-Vector multiplication by MapReduce,


https://www.youtube.com/watch?v=fboNRKmCM8k&pp=ygUpbWF0cml4IHZlY3RvciBtdWx0
aXBsaWNhdGlvbiBieSBtYXByZWR1Y2U%3D
Relational-Algebra Operations,

The following operations are performed either in mapper or in reducer:

• Selection

• Projection
Pick columns (attributes) from a table.

●​ Think: Like hiding unwanted columns and keeping only what you need.​

●​ Example: From a Student table, keep only name and cgpa.

• Union, intersection and difference

Union (R ∪ S)

Combine rows from two tables (no duplicates).

●​ Example: Students_in_Sports ∪ Students_in_Music = all students in either sports or


music.
Intersection (R ∩ S)

Keep only the rows that are in both tables.

●​ Example: Students_in_Sports ∩ Students_in_Music = students who are in both sports


and music.​
Difference (R – S)

Keep rows from the first table that are not in the second.

●​ Example: Students_in_Sports – Students_in_Music = students only in sports, not in


music.​
• Natural join

A Natural Join combines two tables (relations) by matching rows that have the same values in
all common attributes. If the shared columns match, the result includes a single merged row
with attributes from both tables; if they don’t match, that pair is ignored. For example, if
Students and Courses both have an attribute student_id, a natural join will combine rows only
when student_id values are the same. It avoids duplicate columns for the shared attributes.
• Grouping and aggregation

Grouping and Aggregation means splitting rows of a table into groups based on certain
attributes, and then applying functions (like SUM, COUNT, AVG, MIN, MAX) to other
attributes within each group.

For example, if you have a relation Sales(seller, product, amount), and you group by seller,
you can compute:

●​ SUM(amount) → total sales per seller​

●​ COUNT(product) → number of products sold per seller​

This is written as γX(R), where X includes grouping attributes and aggregate expressions. For
instance:​
γ seller, SUM(amount), COUNT(product) (Sales) gives total and count of sales grouped by
each seller.
Computing Selections by MapReduce,

Map Function: For every tuple t in relation R, it checks if the tuple satisfies a condition C
(like age > 30). If it does, the mapper outputs (t, t) as a key-value pair, where both key and
value are the tuple itself. If it doesn’t satisfy the condition, nothing is emitted.​
Reduce Function: The reducer here is just the identity function, meaning it doesn’t need to
compute anything extra. It simply takes the mapper’s output and writes it directly to the final
result.

Computing Projections by MapReduce,

Map Function: For every tuple t in relation R, it constructs a smaller tuple t′ by keeping only
the attributes in set S (the attributes we want in the projection). Then it outputs (t′, t′) as the
key-value pair.
Reduce Function: Since multiple identical tuples t′ may appear, the reducer groups them. For
each key t′, it sees a list [t′, t′, …, t′]. The reducer outputs only one (t′, t′), effectively
removing duplicates.

Union, Intersection, and Difference by MapReduce,


Computing Natural Join by MapReduce,
Grouping and Aggregation by MapReduce

QUESTION(PYQ)

Hadoop Ecosystem Component:

●​ The component used is MapReduce (for processing large datasets in parallel).


●​ HDFS (Hadoop Distributed File System) stores the documents across the cluster.
●​ Together, HDFS + MapReduce solve this problem efficiently.

Complete Flow to Solve the Problem

●​ Data Storage in HDFS


○​ All documents are stored in HDFS.
○​ HDFS splits large files into blocks (default 64/128 MB) and distributes them
across DataNodes.
●​ Input Splitting
○​ The input documents are divided into splits, and each split is processed by a
Mapper.
●​ Map Phase
○​ Each Mapper reads its input split line by line.
○​ It tokenizes text into words.
○​ For every word, the Mapper outputs (word, 1).
○​ Example: For the line “Big data Hadoop” →​
Output: (Big,1), (data,1), (Hadoop,1).
●​ Shuffle & Sort (Grouping by Key)
○​ The framework groups all values belonging to the same key (word).
○​ Example: (Hadoop,1), (Hadoop,1), (Hadoop,1) → grouped as (Hadoop,
[1,1,1]).
●​ Reduce Phase
○​ Each Reducer takes a key and its list of values.
○​ It aggregates them using sum.
○​ Example: (Hadoop, [1,1,1]) → (Hadoop, 3).
●​ Output
○​ The final output is written back to HDFS as a file containing words and their
total counts.

You might also like