Introduction: MapReduce
•   The concept of MapReduce was pioneered by Google.
   •   The original paper titled "MapReduce: Simplified Data Processing on Large
       Clusters" was written by Jeffrey Dean and Sanjay Ghemawat, and it was
       published in 2004.
   •   In the paper, they introduced the MapReduce programming model and
       described its implementation at Google for processing large-scale data across
       distributed clusters.
   •   MapReduce became a fundamental framework for distributed computing and
       played a significant role in the development of big data technologies.
   •   While Google introduced the concept, the open-source Apache Hadoop project
       later implemented its own version of MapReduce, making it accessible to a
       broader community of developers and organizations.
        Prerequisites that can help you grasp MapReduce more effectively
1. Programming Languages:
   •   Proficiency in a programming language is crucial.
   •   Java is commonly used in the Hadoop ecosystem, and many MapReduce
       examples are written in Java.
   •   Knowledge of Python can also be useful.
2. Distributed Systems:
   •   Understanding the basics of distributed computing is essential.
   •   Familiarize yourself with concepts like nodes, clusters, parallel processing, and
       the challenges associated with distributed systems.
3. Hadoop Ecosystem:
   •   MapReduce is often associated with the Hadoop framework.
   •   Therefore, it's helpful to have a basic understanding of Hadoop and its
       ecosystem components, such as HDFS (Hadoop Distributed File System) and
       YARN (Yet Another Resource Negotiator).
4. Basic Understanding of Big Data:
   •   MapReduce is commonly used in the context of big data processing.
   •   It's beneficial to have a foundational understanding of what constitutes "big
       data," the challenges associated with large datasets, and the motivation behind
       distributed computing for big data.
5. Linux/Unix Commands:
   •   Many big data platforms, including Hadoop, are typically deployed on Unix-
       like systems.
   •   Familiarity with basic command-line operations in a Unix environment can be
       helpful for interacting with Hadoop clusters.
6. SQL (Structured Query Language):
   •   If you are planning to use tools like Apache Hive, which provides a SQL-like
       interface for querying data in Hadoop, a basic understanding of SQL can be
       beneficial.
7. Concepts of Data Storage and Retrieval:
   •   Understanding how data is stored and retrieved in a distributed environment
       is crucial.
   •   Concepts like Sharding, replication, and indexing are relevant.
8. Algorithmic and Problem-Solving Skills:
   •   MapReduce involves breaking down problems into smaller tasks that can be
       executed in parallel.
   •   Strong algorithmic and problem-solving skills are valuable for designing
       efficient MapReduce jobs.
                                     Explanation
Q: Describe MapReduce Execution steps with a neat diagram 12 M
   •   MapReduce is a programming model and processing technique designed for
       processing and generating large datasets that can be parallelized across a
       distributed cluster of computers
   •   A job means a MapReduce Program.
   •   Each job consists of several smaller unit, called MapReduce Tasks.
   •   The basic idea behind MapReduce is to divide a large computation into smaller
       tasks that can be performed in parallel across multiple nodes in a cluster.
In a MapReduce job
   1. The data is split into smaller chunks, and a "map" function is applied to each
       chunk independently.
   2. The results are then shuffled and sorted, and a "reduce" function is applied to
       combine the intermediate results into the final output.
MapReduce Programing approach allows for efficient processing of large datasets in
a distributed computing environment.
JobTracker and Task Tracker
   •   MapReduce consists of a single master JobTracker and one slave TaskTracker
       per cluster node.
   •   The master is responsible for scheduling the component tasks in a job onto the
       slaves, monitoring them and re-executing the failed tasks.
   •   The slaves execute the tasks as directed by the master.
   •   The MapReduce framework operates entirely on key, value-pairs.
   •   * The framework views the input to the task as a set of (key, value) pairs and
       produces a set of (key, value) pairs as the output of the task, with different
       types.
Map-Tasks
Map task means a task that implements a map( ) function.
which runs user application codes for each key-value pair (kl, vl).
   •   Key kl is a set of keys.
   •   Key kl maps to group of data values.
   •   Values vl are a large string which is read from the input file(s).
   •   The output of map( ) would be zero (when no values are found) or intermediate
       key-value pairs (k2, v2).
Reduce Task
   •   Refers to a task which takes the output v2 from the map as an input and
       combines those data pieces into a smaller set of data using a combiner.
   •   The reduce task is always performed after the map task.
Key-Value Pair
Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as
input and output.
Data should be first converted into key-value pairs before it is passed to the Mapper,
as the Mapper only understands key-value pairs of data.
Key-value pairs in Hadoop MapReduce are generated as follows:
   •   InputSplit - Defines a logical representation of data and presents a Split data
       for processing at individual map ().
   •   RecordReader - Communicates with the Input Split and converts the Split into
       records which are in the form of key-value pairs in a format suitable for reading
       by the Mapper.
   •   RecordReader uses TextlnputFormat by default for converting data into key-
       value pairs.
   •   RecordReader communicates with the InputSplit until the file is read.
Grouping by Key
   •   When a map task completes, Shuffle process aggregates (combines) all the
       Mapper outputs by grouping the key-values of the Mapper output, and the
       value v2 append in a list of values.
   •   A "Group By" operation on intermediate keys creates v2.
Shuffle and Sorting Phase
   •   All pairs with the same group key (k2) collect and group together, creating one
       group for each key.
   •   Shuffle output format will be a List of. Thus, a different subset of the
       intermediate key space assigns to each reduce node.
Reduced Tasks
   •   Implements reduce () that takes the Mapper output (which shuffles and sorts),
       which is grouped by key-values (k2, v2) and applies it in parallel to each
       group.
   •   Reduce function iterates over the list of values associated with a key and
       produces outputs such as aggregations and statistics.
•   The reduce function sends output zero or another set of key-value pairs (k3,
    v3) to the final the output file. Reduce: {(k2, list (v2) -> list (k3, v3)}
                        MapReduce Implementation
•   MapReduce is a programming model and processing technique for handling
    large datasets in a parallel and distributed fashion.
•   The word count problem is a classic example of a task that can be solved using
    MapReduce.
•   Mathematical representation of the MapReduce algorithm for the word count
    problem.
Example:
Step 1: Input Document:
D="hello Hadoop, hi Hadoop, Hello MongoDB, hi Cassandra Hadoop"
Step 2: Map Function:
The Map function processes each word in the document and emits key-value pairs
where the key is the word, and the value is 1 (indicating the count).
Map("hello”) →{("hello",1)},
Map("Hadoop”) →{("Hadoop",1), ("Hadoop",1), ("Hadoop",1)},
Map("hi”) →{("hi",1), ("hi",1)}, …
Step 3: Shuffle and Sort (Grouping by Key):
Group and sort the intermediate key-value pairs by key.
("hello", [1]), ("Hadoop", [1,1,1]), ("hi", [1,1]), …
Step 4: Reduce Function:
The Reduce function takes each unique key and the list of values and calculates the
sum.
Reduce ("hello", [1]) →{("hello",1)},
Reduce ("Hadoop", [1,1,1]) →{("Hadoop",3)},
Reduce ("hi", [1,1]) →{("hi",2)}, …
Step 5: Final Output:
{("hello",1), ("Hadoop",3), ("hi",2), ("Hello",1), ("MongoDB",1), ("Cassandra",1)}