Big Data & Cloud Computing-LDS7005M
MSc Computer Science
LDS7005M Big Data & Cloud
Computing-LDS7005M,
Week 8: Big Data Processing & Storage
Module Director(s): Dr Gayathri
Lecturer(s): Dr Efosa
Objectives
▪ Comprehensive understanding of data
analytics programs
▪ Analyse the Hadoop architecture from
an overview perspective
▪ An understanding of how Hadoop
MapReduce works
▪ Understanding how cloud
infrastructure facilitates big data
processing and storage
Introduction
to Hadoop
and Spark
Antonino Virgillito
(Anukrati Mehta, 2022)
Traditional solutions for computing large
quantities of data relied mainly on
processor
• Complex processing made on data moved in memory
• Scale only by adding power (more memory, faster
Large-scale processor)
• Works for relatively small-medium amounts of data
Computation but cannot keep up with larger datasets
How to cope with today’s indefinitely
growing production of data?
• Terabytes per day
4
Distributed Computing
• Multiple machines connected among each
other and cooperating for a common job
• «Cluster»
• Challenges
• Complexity of coordination – all processes
and data have to be maintained
syncronized about the global system state
• Failures
• Data distribution
5
Introduction to Hadoop
It provides a scalable, fault-tolerant, and cost-effective solution to handle
Hadoop is an open-source framework designed massive amounts of data across distributed computing nodes.
Hadoop is part of the Apache Software Foundation and consists of two
for distributed storage and processing of large main components:
data sets using a cluster of commodity hardware. •Hadoop Distributed File System (HDFS)
•MapReduce
Hadoop
Open-source platform for distributed Functions: Simplified programming model
processing of large datasets
Based on a project developed at Google Distribution of data and processing across machines Easy to write distributed algorithms
Management of the cluster
Hadoop Scalability
Hadoop can reach massive scalability Huge clusters can be made up using Cluster can easily scale up with little
by exploiting a simple distribution (cheap) commodity hardware or no modifications to the programs
architecture and coordination model
Hadoop Concepts
Applications are written Inter-node Data is distributed in Data is replicated for Scalability and fault-
in common high-level communication is limited advance availability and reliability tolerance
languages to the minimum
Bring the computation close to
the data
9
Scalability principle
• Capacity can be increased by adding nodes to the
cluster
• Increasing load does not cause failures, but in the worst
Scalability and case only a graceful degradation of performance
Fault- Fault-tolerance
Tolerance • Failure of nodes are considered inevitable and are
coped with in the architecture of the platform
• System continues to function when failure of a node
occurs – tasks are re-scheduled
• Data replication guarantees no data is lost
• Dynamic reconfiguration of the cluster when nodes join
and leave
10
Benefits of Hadoop
Previously
impossible or Lower cost of Ask Bigger
Less time
impractical analysis hardware Questions
made possible
11
Hadoop Hive Pig Sqoop HBase
Components Flume
Mahou
t
Oozie
Core Components
12
(Data Flair, 2024)
Hadoop Core Components
HDFS: Hadoop In a nutshell:
Distributed File MapReduce HDFS places the
System data on the
cluster and
Simple programming MapReduce does
Abstraction of model that enables the processing
a file system parallel execution of work
over a cluster data processing
Stores large programs
amount of data
Executes the
by
work on the
transparently
data near the
spreading it on
data
different
machines
https://www.simplilearn.com/tutorials/hadoop-tutorial/what-is-hadoop
Ref: From Simplilearn tutorial, 2024)
Structure of a Hadoop Cluster
Hadoop Any number of Two “Master”
Cluster: “worker” nodes nodes
Group of machines
Run both HDFS
working together Name Node:
and MapReduce
to store and manages HDFS
components
process data
Job Tracker:
manages
MapReduce
14
Hadoop Principle
▪ Hadoop is basically a middleware
platform that manages a cluster of
machines
I’m one
big data set
▪ The core components is a distributed
file system (HDFS)
▪ Files in HDFS are split into blocks that
Hadoop
are scattered over the cluster
HDFS
▪ The cluster can grow indefinitely
simply by adding new nodes
The MapReduce Paradigm
Parallel processing paradigm
Programmer is unaware of parallelism
Programs are structured into a two-phase execution
Map Reduce x4
x5
x3
Data elements are An algorithm is applied to all
classified into the elements of the same
categories category
MapReduce Concepts
MapReduce abstracts
Automatic
A clean abstraction for all the ‘housekeeping’
parallelization and Fault-tolerance
programmers away from the
distribution
developer
Developer can simply
MapReduce programs Can be written in any
All of Hadoop is concentrate on writing
are usually written in language using
written in Java the Map and Reduce
Java Hadoop Streaming
functions
17
MapReduce and Hadoop
MapReduce is logically placed on top of HDFS
Hadoop
MapReduce
HDFS
HDFS is a distributed file system that stores data across multiple machines in a
Hadoop cluster.
Characteristics:
• Data is divided into blocks and distributed across nodes in the cluster.
HDFS & • Provides fault tolerance by replicating data across multiple nodes.
• Enables high-throughput data access and parallel processing.
MapReduce
MapReduce is a programming model and processing engine for distributed data
processing.
Characteristics:
• Breaks down data processing tasks into two phases: Map and Reduce.
• Map phase processes input data and produces key-value pairs.
• Reduce phase aggregates and processes the results from the Map
phase to produce the final output.
MapReduce and Hadoop
Hadoop ▪ MR works on (big) files loaded on HDFS
▪ Each node in the cluster executes the
MR MR MR MR
MR program in parallel, applying map
and reduces phases on the blocks it
stores
HDFS HDFS HDFS HDFS
▪ Output is written on HDFS
Scalability principle:
The computation should be performed where the data is located
Hadoop in MS Azure
Hadoop & Map Reduce Example
• The Hadoop Streaming command is used to run custom scripts (like Python) in a
Hadoop MapReduce job.
Basic Hadoop Streaming Command:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input <input_directory_or_file> \
-output <output_directory> \
-mapper "<your_mapper_script>" \
-reducer "<your_reducer_script>"
Hadoop & Map Reduce Example
(Optional)
1. Requirements before installation – JDK from Oracle
2. Set environment variables
3. Hadoop download and set up
4. MapReduce Word Count Python File
5. For this demonstration, I will be using the Hadoop version
3.2.4 and download jdk1.8.0_202 (jdk-8u202-windows-
x64.exe)
Hadoop & Map Reduce Example
• Create these two folders in the C directory when installing
the jdk
• Create environment variables based on the location of the JDK and the JDK/bin folder. You can verify if
this is correct using the CMD prompt below:
This shows we have successfully installed the required Java
runtime.
Hadoop & Map Reduce Example
• Extract the Hadoop file, this will take some minutes and
move to the local disk where you have the previous JDK
• When you check in the Hadoop folder, you see the contents
Hadoop & Map Reduce Example
• So we also need to set environmental variables for Hadoop. This is done in the etc folder.
Replace the bin folder and do the configurations of the XML files in the etc folders according to the
configuration provided by Apache.
When done enter this on your command prompt via administrator
- hdfs namenode -format
this formats all the namenodes
- jps
this checks all the running services
- start-all.cmd
start all the Hadoop daemons
- jps
this check again all the running services after start
- cls
clears the interface from previous outputs
• On this local machine, it is accessible via - http://127.0.0.1:9870
HDFS (Hadoop Distributed • It provides information about the file system, such as the status of the NameNode, storage capacity,
and block distribution.
File System) Web UI. • You can use this interface to browse the HDFS, check file locations, and monitor the health of the
distributed storage system.
• From the running services on cmd, check for the local web url from the Apache Hadoop Distribution
namenode.
• Let’s go over to my browser to see this.
Hadoop browse files • View utilities>browse files : to view if there are uploaded
files.
Datanode • You can check the datanode to see the node running. This
will let you know if it’s a single node or a cluster.
information
• To check the cluster information and scheduler, it is accessible via:
YARN Resource •
http://127.0.0.1:8088
YARN Resource Manager Web UI.
Manager Web UI •
•
It displays information about the cluster's resource usage, running applications,
and job statuses.
You can monitor and manage tasks, containers, and application logs through this
interface.
Word count example using In MapReduce word count example, we find out the frequency of
Hadoop and Python each word. Here, the role of Mapper is to map the keys to the
existing values and the role of Reducer is to aggregate the keys of
common values. So, everything is represented in the form of Key-
value pair.
Prerequisite commands
• hdfs dfs -mkdir -p /user/hadoop/input #To Create an HDFS Directory for Input
• hdfs dfs -put C:\hadoop\mapreduce\input.txt /user/hadoop/input #To copy
input.txt from Local to HDFS. This command uploads input.txt from your local
system to HDFS at /user/hadoop/input/input.txt
• hdfs dfs -ls /user/hadoop/input # Check whether the file was uploaded
successfully
• hdfs dfs -cat /user/hadoop/input/input.txt #To read the uploaded file.
• hdfs dfs -ls /user/hadoop/output #To read the output, _SUCCESS indicates the
job completed successfully and part-00000 contains the word count results.
• Store mapper.py and reducer.py inside your working directory, lets keep it here
“C:\hadoop\mapreduce\”
Running the Hadoop Stream job
hadoop jar
%HADOOP_HOME%/share/hadoop/tools/lib/
hadoop-streaming-*.jar -input
/user/hadoop/input/input.txt -output
/user/hadoop/output -mapper "python
mapper.py" -reducer "python reducer.py"
Hadoop Completed Job
Hadoop Completed Job
After Reducer & Final Output
After Mapper
Real-case scenarios for Hadoop & MapReduce
Fraud detection in financial systems
Mapper checks transactions for abnormal patterns (large withdrawals, multiple withdrawals,
location mismatches) while Reducer aggregates risky transactions and flags potential fraud.
Weather forecasting
Mapper processes satellite temperature and pressure readings. Reducer calculate the average
values and trends for forecasts.
Seminar Activity
• Jupiter visualizations – This week
•MOCK Presentation-10 mins
Support available
• Academic Quality to check in with Student Support and Guidance Manager and Head of Student Opportunities
• Academic:
• Academic Skills Session (Compulsory – Scheduled session available on timetable)
• Small Group Academic Writing Tutorials both online and in-person (currently covering: Writing Critically, Essay writing, Report writing, Paraphrasing, Harvard Style
Referencing and Harvard Style Referencing) 1:1 support available on request if needed
• Targeted 1:1 support for students referred for Academic Misconduct
• Skills Guides available online to support
• Wellbeing:
• 1:1 Wellbeing Appointments
• Mental Health Support (online)
• Welfare Appointments (online)
• Wellbeing Breakfasts
Any Questions ?
Thank You ☺