0% found this document useful (0 votes)

8 views39 pages

wk8 Final

The document outlines the Big Data & Cloud Computing module focusing on Hadoop and its architecture, including MapReduce for data processing. It explains the principles of distributed computing, scalability, fault tolerance, and the core components of Hadoop such as HDFS and MapReduce. Additionally, it provides practical examples and commands for implementing Hadoop and MapReduce, along with support resources available for students.

Uploaded by

binaya Dai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views39 pages

wk8 Final

Uploaded by

binaya Dai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Big Data & Cloud Computing-LDS7005M

MSc Computer Science

LDS7005M Big Data & Cloud
Computing-LDS7005M,

Week 8: Big Data Processing & Storage

Module Director(s): Dr Gayathri

Lecturer(s): Dr Efosa
Objectives

▪ Comprehensive understanding of data

analytics programs
▪ Analyse the Hadoop architecture from
an overview perspective
▪ An understanding of how Hadoop
MapReduce works
▪ Understanding how cloud
infrastructure facilitates big data
processing and storage
Introduction
to Hadoop
and Spark
Antonino Virgillito

(Anukrati Mehta, 2022)

Traditional solutions for computing large
quantities of data relied mainly on
processor
• Complex processing made on data moved in memory
• Scale only by adding power (more memory, faster
Large-scale processor)
• Works for relatively small-medium amounts of data
Computation but cannot keep up with larger datasets

How to cope with today’s indefinitely

growing production of data?

• Terabytes per day

4
Distributed Computing

• Multiple machines connected among each

other and cooperating for a common job
• «Cluster»
• Challenges
• Complexity of coordination – all processes
and data have to be maintained
syncronized about the global system state
• Failures
• Data distribution

5
Introduction to Hadoop

It provides a scalable, fault-tolerant, and cost-effective solution to handle

Hadoop is an open-source framework designed massive amounts of data across distributed computing nodes.
Hadoop is part of the Apache Software Foundation and consists of two
for distributed storage and processing of large main components:
data sets using a cluster of commodity hardware. •Hadoop Distributed File System (HDFS)
•MapReduce
Hadoop

Open-source platform for distributed Functions: Simplified programming model

processing of large datasets
Based on a project developed at Google Distribution of data and processing across machines Easy to write distributed algorithms
Management of the cluster
Hadoop Scalability

Hadoop can reach massive scalability Huge clusters can be made up using Cluster can easily scale up with little
by exploiting a simple distribution (cheap) commodity hardware or no modifications to the programs
architecture and coordination model
Hadoop Concepts

Applications are written Inter-node Data is distributed in Data is replicated for Scalability and fault-
in common high-level communication is limited advance availability and reliability tolerance
languages to the minimum
Bring the computation close to
the data

9
Scalability principle
• Capacity can be increased by adding nodes to the
cluster
• Increasing load does not cause failures, but in the worst
Scalability and case only a graceful degradation of performance
Fault- Fault-tolerance
Tolerance • Failure of nodes are considered inevitable and are
coped with in the architecture of the platform
• System continues to function when failure of a node
occurs – tasks are re-scheduled
• Data replication guarantees no data is lost
• Dynamic reconfiguration of the cluster when nodes join
and leave

10
Benefits of Hadoop

Previously
impossible or Lower cost of Ask Bigger
Less time
impractical analysis hardware Questions
made possible

11
Hadoop Hive Pig Sqoop HBase

Components Flume
Mahou
t
Oozie

Core Components

(Data Flair, 2024)

Hadoop Core Components

HDFS: Hadoop In a nutshell:

Distributed File MapReduce HDFS places the
System data on the
cluster and
Simple programming MapReduce does
Abstraction of model that enables the processing
a file system parallel execution of work
over a cluster data processing
Stores large programs
amount of data
Executes the
by
work on the
transparently
data near the
spreading it on
data
different
machines

https://www.simplilearn.com/tutorials/hadoop-tutorial/what-is-hadoop

Ref: From Simplilearn tutorial, 2024)

Structure of a Hadoop Cluster

Hadoop Any number of Two “Master”

Cluster: “worker” nodes nodes

Group of machines
Run both HDFS
working together Name Node:
and MapReduce
to store and manages HDFS
components
process data

Job Tracker:
manages
MapReduce

14
Hadoop Principle

▪ Hadoop is basically a middleware

platform that manages a cluster of
machines
I’m one
big data set
▪ The core components is a distributed
file system (HDFS)

▪ Files in HDFS are split into blocks that

Hadoop
are scattered over the cluster

HDFS
▪ The cluster can grow indefinitely
simply by adding new nodes
The MapReduce Paradigm

Parallel processing paradigm

Programmer is unaware of parallelism
Programs are structured into a two-phase execution

Map Reduce x4
x5
x3
Data elements are An algorithm is applied to all
classified into the elements of the same
categories category
MapReduce Concepts

MapReduce abstracts
Automatic
A clean abstraction for all the ‘housekeeping’
parallelization and Fault-tolerance
programmers away from the
distribution
developer

Developer can simply

MapReduce programs Can be written in any
All of Hadoop is concentrate on writing
are usually written in language using
written in Java the Map and Reduce
Java Hadoop Streaming
functions

17
MapReduce and Hadoop
MapReduce is logically placed on top of HDFS

Hadoop

MapReduce

HDFS
HDFS is a distributed file system that stores data across multiple machines in a
Hadoop cluster.
Characteristics:
• Data is divided into blocks and distributed across nodes in the cluster.
HDFS & • Provides fault tolerance by replicating data across multiple nodes.
• Enables high-throughput data access and parallel processing.
MapReduce
MapReduce is a programming model and processing engine for distributed data
processing.
Characteristics:
• Breaks down data processing tasks into two phases: Map and Reduce.
• Map phase processes input data and produces key-value pairs.
• Reduce phase aggregates and processes the results from the Map
phase to produce the final output.
MapReduce and Hadoop
Hadoop ▪ MR works on (big) files loaded on HDFS

▪ Each node in the cluster executes the

MR MR MR MR
MR program in parallel, applying map
and reduces phases on the blocks it
stores
HDFS HDFS HDFS HDFS

▪ Output is written on HDFS

Scalability principle:
The computation should be performed where the data is located
Hadoop in MS Azure
Hadoop & Map Reduce Example

• The Hadoop Streaming command is used to run custom scripts (like Python) in a
Hadoop MapReduce job.

Basic Hadoop Streaming Command:

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \

-input <input_directory_or_file> \
-output <output_directory> \
-mapper "<your_mapper_script>" \
-reducer "<your_reducer_script>"
Hadoop & Map Reduce Example
(Optional)
1. Requirements before installation – JDK from Oracle
2. Set environment variables
3. Hadoop download and set up
4. MapReduce Word Count Python File
5. For this demonstration, I will be using the Hadoop version
3.2.4 and download jdk1.8.0_202 (jdk-8u202-windows-
x64.exe)
Hadoop & Map Reduce Example
• Create these two folders in the C directory when installing
the jdk

• Create environment variables based on the location of the JDK and the JDK/bin folder. You can verify if
this is correct using the CMD prompt below:

This shows we have successfully installed the required Java

runtime.
Hadoop & Map Reduce Example
• Extract the Hadoop file, this will take some minutes and
move to the local disk where you have the previous JDK

• When you check in the Hadoop folder, you see the contents
Hadoop & Map Reduce Example
• So we also need to set environmental variables for Hadoop. This is done in the etc folder.
Replace the bin folder and do the configurations of the XML files in the etc folders according to the
configuration provided by Apache.
When done enter this on your command prompt via administrator
- hdfs namenode -format
this formats all the namenodes
- jps
this checks all the running services
- start-all.cmd
start all the Hadoop daemons
- jps
this check again all the running services after start
- cls
clears the interface from previous outputs
• On this local machine, it is accessible via - http://127.0.0.1:9870

HDFS (Hadoop Distributed • It provides information about the file system, such as the status of the NameNode, storage capacity,
and block distribution.
File System) Web UI. • You can use this interface to browse the HDFS, check file locations, and monitor the health of the
distributed storage system.
• From the running services on cmd, check for the local web url from the Apache Hadoop Distribution
namenode.
• Let’s go over to my browser to see this.
Hadoop browse files • View utilities>browse files : to view if there are uploaded
files.
Datanode • You can check the datanode to see the node running. This
will let you know if it’s a single node or a cluster.
information
• To check the cluster information and scheduler, it is accessible via:
YARN Resource •
http://127.0.0.1:8088
YARN Resource Manager Web UI.
Manager Web UI •
•
It displays information about the cluster's resource usage, running applications,
and job statuses.
You can monitor and manage tasks, containers, and application logs through this
interface.
Word count example using In MapReduce word count example, we find out the frequency of
Hadoop and Python each word. Here, the role of Mapper is to map the keys to the
existing values and the role of Reducer is to aggregate the keys of
common values. So, everything is represented in the form of Key-
value pair.
Prerequisite commands
• hdfs dfs -mkdir -p /user/hadoop/input #To Create an HDFS Directory for Input
• hdfs dfs -put C:\hadoop\mapreduce\input.txt /user/hadoop/input #To copy
input.txt from Local to HDFS. This command uploads input.txt from your local
system to HDFS at /user/hadoop/input/input.txt
• hdfs dfs -ls /user/hadoop/input # Check whether the file was uploaded
successfully
• hdfs dfs -cat /user/hadoop/input/input.txt #To read the uploaded file.
• hdfs dfs -ls /user/hadoop/output #To read the output, _SUCCESS indicates the
job completed successfully and part-00000 contains the word count results.
• Store mapper.py and reducer.py inside your working directory, lets keep it here
“C:\hadoop\mapreduce\”
Running the Hadoop Stream job

hadoop jar
%HADOOP_HOME%/share/hadoop/tools/lib/
hadoop-streaming-*.jar -input
/user/hadoop/input/input.txt -output
/user/hadoop/output -mapper "python
mapper.py" -reducer "python reducer.py"
Hadoop Completed Job
Hadoop Completed Job

After Reducer & Final Output

After Mapper
Real-case scenarios for Hadoop & MapReduce

Fraud detection in financial systems

Mapper checks transactions for abnormal patterns (large withdrawals, multiple withdrawals,
location mismatches) while Reducer aggregates risky transactions and flags potential fraud.

Weather forecasting

Mapper processes satellite temperature and pressure readings. Reducer calculate the average
values and trends for forecasts.
Seminar Activity
• Jupiter visualizations – This week

•MOCK Presentation-10 mins

Support available

• Academic Quality to check in with Student Support and Guidance Manager and Head of Student Opportunities

• Academic:
• Academic Skills Session (Compulsory – Scheduled session available on timetable)
• Small Group Academic Writing Tutorials both online and in-person (currently covering: Writing Critically, Essay writing, Report writing, Paraphrasing, Harvard Style
Referencing and Harvard Style Referencing) 1:1 support available on request if needed
• Targeted 1:1 support for students referred for Academic Misconduct
• Skills Guides available online to support

• Wellbeing:
• 1:1 Wellbeing Appointments
• Mental Health Support (online)
• Welfare Appointments (online)
• Wellbeing Breakfasts
Any Questions ?

Thank You ☺

Introduction To
No ratings yet
Introduction To
7 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Unit 2
No ratings yet
Unit 2
17 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Big Data
No ratings yet
Big Data
67 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Unit 5
No ratings yet
Unit 5
32 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Module - 2
No ratings yet
Module - 2
84 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
CC-Unit 3
No ratings yet
CC-Unit 3
22 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
HADOOP
No ratings yet
HADOOP
4 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Unit 3
No ratings yet
Unit 3
90 pages
IDS Unit3
100% (1)
IDS Unit3
16 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Unit 2
No ratings yet
Unit 2
9 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Unit 2
No ratings yet
Unit 2
22 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Hadoop 2 Quick Start Guide PDF
100% (1)
Hadoop 2 Quick Start Guide PDF
736 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
3 pages
Part 02 - Big Data Solutions
No ratings yet
Part 02 - Big Data Solutions
17 pages
COMSYS
No ratings yet
COMSYS
3 pages
SAS Functions Guide for Analysts
100% (1)
SAS Functions Guide for Analysts
18 pages
Subnetting Student Labs
No ratings yet
Subnetting Student Labs
56 pages
Notas
No ratings yet
Notas
4 pages
RRU !remote Relay Unit
No ratings yet
RRU !remote Relay Unit
2 pages
HRT-710 en
No ratings yet
HRT-710 en
2 pages
Optimized Priority Scheduling
No ratings yet
Optimized Priority Scheduling
5 pages
2017 Year 12 Computer Studies QP - 29.08
No ratings yet
2017 Year 12 Computer Studies QP - 29.08
12 pages
Comparison of Windows Vista and Windows XP - Wikipedia
No ratings yet
Comparison of Windows Vista and Windows XP - Wikipedia
5 pages
A560 North America Operators Manual 985 090 - 06 Rev A
No ratings yet
A560 North America Operators Manual 985 090 - 06 Rev A
300 pages
Batch OS: Pros and Cons Explained
No ratings yet
Batch OS: Pros and Cons Explained
3 pages
RSL in X Classic GRG
No ratings yet
RSL in X Classic GRG
58 pages
2 - 1 Local Area Networks ANSWERS
No ratings yet
2 - 1 Local Area Networks ANSWERS
4 pages
Java Scripy Basics 1
No ratings yet
Java Scripy Basics 1
3 pages
Introduction to Computers Guide
No ratings yet
Introduction to Computers Guide
44 pages
SHP 2 Grid
No ratings yet
SHP 2 Grid
7 pages
Practical File: Department of Computer Science and Engineering
No ratings yet
Practical File: Department of Computer Science and Engineering
32 pages
VT50 Programming Manual
No ratings yet
VT50 Programming Manual
48 pages
Cracker's Guide
100% (1)
Cracker's Guide
2 pages
Modern Work Plan Comparison Enterprise EEA
No ratings yet
Modern Work Plan Comparison Enterprise EEA
10 pages
Grade 6 ICT Exam Paper 2023
No ratings yet
Grade 6 ICT Exam Paper 2023
2 pages
Print Spooler Queue - Clear and Reset - Windows 7 Help Forums
No ratings yet
Print Spooler Queue - Clear and Reset - Windows 7 Help Forums
7 pages
Long Quiz Ict 10 2nd G
No ratings yet
Long Quiz Ict 10 2nd G
4 pages
Computer Systems Training Guide
No ratings yet
Computer Systems Training Guide
15 pages
x6 8 Installguide
No ratings yet
x6 8 Installguide
243 pages
Advance Flutter App Development
No ratings yet
Advance Flutter App Development
10 pages
PowerShell vs Azure CLI Guide
No ratings yet
PowerShell vs Azure CLI Guide
7 pages
SAP BASIS Course Content
No ratings yet
SAP BASIS Course Content
2 pages
Cisco NXOS Tshoot
No ratings yet
Cisco NXOS Tshoot
128 pages
HPE Integrated Lights-Out Portfolio
No ratings yet
HPE Integrated Lights-Out Portfolio
8 pages

wk8 Final

Uploaded by

wk8 Final

Uploaded by

Big Data & Cloud Computing-LDS7005M

MSc Computer Science

Week 8: Big Data Processing & Storage

Module Director(s): Dr Gayathri

▪ Comprehensive understanding of data

(Anukrati Mehta, 2022)

How to cope with today’s indefinitely

• Terabytes per day

• Multiple machines connected among each

It provides a scalable, fault-tolerant, and cost-effective solution to handle

Open-source platform for distributed Functions: Simplified programming model

(Data Flair, 2024)

HDFS: Hadoop In a nutshell:

Ref: From Simplilearn tutorial, 2024)

Hadoop Any number of Two “Master”

▪ Hadoop is basically a middleware

▪ Files in HDFS are split into blocks that

Parallel processing paradigm

Developer can simply

▪ Each node in the cluster executes the

▪ Output is written on HDFS

Basic Hadoop Streaming Command:

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \

This shows we have successfully installed the required Java

After Reducer & Final Output

Fraud detection in financial systems

•MOCK Presentation-10 mins

You might also like