0% found this document useful (0 votes)

30 views71 pages

Hadoop

The document discusses the Hadoop platform and HDFS (Hadoop Distributed File System). It explains that HDFS is a distributed file system that stores files across multiple machines. HDFS splits files into blocks and replicates each block across multiple machines for reliability and parallel access. The namenode knows the locations of all blocks and coordinates access. The document provides examples of how to interact with HDFS using its Java API, including getting file metadata and renaming files.

Uploaded by

ouadaouiamine2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views71 pages

Hadoop

Uploaded by

ouadaouiamine2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

The HADOOP platform

Introduction
In pioneer days they used oxen for
heavy pulling, and when one ox couldn’t
budge a log, they didn’t try to grow a larger
ox.
We shouldn’t be trying for bigger
computers, but for more systems of
computers.
Introduction

Before talking about Hadoop, do you know

the prefixes?
Sign Prefix Factor Example
K Kilo 103 Page of text
M Mega 106 Transfer speed per second
G Gega 109 DVD, USB key
T Tera 1012 Hard disk
P Péta 1015
E Exa 1018 Facebook, Amazon
Z Zeta 1021 Entire internet since 2010
Introduction

The processing of such large amounts of data requires special

methods. A classic DBMS, is unable to process so much
information.

Distribute data across multiple machines (up to multiple

million computers) in Data Centers
➔Special file system allowing to see only one space which can contain gigantic and/or ver

➔Specific databases (HBase, Cassandra, ElasticSearch).

Processing of the "map-reduce" type:

➔Easy to write algorithms,

➔Easy to parallelize executions.
Introduction

Imagine 5000 computers connected together forming a

cluster.

Data center
Introduction

Each of these blade servers can look like this (4 multi-core

CPUs, 1TB RAM, 24TB hard disks, 5000$ ever-changing
price and technology)

Blade server
Connected machines
All these machines are connected to each other in order to
share storage space and computing power.

The Cloud is an example of distributed storage space: files

are stored on different machines, usually in duplicate to
prevent failure.

The execution of programs is also distributed: they are

executed on one or more machines on the network.

This whole module aims to teach application

programming on a cluster, using Hadoop tools.
Hadoop is a distributed data management and processing system.
It contains many components, including:

●HDFS (Hadoop Distributed File System) a file system that

distributes data over many machines,
●YARN a MapReduce-like program scheduling mechanism.

We will first present HDFS then YARN/MapReduce.

HDFS is a distributed file system:
●

Files and folders are organized in a tree (like Unix) these files are
stored on a large number of machines in such a way as to make
the exact position of a file invisible.
●

Access is transparent, regardless of the machines that contain the

files.

●Files are copied in several copies for reliability

and to allow multiple simultaneous accesses
HDFS makes it possible to see all the folders and
files of these thousands of machines as a single
tree, containing P0 of data, as if they were on the
hard disk local.
File organization
Seen from the user, HDFS looks like a Unix file system: there is a
root, directories and files. Files have an owner, a group and
access rights.

Under the root /, there is:

●Directories for Hadoop services: /hbase, /tmp, /var

➔a directory for users' personal files:

✔/user. In this directory, there are also three system

folders: /user/hive, /user/history and /user/spark.

➔a directory to deposit files to share with all users: /share

You will need to distinguish between

HDFS files and "normal" files.
Install Hadoop 3.2.2
Step 1 : Installing Java
First you should install JDK (Java Develoment Kit)

Step 2 : Create environment variables

For java, we create new User variable called “JAVA_Home”

It contains the java directory.
For Hadoop, we create new User variable called
“HADOOP_Home” . It contains the Hadoop directory.
For Spark, we create new User variable called
“SPARK_Home” . It contains the Sparkdirectory.
Install Hadoop 3.2.2

Step 3 : Create environment paths

For java, we create new User path

It contains the java directory +”\bin”.

For Hadoop, we create two new User paths

The 1st contains the Hadoop directory +”\bin”.
The 2nd contains the Hadoop directory +”\sbin”.
For Spark, we create two new User paths
The 1st contains the Spark directory +”\bin”.
The 2nd contains the Spark directory +”\sbin”.
Install Hadoop 3.2.2
Step 4 : Update the Hadoop xml files
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “core-site.xml” and change the directory of tempfir
according to your Hadoop directory.
Install Hadoop 3.2.2
Step 4 : Update the Hadoop xml files
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “hdfs-site.xml” and change the directory of namenode
and datanode according to your Hadoop directory.
Install Hadoop 3.2.2
Step 5 : Update the Hadoop-env command file
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “Hadoop-env” and add the following commands at the
las.

set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
Install Hadoop 3.2.2
Step 5 : Update the Hadoop-env command file
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “Hadoop-env” and add the following commands at the
last.

set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
Install Hadoop 3.2.2
Step 6 : start hadoop
In the Command Prompt. Excute the command “spark-shell”.
Then the command “for %I in (.) do echo %~sI”
This last command mast be excuted in the java directory to display
the short name of your installed jdk.
Use the short name to update the “Hadoop-env” file.
Install Hadoop 3.2.2
Step 7 : start hadoop

In a new Command Prompt, execute the command “start-dfs” to

start the Hadoop services.
Install Hadoop 3.2.2
Step 8 : open the localhost http://localhost:9870/
How does HDFS work?

As with many systems, each HDFS file is split into fixed-size blocks.
A block HDFS= 256MB. Depending on the size of a file, it will need a certain
number of blocks. On HDFS, the last block of a file is the remaining size.

The blocks of the same ﬁle are not necessarily all on the same
machine. They are each copied to different machines in order to be
accessed simultaneously by several processes. By default, each block is
copied to 3 diﬀerent machines (this is configurable).

This replication of blocks on several machines also makes it

possible to guard against breakdowns. Each file is therefore found in several
copies and in different places.
Organization of machines for HDFS
An HDFS cluster is made up of machines playing different roles
mutually exclusive:

• One of the machines is the HDFS master, called the namenode. This
machine contains all the ﬁle names and blocks, like a big phone book.
• Another machine is the secondary namenode, a kind of backup
namenode, which saves backups of the directory at regular intervals.
• Some machines are clients. These are access points to the cluster to
connect to and work with.
• All other machines are datanodes. They store blocks of ﬁle content.
Diagram of HDFS nodes

The datanodes contain blocks (A, B, C. .. ), the namenode knows

where the ﬁles are: which blocks and on which datanodes.
More explication
Datanodes contain blocks. The same blocks are duplicated (replication)
on diﬀerent datanodes, generally 3 times. This ensures:
• Data reliability in the event of a datanode failure,
• Parallel access by different processes to the same data.

The namenode knows both:

• On which blocks the files are contained,
• On which datanodes the desired blocks are located.
More explication
Datanodes contain blocks. The same blocks are duplicated (replication)
on diﬀerent datanodes, generally 3 times. This ensures:
• Data reliability in the event of a datanode failure,
• Parallel access by different processes to the same data.
The namenode knows both:
• On which blocks the files are contained,
• On which datanodes the desired blocks are located.

This is called metadata.

Major drawback: failure of the namenode = death of HDFS,

for that, there is the secondary namenode. It archives
metadata, for example every hour.
Java API for HDFS
Java API for HDFS
Hadoop offers a complete Java API for accessing files of HDFS. It is
based on two main classes:
• FileSystem represents the file tree (file system). This class
allows copying local files to HDFS (and vice versa), renaming,
creating and deleting files and folders

• FileStatus manages the information of a ﬁle or

folder:
size with getLen(),
nature with isDirectory() and isFile()
Java API for HDFS
Hadoop offers a complete Java API for accessing files of HDFS. It is
based on two main classes:
• FileSystem represents the file tree (file system). This class
allows copying local files to HDFS (and vice versa), renaming,
creating and deleting files and folders
• FileStatus manages the information of a file or
folder:
size with getLen(),
nature with isDirectory() and isFile()
These classes need to know the
configuration of the HDFS cluster, using
the Configuration class. On the other
hand, full file names are represented by
the Path class
Java API example
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.FileSystem;
importorg.apache.hadoop.fs.FileStatus;
importorg.apache.hadoop.fs.Path;

Configuration conf =newConfiguration();

FileSystem fs = FileSystem.get(conf);
Path fullName=newPath("/user/userName",“hello.txt");
FileStatus infos = fs.getFileStatus(fullName);
System.out.println(Long.toString(infos.getLen())+" octets");
fs.rename(fullName,newPath("/user/etudiant1",“g_mor.txt"));
Displaying the list of blocks in a ﬁle
import...;
Public classHDFSinfo {
Public static void main(String[] args) throws IOException
{
Configuration conf =newConfiguration();
FileSystem fs = FileSystem.get(conf);
Path fullName=newPath(“test.txt");
FileStatus infos = fs.getFileStatus(fullName);
BlockLocation[] blocks = fs.getFileBlockLocations(infos, 0,
infos.getLen());
for(BlockLocation blocloc: blocks)
System.out.println(blocloc.toString());
}
}
Reading an HDFS file
importjava.io.*;
import...;
Public class HDFSread {
Public static void main(String[] args)throws IOException {
Configuration conf =newConfiguration();
FileSystem fs = FileSystem.get(conf);
Path fullName=newPath("apitest.txt");
FSDataInputStream inStream = fs.open(fullName);
InputStreamReader isr =newInputStreamReader(inStream);
BufferedReader br =newBufferedReader(isr);
String line = br.readLine();
System.out.println(line);
inStream.close();
fs.close();
}}
Creating an HDFS File
import...;
Public class HDFSwrite {
Public static void main(String[] args)throwsIOException {
Configuration conf =newConfiguration();
FileSystem fs = FileSystem.get(conf);
Path fullName=newPath("apitest.txt");
if(! fs.exists(fullName)) {
}
}
FSDataOutputStream outStream = fs.create(fullName);
outStream.writeUTF(“Hello world!");
outStream.close();
}
fs.close();
MapReduce algorithms
Principles
We want to collect synthetic information from a data set.
Examples on a list of items with a price:
 Calculate the total amount of sales of an item,
 Find the most expensive item,
 Calculate the average price of items.

For each of these examples, the problem can be

written as form of the composition of two functions:
 map: extraction/calculation of
information on each tuple,
 reduce: grouping of this information.
MapReduce algorithms
Example

Calculating the maximum, average or total price can be

written using algorithms of the type:
for each tuple:
value = Mfunction (current tuple)
return FunctionR (values encountered)
MapReduce algorithms
Example
FunctionM is a correspondence function. It calculates a value that
interests us from a tuple,
FunctionR is a grouping function (aggregation): maximum, sum,
count, average, distinct.. .
For example, FunctionM retrieves the price of a car,
FunctionR calculates the max of a set of values:
all_prices = list()
for each car:
all_prices. add( getPrice(current car) )
return max (all_prices)
MapReduce algorithms
Example in Python
data =
[{'id':1, 'mak':'Renault', 'model':'Clio', 'price':4200},
{'id':2, ‘mark':'Fiat', 'model':'500', 'price':8840},
{'id':3, ‘mark':'Peugeot', 'model':'206', 'price':4300},
{'id':4, ‘mark':'Peugeot', 'model':'306', 'price':6140} ]

#returns the price of the car passed as a parameter

def getPrice (car): returncar[ 'price']
#show car price list
print map (getPrice, data)
# displays the highest price
print reduce (max, map(getPrice, data) )
MapReduce algorithms
Example in Python

map(function, list) applies the function to each element of

the list. It performs the “for” loop of the previous algorithm
and returns the list of car prices. The result contains as
many values as in the input list. The function

reduce (function, list) aggregates the values of the list by

the function and returns the final result
MapReduce algorithms
Example in Python

These two functions constitute a "map-reduce" couple and

the goal of this course is to understand and program them.
The key point is the possibility of parallelizing these functions
in order to calculate much faster on a machine with several
cores or on a set of machines linked together.
MapReduce algorithms
Parallelization of Map
The map function is parallelizable by nature, because the
calculations are independent.
Example, for 4 elements to process:

Value1=functionM(element1)
Value2=functionM(element2)
Value3=functionM(element3)
Value4=functionM(element4)
MapReduce algorithms
Parallelization of Map
The four calculations can be done simultaneously, for
example on 4 diﬀerent machines, provided that the data is
copied there.

Note: the mapped function must be a pure function of its

parameter, it must have no side effects such as modifying a
global variable or remembering its previous values.
MapReduce algorithms
Parallelization of Reduce
The reduce function is partially parallelized, in a form
hierarchy, for example:
Inter 1&2 FunctionR(value1, value2)
Inter 3&4 function(value3, value4)
Result function(Inter 1&2, Inter 3&4)

Only the first two calculations can be done

simultaneously. The 3rd must wait. If there were
more values, we would proceed thus :
MapReduce algorithms
Parallelization of Reduce
1. parallel calculation of the R-Function on all pairs of values
from the map
2. parallel calculation of the R-Function on all pairs of values
intermediates from the previous phase.
3. and so on, until there is only one value left.
MapReduce algorithms
Schema

Data

Map
Reduce
YARN and MapReduce
What is YARN?
YARN (Yet Another Resource Negotiator) is a mechanism
in Hadoop for managing jobs on a cluster of machines.
YARN allows users to launch MapReduce jobs on data
present in HDFS, and to follow (monitor) their progress,
retrieve the messages (logs) displayed by the programs.
Eventually YARN can move a process from one
machine to another in the event of failure or progress
deemed too slow. In fact, YARN is transparent to
the user. We launch the execution of
a MapReduce program and YARN
ensures that it is executed as quickly
as possible.
YARN and MapReduce
What is MapReduce?
MapReduce is a Java environment for writing programs
for YARN. Java is not the simplest language for this, there are
packages to import, class paths to provide...
There are several points to know :

• Principles of a MapReduce job in Hadoop,

• Map function programming,
• Programming the Reduce function,
• Programming a MapReduce job that calls the
two functions,
• Launching the job and retrieving the results.
YARN and MapReduce
Key-value pairs
It's actually a bit more complicated than what was
initially explained. The data exchanged between Map and
Reduce, and more, in the whole job are (key, value) pairs:

 a key: it is any type of data: integer, text. . .

 a value: it is any type of data

Everything is represented like this. For example :

 a text ﬁle is a set of (line number, line).

 a weather file is a set of (date and time,
temperature)
YARN and MapReduce
Map
The Map function receives a pair as input and can
produce any number of pairs as output: none, one, or many, at
will. The types of inputs and outputs are as desired.

This very unconstrained speciﬁcation does a lot of things. In

general, the pairs that Map receives are made up as follows:
 the value of type text is one of the lines or one of the tuples
file to process
 the key of type integer is the position of this line
in the ﬁle (we call it oﬀset)
YARN and MapReduce
Map

It should be understood that YARN launches an instance

of Map for each row of each data ﬁle to be processed. Each
instance processes the row assigned to it and produces output
pairs.
YARN and MapReduce
Map schema

The MAP tasks each process a pair and produce

0..n pairs. The same keys and/or values may be
produced.
YARN and MapReduce
Reduce
The Reduce function receives a list of pairs as input. These
are the pairs produced by instances of Map. Reduce can produce
any number of output pairs, but most of the time it's just one.
However, the crucial point is that the input pairs processed by an
instance of Reduce all have the same key.
YARN launches a Reduce instance for each different key that
the Map instances have produced, and provides them with only
the pairs with the same key. This is what makes it possible
to aggregate the values. In general, Reduce must do
some processing on the values, like adding all the
values together, or determining the largest of
the values.. .When we design a MapReduce
process, we have to think about the
keys and values necessary for it to work.
YARN and MapReduce
Reduce schema

Reduce tasks receive a list of pairs that all

have the same key and produce a pair that
contains the expected result. This output
pair can have the same key as the input
pair.
YARN and MapReduce
Example
A telephone company wants to calculate the total duration of a subscriber's
telephone calls from a CSV file containing all calls from all subscribers
(subscriber number, called number, date, call duration).
This problem is handled as follows:
1. In input, we have the calls file (1 call per line)
2. YARN launches one instance of Map function per call
3. Each instance of Map receives a pair (offset, line) and produces a
pair (subscriber number, duration) or nothing if it is not the
subscriber that we want. NB: the offset is useless here.
4. YARN sends all pairs to a single instance of Reduce
(because there is only one different key)
5. The Reduce instance adds all the values of
the pairs it receives and produces a single
output pair (subscriber number, total duration)
YARN and MapReduce
MapReduce job phases
A MapReduce job consists of several phases:
1.Pre-processing of input data, e.g. decompression of files
2. Split: separation of data into separately processable blocks and put in the
form of (key, value), ex: in lines or in tuples
3. Map: application of the map function on all pairs (key, value) formed from
the input data, this produces other pairs (key, value) as output
4. Shuffle& Sort: redistribution of data so that the pairs produced by Map
having the same keys are on the same machines
5. Reduce: aggregation of pairs with the same key to obtain the
final result.
YARN and MapReduce
Schema
Data

Result
Workout 1
Some commands
hadoop version # give the version of the Hadoop
hadoop fs -mkdir /test # create new directory named “test”
bin>hadoop fs -ls / # display the content of the directory
hadoop fs -copyFromLocal <localsrc> <hdfs destination>
# copy a file from <localsrc> to <dest>
haoop fs -put <localsrc> <dest> #copy localfile1 of the local file system
to the Hadoop filesystem.
hadoop fs -get <src> <localdest> #copies the file or directory from the
Hadoop file system to the local file
system.
hadoop fs –cat /path_to_file_in_hdfs
Workout 2
Map reduce
Block 1 Tuple 1
Data Block 2 Tuple 2
node Block 3 Tuple 3
1 Block 4 Tuple 4
Na Data
node
me 2
Block m
no Tuple n

de Nb_tuples=n
Data Nb_blocks=m
node Nb_datanodes=p
p Block_size
Workout 2
Student score (example)
Tuple architecture
{'st_id': 89, 'sp': 'GLSD', 'math': 3.09, 'phy': 16.89, 'sci': 14.26, 'phyl': 12.45, 'geog': 19.15, 'eng': 14.1}

block architecture

{'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy':

19.07, 'sci': 18.46, 'phyl': 5.47, 'geog': 10.92, 'eng': 17.65}, {'st_id':
17, 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02,
'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33,
'phy': 14.58, 'sci': 19.39, 'phyl': 17.84, 'geog': 0.02, 'eng': 14.65},
{'st_id': 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19,
'geog': 18.34, 'eng': 14.88}]}
Workout 2
Student score (example)
Data node architecture
[{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl': 4.61, 'geog': 19.37, 'eng': 1.8}, {'st_id': 1,
'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng': 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy':
14.35, 'sci': 5.35, 'phyl': 0.9, 'geog': 15.71, 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog':
7.23, 'eng': 18.99}]}, {'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog': 0.79,
'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12, 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng': 4.59}, {'st_id': 6, 'sp': 'STW',
'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng': 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy':
9.07, 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng': 14.38}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32,
'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44,
'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54},
{'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog': 15.25, 'eng': 9.84}]}, {'id_bk': 2, 'data':
[{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH',
'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44, 'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci':
3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog':
15.25, 'eng': 9.84}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng':
11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl': 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD',
'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math': 1.8, 'phy': 18.2,
'sci': 5.26, 'phyl': 19.19, 'geog': 14.35, 'eng': 18.45}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37,
'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl': 3.66, 'geog': 4.38,
'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15,
'sp': 'MATH', 'math': 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl': 19.19, 'geog': 14.35, 'eng': 18.45}]}]
Workout 2
Student score (example)
Dataset architecture
[[{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl': 4.61, 'geog': 19.37, 'eng': 1.8}, {'st_id': 1, 'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng':
17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy': 14.35, 'sci': 5.35, 'phyl': 0.9, 'geog': 15.71, 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog': 7.23, 'eng': 18.99}]},
{'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog': 0.79, 'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12, 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng':
4.59}, {'st_id': 6, 'sp': 'STW', 'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng': 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy': 9.07, 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng':
14.38}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44, 'geog':
1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog': 15.25,
'eng': 9.84}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44,
'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog':
15.25, 'eng': 9.84}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl':
3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math': 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl': 19.19,
'geog': 14.35, 'eng': 18.45}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58,
'phyl': 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math': 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl':
19.19, 'geog': 14.35, 'eng': 18.45}]}], [{'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog': 0.79, 'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12,
'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng': 4.59}, {'st_id': 6, 'sp': 'STW', 'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng': 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy': 9.07,
'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng': 14.38}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math':
3.62, 'phy': 8.83, 'sci': 4.58, 'phyl': 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math':
1.8, 'phy': 18.2, 'sci': 5.26, 'phyl': 19.19, 'geog': 14.35, 'eng': 18.45}]}, {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy': 19.07, 'sci': 18.46, 'phyl': 5.47, 'geog': 10.92, 'eng': 17.65}, {'st_id': 17,
'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02, 'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33, 'phy': 14.58, 'sci': 19.39, 'phyl': 17.84, 'geog': 0.02, 'eng': 14.65}, {'st_id':
19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19, 'geog': 18.34, 'eng': 14.88}]}, {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy': 19.07, 'sci': 18.46, 'phyl': 5.47, 'geog': 10.92,
'eng': 17.65}, {'st_id': 17, 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02, 'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33, 'phy': 14.58, 'sci': 19.39, 'phyl': 17.84, 'geog':
0.02, 'eng': 14.65}, {'st_id': 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19, 'geog': 18.34, 'eng': 14.88}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl':
4.61, 'geog': 19.37, 'eng': 1.8}, {'st_id': 1, 'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng': 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy': 14.35, 'sci': 5.35, 'phyl': 0.9, 'geog':
15.71, 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog': 7.23, 'eng': 18.99}]}, {'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl': 4.61,
'geog': 19.37, 'eng': 1.8}, {'st_id': 1, 'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng': 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy': 14.35, 'sci': 5.35, 'phyl': 0.9, 'geog': 15.71,
'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog': 7.23, 'eng': 18.99}]}, {'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog':
0.79, 'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12, 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng': 4.59}, {'st_id': 6, 'sp': 'STW', 'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng':
14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy': 9.07, 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng': 14.38}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog':
5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44, 'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng':
16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog': 15.25, 'eng': 9.84}]}, {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy': 19.07, 'sci': 18.46, 'phyl': 5.47,
'geog': 10.92, 'eng': 17.65}, {'st_id': 17, 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02, 'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33, 'phy': 14.58, 'sci': 19.39, 'phyl': 17.84,
'geog': 0.02, 'eng': 14.65}, {'st_id': 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19, 'geog': 18.34, 'eng': 14.88}]}]]
def randomDataset(nb_tuples=2000, block_size=15, nb_dataNode=7, nb_copies=3,

Workout 2 specialities=specialities):

data_nodes=[]

System architecture for i in range(nb_dataNode):

data_nodes.append([])
Nb_tuples=n
nb_blocks=int(nb_tuples/block_size)
Nb_blocks=m if (nb_tuples%block_size!=0):
nb_blocks=nb_blocks+1
Nb_datanodes=p
for i in range(nb_blocks):
Block_size block={}
Specialities block['id_bk']=i
block['data']=[]
nb_copies
for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1)
tuple={}
Generate a dataset tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
tuple['sci']= round(random.uniform(0.,20.),2)
tuple['phyl']= round(random.uniform(0.,20.),2)
Random dataset tuple['geog']=round(random.uniform(0.,20.),2)
tuple['eng']= round(random.uniform(0.,20.),2)

block['data'].append(tuple)
for k in range(nb_copies):
dns=[]
dn=random.randint(0, len(data_nodes)-1)
while dn in dns:
dn=random.randint(0, len(data_nodes)-1)
data_nodes[dn].append(block)
dns.append(dn)
return [data_nodes,nb_blocks]
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

def randomDataset(nb_tuples=2000, block_size=15, nb_dataNode=7, nb_copies=3,

specialities=specialities):

data_nodes=[] create a empty data nodes

for i in range(nb_dataNode):
data_nodes.append([])

nb_blocks=int(nb_tuples/block_size)
if (nb_tuples%block_size!=0):
nb_blocks=nb_blocks+1

for i in range(nb_blocks):
block={}
block['id_bk']=i
block['data']=[]
for j in range(block_size):
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

nb_blocks=int(nb_tuples/block_size)
if (nb_tuples%block_size!=0): calculate the number of blocks
nb_blocks=nb_blocks+1

for i in range(nb_blocks):
block={}
block['id_bk']=i
block['data']=[]
for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1)
tuple={}
tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

for i in range(nb_blocks):
block={}
block['id_bk']=I create blocks
block['data']=[]
for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1)
tuple={}
tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
tuple['sci']= round(random.uniform(0.,20.),2)
tuple['phyl']= round(random.uniform(0.,20.),2)
tuple['geog']=round(random.uniform(0.,20.),2)
tuple['eng']= round(random.uniform(0.,20.),2)
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1) create tuples
tuple={}
tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
tuple['sci']= round(random.uniform(0.,20.),2)
tuple['phyl']= round(random.uniform(0.,20.),2)
tuple['geog']=round(random.uniform(0.,20.),2)
tuple['eng']= round(random.uniform(0.,20.),2)

block['data'].append(tuple)
for k in range(nb_copies):
dns=[]
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

block['data'].append(tuple)
for k in range(nb_copies):
dns=[]
dn=random.randint(0, len(data_nodes)-1) save dataset
while dn in dns:
dn=random.randint(0, len(data_nodes)-1)
data_nodes[dn].append(block)
dns.append(dn)
return [data_nodes,nb_blocks]
def findBlock(id,dataset):
random_sort=np.arange(len(dataset[0]))

Workout 2 for i in range(len(dataset[0])-1):

n=random.randint(0, len(dataset[0])-1)
m=random.randint(0, len(dataset[0])-1)
System architecture x=random_sort[n]
random_sort[n]=random_sort[m]
Nb_tuples=n
Nb_blocks=m random_sort[m]=x
for i in random_sort:
Nb_datanodes=p
for data in dataset[0][i]:
Block_size
Specialities if data['id_bk']==id:
nb_copies return [data['data'],i]
return
def recuperateDataset(dataset):
Generate a dataset result=[]
nb_blk=dataset[1]
for i in range(nb_blk):
Random dataset blk=findBlock(i, dataset)
result.append(blk)
return result

Recuperate dataset dataset

Without repetition
Workout 2 [[{'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26,
'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8, 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl':
4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]}, {'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5,
'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW',
'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8, 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST',
'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]}, {'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55,
'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl': 7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49,

System architecture
'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy': 19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci':
15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp':
'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl': 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3,
'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng':
18.28}]}, {'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62,
'phyl': 7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy':
19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}, {'id_bk': 3, 'data':
Nb_tuples=n [{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92,
'eng': 7.57}, {'st_id': 17, 'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl':
11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math':

Random dataset
Nb_blocks=m 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD',
'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl': 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4,
'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}]}, {'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56,
'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl':
Nb_datanodes=p 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci':
13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}]}, {'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST',
'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8,
Block_size 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]},
{'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl':
7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy': 19.06, 'sci':
Specialities 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}, {'id_bk': 3, 'data': [{'st_id': 15, 'sp':
'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id':

nb_copies 17, 'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng':
10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}, {'id_bk': 3, 'data': [{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci':
6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id': 17, 'sp': 'GLSD', 'math': 16.25, 'phy':
13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD',
'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}]]

[[[{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math':
1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl':
6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49},
Generate a dataset {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}], 2], [[{'st_id': 5, 'sp': 'MATH',
'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94,

Dataset without repitition

'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng':
13.38}, {'st_id': 8, 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math':
17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}], 0], [[{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65,
'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl': 7.67, 'geog': 3.43, 'eng':
14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH',
'math': 16.25, 'phy': 19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78,
'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}], 2], [[{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15,
Random dataset 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id': 17,
'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63,
'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl':
3.2, 'geog': 18.01, 'eng': 14.63}], 2]]

Recuperate dataset dataset

Without repetition
Workout 2 def getAverage(student):
marks=np.array([*student.values()][2:])
System architecture coef=np.array([*coefficients.values()])
return {'st_id':student['st_id'],
Nb_tuples=n
'avg':round(np.dot(marks,coef)/coef.sum(),2
Nb_blocks=m
)}
Nb_datanodes=p
Block_size def getAverages(dataset):
Specialities studentaverages=[]
nb_copies
for block in dataset:
if (block!=None):
avgs= list(map(getAverage,block[0]))
Generate a dataset studentaverages.append([avgs,block[1]])
Averages
return studentaverages

Random dataset
Map function

Recuperate dataset dataset

Without repetition
Workout 2 [[[{'st_id': 0, 'avg': 3.93}, {'st_id': 1, 'avg': 8.73},
{'st_id': 2, 'avg': 9.49}, {'st_id': 3, 'avg': 5.02}, {'st_id':
System architecture 4, 'avg': 16.3}], 0], [[{'st_id': 5, 'avg': 12.71}, {'st_id':
6, 'avg': 14.06}, {'st_id': 7, 'avg': 11.17}, {'st_id': 8,
Nb_tuples=n
Nb_blocks=m 'avg': 13.31}, {'st_id': 9, 'avg': 8.71}], 2], [[{'st_id': 10,
Nb_datanodes=p 'avg': 16.49}, {'st_id': 11, 'avg': 6.88}, {'st_id': 12,
Block_size 'avg': 7.9}, {'st_id': 13, 'avg': 11.04}, {'st_id': 14,
Specialities 'avg': 7.88}], 0], [[{'st_id': 15, 'avg': 15.84}, {'st_id':
nb_copies
16, 'avg': 8.25}, {'st_id': 17, 'avg': 14.43}, {'st_id': 18,
'avg': 12.81}, {'st_id': 19, 'avg': 10.53}], 2]]

Generate a dataset
Averages

Random dataset
Map function

Recuperate dataset dataset

Generate a dataset
Averages

Random dataset
Map function

Recuperate dataset dataset

Without repetition
def avgSum(st1,st2):
Workout 2 return {'avg':round(st1['avg']+st2['avg'],2)}

System architecture def sum_blocks(dataset):

res=[]
Nb_tuples=n nb_tuples=0
Nb_blocks=m blk_nb=0
for data in dataset:
Nb_datanodes=p Sum, nb_values
nb_tuples=nb_tuples+len(data[0])
Block_size sr={}
Specialities
sr['block']=blk_nb
nb_copies
sr['DN']=data[1]
Reduce function sr['sum']=reduce(avgSum, data[0])['avg']
sr['nb_val']=len(data[0])
sr['avg']=round(sr['sum']/len(data[0]),2)
Generate a dataset blk_nb=blk_nb+1
Averages
res.append(sr)
rs=map(lambda r:{'avg':r['sum']}, res)
sumb=reduce(avgSum, rs)
Random dataset return
Map function {'detail':res,'res':[{'sum':sumb['avg'],'nb_val':n
b_tuples,
'avg':round(sumb['avg']/nb_tuples,2)}]}

Recuperate dataset dataset

Without repetition
Workout 2 ________________ detail of blocks
______________________________
System architecture block DN sum nb_val avg
Nb_tuples=n
0 0 0 43.47 5 8.69
Nb_blocks=m
1 1 2 59.96 5 11.99
Nb_datanodes=p Sum, nb_values 2 2 0 50.19 5 10.04
Block_size
Specialities
3 3 2 61.86 5 12.37
nb_copies _____________________ final result
Reduce function sum nb_val avg
0 215.48 20 10.77
Generate a dataset
Averages

Random dataset
Map function

Recuperate dataset dataset

Without repetition

3 Hadoop
No ratings yet
3 Hadoop
40 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Hadoop Setup Guide for Windows Users
No ratings yet
Hadoop Setup Guide for Windows Users
29 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Unit - II
No ratings yet
Unit - II
64 pages
1.mrplab Intro
No ratings yet
1.mrplab Intro
18 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
W Java132
No ratings yet
W Java132
14 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
Big Data
No ratings yet
Big Data
67 pages
4.1 HDFS Federation Namenode
No ratings yet
4.1 HDFS Federation Namenode
22 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Introduction To
No ratings yet
Introduction To
7 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Hadoop Basics and HDFS Overview
No ratings yet
Hadoop Basics and HDFS Overview
126 pages
BDA UNIT - 3 Updated
No ratings yet
BDA UNIT - 3 Updated
25 pages
Hadoop Setup & File Management Guide
No ratings yet
Hadoop Setup & File Management Guide
16 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
6 - HDFS
No ratings yet
6 - HDFS
37 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Unit 5
No ratings yet
Unit 5
101 pages
Big Data Technologies
No ratings yet
Big Data Technologies
492 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Unit I
No ratings yet
Unit I
38 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Bda Unit34
No ratings yet
Bda Unit34
17 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
13 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Hadoop BigData Testing Overview
No ratings yet
Hadoop BigData Testing Overview
37 pages
Dhan Singh Big Data File - 3
No ratings yet
Dhan Singh Big Data File - 3
1 page

Hadoop

Uploaded by

Hadoop

Uploaded by

The HADOOP platform

Before talking about Hadoop, do you know

The processing of such large amounts of data requires special

Distribute data across multiple machines (up to multiple

➔Specific databases (HBase, Cassandra, ElasticSearch).

Processing of the "map-reduce" type:

➔Easy to write algorithms,

Imagine 5000 computers connected together forming a

Each of these blade servers can look like this (4 multi-core

The Cloud is an example of distributed storage space: files

The execution of programs is also distributed: they are

This whole module aims to teach application

●HDFS (Hadoop Distributed File System) a file system that

We will first present HDFS then YARN/MapReduce.

Access is transparent, regardless of the machines that contain the

●Files are copied in several copies for reliability

Under the root /, there is:

➔a directory for users' personal files:

✔/user. In this directory, there are also three system

folders: /user/hive, /user/history and /user/spark.

You will need to distinguish between

Step 2 : Create environment variables

For java, we create new User variable called “JAVA_Home”

Step 3 : Create environment paths

For java, we create new User path

For Hadoop, we create two new User paths

In a new Command Prompt, execute the command “start-dfs” to

This replication of blocks on several machines also makes it

The datanodes contain blocks (A, B, C. .. ), the namenode knows

The namenode knows both:

This is called metadata.

Major drawback: failure of the namenode = death of HDFS,

• FileStatus manages the information of a ﬁle or

Configuration conf =newConfiguration();

For each of these examples, the problem can be

Calculating the maximum, average or total price can be

#returns the price of the car passed as a parameter

map(function, list) applies the function to each element of

reduce (function, list) aggregates the values of the list by

These two functions constitute a "map-reduce" couple and

Note: the mapped function must be a pure function of its

Only the first two calculations can be done

• Principles of a MapReduce job in Hadoop,

 a key: it is any type of data: integer, text. . .

Everything is represented like this. For example :

 a text ﬁle is a set of (line number, line).

This very unconstrained speciﬁcation does a lot of things. In

It should be understood that YARN launches an instance

The MAP tasks each process a pair and produce

Reduce tasks receive a list of pairs that all

{'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy':

System architecture for i in range(nb_dataNode):

def randomDataset(nb_tuples=2000, block_size=15, nb_dataNode=7, nb_copies=3,

data_nodes=[] create a empty data nodes

Workout 2 for i in range(len(dataset[0])-1):

Recuperate dataset dataset

Dataset without repitition

Recuperate dataset dataset

Recuperate dataset dataset

Recuperate dataset dataset

Recuperate dataset dataset

System architecture def sum_blocks(dataset):

Recuperate dataset dataset

Recuperate dataset dataset

You might also like