The HADOOP platform
Introduction
    In pioneer days they used oxen for
heavy pulling, and when one ox couldn’t
budge a log, they didn’t try to grow a larger
                    ox.
    We shouldn’t be trying for bigger
computers, but for more systems of
computers.
       Introduction
       Before talking about Hadoop, do you know
       the prefixes?
Sign   Prefix   Factor   Example
K      Kilo     103      Page of text
M      Mega     106      Transfer speed per second
G      Gega     109      DVD, USB key
T      Tera     1012     Hard disk
P      Péta     1015
E      Exa      1018     Facebook, Amazon
Z      Zeta     1021     Entire internet since 2010
 Introduction
The processing of such large amounts of data requires special
methods. A classic DBMS, is unable to process so much
information.
Distribute data across multiple machines (up to multiple
million computers) in Data Centers
  ➔Special file system allowing to see only one space which can contain gigantic and/or ver
  ➔Specific databases (HBase, Cassandra, ElasticSearch).
Processing of the "map-reduce" type:
       ➔Easy to write algorithms,
       ➔Easy to parallelize executions.
 Introduction
Imagine 5000 computers connected together forming a
cluster.
Data center
Introduction
Each of these blade servers can look like this (4 multi-core
CPUs, 1TB RAM, 24TB hard disks, 5000$ ever-changing
price and technology)
Blade server
 Connected machines
All these machines are connected to each other in order to
share storage space and computing power.
The Cloud is an example of distributed storage space: files
are stored on different machines, usually in duplicate to
prevent failure.
The execution of programs is also distributed: they are
executed on one or more machines on the network.
This whole module aims to teach application
programming on a cluster, using Hadoop tools.
Hadoop is a distributed data management and processing system.
It contains many components, including:
   ●HDFS (Hadoop Distributed File System) a file system that
   distributes data over many machines,
   ●YARN a MapReduce-like program scheduling mechanism.
We will first present HDFS then YARN/MapReduce.
HDFS is a distributed file system:
   ●
   Files and folders are organized in a tree (like Unix) these files are
   stored on a large number of machines in such a way as to make
   the exact position of a file invisible.
   ●
   Access is transparent, regardless of the machines that contain the
   files.
   ●Files are copied in several copies for reliability
   and to allow multiple simultaneous accesses
   HDFS makes it possible to see all the folders and
   files of these thousands of machines as a single
   tree, containing P0 of data, as if they were on the
   hard disk local.
  File organization
Seen from the user, HDFS looks like a Unix file system: there is a
root, directories and files. Files have an owner, a group and
access rights.
Under the root /, there is:
   ●Directories for Hadoop services: /hbase, /tmp, /var
   ➔a directory for users' personal files:
     ✔/user. In this directory, there are also three system
     folders: /user/hive, /user/history and /user/spark.
   ➔a directory to deposit files to share with all users: /share
You will need to distinguish between
HDFS files and "normal" files.
    Install Hadoop 3.2.2
Step 1 : Installing Java
First you should install JDK (Java Develoment Kit)
 Step 2 : Create environment variables
For java, we create new User variable called “JAVA_Home”
It contains the java directory.
For Hadoop, we create new User variable called
 “HADOOP_Home” . It contains the Hadoop directory.
For Spark, we create new User variable called
“SPARK_Home” . It contains the Sparkdirectory.
  Install Hadoop 3.2.2
Step 3 : Create environment paths
For java, we create new User path
It contains the java directory +”\bin”.
For Hadoop, we create two new User paths
The 1st contains the Hadoop directory +”\bin”.
The 2nd contains the Hadoop directory +”\sbin”.
For Spark, we create two new User paths
The 1st contains the Spark directory +”\bin”.
The 2nd contains the Spark directory +”\sbin”.
Install Hadoop 3.2.2
 Step 4 : Update the Hadoop xml files
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “core-site.xml” and change the directory of tempfir
according to your Hadoop directory.
Install Hadoop 3.2.2
 Step 4 : Update the Hadoop xml files
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “hdfs-site.xml” and change the directory of namenode
and datanode according to your Hadoop directory.
Install Hadoop 3.2.2
 Step 5 : Update the Hadoop-env command file
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “Hadoop-env” and add the following commands at the
las.
        set HADOOP_PREFIX=%HADOOP_HOME%
        set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
        set YARN_CONF_DIR=%HADOOP_CONF_DIR%
        set PATH=%PATH%;%HADOOP_PREFIX%\bin
Install Hadoop 3.2.2
 Step 5 : Update the Hadoop-env command file
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “Hadoop-env” and add the following commands at the
last.
        set HADOOP_PREFIX=%HADOOP_HOME%
        set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
        set YARN_CONF_DIR=%HADOOP_CONF_DIR%
        set PATH=%PATH%;%HADOOP_PREFIX%\bin
Install Hadoop 3.2.2
 Step 6 : start hadoop
In the Command Prompt. Excute the command “spark-shell”.
Then the command “for %I in (.) do echo %~sI”
This last command mast be excuted in the java directory to display
the short name of your installed jdk.
Use the short name to update the “Hadoop-env” file.
Install Hadoop 3.2.2
 Step 7 : start hadoop
In a new Command Prompt, execute the command “start-dfs” to
start the Hadoop services.
Install Hadoop 3.2.2
Step 8 : open the localhost http://localhost:9870/
    How does HDFS work?
         As with many systems, each HDFS file is split into fixed-size blocks.
A block HDFS= 256MB. Depending on the size of a file, it will need a certain
number of blocks. On HDFS, the last block of a file is the remaining size.
         The blocks of the same file are not necessarily all on the same
machine. They are each copied to different machines in order to be
accessed simultaneously by several processes. By default, each block is
copied to 3 different machines (this is configurable).
         This replication of blocks on several machines also makes it
possible to guard against breakdowns. Each file is therefore found in several
copies and in different places.
    Organization of machines for HDFS
       An HDFS cluster is made up of machines playing different roles
  mutually exclusive:
• One of the machines is the HDFS master, called the namenode. This
  machine contains all the file names and blocks, like a big phone book.
• Another machine is the secondary namenode, a kind of backup
  namenode, which saves backups of the directory at regular intervals.
• Some machines are clients. These are access points to the cluster to
  connect to and work with.
• All other machines are datanodes. They store blocks of file content.
  Diagram of HDFS nodes
      The datanodes contain blocks (A, B, C. .. ), the namenode knows
where the files are: which blocks and on which datanodes.
   More explication
Datanodes contain blocks. The same blocks are duplicated (replication)
on different datanodes, generally 3 times. This ensures:
    • Data reliability in the event of a datanode failure,
    • Parallel access by different processes to the same data.
The namenode knows both:
    • On which blocks the files are contained,
    • On which datanodes the desired blocks are located.
  More explication
      Datanodes contain blocks. The same blocks are duplicated (replication)
      on different datanodes, generally 3 times. This ensures:
          • Data reliability in the event of a datanode failure,
          • Parallel access by different processes to the same data.
      The namenode knows both:
          • On which blocks the files are contained,
          • On which datanodes the desired blocks are located.
                   This is called metadata.
Major drawback: failure of the namenode = death of HDFS,
for that, there is the secondary namenode. It archives
metadata, for example every hour.
Java API for HDFS
Java API for HDFS
Hadoop offers a complete Java API for accessing files of HDFS. It is
based on two main classes:
   • FileSystem represents the file tree (file system). This class
     allows copying local files to HDFS (and vice versa), renaming,
     creating and deleting files and folders
   • FileStatus manages the information of a file or
     folder:
             size with getLen(),
             nature with isDirectory() and isFile()
Java API for HDFS
Hadoop offers a complete Java API for accessing files of HDFS. It is
based on two main classes:
 • FileSystem represents the file tree (file system). This class
   allows copying local files to HDFS (and vice versa), renaming,
   creating and deleting files and folders
 • FileStatus manages the information of a file or
   folder:
           size with getLen(),
           nature with isDirectory() and isFile()
These classes need to know the
configuration of the HDFS cluster, using
the Configuration class. On the other
hand, full file names are represented by
the Path class
    Java API example
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.FileSystem;
importorg.apache.hadoop.fs.FileStatus;
importorg.apache.hadoop.fs.Path;
Configuration conf =newConfiguration();
FileSystem fs = FileSystem.get(conf);
Path fullName=newPath("/user/userName",“hello.txt");
FileStatus infos = fs.getFileStatus(fullName);
System.out.println(Long.toString(infos.getLen())+" octets");
fs.rename(fullName,newPath("/user/etudiant1",“g_mor.txt"));
Displaying the list of blocks in a file
import...;
Public classHDFSinfo {
Public static void main(String[] args) throws IOException
{
    Configuration conf =newConfiguration();
    FileSystem fs = FileSystem.get(conf);
    Path fullName=newPath(“test.txt");
    FileStatus infos = fs.getFileStatus(fullName);
    BlockLocation[] blocks = fs.getFileBlockLocations(infos, 0,
    infos.getLen());
    for(BlockLocation blocloc: blocks)
        System.out.println(blocloc.toString());
}
}
Reading an HDFS file
importjava.io.*;
import...;
Public class HDFSread {
Public static void main(String[] args)throws IOException {
Configuration conf =newConfiguration();
FileSystem fs = FileSystem.get(conf);
Path fullName=newPath("apitest.txt");
FSDataInputStream inStream = fs.open(fullName);
InputStreamReader isr =newInputStreamReader(inStream);
BufferedReader br =newBufferedReader(isr);
String line = br.readLine();
System.out.println(line);
inStream.close();
fs.close();
}}
Creating an HDFS File
import...;
Public class HDFSwrite {
Public static void main(String[] args)throwsIOException {
Configuration conf =newConfiguration();
FileSystem fs = FileSystem.get(conf);
Path fullName=newPath("apitest.txt");
if(! fs.exists(fullName)) {
}
}
FSDataOutputStream outStream = fs.create(fullName);
outStream.writeUTF(“Hello world!");
outStream.close();
}
fs.close();
MapReduce algorithms
 Principles
  We want to collect synthetic information from a data set.
  Examples on a list of items with a price:
        Calculate the total amount of sales of an item,
        Find the most expensive item,
        Calculate the average price of items.
  For each of these examples, the problem can be
  written as form of the composition of two functions:
         map: extraction/calculation of
          information on each tuple,
         reduce: grouping of this information.
MapReduce algorithms
    Example
Calculating the maximum, average or total price can be
written using algorithms of the type:
for each tuple:
       value = Mfunction (current tuple)
       return FunctionR (values encountered)
MapReduce algorithms
 Example
  FunctionM is a correspondence function. It calculates a value that
             interests us from a tuple,
  FunctionR is a grouping function (aggregation): maximum, sum,
             count, average, distinct.. .
  For example, FunctionM retrieves the price of a car,
  FunctionR calculates the max of a set of values:
  all_prices = list()
  for each car:
          all_prices. add( getPrice(current car) )
  return max (all_prices)
   MapReduce algorithms
      Example in Python
  data =
  [{'id':1, 'mak':'Renault', 'model':'Clio', 'price':4200},
   {'id':2, ‘mark':'Fiat', 'model':'500', 'price':8840},
   {'id':3, ‘mark':'Peugeot', 'model':'206', 'price':4300},
   {'id':4, ‘mark':'Peugeot', 'model':'306', 'price':6140} ]
#returns the price of the car passed as a parameter
  def getPrice (car): returncar[ 'price']
#show car price list
  print map (getPrice, data)
# displays the highest price
  print reduce (max, map(getPrice, data) )
   MapReduce algorithms
       Example in Python
map(function, list) applies the function to each element of
the list. It performs the “for” loop of the previous algorithm
and returns the list of car prices. The result contains as
many values as in the input list. The function
reduce (function, list) aggregates the values of the list by
the function and returns the final result
    MapReduce algorithms
        Example in Python
These two functions constitute a "map-reduce" couple and
the goal of this course is to understand and program them.
The key point is the possibility of parallelizing these functions
in order to calculate much faster on a machine with several
cores or on a set of machines linked together.
    MapReduce algorithms
    Parallelization of Map
The map function is parallelizable by nature, because the
calculations are independent.
Example, for 4 elements to process:
Value1=functionM(element1)
Value2=functionM(element2)
Value3=functionM(element3)
Value4=functionM(element4)
    MapReduce algorithms
    Parallelization of Map
The four calculations can be done simultaneously, for
example on 4 different machines, provided that the data is
copied there.
Note: the mapped function must be a pure function of its
parameter, it must have no side effects such as modifying a
global variable or remembering its previous values.
    MapReduce algorithms
    Parallelization of Reduce
The reduce function is partially parallelized, in a form
hierarchy, for example:
Inter 1&2 FunctionR(value1, value2)
Inter 3&4 function(value3, value4)
Result function(Inter 1&2, Inter 3&4)
    Only the first two calculations can be done
  simultaneously. The 3rd must wait. If there were
  more values, we would proceed thus :
    MapReduce algorithms
    Parallelization of Reduce
1. parallel calculation of the R-Function on all pairs of values
   from the map
2. parallel calculation of the R-Function on all pairs of values
   intermediates from the previous phase.
3. and so on, until there is only one value left.
MapReduce algorithms
Schema
               Data
                Map
               Reduce
    YARN and MapReduce
    What is YARN?
       YARN (Yet Another Resource Negotiator) is a mechanism
in Hadoop for managing jobs on a cluster of machines.
       YARN allows users to launch MapReduce jobs on data
present in HDFS, and to follow (monitor) their progress,
retrieve the messages (logs) displayed by the programs.
      Eventually YARN can move a process from one
machine to another in the event of failure or progress
deemed too slow. In fact, YARN is transparent to
the user. We launch the execution of
 a MapReduce program and YARN
   ensures that it is executed as quickly
    as possible.
     YARN and MapReduce
     What is MapReduce?
      MapReduce is a Java environment for writing programs
for YARN. Java is not the simplest language for this, there are
packages to import, class paths to provide...
     There are several points to know :
 • Principles of a MapReduce job in Hadoop,
 • Map function programming,
 • Programming the Reduce function,
 • Programming a MapReduce job that calls the
   two functions,
 • Launching the job and retrieving the results.
    YARN and MapReduce
    Key-value pairs
        It's actually a bit more complicated than what was
initially explained. The data exchanged between Map and
Reduce, and more, in the whole job are (key, value) pairs:
       a key: it is any type of data: integer, text. . .
       a value: it is any type of data
 Everything is represented like this. For example :
   a text file is a set of (line number, line).
   a weather file is a set of (date and time,
    temperature)
    YARN and MapReduce
    Map
        The Map function receives a pair as input and can
produce any number of pairs as output: none, one, or many, at
will. The types of inputs and outputs are as desired.
This very unconstrained specification does a lot of things. In
general, the pairs that Map receives are made up as follows:
  the value of type text is one of the lines or one of the tuples
   file to process
  the key of type integer is the position of this line
     in the file (we call it offset)
    YARN and MapReduce
    Map
       It should be understood that YARN launches an instance
of Map for each row of each data file to be processed. Each
instance processes the row assigned to it and produces output
pairs.
  YARN and MapReduce
  Map schema
The MAP tasks each process a pair and produce
0..n pairs. The same keys and/or values may be
produced.
    YARN and MapReduce
    Reduce
        The Reduce function receives a list of pairs as input. These
are the pairs produced by instances of Map. Reduce can produce
any number of output pairs, but most of the time it's just one.
However, the crucial point is that the input pairs processed by an
instance of Reduce all have the same key.
        YARN launches a Reduce instance for each different key that
the Map instances have produced, and provides them with only
the pairs with the same key. This is what makes it possible
to aggregate the values. In general, Reduce must do
  some processing on the values, like adding all the
   values together, or determining the largest of
    the values.. .When we design a MapReduce
     process, we have to think about the
      keys and values necessary for it to work.
YARN and MapReduce
Reduce schema
Reduce tasks receive a list of pairs that all
have the same key and produce a pair that
contains the expected result. This output
pair can have the same key as the input
pair.
        YARN and MapReduce
        Example
A telephone company wants to calculate the total duration of a subscriber's
telephone calls from a CSV file containing all calls from all subscribers
(subscriber number, called number, date, call duration).
This problem is handled as follows:
    1. In input, we have the calls file (1 call per line)
    2. YARN launches one instance of Map function per call
    3. Each instance of Map receives a pair (offset, line) and produces a
       pair (subscriber number, duration) or nothing if it is not the
       subscriber that we want. NB: the offset is useless here.
    4. YARN sends all pairs to a single instance of Reduce
       (because there is only one different key)
        5. The Reduce instance adds all the values of
        the pairs it receives and produces a single
          output pair (subscriber number, total duration)
        YARN and MapReduce
        MapReduce job phases
A MapReduce job consists of several phases:
1.Pre-processing of input data, e.g. decompression of files
2. Split: separation of data into separately processable blocks and put in the
         form of (key, value), ex: in lines or in tuples
3. Map: application of the map function on all pairs (key, value) formed from
          the input data, this produces other pairs (key, value) as output
4. Shuffle& Sort: redistribution of data so that the pairs produced by Map
                    having the same keys are on the same machines
5. Reduce: aggregation of pairs with the same key to obtain the
             final result.
YARN and MapReduce
Schema
  Data
  Result
        Workout 1
        Some commands
hadoop version                        # give the version of the Hadoop
hadoop fs -mkdir /test                 # create new directory named “test”
bin>hadoop fs -ls /                   # display the content of the directory
hadoop fs -copyFromLocal <localsrc> <hdfs destination>
                                      # copy a file from <localsrc> to <dest>
haoop fs -put <localsrc> <dest>     #copy localfile1 of the local file system
                                     to the Hadoop filesystem.
hadoop fs -get <src> <localdest>    #copies the file or directory from the
                                     Hadoop file system to the local file
                                     system.
    hadoop fs –cat /path_to_file_in_hdfs
Workout 2
Map reduce
                       Block 1   Tuple 1
     Data              Block 2   Tuple 2
     node              Block 3   Tuple 3
      1                Block 4   Tuple 4
Na   Data
     node
me     2
                       Block m
no                               Tuple n
de           Nb_tuples=n
     Data    Nb_blocks=m
     node    Nb_datanodes=p
      p      Block_size
        Workout 2
        Student score (example)
 Tuple architecture
{'st_id': 89, 'sp': 'GLSD', 'math': 3.09, 'phy': 16.89, 'sci': 14.26, 'phyl': 12.45, 'geog': 19.15, 'eng': 14.1}
block architecture
           {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy':
           19.07, 'sci': 18.46, 'phyl': 5.47, 'geog': 10.92, 'eng': 17.65}, {'st_id':
           17, 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02,
           'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33,
           'phy': 14.58, 'sci': 19.39, 'phyl': 17.84, 'geog': 0.02, 'eng': 14.65},
           {'st_id': 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19,
           'geog': 18.34, 'eng': 14.88}]}
            Workout 2
            Student score (example)
   Data node architecture
[{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl': 4.61, 'geog': 19.37, 'eng': 1.8}, {'st_id': 1,
'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng': 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy':
14.35, 'sci': 5.35, 'phyl': 0.9, 'geog': 15.71, 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog':
7.23, 'eng': 18.99}]}, {'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog': 0.79,
'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12, 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng': 4.59}, {'st_id': 6, 'sp': 'STW',
'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng': 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy':
9.07, 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng': 14.38}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32,
'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44,
'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54},
{'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog': 15.25, 'eng': 9.84}]}, {'id_bk': 2, 'data':
[{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH',
'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44, 'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci':
3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog':
15.25, 'eng': 9.84}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng':
11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl': 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD',
'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math': 1.8, 'phy': 18.2,
'sci': 5.26, 'phyl': 19.19, 'geog': 14.35, 'eng': 18.45}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37,
'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl': 3.66, 'geog': 4.38,
'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15,
'sp': 'MATH', 'math': 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl': 19.19, 'geog': 14.35, 'eng': 18.45}]}]
             Workout 2
             Student score (example)
Dataset architecture
 [[{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl': 4.61, 'geog': 19.37, 'eng': 1.8}, {'st_id': 1, 'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng':
 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy': 14.35, 'sci': 5.35, 'phyl': 0.9, 'geog': 15.71, 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog': 7.23, 'eng': 18.99}]},
 {'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog': 0.79, 'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12, 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng':
 4.59}, {'st_id': 6, 'sp': 'STW', 'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng': 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy': 9.07, 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng':
 14.38}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44, 'geog':
 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog': 15.25,
 'eng': 9.84}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44,
 'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog':
 15.25, 'eng': 9.84}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl':
 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math': 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl': 19.19,
 'geog': 14.35, 'eng': 18.45}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58,
 'phyl': 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math': 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl':
 19.19, 'geog': 14.35, 'eng': 18.45}]}], [{'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog': 0.79, 'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12,
 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng': 4.59}, {'st_id': 6, 'sp': 'STW', 'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng': 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy': 9.07,
 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng': 14.38}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math':
 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl': 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math':
 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl': 19.19, 'geog': 14.35, 'eng': 18.45}]}, {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy': 19.07, 'sci': 18.46, 'phyl': 5.47, 'geog': 10.92, 'eng': 17.65}, {'st_id': 17,
 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02, 'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33, 'phy': 14.58, 'sci': 19.39, 'phyl': 17.84, 'geog': 0.02, 'eng': 14.65}, {'st_id':
 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19, 'geog': 18.34, 'eng': 14.88}]}, {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy': 19.07, 'sci': 18.46, 'phyl': 5.47, 'geog': 10.92,
 'eng': 17.65}, {'st_id': 17, 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02, 'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33, 'phy': 14.58, 'sci': 19.39, 'phyl': 17.84, 'geog':
 0.02, 'eng': 14.65}, {'st_id': 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19, 'geog': 18.34, 'eng': 14.88}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl':
 4.61, 'geog': 19.37, 'eng': 1.8}, {'st_id': 1, 'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng': 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy': 14.35, 'sci': 5.35, 'phyl': 0.9, 'geog':
 15.71, 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog': 7.23, 'eng': 18.99}]}, {'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl': 4.61,
 'geog': 19.37, 'eng': 1.8}, {'st_id': 1, 'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng': 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy': 14.35, 'sci': 5.35, 'phyl': 0.9, 'geog': 15.71,
 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog': 7.23, 'eng': 18.99}]}, {'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog':
 0.79, 'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12, 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng': 4.59}, {'st_id': 6, 'sp': 'STW', 'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng':
 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy': 9.07, 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng': 14.38}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog':
 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44, 'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng':
 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog': 15.25, 'eng': 9.84}]}, {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy': 19.07, 'sci': 18.46, 'phyl': 5.47,
 'geog': 10.92, 'eng': 17.65}, {'st_id': 17, 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02, 'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33, 'phy': 14.58, 'sci': 19.39, 'phyl': 17.84,
 'geog': 0.02, 'eng': 14.65}, {'st_id': 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19, 'geog': 18.34, 'eng': 14.88}]}]]
                                      def randomDataset(nb_tuples=2000, block_size=15, nb_dataNode=7, nb_copies=3,
    Workout 2                         specialities=specialities):
                                        data_nodes=[]
    System architecture                 for i in range(nb_dataNode):
                                          data_nodes.append([])
                     Nb_tuples=n
                                        nb_blocks=int(nb_tuples/block_size)
                     Nb_blocks=m        if (nb_tuples%block_size!=0):
                                            nb_blocks=nb_blocks+1
                     Nb_datanodes=p
                                        for i in range(nb_blocks):
                     Block_size           block={}
                     Specialities         block['id_bk']=i
                                          block['data']=[]
                     nb_copies
                                          for j in range(block_size):
                                              if (i*block_size+j==nb_tuples):
                                                  break
                                              sp=random.randint(0, len(specialities)-1)
                                              tuple={}
Generate a dataset                            tuple['st_id']=i*block_size+j
                                              tuple['sp']=specialities[sp]
                                              tuple['math']= round(random.uniform(0.,20.),2)
                                              tuple['phy']= round(random.uniform(0.,20.),2)
                                              tuple['sci']= round(random.uniform(0.,20.),2)
                                              tuple['phyl']= round(random.uniform(0.,20.),2)
 Random dataset                               tuple['geog']=round(random.uniform(0.,20.),2)
                                              tuple['eng']= round(random.uniform(0.,20.),2)
                                            block['data'].append(tuple)
                                          for k in range(nb_copies):
                                            dns=[]
                                            dn=random.randint(0, len(data_nodes)-1)
                                            while dn in dns:
                                               dn=random.randint(0, len(data_nodes)-1)
                                            data_nodes[dn].append(block)
                                            dns.append(dn)
                                        return [data_nodes,nb_blocks]
 Workout 2
 System architecture                             specialities=['ST','MATH','GLSD','STW']
def randomDataset(nb_tuples=2000, block_size=15, nb_dataNode=7, nb_copies=3,
specialities=specialities):
  data_nodes=[]                         create a empty data nodes
  for i in range(nb_dataNode):
    data_nodes.append([])
  nb_blocks=int(nb_tuples/block_size)
  if (nb_tuples%block_size!=0):
      nb_blocks=nb_blocks+1
  for i in range(nb_blocks):
    block={}
    block['id_bk']=i
    block['data']=[]
    for j in range(block_size):
Workout 2
System architecture                            specialities=['ST','MATH','GLSD','STW']
nb_blocks=int(nb_tuples/block_size)
if (nb_tuples%block_size!=0):                 calculate the number of blocks
    nb_blocks=nb_blocks+1
for i in range(nb_blocks):
  block={}
  block['id_bk']=i
  block['data']=[]
  for j in range(block_size):
      if (i*block_size+j==nb_tuples):
          break
      sp=random.randint(0, len(specialities)-1)
      tuple={}
      tuple['st_id']=i*block_size+j
      tuple['sp']=specialities[sp]
      tuple['math']= round(random.uniform(0.,20.),2)
      tuple['phy']= round(random.uniform(0.,20.),2)
Workout 2
System architecture                            specialities=['ST','MATH','GLSD','STW']
for i in range(nb_blocks):
  block={}
  block['id_bk']=I                 create blocks
  block['data']=[]
  for j in range(block_size):
      if (i*block_size+j==nb_tuples):
          break
      sp=random.randint(0, len(specialities)-1)
      tuple={}
      tuple['st_id']=i*block_size+j
      tuple['sp']=specialities[sp]
      tuple['math']= round(random.uniform(0.,20.),2)
      tuple['phy']= round(random.uniform(0.,20.),2)
      tuple['sci']= round(random.uniform(0.,20.),2)
      tuple['phyl']= round(random.uniform(0.,20.),2)
      tuple['geog']=round(random.uniform(0.,20.),2)
      tuple['eng']= round(random.uniform(0.,20.),2)
  Workout 2
  System architecture                              specialities=['ST','MATH','GLSD','STW']
for j in range(block_size):
         if (i*block_size+j==nb_tuples):
             break
         sp=random.randint(0, len(specialities)-1)      create tuples
         tuple={}
         tuple['st_id']=i*block_size+j
         tuple['sp']=specialities[sp]
         tuple['math']= round(random.uniform(0.,20.),2)
         tuple['phy']= round(random.uniform(0.,20.),2)
         tuple['sci']= round(random.uniform(0.,20.),2)
         tuple['phyl']= round(random.uniform(0.,20.),2)
         tuple['geog']=round(random.uniform(0.,20.),2)
         tuple['eng']= round(random.uniform(0.,20.),2)
      block['data'].append(tuple)
    for k in range(nb_copies):
      dns=[]
 Workout 2
 System architecture                           specialities=['ST','MATH','GLSD','STW']
block['data'].append(tuple)
    for k in range(nb_copies):
       dns=[]
       dn=random.randint(0, len(data_nodes)-1)     save dataset
       while dn in dns:
         dn=random.randint(0, len(data_nodes)-1)
       data_nodes[dn].append(block)
       dns.append(dn)
  return [data_nodes,nb_blocks]
                                                 def findBlock(id,dataset):
                                                   random_sort=np.arange(len(dataset[0]))
    Workout 2                                      for i in range(len(dataset[0])-1):
                                                      n=random.randint(0, len(dataset[0])-1)
                                                      m=random.randint(0, len(dataset[0])-1)
    System architecture                               x=random_sort[n]
                                                      random_sort[n]=random_sort[m]
                     Nb_tuples=n
                     Nb_blocks=m                      random_sort[m]=x
                                                   for i in random_sort:
                     Nb_datanodes=p
                                                      for data in dataset[0][i]:
                     Block_size
                     Specialities                        if data['id_bk']==id:
                     nb_copies                              return [data['data'],i]
                                                   return
                                       def recuperateDataset(dataset):
Generate a dataset                       result=[]
                                         nb_blk=dataset[1]
                                         for i in range(nb_blk):
 Random dataset                             blk=findBlock(i, dataset)
                                            result.append(blk)
                                         return result
              Recuperate dataset           dataset
                                      Without repetition
    Workout 2                                                                       [[{'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26,
                                                                                    'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8, 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl':
                                                                                    4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]}, {'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5,
                                                                                    'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW',
                                                                                    'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8, 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST',
                                                                                    'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]}, {'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55,
                                                                                    'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl': 7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49,
    System architecture
                                                                                    'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy': 19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci':
                                                                                    15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp':
                                                                                    'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl': 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3,
                                                                                    'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng':
                                                                                    18.28}]}, {'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62,
                                                                                    'phyl': 7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy':
                                                                                    19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}, {'id_bk': 3, 'data':
                     Nb_tuples=n                                                    [{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92,
                                                                                    'eng': 7.57}, {'st_id': 17, 'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl':
                                                                                    11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math':
                                                                                                             Random dataset
                     Nb_blocks=m                                                    16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD',
                                                                                    'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl': 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4,
                                                                                    'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}]}, {'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56,
                                                                                    'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl':
                     Nb_datanodes=p                                                 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci':
                                                                                    13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}]}, {'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST',
                                                                                    'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8,
                     Block_size                                                     'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]},
                                                                                    {'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl':
                                                                                    7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy': 19.06, 'sci':
                     Specialities                                                   18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}, {'id_bk': 3, 'data': [{'st_id': 15, 'sp':
                                                                                    'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id':
                     nb_copies                                                      17, 'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng':
                                                                                    10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}, {'id_bk': 3, 'data': [{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci':
                                                                                    6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id': 17, 'sp': 'GLSD', 'math': 16.25, 'phy':
                                                                                    13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD',
                                                                                    'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}]]
                                    [[[{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math':
                                    1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl':
                                    6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49},
Generate a dataset                  {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}], 2], [[{'st_id': 5, 'sp': 'MATH',
                                    'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94,
                                                     Dataset without repitition
                                    'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng':
                                    13.38}, {'st_id': 8, 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math':
                                    17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}], 0], [[{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65,
                                    'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl': 7.67, 'geog': 3.43, 'eng':
                                    14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH',
                                    'math': 16.25, 'phy': 19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78,
                                    'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}], 2], [[{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15,
 Random dataset                     'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id': 17,
                                    'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63,
                                    'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl':
                                    3.2, 'geog': 18.01, 'eng': 14.63}], 2]]
              Recuperate dataset                                                   dataset
                                                                   Without repetition
    Workout 2                                    def getAverage(student):
                                                    marks=np.array([*student.values()][2:])
    System architecture                             coef=np.array([*coefficients.values()])
                                                    return {'st_id':student['st_id'],
                     Nb_tuples=n
                                                 'avg':round(np.dot(marks,coef)/coef.sum(),2
                     Nb_blocks=m
                                                 )}
                     Nb_datanodes=p
                     Block_size                  def getAverages(dataset):
                     Specialities                  studentaverages=[]
                     nb_copies
                                                   for block in dataset:
                                                      if (block!=None):
                                                         avgs= list(map(getAverage,block[0]))
Generate a dataset                                       studentaverages.append([avgs,block[1]])
                                          Averages
                                                   return studentaverages
 Random dataset
                                        Map function
              Recuperate dataset           dataset
                                      Without repetition
    Workout 2                                [[[{'st_id': 0, 'avg': 3.93}, {'st_id': 1, 'avg': 8.73},
                                             {'st_id': 2, 'avg': 9.49}, {'st_id': 3, 'avg': 5.02}, {'st_id':
    System architecture                      4, 'avg': 16.3}], 0], [[{'st_id': 5, 'avg': 12.71}, {'st_id':
                                             6, 'avg': 14.06}, {'st_id': 7, 'avg': 11.17}, {'st_id': 8,
                     Nb_tuples=n
                     Nb_blocks=m             'avg': 13.31}, {'st_id': 9, 'avg': 8.71}], 2], [[{'st_id': 10,
                     Nb_datanodes=p          'avg': 16.49}, {'st_id': 11, 'avg': 6.88}, {'st_id': 12,
                     Block_size              'avg': 7.9}, {'st_id': 13, 'avg': 11.04}, {'st_id': 14,
                     Specialities            'avg': 7.88}], 0], [[{'st_id': 15, 'avg': 15.84}, {'st_id':
                     nb_copies
                                             16, 'avg': 8.25}, {'st_id': 17, 'avg': 14.43}, {'st_id': 18,
                                             'avg': 12.81}, {'st_id': 19, 'avg': 10.53}], 2]]
Generate a dataset
                                          Averages
 Random dataset
                                        Map function
              Recuperate dataset           dataset
                                      Without repetition
    Workout 2                                [[[{'st_id': 0, 'avg': 3.93}, {'st_id': 1, 'avg': 8.73},
                                             {'st_id': 2, 'avg': 9.49}, {'st_id': 3, 'avg': 5.02}, {'st_id':
    System architecture                      4, 'avg': 16.3}], 0], [[{'st_id': 5, 'avg': 12.71}, {'st_id':
                                             6, 'avg': 14.06}, {'st_id': 7, 'avg': 11.17}, {'st_id': 8,
                     Nb_tuples=n
                     Nb_blocks=m             'avg': 13.31}, {'st_id': 9, 'avg': 8.71}], 2], [[{'st_id': 10,
                     Nb_datanodes=p          'avg': 16.49}, {'st_id': 11, 'avg': 6.88}, {'st_id': 12,
                     Block_size              'avg': 7.9}, {'st_id': 13, 'avg': 11.04}, {'st_id': 14,
                     Specialities            'avg': 7.88}], 0], [[{'st_id': 15, 'avg': 15.84}, {'st_id':
                     nb_copies
                                             16, 'avg': 8.25}, {'st_id': 17, 'avg': 14.43}, {'st_id': 18,
                                             'avg': 12.81}, {'st_id': 19, 'avg': 10.53}], 2]]
Generate a dataset
                                          Averages
 Random dataset
                                        Map function
              Recuperate dataset           dataset
                                      Without repetition
                                         def avgSum(st1,st2):
    Workout 2                              return {'avg':round(st1['avg']+st2['avg'],2)}
    System architecture                                    def sum_blocks(dataset):
                                                              res=[]
                     Nb_tuples=n                              nb_tuples=0
                     Nb_blocks=m                              blk_nb=0
                                                              for data in dataset:
                     Nb_datanodes=p   Sum, nb_values
                                                                nb_tuples=nb_tuples+len(data[0])
                     Block_size                                 sr={}
                     Specialities
                                                                sr['block']=blk_nb
                     nb_copies
                                                                sr['DN']=data[1]
                                      Reduce function           sr['sum']=reduce(avgSum, data[0])['avg']
                                                                sr['nb_val']=len(data[0])
                                                                sr['avg']=round(sr['sum']/len(data[0]),2)
Generate a dataset                                              blk_nb=blk_nb+1
                                          Averages
                                                                res.append(sr)
                                                              rs=map(lambda r:{'avg':r['sum']}, res)
                                                              sumb=reduce(avgSum, rs)
 Random dataset                                               return
                                        Map function       {'detail':res,'res':[{'sum':sumb['avg'],'nb_val':n
                                                           b_tuples,
                                                           'avg':round(sumb['avg']/nb_tuples,2)}]}
              Recuperate dataset           dataset
                                      Without repetition
    Workout 2                                              ________________ detail of blocks
                                                           ______________________________
    System architecture                                     block DN sum nb_val avg
                     Nb_tuples=n
                                                           0 0 0 43.47     5 8.69
                     Nb_blocks=m
                                                           1 1 2 59.96     5 11.99
                     Nb_datanodes=p   Sum, nb_values       2 2 0 50.19     5 10.04
                     Block_size
                     Specialities
                                                           3 3 2 61.86     5 12.37
                     nb_copies                             _____________________ final result
                                      Reduce function         sum nb_val avg
                                                           0 215.48 20 10.77
Generate a dataset
                                          Averages
 Random dataset
                                        Map function
              Recuperate dataset           dataset
                                      Without repetition