HPC Cluster Job submission
Indian Institute of Information
Technology, Allahabad
By
Netweb Technologies India Pvt. Ltd.
Plot No-H1, Pocket- 9,
Faridabad Industrial Town (FIT)
Sector- 57, Faridabad, Ballabgarh,
State – Haryana- 121004, India
Table of Contents
1. Login to HPC (named Surya) cluster
2. Job Submission
3. Sample job scripts
a. General script structure
b. MPI jobs
c. Tensorflow
4. Job scheduler commands
1. Login to HPC (named Surya) cluster
(a) Login to Surya cluster with your username and password using following two ways:
i. ssh username@surya.iiita.ac.in from your terminal
ii. Use putty and enter surya.iiita.ac.in(or 172.20.70.12)
Note: If you are using surya.iiita.ac.in in place of ip-address for login then your primary DNS
should be 172.31.1.21 (IIITA DNS Server IP address)
Enter your username and password
2.
job
The scheduler used to schedule jobs on cluster is PBSPRO
2. Job submission
Write the job scheduler script (shell script, sample given in Section 3), load the required modules
and submit the job.
#load the module (like anaconda module if you want to use tensorflow) by using below steps
(a) Check available module ($ is your command prompt and should not be written)
$ module avail
(b) Load the module
$ module load <module_name>
(c) Check the loaded modules with below command
$ module list
and submit the job with qsub -V <script_name>
example:
$ qsub -V job1.sh
(d) Available queues in the cluster (Testing phase only)
(1) prerunl : unlimited time with 160 cores
(2) preruns : 4 hours wall time with 160 cores
(3) prerungl : unlimited time with 2 gpus with 40 cores or 1 gpu with 20 cores
(4) prerungs : 4 hours time with 1 gpu with 20 cores and 190Gb memory
3. Sample Job Scripts
(a). General script structure, save the script as <NameOfScript>.sh for example job1.sh
#!/bin/bash
##name of the job
#PBS -N jobname
##job output log
#PBS -o out.log
##job/application error logs
#PBS -e error.log
##requesting number of nodes and resources
#PBS -l nodes=4:ppn=40
##selecting queue
# PBS -q preruns
cd $PBS_O_WORKDIR
#job command without hash
Note: Lines above with two hash symbols (##) are comments and one hash symbol (#) are actual
commands. #PBS commands are used to set properties of the job for example, #PBS -l
nodes=4:ppn=40, request the scheduler to assign 4 nodes and 40 processors per node (ppn) to the
job.
=====================================================================
command to submit the job
$qsub -V job1.sh
(b) MPI jobs
#!/bin/bash
#PBS -N cpi
#PBS -o out.log
#PBS -e error.log
#PBS -l nodes=4:ppn=40
#PBS -q preruns
cd $PBS_O_WORKDIR
mpiexec.hydra -genv -machinefile $PBS_NODEFILE -np 160 ./a.out
Note: ./a.out is the executable code. In case you have your mpi c code for ex. Helloworld.c (shown
below), then you should first compile your code with mpicc to get an executable and then use your
executable at end of mpiexec.hydra command above.
//Helloworld.c (using mpi)
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
processor_name, world_rank, world_size);
// Finalize the MPI environment.
MPI_Finalize();
}
---------------------------------------------------------------------------------------------------------------------
(c) Tensorflow jobs
step1. Load the anaconda module by using below command
$ module load utils/anaconda3.5
step2: write your tensorflow program, basic hello Tensorflow (tensortest.py) is given below:
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
step3: write the job script (tensor_jobscipt.sh) as show below (modify the name of the file
accordingly)
#!/bin/bash
#PBS -N tensorflow
#PBS -l nodes=1:ppn=1
#PBS -o outlog
#PBS -e errorlog
cd $PBS_O_WORKDIR
python tensortest.py
step4: submit the job using the following command
$qsub -V tensor_jobscipt.sh
#your output will be in outlog file.
Screen shot of the running job is shown below
4. JOB scheduler commands.
(a) Submit the job to the scheduler by $qsub -V <script_name>
(b) Check the jobs status by $qstat
(c) check where the job are running $ qstat -n
(d) check full information of the job $ qstat -f <job_id>
...part of the information only shown in screenshot
(e) Delete the job from the queue $qdel <job_id> (it may take 5 to 10 seconds)
(f) Check the queue information $qstat -q