LSF for Users
Mike Page
mpage@ucar.edu
SCD Consulting Services Group
SCD/HSS/CSG
What is LSF?
LSF - Load Sharing Facility
Batch Management Subsystem
for multi-host, multi-vendor complexes
Same role as LoadLeveler or NQE with capability to manage
computing resources across multiple platforms
LSF runs on the Lightning cluster
------------------------------------------------------------------------------
Documentation: /usr/local/docs/LSF/6.0/*.pdf
Hardware description: http://www.scd.ucar.edu/docs/lightning/overview.html
At a lightning command line enter: man lsfintro
Further reading: http://accl.grc.nasa.gov/lsf/about.html
To be able to access LSF
This has been added to your login processing:
. /usr/local/lsf/conf/profile.lsf (sh users)
or
source /usr/local/lsf/conf/cshrc.lsf (csh users)
These commands are executed before you receive a command prompt.
There is no need for you to add anything to your login files in order to use
LSF.
These commands define the LSF environment:
LSF_SERVERDIR, LSF_BINDIR, LSF_LIBDIR, XLSF_UIDDIR, LSF_ENVDIR, PATH, MANPATH
-------------------------------------------------------------------
Check: env | grep -i lsf
Essential Commands
for Users
• bhosts • bmod
• bqueues • bbot/btop
• bsub • bswitch
• bjobs • bstop/bresume
• bhist
• bkill
• bpeek
Essential Commands
Purpose
• bhosts - information about available hosts (lshosts)
• bqueues - information about available queues
• bsub - submit jobs to batch subsystem
• bjobs - list jobs in the batch subsystem
• bhist - displays historical information about user’s jobs
• bpeek - displays stdout and stderr of user’s unfinished job
• bmod - modifies job submission options for user’s job
Essential Commands
Purpose (cont’d)
• bbot/btop - moves a pending job relative to user’s last/first job
in a queue
• bswitch - switches user’s unfinished jobs from one queue to
another
• bstop/bresume - suspends/resumes user’s unfinished jobs
• bkill - kill, suspend or resume user’s jobs
Essential Commands: bhosts
bhosts [-w|-l][-R “res_req”][host_name|host_group]
Displays information about hosts/platforms
lshosts [-w | -l] [-R "res_req"] [host_name | cluster_name]
lshosts -s [shared_resource_name ...]
Displays hosts and their static resource information
ln0126en$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
ln0126en ok - 2 0 0 0 0 0
ln0127en ok - 2 0 0 0 0 0
ln0128en ok - 2 0 0 0 0 0
.
.
.
ln0440en ok - 2 0 0 0 0 0
ln0441en ok - 2 0 0 0 0 0
ln0442en ok - 2 0 0 0 0 0
Essential Commands: bqueues
bqueues [-w|-l|-r][-m host_name|-m all]
[-u user_name|-u all][queue_name …]
Displays information about queues.
By default, returns the following information about all queues: queue
name, queue priority, queue status, job slot statistics, and job state
statistics.
ln0126en$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
special 500 Open:Active - - - - 0 0 0 0
premium 300 Open:Active - - - - 0 0 0 0
regular 200 Open:Active - - - - 0 0 0 0
economy 160 Open:Active - - - - 0 0 0 0
hold 104 Open:Active - - - - 0 0 0 0
standby 100 Open:Active - - - - 0 0 0 0
share 100 Open:Active - - - - 0 0 0 0
Essential Commands: bsub
bsub [options] command [cmd_args]
Submits a job for batch execution
OPTION LIST
-B Sends mail at dispatch and initiation times.
-H Holds job in PSUSP and waits for bresume
-I | -Ip | -Is Submits as batch interactive
-K Submits job and locks cmd line with status updates
-N Sends job report by e-mail (use only with -I | -Is | -Ip or -o)
-r Rerun job on another host if host terminates
-x Exclusive execution mode
-a esub_parameters Specifies parallel job launcher (PJL) to be used
-b [[month:]day:]hour:minute Dispatch date/time
-C core_limit Limits size of core dumps (-C 0 recommended?)
-c [hours:]minutes[/host_name | /host_model] Cpu time limit
-D data_limit
-e err_file File to use as stderr
-E "pre_exec_command [arguments ...]" Pre-exec command invoked before batch stream command processing
-ext[sched] "external_scheduler_options" N/A
-f "local_file operator [remote_file]" ... Files to be copied between local/remote systems
-F file_limit Per process file size limit
Essential Commands: bsub (cont’d)
bsub [options] command [cmd_args]
OPTION LIST (cont’d)
- g job_group_name Submits job to a job group
-G user_group Associates job with a specific group
-i input_file | -is input_file Specifies stdin for job
-J job_name | -J "job_name[index_list]%job_slot_limit" Specifies job name
-k "checkpoint_dir [checkpoint_period][method=method_name]" Makes a job checkpointable and specifies checkpoint directory
-L login_shell Uses login_shell for runtime environment
-m "host_name[@cluster_name][+[pref_level]] | host_group[+[pref_level]] Selects and ranks hosts/groups on which to run
-M mem_limit Sets per process memory limit
-n min_proc[,max_proc] Sets min/max number of processors required to run job
-o out_file Specifies stdout
-P project_name Specifies project name
-p process_limit Limits total number of processes
-q queue_name Specifies queue for job (default provided by system)
-R "res_req" Specifies resource requirements
-sla service_class_name Specifies service class for job
-sp priority Specifies priority amongst user’s jobs
-S stack_limit Sets per-process stack limit
Essential Commands: bsub (cont’d)
bsub [options] command [cmd_args]
OPTION LIST (cont’d)
-t [[month:]day:]hour:minute Specifies job termination date
-T thread_limit Sets limit on number of concurrent jobs
-U reservation_ID Uses reservation via brsvadd command
-u mail_user Mail-to address
-v swap_limit Sets total process virtual memory limit
-w 'dependency_expression' Defines dependencies to be met before job initiation
-wa '[signal | command | CHKPNT]' Specifies action to be taken before job control step occurs
-wt '[hours:]minutes' Specifies time interval before job control occurs to send warning signal
-W [hours:]minutes[/host_name | /host_model] Specifies run time limit for job
-Zs Spolls command file and runs from there
The Importance of Being <
LSF usage is different from LL/NQS
bsub a.out
bsub -n 2 a.out
bsub myscript
bsub -q queuename a.out
bsub -i infile -o outfile - e errfile a.out
bsub < myscript
Sample LSF script
Serial Job
#!/bin/ksh
#
# LSF batch script to run a serial code
#
#BSUB -P 93300070 # Project 93300070
#BSUB -n 1 # number of tasks
#BSUB -J seriallsf.test # job name
#BSUB -o seriallsf.out # output filename
#BSUB -e seriallsf.err # input filename
#BSUB -q regular # queue
# Fortran example
pgf90 -o samp_f -Mextend samp.f
./samp_f
# C example
pgcc -o samp_c samp.c
./samp_c
# C++ example
pgCC --no_auto_instantiation -o samp_cc samp.cc
./samp_cc
bsub < serial.lsf
Sample LSF script
MPI Job
#!/bin/ksh
#
# LSF batch script to run the test MPI code
#
#BSUB -P 93300070 # Project 93300070
#BSUB -a mpich_gm # select the mpich-gm elim
#BSUB -x # exlusive use of node (not_shared)
#BSUB -n 2 # number of total tasks
#BSUB -R "span[ptile=1]" # run 1 tasks per node
#BSUB -J mpilsf.test # job name
#BSUB -o mpilsf.out # output filename
#BSUB -e mpilsf.err # error filename
#BSUB -q regular # queue
# Fortran example
mpif90 -o mpi_samp_f mpisamp.f
mpirun.lsf ./mpi_samp_f
# C example
mpicc -o mpi_samp_c mpisamp.c
mpirun.lsf ./mpi_samp_c
# C++ example
mpicxx -o mpi_samp_cc mpisamp.cc
mpirun.lsf ./mpi_samp_cc
bsub < mpi.lsf
Sample LSF script
OpenMP Job
#!/bin/ksh # C example
# pgcc -mp -o samp_c samp.c
# LSF script to run the test OMP codes export OMP_NUM_THREADS=1
# ./samp_c
#BSUB -P 93300070 # Proposal group 2 - Project 93300070 export OMP_NUM_THREADS=2
#BSUB -a mpich_gm # select the mpich-gm elim ./samp_c
#BSUB -x # exclusive use of node
#BSUB -n 2 # number of tasks # C++ example
#BSUB -R "span[hosts=1]" # jobs run on one host pgCC --no_auto_instantiation -mp -o sampcc samp.cc
#BSUB -J omplsf.test # job name export OMP_NUM_THREADS=1
#BSUB -o omplsf.out # ouput filename ./samp_cc
#BSUB -e omplsf.err # input filename export OMP_NUM_THREADS=2
#BSUB -q regular # queue ./samp_cc
# Fortran example
pgf90 -o samp_f -Mextend -mp samp.f
export OMP_NUM_THREADS=1
./samp_f
export OMP_NUM_THREADS=2
./samp_f
bsub < omp.lsf
Sample LSF script
MPMD Job
#!/bin/ksh
# # Fortran example
# LSF batch script to run the test MPMD codes mpif90 -Mextend -o $EXE'0' ../src/mpmd/itmpmd.f
# mpif90 -Mextend -o $EXE'1' ../src/mpmd/itmpmd.f
#BSUB -P 93300070 # Project 93300070 mpirun -pg pgfile /bin/pwd
#BSUB -a mpich_gm
#BSUB -n 2 # C example
#BSUB -x mpicc -o $EXE'0' ../src/mpmd/itmpmd.c
#BSUB -R "span[ptile=1]" mpicc -o $EXE'1' ../src/mpmd/itmpmd.c
#BSUB -o mpmdlsf.out # output filename mpirun -pg pgfile /bin/pwd
#BSUB -e mpmdlsf.err # error filename
#BSUB -J mpmdlsf.test # job name # C++ example
#BSUB -q regular # queue mpicxx --no_auto_instantiation -o $EXE'0' ../src/mpmd/itmpmd.cc
# mpicxx --no_auto_instantiation -o $EXE'1' ../src/mpmd/itmpmd.cc
#Build pgfile for mpmd run mpirun -pg pgfile /bin/pwd
rm -f pgfile
touch pgfile rm $EXE'0' $EXE'1' pgfile
#
EXE=../bin/itmpmd
#
j=0
for h in `echo $LSB_HOSTS`
do
echo ${h}" "${j}" "${EXE}${j} >> pgfile
j=`expr $j + 1`
done
#cat pgfile
bsub < mpmd.lsf
Sample LSF script
Hybrid Job
#!/bin/ksh
# # Fortran example
# LSF batch script to run the test mixed MPI/OMP codes mpif90 -Mextend -mp -lmp -o mix mix.f
# export OMP_NUM_THREADS=1
#BSUB -a mpich_gm # select mpich_gm elim mpirun-env.pl -pg pgfile $EXE
#BSUB -x # exclusive use of node export OMP_NUM_THREADS=2
#BSUB -n 2 # sum of number of tasks mpirun-env.pl -pg pgfile $EXE
#BSUB -R "span[ptile=1]" # number of processes per node
#BSUB -o mixlsf.out # output filename # C example
#BSUB -e mixlsf.err # error filename mpicc -mp -o mix mix.c
#BSUB -J mixlsf.test # job name export OMP_NUM_THREADS=1
#BSUB -q regular # queue mpirun-env.pl -pg pgfile $EXE
# export OMP_NUM_THREADS=2
#Build pgfile for mix run mpirun-env.pl -pg pgfile $EXE
rm -f pgfile
touch pgfile # C++ example
# mpicxx --no_auto_instantiation -mp -o mix mix.cc
EXE=${PWD}/mix export OMP_NUM_THREADS=1
# mpirun-env.pl -pg pgfile $EXE
echo $LSB_HOSTS export OMP_NUM_THREADS=2
j=0 mpirun-env.pl -pg pgfile $EXE
for h in `echo $LSB_HOSTS`
do rm pgfile
echo ${h}" "${j}" "${EXE} >> pgfile
j=`expr $j + 1`
done
bsub < mix.lsf
Essential Commands: bjobs
bjobs - Displays information about LSF jobs
bjobs -u user_name
bjobs -u all
bjobs -l
bjobs -r
bjobs -s
bjobs -q queue_name
Essential Commands: bhist
bhist - displays historical information about jobs
bhist -J job_name
bhist -C start_time, end_time
bhist -D start_time, end_time
bhist -S start_time, end_time
bhist -T start_time, end_time
Essential Commands: bpeek
bpeek - displays stdout and stderr of user’s selected, unfinished job
bpeek -f uses ‘tail -f’ to display output instead of ‘cat’
bpeek [-q queue_name | -m host_name | -J job_name |
job_ID | "job_ID[index_list]"]
Essential Commands: bmod
bmod - modifies job submission options of a job
bmod [bsub options] [job_ID | "job_ID[index]"]
bmod -g job_group_name | -gn [job_ID]
bmod [-sla service_class_name | -slan] [job_ID]
bmod [-h | -V]
Essential Commands: bbot, btop
bbot - moves a pending job relative to the last job in the
queue
bbot job_ID | "job_ID[index_list]" [position]
bbot [-h | -V]
btop - moves a pending job relative to the first job in the
queue
btop job_ID | "job_ID[index_list]" [position]
btop [-h | -V]
Essential Commands: bswitch
bswitch - switches unfinished jobs from one queue to
another
bswitch [-J job_name] [-m host_name | -m host_group]
[-q queue_name] [-u user_name | -u user_group | -u all]
destination_queue [0]
bswitch destination_queue [job_ID | "job_ID[index_list]"] ...
bswitch [-h | -V]
Essential Commands: bstop/bresume
bstop -suspends unfinished jobs
bstop [-a] [-d] [-g job_group_name |-sla service_class_name]
[-J job_name] [-m host_name | -m host_group]
[-q queue_name] [-u user_name | -u user_group | -u all] [0]
[job_ID | "job_ID[index]"] ...
bstop [-h | -V]
bresume -resumes one or more suspended jobs
bresume [-g job_group_name] [-J job_name] [-m host_name ]
[-q queue_name] [-u user_name | -u user_group | -u all ] [0]
bresume [job_ID | "job_ID[index_list]"] ...
bresume [-h | -V]
Essential Commands: bkill
bkill - sends signals to kill, suspend, or resume unfinished
jobs
bkill [-l] [-g job_group_name | -sla service_class_name]
[-J job_name] [-m host_name | -m host_group]
[-q queue_name] [-r | -s (signal_value | signal_name)]
[-u user_name | -u user_group | -u all]
[job_ID ... | 0 | "job_ID[index]" ...]
bkill [-h | -V]
Questions?
Comments?