Skip to content

Parent job in slurm exits before child jobs | nested slurm jobs #315

@ag1805x

Description

@ag1805x

I was trying to setup a small test project to use batchtools on slurm. I am having an issue that the parent job exits from slurm before all the child jobs are completed. How can I solve this issue?

The main Rscript that submits jobs and the associated configuration files are as:

run_batchtools_job.R

library(batchtools)

reg <- makeRegistry(file.dir =  "slurm_registry", seed = 5081, conf.file = "Scripts/batch_tools_test/.batchtools.conf.R")

my_fun <- function(x) {
  Sys.sleep(x)  
  return(x^2)
}

ids <- batchMap(fun = my_fun, x = 100:150, reg = reg)
done <- submitJobs(ids = ids, reg = reg, resources = list(partition = "small", walltime = 86400, memory = 1024, ntasks = 1))
waitForJobs(ids = ids, reg = reg) 
getStatus(ids = ids, reg = reg)    

final_res <- reduceResultsList(ids = ids, reg = reg)
print(class(final_res))

.batchtools.conf.R

cluster.functions <- makeClusterFunctionsSlurm(template = "Scripts/batch_tools_test/slurm_config.tmpl", 
                                               array.jobs = TRUE, 
                                               scheduler.latency = 60,
                                               fs.latency = 30)
max.concurrent.jobs <- 5

slurm_config.tmpl

#!/bin/bash
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --ntasks=<%= resources$ntasks %>
#SBATCH --mem=<%= resources$memory %>MB
#SBATCH --partition=<%= resources$partition %>

module load  r/4.3.3

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

I submit the run_batchtools_job.R script to slurm using the following sbatch script.

run_batchtools.sh

#!/bin/bash
#SBATCH --job-name=batchtools_test
#SBATCH --output=batchtools_test.log
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --mem=2G
#SBATCH --partition=small

# Load R
module load  r/4.3.3

# Run your R script
Rscript Scripts/batch_tools_test/run_batchtools_job.R

I observed that the batchtools_test job exits before all the child jobs spawned using submitJobs end. As a result, there is nothing in final_res.

While checking getErrorMessages, I saw that several jobs are listed as 'not terminated'. But when I manually checked the logs and the results within the registry directories, everything completed as expected.

How can I overcome this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions