[RESOLVED] SLURM scheduling - sessions vs. batch file

I’m trying run a recipe for multiple subjects/sessions in parallel, using the SLURM scheduler. When I just specify the batch file, it only schedules a 2-job array of tasks for the first two of 24 sessions. If I specify the sessions explicitly with the “sessions” parameter, then it works as expected, scheduling all 24. It was my understanding that when the sessions aren’t explicitly stated, all the sessions specified in the batch file would be run.

The qunex call that does not work as expected is:

qunex_container run_recipe \
   --recipe_file="${STUDY_FOLDER}/processing/recipe.yaml" \
   --recipe="ncanda_hcp" \
   --bash_post="${BASH_POST}" \
   --bind="${STUDY_FOLDER}:${STUDY_FOLDER}" \
   --container="${QUNEX_CONTAINER}" \
   --steps="hcp_pre_freesurfer" \
   --scheduler="SLURM,array,time=23:59:59,mem=12G,jobname=ncanda,partition=batch"

And if I add the line with those 24 sessions specified in the batch file (--sessions="S00081,S00096,S00097,S00108,S00131,S00135,S00141,S00159,S00162,S00164,S00168,S00170,S00176,S00185,S00187,S00192,S00195,S00215,S00230,S00253,S00269,S00271,S00290,S00298") then it works as expected, and creates the 24 jobs.

My batch file and recipe yaml file (saved as a .txt) are attached.
batch.txt (14.0 KB)
recipe.txt (1.3 KB)

Thanks very much for any help!

Hi delrubin,

I think this is expected behavior. qunex_container is not real QuNex execution, but an orchestration script, that sorts out scheduling and containerization for you. As you are not providing it with relevant information (batchfile) it does not know how to schedule/parallelize over them. Batchfile is provided in the recipe file, which is used inside the job / inside the container.

The fix is simple, just add the --batchfile=... parameter to the qunex_container call. This way it will have the information about sessions and it will be able to create multiple jobs. Also, maybe you do not even need to use a job array and can skip the array keyword. Job array is useful when you have huge amounts of sessions, so SLURM will create 1 main orchestration job, that will spawn other jobs. Here I do not think this is needed.

Best, Jure

Hi Jure,

Ah – I did include the batchfile parameter in the top of the recipe file as a global parameter. Is that not equivalent to providing it in the qunex_container call?
Eventually I’ll want to do many more subjects (I have 800 total), so I was just testing out the behavior for now.

Thanks!

No, that is not equal, it is a bit complex :). Things are a bit complex due to scheduling and containers. We try to hide these complexities from the QuNex users, but sometimes we cannot do that completely (as in your case).

qunex_container is a helper script for orchestrating and scheduling commands. The recipe file is read by the run_recipe command that is execute on the system’s compute node in the container. qunex_container is executed on the login node to prepare everything so that correct sessions are passed and used only in jobs that belong to them. In a way qunex_container uses the batchfile to distribute sessions between jobs. While the recipe execution is a within job/container thing.

Best, Jure

1 Like

Thanks again – I included the --batchfile=... parameter in the qunex_container call, and at first it gave me an error that “It seems like you passed the batchfile both through the sessions and the batchfile parameters!”, but then I commented out the batchfile parameter in the global_parameters section of the recipe file, and it worked fine.

1 Like