[RESOLVED] Hcp_diffusion fails with eddy input error

Hello,

I was running hcp_diffusion for the HCP Early Psychosis dataset, after all the freesurfer steps. The job was submitted to GPU nodes, running against singularity. However, I received this comlog which seems to report several errors related to file reading. Here are the qunex options I used when submitting the job:

–qunex_command=“hcp_diffusion”
–qunex_options=“–sessions=/u/project/CCN/kkarlsgo/data/HCP/HCP_EOP/imagingcollection01/qunex_studyfolder/processing/hcpep_batch.txt --sessionsfolder=/u/project/CCN/kkarlsgo/data/HCP/HCP_EOP/imagingcollection01/qunex_studyfolder/sessions --parsessions=1 --overwrite=yes”
–scheduler_options=“-l h_data=12G,h_rt=23:00:00,gpu,RTX2080Ti”
–logdir=“/u/project/CCN/kkarlsgo/data/HCP/HCP_EOP/imagingcollection01/qunex_studyfolder/processing/logs/manual”
–array=“yes”
–sessions=“1001_01_MR 1002_01_MR 1003_01_MR”

Here is the Diffusion directory of subj 1001_01_MR after running hcp_diffusion:

I have also attached the comlog and my parameter file below. Please let me know if I can provide any more info. Thank you so much for your help with troubleshooting!

error_hcp_diffusion_1001_01_MR_2022-09-29_13.50.1664484607.log (90.6 KB)

hcpep_batch.txt (137.6 KB)

Warmly,
Haley

Hi Haley,

from your description it is not clear how exactly you are invoking the QuNex command. Can you provide the whole job submission script maybe? The traditional and officially supported way when using containers is to use the specialized qunex_container script that we prepared. So maybe best to use that approach. See https://qunex.readthedocs.io/en/latest/wiki/UsageDocs/RunningQunexContainer.html, for a detailed description of this. But this supports only SLURM and PBS scheduling, not sure which one you use.

A common problem with GPU commands and Singularity is that everything is ran without the --nv flag, which gives access to CUDA libraries to the system inside the QuNex container.

Cheers, Jure

Hi Jure,

Thank you so much for your reply! I’ve attached my job submission script (written by a labmate) here. Our university cluster only supports SGE, so we use this submission script to schedule the jobs in batch. This submission code worked well for my PreFS/FS/PostFS steps. The singularity version I used was 0.90.6.

hoffman2_submit_qunex.txt (8.6 KB)

Many thanks,
Haley

Hi Haley,

like mentioned, when using GPUs with Singularity you need to add the --nv flag. For starters try adding this to lines 176 and 205. Basically where you have singularity exec change it to singularity exec --nv. After you do that, try rerunning everything.

Furthermore, maybe you need to load some CUDA (NVIDIA GPU) libraries once getting to the compute node where everything gets executed. In our case, I need to run module load cuda/9.1.85 for example. When using the qunex entry point, I can do this with the --bash=module load cuda/9.1.85 parameter, I am not sure whether you need to do this and how do to this with your custom launch script.

Also, when uploading fail logs, it is best to provide both the comlog and the runlog (usually both are created), this will give me some additional insight. By default these are in processing/logs/comlogs and processing/logs/runlogs.

As always, we advise you to use tha latest Qunex version, which is 0.94.14 at the moment.

Let me know how this goes.

An extreme option would be to use --hcp_dwi_nogpu when running hcp_diffusion. This will run everything without the GPU support, the results will be the same, but it will take much longer to finish.

Hi Jure,

Sorry for the delayed testing and response and thank you for the suggestions!
It turns out that the initial error was caused by an import error that happened earlier in the processing steps. After that was fixed, I switched to the newest version of the container and it worked well for both GPU and CPU options. Thank you again for your help!

Best,
Haley

Glad we resolved it!

1 Like