[RESOLVED] Issue running FSLBedpostxGPU

amber.howell · December 18, 2020, 7:14pm

Description:

I am attempting to run the FLS bedpostx command on the HCP-unrelated sample. However, when I run the command using qunexcontainer the slurm job fails but doesn’t report an error. When I run it without the container there seems to be an issue with read dicom. Qunex command calls and outputs are listed below.

Call:
All calls were run from : /gpfs/project/fas/n3/Studies/Connectome/processing/logs

sourceQunex
qunexContainer FSLBedpostxGPU \
--sessionsfolder="/gpfs/project/fas/n3/Studies/Connectome/subjects" \
--sessions="101410, 102008, 102109, 102614" \
--fibers='3' \
--burnin='3000' \
--model='3' \
--overwrite="no" \
--cores=4 \
--scheduler="SLURM,time=1-00:00:00,ntasks=4,cpus-per-task=1,mem-per-cpu=20000,partition=pi_anticevic_gpu" \
--container="/gpfs/project/fas/n3/software/Singularity/qunex_suite-latest.sif"

And I also tried:

sshgpu
sourceQunex

qunex FSLBedpostxGPU \
--sessionsfolder="/gpfs/project/fas/n3/Studies/Connectome/subjects" \
--sessions="101410, 102008, 102109, 102614" \
--fibers='3' \
--burnin='3000' \
--model='3' \
--overwrite="no" \
--cores=4 \
--scheduler="SLURM,time=1-00:00:00,ntasks=4,cpus-per-task=1,mem-per-cpu=20000,partition=pi_anticevic_gpu"

Logs:
Runlogs and comlogs are not being generated.

Example slurm output when running qunexContainer call:

/gpfs/project/fas/n3/Studies/Connectome/processing/logs/slurm-9081999.out

Terminal output when not using the container:

     ........................ Running Qu|Nex v0.62.6 ........................ 

Traceback (most recent call last):
  File "/gpfs/project/fas/n3/software/qunex/niutilities/gmri", line 4, in <module>
    import niutilities as niu
  File "/gpfs/loomis/pi/n3/software/qunex/niutilities/niutilities/__init__.py", line 2, in <module>
    import g_dicom
  File "/gpfs/loomis/pi/n3/software/qunex/niutilities/niutilities/g_dicom.py", line 46, in <module>
    import dicom.filereader as dfr
ImportError: No module named dicom.filereader

** For an example of how to report an issue, please refer to this post.

aleksij.kraljic · December 22, 2020, 1:21pm

Hi Amber, we will assign a developer to this issue and will let you know as soon as we have a solution.

demsarjure · December 27, 2020, 12:23pm

Hi Amber,

when using qunexContainer and a Singularity container (.sif file) to execute commands that require GPU (CUDA) you need to specify the --nv flag. See the CUDA/GPU processing part in oriadev / qunex / wiki / UsageDocs / RunningQunexContainer — Bitbucket. The reason is that the system needs to give access to container to GPU libraries and drivers. So your call should look like this:

qunexContainer DWIFSLbedpostxGPU \
--sessionsfolder="/gpfs/project/fas/n3/Studies/Connectome/subjects" \
--sessions="101410, 102008, 102109, 102614" \
--fibers='3' \
--burnin='3000' \
--model='3' \
--overwrite="no" \
--parsessions=4 \
--scheduler="SLURM,time=1-00:00:00,ntasks=4,cpus-per-task=1,mem-per-cpu=20000,partition=pi_anticevic_gpu" \
--container="/gpfs/project/fas/n3/software/Singularity/qunex_suite-latest.sif"

I also made two small changes, FSLBedpostxGPU is now known as DWIFSLbedpostxGPU (note here that this will change again in the nearby future when we will make command naming format consistent across the whole suite). I also changed cores parameter to parsessions (which is the new name). Using old/deprecated names would also work, but you would see some warnings.

Your command does not work when running from source because it seems like you have some issues with your environment (no privileges to access some files, missing libraries …). Try running qunex_environment_status to see what is going on there.

Regards, Jure

amber.howell · January 15, 2021, 7:42am

Hi Jure,

Thank you for the command alterations. I’ve tried running the command you posted with the -nv flag like so, from the /gpfs/project/fas/n3/Studies/Connectome/processing/logs/ directory.

qunexContainer DWIFSLbedpostxGPU \
--sessionsfolder="/gpfs/project/fas/n3/Studies/Connectome/subjects" \
--sessions="110616, 110613" \
--fibers='3' \
--burnin='3000' \
--model='3' \
--nv \
--overwrite="yes" \
--parsessions=4 \
--scheduler="SLURM,time=1-00:00:00,ntasks=4,cpus-per-task=1,mem-per-cpu=20000,partition=pi_anticevic_gpu" \
--container="/gpfs/project/fas/n3/software/Singularity/qunex_suite-latest.sif"

A sample output slurm file is here: /gpfs/project/fas/n3/Studies/Connectome/processing/logs/slurm-11655680.out

I see this error in the slurm file: cuda error at CUDA/init_gpu.cu:17. no CUDA-capable device is detected
ERROR: init_gpu: no CUDA-capable device is detected

I’ve tried activating the Qunex environment, which should have taken care of my environmental variables and there don’t seem to be any red flags when I run qunex_environment_status.

Any help would be greatly appreciated!

demsarjure · January 15, 2021, 8:53am

Hi Amber,

a possible issue would be that you did select a GPU node, but you did not allocate a GPU. Try:

--scheduler="SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=16000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=FSLBedpostxGPU"

Note the gres=gpu:1 which allocates a GPU device to your job.

Cheers, Jure

amber.howell · January 21, 2021, 5:31am

Hi Jure, I have gotten Bedpostx to run with you modifications, thank you.

I’ve moved on to trying the ProbTrackx command and am again running into issues - this time a different issue such that the job ends after a few seconds, no com or run logs are generated, and the slurm output doesn’t report an error. Happy to post this as a separate issue if need be.

command run from: /gpfs/project/fas/n3/Studies/Connectome/processing/logs

command:
qunexContainer ProbtrackxGPUDense
–sessionsfolder="/gpfs/project/fas/n3/Studies/Connectome/subjects"
–sessions=“107321, 108121”
–omatrix1=‘yes’
–omatrix3=‘yes’
–nv
–overwrite=“yes”
–parsessions=4
–scheduler=“SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=16000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=ProbtrackxGPUDense”
–container="/gpfs/project/fas/n3/software/Singularity/qunex_suite-latest.sif"

slurm output: /gpfs/project/fas/n3/Studies/Connectome/processing/logs/slurm-12172811.out

Thank you again for your help so far!

demsarjure · January 23, 2021, 11:03am

Hi Amber,

It seems like ProbtrackxGPUDense needs to be revised in accordance with latest changes in the codebase. I will let you know once I review and debug it (probably next week).

Cheers, Jure

demsarjure · February 12, 2021, 1:15pm

Hi Amber,

Sorry for the late reply, we are plenty busy with finalizing things for the QuNex pub release. Before running ProbtrackxGPUDense you need to run pre tractography (DWIpreTractography). But even with this it could happen that ProbtrackxGPUDense will not work. I fixed them in the develop version of the codebase. How urgent is this for you? If it is urgent please contact me offline and we will find a workaround.

Cheers, Jure