[RESOLVED] Issue running DWIFSLbedpostxGPU

amber.howell · July 2, 2021, 12:25am

Description:

I am once again running DWIFSLbedpostxGPU, however I’m using commands that used to work but now produce an error. I source the qunex environment (source $TOOLS/$QUNEXREPO/env/qunex_environment.sh) before running the call.

Call:
I run the following command from

/gpfs/project/fas/n3/Studies/Connectome/processing/logs/bedpostx/

qunexContainer DWIFSLbedpostxGPU
–sessionsfolder=“/gpfs/project/fas/n3/Studies/Connectome/subjects”
–sessions=“116524”
–fibers=‘3’
–burnin=‘3000’
–model=‘3’
–nv
–overwrite=“yes”
–parsessions=4
–scheduler=“SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=16000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=FSLBedpostxGPU”
–container=“/gpfs/project/fas/n3/software/Singularity/qunex_suite-latest.sif”

Logs:

/gpfs/project/fas/n3/Studies/Connectome/processing/logs/bedpostx/slurm-30252864.out

I do see the following error in the log file so my first guess is that there is a permissions issue?

/opt/qunex/qx_library/etc/fsl_gpu_binaries/bedpostx_gpu_cuda_9.1/bedpostx_gpu/bin/xfibres_gpu: error while loading shared libraries: libbedpostx_cuda.so: cannot open shared object file: No such file or directory

demsarjure · July 2, 2021, 6:28am

Hi Amber,

On HPC (high performance computing) systems you often have to load CUDA libraries (libraries that enable GPU execution) manually. In order to do that you need to add --bash="module load CUDA/9.1.85" to your command call.

I also made a couple of smaller adjustments that are not necessary for the execution to work, but make the command call cleaner and nicer. First, the parsessions parameter (the amount of sessions you will run in parallel) is redundant in your case as you are executing the command on a single session (116524). Next, I replaced qunexContainer with qunex_container and DWIFSLbedpostxGPU with dwi_fsl_bedpostx_gpu. The names you are using are legacy names. They would still work (everything is backwards compatible), but for clarity and consistency we encourage you to use the new, snake cased naming of commands.

So your command should be:

qunex_container dwi_fsl_bedpostx_gpu \
  --sessionsfolder="/gpfs/project/fas/n3/Studies/Connectome/subjects" \
  --sessions="116524" \
  --fibers="3" \
  --burnin="3000" \
  --model="3" \
  --overwrite="yes" \
  --bash="module load CUDA/9.1.85" \
  --nv \
  --scheduler="SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=16000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=FSLBedpostxGPU" \
  --container="/gpfs/project/fas/n3/software/Singularity/qunex_suite-latest.sif"

Let me know how it goes.

Cheers, Jure

amber.howell · July 2, 2021, 3:42pm

Hi Jure, thanks for the quick reply.

I ran your command exactly, but the error persists.

qunex_container dwi_fsl_bedpostx_gpu
–sessionsfolder=“/gpfs/project/fas/n3/Studies/Connectome/subjects”
–sessions=“116524”
–fibers=“3”
–burnin=“3000”
–model=“3”
–overwrite=“yes”
–bash=“module load CUDA/9.1.85”
–nv
–scheduler=“SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=16000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=FSLBedpostxGPU”
–container=“/gpfs/project/fas/n3/software/Singularity/qunex_suite-latest.sif”

The new slurm output is located here:

/gpfs/project/fas/n3/Studies/Connectome/processing/logs/bedpostx/slurm-30361642.out

demsarjure · July 7, 2021, 7:31am

Hi Amber,

I am still running some tests regarding this, I will get back to you once I have a tested solution on my hands.

Jure

demsarjure · July 7, 2021, 8:14am

I believe I got to the bottom of it. Unfortunately, the container will be fixed with the next update (0.90.9). This should happen soon. In the meantime, you can run bedpostx outside of the container. Below is an example of a run that works:

export STUDYFOLDER="/gpfs/project/fas/n3/Studies/QuNexAcceptTest/Results/jd2528/0.90.8/QuNex_acceptance_MB_Yale_BIC_Prisma_20210629_040649"

qunex dwi_fsl_bedpostx_gpu \
	--sessionsfolder="${STUDYFOLDER}/sessions" \
	--sessions="${STUDYFOLDER}/processing/batch.txt" \
	--overwrite="yes" \
	--bash="module load CUDA/9.1.85" \
	--scheduler="SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=64000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=bedpostx"

amber.howell · July 19, 2021, 8:14pm

Okay, I’ll wait for the container update. In the meantime I did try the to run it in terminal and I’ve attached the full terminal output. The command runs but it generates a bunch of empty files/nifti images. I tried it with and without the --nv flag and got the same result and all logs here are associated with the command that includes the --nv flag.

Command output log:
/gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/ error_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log - here I notice the same issue before : /gpfs/project/fas/n3/software/qunex/qx_library/etc/fsl_gpu_binaries/bedpostx_gpu_cuda_9.1/bedpostx_gpu/bin/xfibres_gpu: error while loading shared libraries: libbedpostx_cuda.so: cannot open shared object file: No such file or directory

Command log:
Here it seems like the command is cutoff after "species = ": /gpfs/project/fas/n3/Studies/Connectome/processing/logs/runlogs/Log-dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.log

Command run and Full terminal output:

(qunex)[QuNex bedpostx]$ qunex dwi_fsl_bedpostx_gpu \
> --sessionsfolder="/gpfs/project/fas/n3/Studies/Connectome/subjects" \
> --sessions="116524" \
> --burnin='3000' \
> --model='3' \
> --nv \
> --overwrite="yes" \
> --bash="module load CUDA/9.1.85" \
> --scheduler="SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=64000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=bedpostx"

 ........................ Running QuNex v0.90.8 ........................ 


WARNING: You are attempting to execute QuNex command using an outdated QuNex file hierarchy: 

       Found: --> /gpfs/project/fas/n3/Studies/Connectome/subjects

     Note: Current version of QuNex supports the following default specification: 
       --> /gpfs/project/fas/n3/Studies/Connectome/sessions

       QuNex will proceed but please consider renaming your directories per latest specs:
          https://bitbucket.org/oriadev/qunex/wiki/Overview/DataHierarchy


--- Full QuNex call for command: dwi_fsl_bedpostx_gpu 

. /gpfs/project/fas/n3/software/qunex/bash/qx_utilities/dwi_fsl_bedpostx_gpu.sh --sessionsfolder='/gpfs/project/fas/n3/Studies/Connectome/subjects' --session='116524' --fibers='' --weight='' --burnin='3000' --jumps='' --sample='' --model='3' --rician='' --overwrite='yes' --species= 

-------------------------------------------------------------- 




# Generated by QuNex 0.90.8 on 2021-07-19_15.46.1626724011.099514 /gpfs/loomis/pi/n3/Studies/Connectome/processing/logs/comlogs/tmp_schedule_2021-07-19_15.46.1626724011.099514.log
# /gpfs/loomis/pi/n3/Studies/Connectome/processing/logs/comlogs/tmp_schedule_2021-07-19_15.46.1626724011.099514.log
started running schedule at 2021-07-19 15:46:51, track progress in /gpfs/loomis/pi/n3/Studies/Connectome/processing/logs/comlogs/tmp_schedule_2021-07-19_15.46.1626724011.099514.log
# Generated by QuNex 0.90.8 on 2021-07-19_15.46.1626724011.099514
#
Started running schedule at 2021-07-19 15:46:51
call: gmri schedule command=". /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/Run_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.sh 2>&1 | tee -a /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log; cat /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log | grep 'Successful completion' > /gpfs/project/fas/n3/Studies/Connectome/processing/runchecks/CompletionCheck_dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.Pass; cat /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log | grep 'ERROR' > /gpfs/project/fas/n3/Studies/Connectome/processing/runchecks/CompletionCheck_dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.Fail; if [[ -s /gpfs/project/fas/n3/Studies/Connectome/processing/runchecks/CompletionCheck_dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.Pass && ! -s /gpfs/project/fas/n3/Studies/Connectome/processing/runchecks/CompletionCheck_dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.Fail ]]; then mv /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/done_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log; echo ''; echo ' ===> Successful completion of dwi_fsl_bedpostx_gpu. Check final QuNex log output:'; echo ''; echo '    /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/done_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log'; echo ''; echo 'QUNEX PASSED!'; echo ''; else mv /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/error_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log; echo ''; echo ' ===> ERROR during dwi_fsl_bedpostx_gpu. Check final QuNex error log output:'; echo ''; echo '    /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/error_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log'; echo ''; echo ''; echo 'QUNEX FAILED!'; fi; if [[ -f 0 && ! -s 0 ]]; then echo 'delete' >> qunex_garbage0; fi; if [[ -s 1 ]]; then cat 1 | grep 'qunex' > qunex_garbage1; fi; if [[ -s 2 ]]; then cat 2 | grep 'FSL_FIX_MCRROOT' >> qunex_garbage2; fi; if [[ -s qunex_garbage0 ]]; then rm 0; rm qunex_garbage0; fi; if [[ -s qunex_garbage1 ]]; then rm 1; rm qunex_garbage1; fi; if [[ -s qunex_garbage2 ]]; then rm 2; rm qunex_garbage2; fi" bash="module load CUDA/9.1.85" settings="SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=64000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=bedpostx"
-----------------------------------------
Submitting:
------------------------------
#!/bin/sh
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --partition=pi_anticevic_gpu
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=64000
#SBATCH --job-name=bedpostx

module load CUDA/9.1.85
. /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/Run_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.sh 2>&1 | tee -a /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log; cat /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log | grep 'Successful completion' > /gpfs/project/fas/n3/Studies/Connectome/processing/runchecks/CompletionCheck_dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.Pass; cat /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log | grep 'ERROR' > /gpfs/project/fas/n3/Studies/Connectome/processing/runchecks/CompletionCheck_dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.Fail; if [[ -s /gpfs/project/fas/n3/Studies/Connectome/processing/runchecks/CompletionCheck_dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.Pass && ! -s /gpfs/project/fas/n3/Studies/Connectome/processing/runchecks/CompletionCheck_dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.Fail ]]; then mv /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/done_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log; echo ''; echo ' ===> Successful completion of dwi_fsl_bedpostx_gpu. Check final QuNex log output:'; echo ''; echo '    /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/done_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log'; echo ''; echo 'QUNEX PASSED!'; echo ''; else mv /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/error_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log; echo ''; echo ' ===> ERROR during dwi_fsl_bedpostx_gpu. Check final QuNex error log output:'; echo ''; echo '    /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/error_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log'; echo ''; echo ''; echo 'QUNEX FAILED!'; fi; if [[ -f 0 && ! -s 0 ]]; then echo 'delete' >> qunex_garbage0; fi; if [[ -s 1 ]]; then cat 1 | grep 'qunex' > qunex_garbage1; fi; if [[ -s 2 ]]; then cat 2 | grep 'FSL_FIX_MCRROOT' >> qunex_garbage2; fi; if [[ -s qunex_garbage0 ]]; then rm 0; rm qunex_garbage0; fi; if [[ -s qunex_garbage1 ]]; then rm 1; rm qunex_garbage1; fi; if [[ -s qunex_garbage2 ]]; then rm 2; rm qunex_garbage2; fi

-----------------------------------------
Finished at 2021-07-19 15:46:51

===> Successful completion of task
-------------------------------------------------------------- 

   Data successfully submitted to scheduler 
   Scheduler details: SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=64000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=bedpostx 
   Command log: /gpfs/project/fas/n3/Studies/Connectome/processing/logs/runlogs/Log-dwi_fsl_bedpostx_gpu_2021-07-19_15.46.0445751991.log 
   Command output: /gpfs/project/fas/n3/Studies/Connectome/processing/logs/comlogs/tmp_dwi_fsl_bedpostx_gpu_116524_2021-07-19_15.46.0445751991.log  

-------------------------------------------------------------- 

(qunex)[QuNex bedpostx]$ Submitted batch job 31578818

demsarjure · July 20, 2021, 8:41am

Yeah, this is my bad, the patch I prepared is on our develop version. I just tested what you pasted above on the develop version and it works (bedpostx for the session you selected is running as I am typing this). To use the develop version on the Yale Grace system you need to type in qx_setenv and then develop. To get back to production you use qx_setenv followed by master.

The change will be propagated into master along with a new container release soon (I would say within a week).

demsarjure · July 23, 2021, 3:32pm

I just released a new version on grace, please test and let me know how it goes.

amber.howell · July 25, 2021, 6:41pm

Hi Jure,
I ran the following command and it worked perfectly. Thanks for resolving this.

qunex_container dwi_fsl_bedpostx_gpu \
--sessionsfolder="/gpfs/project/fas/n3/Studies/Connectome/subjects" \
--sessions="116524" \
--burnin='3000' \
--model='3' \
--nv \
--overwrite="yes" \
--bash="module load CUDA/9.1.85" \
--scheduler="SLURM,time=12:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=64000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=bedpostx" \
--container="/gpfs/project/fas/n3/software/Singularity/qunex_suite-0.90.10.sif"

demsarjure · July 26, 2021, 7:27am

Thanks for reporting this. I will now label this issue as resolved.