[RESOLVED] Dwi_probtrackx_dense_gpu error

Description:

Hello, I’m working through processing my DWI data and got the steps dwi_bedpostx_gpu and dwi_pre_tractography completed without issues. When I tried to run dwi_probtrackx_dense_gpu, it had an error. Any thoughts on what is causing it?

Thank you.

Estephan

P.S.: What is the purpose of dwi_pre_tractography command - is it equivalent to the “Registration within FDT” described in the FSL wiki https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FDT/UserGuide#Registration_within_FDT? What are the expected outputfiles from this command?

Call:

msi_resources_time=12:00:00; msi_resources_nodes=1; msi_resources_ntaskspernode=1; msi_resources_mem=64000; msi_queue=a100-4; msi_resources_gpu=gpu:a100:1; msi_resources_jobname=PROBTRACKX; \
study_sharedfolder=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02; \
qunex_container dwi_probtrackx_dense_gpu \
--batchfile=${study_sharedfolder}/processing/batch_K99Aim2.txt --sessionsfolder=${study_sharedfolder}/sessions \
--omatrix1="yes" --omatrix3="yes" --nsamplesmatrix1="10000" --nsamplesmatrix3="3000" --distancecorrection="yes" --storestreamlineslength="yes" --overwrite="yes" \
--nv \
--bash_pre="module load cuda/11.2" \
--scheduler=SLURM,time=${msi_resources_time},nodes=${msi_resources_nodes},cpus-per-task=${msi_resources_ntaskspernode},mem=${msi_resources_mem},partition=${msi_queue},gres=${msi_resources_gpu},jobname=${msi_resources_jobname} \
--bind=${study_sharedfolder}:${study_sharedfolder} --container=${HOME}/qunex/qunex_suite-0.98.0.sif

Logs:

# Generated by QuNex 0.98.0 on 2023-06-11_14.18.55.930987
#


-- qunex.sh: Specified Command-Line Options - Start --
   Study Folder: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02
   Sessions Folder: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions
   Session: 10001
   probtrackX GPU scripts Folder: /opt/qunex/bash/qx_utilities/diffusion_tractography_dense/tractography_gpu_scripts
   Compute Matrix1: yes
   Compute Matrix3: yes
   Number of samples for Matrix1: 10000
   Number of samples for Matrix3: 3000
   Distance correction: yes
   Store streamlines length: yes
   Overwrite prior run: yes
   No GPU: no
-- qunex.sh: Specified Command-Line Options - End --

e[32m ------------------------- Start of work -------------------------------- e[0m


e[32m    --- probtrackX GPU for session 10001... e[0m


e[31m  --- Removing existing Probtrackxgpu Matrix1 dense run for 10001... e[0m


e[32m Checking if ProbtrackX Matrix 1 and dense connectome was completed on 10001... e[0m


e[32m ProbtrackX Matrix 1 solution and dense connectome incomplete for 10001. Starting run with 10000 samples... e[0m

Running the following probtrackX GPU command: 

---------------------------

   /opt/qunex/bash/qx_utilities/diffusion_tractography_dense/tractography_gpu_scripts/run_matrix1.sh /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions 10001 10000 yes yes no

---------------------------

-- Queueing Probtrackx

PROBTRACKX2 VERSION GPU
Log directory is: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography
Running in seedmask mode
/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/commands_Mat1.sh: line 1: 1014741 Killed                  /opt/fsl/fsl/bin/probtrackx2_gpu10.2 --loopcheck --forcedir --fibthresh=0.01 --cthr=0.2 --sampvox=2 --randfib=1 --nsamples=10000 --nsteps=2000 --steplength=0.5 --pd --ompl --samples=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/T1w/Diffusion.bedpostX/merged --mask=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/T1w/Diffusion.bedpostX/nodif_brain_mask --meshspace=caret --seed=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1_seeds --seedref=/opt/fsl/fsl/data/standard/MNI152_T1_2mm_brain_mask --xfm=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/xfms/standard2acpc_dc --invxfm=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/xfms/acpc_dc2standard --stop=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/stop --wtstop=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/wtstop --forcefirststep --waypoints=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/ROIs/Whole_Brain_Trajectory_ROI_2 --omatrix1 --dir=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography

-- Queueing Post-Matrix 1 Calls


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -probtrackx-dot-convert /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/fdt_matrix1.dot /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii -row-cifti /opt/qunex/qx_library/etc/diffusion_tractography_dense/templates/91282_Greyordinates.dscalar.nii COLUMN -col-cifti /opt/qunex/qx_library/etc/diffusion_tractography_dense/templates/91282_Greyordinates.dscalar.nii COLUMN

ERROR: error opening text file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/fdt_matrix1.dot'


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-transpose /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1_transp.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii', file does not exist, or folder permissions prevent seeing it


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-average /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii -cifti /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii -cifti /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1_transp.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii', file does not exist, or folder permissions prevent seeing it


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-reduce /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii SUM /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_sum.dscalar.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii', file does not exist, or folder permissions prevent seeing it

mv: cannot stat ‘/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/waytotal’: No such file or directory
cat: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotal: No such file or directory

While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-math a/ /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm.dconn.nii -var a /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii', file does not exist, or folder permissions prevent seeing it


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-math 'log(1+a)' /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm_log.dconn.nii -var a /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm.dconn.nii', file does not exist, or folder permissions prevent seeing it

gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm_log.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1_transp.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/fdt_matrix1.dot: No such file or directory

-- Matrix 1 Probtrackx Completed successfully.

e[31m ERROR: dwi_probtracx_dense_gpu for 10001 failed! e[0m

e[31m  --- Removing existing Probtrackxgpu Matrix3 dense run for 10001... e[0m


e[32m Checking if ProbtrackX Matrix 3 and dense connectome was completed on 10001... e[0m


e[32m ProbtrackX Matrix 3 solution and dense connectome incomplete for 10001. Starting run with 3000 samples... e[0m

Running the following probtrackX GPU command: 

---------------------------

   /opt/qunex/bash/qx_utilities/diffusion_tractography_dense/tractography_gpu_scripts/run_matrix3.sh /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions 10001 3000 yes yes no

---------------------------

-- Queueing Probtrackx

PROBTRACKX2 VERSION GPU
Log directory is: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography
Running in seedmask mode
/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/commands_Mat3.sh: line 1: 1015368 Killed                  /opt/fsl/fsl/bin/probtrackx2_gpu10.2 --loopcheck --forcedir --fibthresh=0.01 --cthr=0.2 --sampvox=2 --randfib=1 --nsamples=3000 --nsteps=2000 --steplength=0.5 --pd --ompl --samples=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/T1w/Diffusion.bedpostX/merged --mask=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/T1w/Diffusion.bedpostX/nodif_brain_mask --meshspace=caret --seed=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/ROIs/Whole_Brain_Trajectory_ROI_2 --seedref=/opt/fsl/fsl/data/standard/MNI152_T1_2mm_brain_mask --xfm=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/xfms/standard2acpc_dc --invxfm=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/xfms/acpc_dc2standard --stop=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/stop --wtstop=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/wtstop --waypoints=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/ROIs/Whole_Brain_Trajectory_ROI_2 --omatrix3 --target3=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat3_targets --dir=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography

-- Queueing Post-Matrix 3 Calls


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -probtrackx-dot-convert /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/fdt_matrix3.dot /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3.dconn.nii -row-cifti /opt/qunex/qx_library/etc/diffusion_tractography_dense/templates/91282_Greyordinates.dscalar.nii COLUMN -col-cifti /opt/qunex/qx_library/etc/diffusion_tractography_dense/templates/91282_Greyordinates.dscalar.nii COLUMN -make-symmetric

ERROR: error opening text file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/fdt_matrix3.dot'


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-reduce /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3.dconn.nii SUM /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3_sum.dscalar.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3.dconn.nii', file does not exist, or folder permissions prevent seeing it

mv: cannot stat ‘/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/waytotal’: No such file or directory
cat: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3_waytotal: No such file or directory

While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-math a/ /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3_waytotnorm.dconn.nii -var a /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3.dconn.nii', file does not exist, or folder permissions prevent seeing it


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-math 'log(1+a)' /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3_waytotnorm_log.dconn.nii -var a /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3_waytotnorm.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3_waytotnorm.dconn.nii', file does not exist, or folder permissions prevent seeing it

gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3_waytotnorm.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn3_waytotnorm_log.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/fdt_matrix3.dot: No such file or directory

-- Matrix 3 Probtrackx Completed successfully.

e[31m ERROR: dwi_probtracx_dense_gpu for 10001 failed! e[0m

e[31m ERROR: dwi_probtracx_dense_gpu run did not complete successfully e[0m


** For an example of how to report an issue, please refer to this post.

Hi Estephan,

Yes, dwi_pre_tractography is a requirement for dwi_probtrackx_dense_gpu. The command does pre-tractography dense trajectory space generation. It seems like a CUDA/GPU issues:

Killed                  /opt/fsl/fsl/bin/probtrackx2_gpu10.2

Could you try rerunning this with --nogpu="yes" added to the command call.

Another thing worth trying is adding the --cuda_path=<PATH TO CUDA ON YOUR SYSTEM> to the call.

Jure

Hi,

As we received no feedback for 3 weeks, I will mark this as resolved. Feel free to post new messages here if there are still issues and I will reopen the topic.

Jure

Hi Jure, sorry for the delay as I was out of town until last week. I tried your suggestions to no avail. I also logged in into a GPU-enabled node in our HCP system, and the error still came around when I ran the command from inside the container - see below. Would this rule out CUDA/GPU issues?

aga13:~ moana004$ module list
Currently Loaded Modulefiles:
 1) singularity/current                            3) cuda/11.2(default)  
 2) python3/3.8.3_anaconda2020.07_mamba(default)  
aga13:~ moana004$ singularity shell -B ${study_sharedfolder}:${study_sharedfolder},${DWIQC_folder}:${DWIQC_folder} ${HOME}/qunex/qunex_suite-0.98.0.sif
Apptainer> source /opt/qunex/env/qunex_environment.sh
--> unsetting the following environment variables: PATH MATLABPATH PYTHONPATH QUNEXVer TOOLS QUNEXREPO QUNEXPATH QUNEXEXTENSIONS QUNEXLIBRARY QUNEXLIBRARYETC TemplateFolder FSL_FIXDIR FREESURFERDIR FREESURFER_HOME FREESURFER_SCHEDULER FreeSurferSchedulerDIR WORKBENCHDIR DCMNIIDIR DICMNIIDIR MATLABDIR MATLABBINDIR OCTAVEDIR OCTAVEPKGDIR OCTAVEBINDIR RDIR HCPWBDIR AFNIDIR PYLIBDIR FSLDIR FSLBINDIR PALMDIR QUNEXMCOMMAND HCPPIPEDIR CARET7DIR GRADUNWARPDIR HCPPIPEDIR_Templates HCPPIPEDIR_Bin HCPPIPEDIR_Config HCPPIPEDIR_PreFS HCPPIPEDIR_FS HCPPIPEDIR_PostFS HCPPIPEDIR_fMRISurf HCPPIPEDIR_fMRIVol HCPPIPEDIR_tfMRI HCPPIPEDIR_dMRI HCPPIPEDIR_dMRITract HCPPIPEDIR_Global HCPPIPEDIR_tfMRIAnalysis HCPCIFTIRWDIR MSMBin HCPPIPEDIR_dMRITractFull HCPPIPEDIR_dMRILegacy AutoPtxFolder EDDYCUDA USEOCTAVE QUNEXENV CONDADIR MSMBINDIR MSMCONFIGDIR R_LIBS FSL_FIX_CIFTIRW FSFAST_HOME SUBJECTS_DIR MINC_BIN_DIR MNI_DIR MINC_LIB_DIR MNI_DATAPATH FSF_OUTPUT_FORMAT
 
Generated by QuNex 
------------------------------------------------------------------------ 
Version: 0.98.0 
User: moana004 
System: aga13 
OS: RedHat Linux #1 SMP Tue Jun 20 11:48:01 UTC 2023 
------------------------------------------------------------------------ 
 
        ██████\                  ║      ██\   ██\                        
       ██  __██\                 ║      ███\  ██ |                       
       ██ /  ██ |██\   ██\       ║      ████\ ██ | ██████\ ██\   ██\     
       ██ |  ██ |██ |  ██ |      ║      ██ ██\██ |██  __██\\██\ ██  | 
       ██ |  ██ |██ |  ██ |      ║      ██ \████ |████████ |\████  /     
       ██ ██\██ |██ |  ██ |      ║      ██ |\███ |██   ____|██  ██\      
       \██████ / \██████  |      ║      ██ | \██ |\███████\██  /\██\     
        \___███\  \______/       ║      \__|  \__| \_______\__/  \__|    
            \___|                ║                                       
 
 
                       DEVELOPED & MAINTAINED BY: 
 
                    Anticevic Lab, Yale University 
               Mind & Brain Lab, University of Ljubljana 
                     Murray Lab, Yale University 
 
                      COPYRIGHT & LICENSE NOTICE: 
 
Use of this software is subject to the terms and conditions defined in 
'LICENSES' which is a part of the QuNex Suite source code package: 
https://gitlab.qunex.yale.edu/qunex/qunex/-/tree/master/LICENSES 
 
 ---> Setting up Octave  

(qunex) [ahl02 ~]$ /opt/qunex/bash/qx_utilities/diffusion_tractography_dense/tractography_gpu_scripts/run_matrix1.sh /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions 10001 10000 yes yes no

-- Queueing Probtrackx

PROBTRACKX2 VERSION GPU
Log directory is: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography
Running in seedmask mode
Loading tractography data
Number of Seeds: 91282
/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/commands_Mat1.sh: line 1: 1901348 Killed                  /opt/fsl/fsl/bin/probtrackx2_gpu10.2 --loopcheck --forcedir --fibthresh=0.01 --cthr=0.2 --sampvox=2 --randfib=1 --nsamples=10000 --nsteps=2000 --steplength=0.5 --pd --ompl --samples=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/T1w/Diffusion.bedpostX/merged --mask=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/T1w/Diffusion.bedpostX/nodif_brain_mask --meshspace=caret --seed=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1_seeds --seedref=/opt/fsl/fsl/data/standard/MNI152_T1_2mm_brain_mask --xfm=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/xfms/standard2acpc_dc --invxfm=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/xfms/acpc_dc2standard --stop=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/stop --wtstop=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/wtstop --forcefirststep --waypoints=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/ROIs/Whole_Brain_Trajectory_ROI_2 --omatrix1 --dir=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography

-- Queueing Post-Matrix 1 Calls


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -probtrackx-dot-convert /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/fdt_matrix1.dot /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii -row-cifti /opt/qunex/qx_library/etc/diffusion_tractography_dense/templates/91282_Greyordinates.dscalar.nii COLUMN -col-cifti /opt/qunex/qx_library/etc/diffusion_tractography_dense/templates/91282_Greyordinates.dscalar.nii COLUMN

ERROR: error opening text file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/fdt_matrix1.dot'


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-transpose /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1_transp.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii', file does not exist, or folder permissions prevent seeing it


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-average /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii -cifti /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii -cifti /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1_transp.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii', file does not exist, or folder permissions prevent seeing it


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-reduce /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii SUM /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_sum.dscalar.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii', file does not exist, or folder permissions prevent seeing it

mv: cannot stat ‘/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/waytotal’: No such file or directory
cat: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotal: No such file or directory

While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-math a/ /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm.dconn.nii -var a /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii', file does not exist, or folder permissions prevent seeing it


While running:
/opt/workbench/workbench-1.5.0/bin_rh_linux64/../exe_rh_linux64/wb_command -cifti-math 'log(1+a)' /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm_log.dconn.nii -var a /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm.dconn.nii

ERROR: failed to open file '/home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm.dconn.nii', file does not exist, or folder permissions prevent seeing it

gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Conn1_waytotnorm_log.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/Mat1_transp.dconn.nii: No such file or directory
gzip: /home/moanae/shared/project_K99_ChrTMDHCP_qunex02/sessions/10001/hcp/10001/MNINonLinear/Results/Tractography/fdt_matrix1.dot: No such file or directory

-- Matrix 1 Probtrackx Completed successfully.

(qunex) [ahl02 ~]$ 

However I tried these commands to confirm that I was using a GPU-enabled node, and got some output suggesting errors:

(qunex) [ahl02 ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
(qunex) [ahl02 ~]$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
(qunex) [ahl02 ~]$ nvidia-settings

ERROR: /lib64/libcairo.so.2: undefined symbol: FT_Get_Var_Design_Coordinates
       libnvidia-gtk3.so: cannot open shared object file: No such file or
       directory
       libnvidia-gtk2.so.530.30.02: cannot open shared object file: No such
       file or directory
       libnvidia-gtk2.so: cannot open shared object file: No such file or
       directory


ERROR: A problem occurred when loading the GUI library. Please check your
       installation and library path. You may need to specify this library when
       calling nvidia-settings. Please run `nvidia-settings --help` for usage
       information.

(qunex) [ahl02 ~]$ 

Do think this may explain why I got those errors with probtrackx_gpu? Interestingly, I was able to run BEDPOSTX without issues in the same HCP using this command:

msi_resources_time=12:00:00; msi_resources_nodes=1; msi_resources_ntaskspernode=1; msi_resources_mem=64000; msi_queue=a100-4; msi_resources_gpu=gpu:a100:1; msi_resources_jobname=BEDPOSTX; \
study_sharedfolder=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02; \
qunex_container dwi_bedpostx_gpu \
--batchfile=${study_sharedfolder}/processing/batch_K99Aim2.txt --sessionsfolder=${study_sharedfolder}/sessions \
--fibers="3" --weight="1" --burnin="3000" --jumps="1250" --sample="25" --model="3" --rician="yes" --gradnonlin="yes" --overwrite="yes" \
--nv \
--bash_pre="module load cuda/11.2" \
--scheduler=SLURM,time=${msi_resources_time},nodes=${msi_resources_nodes},cpus-per-task=${msi_resources_ntaskspernode},mem=${msi_resources_mem},partition=${msi_queue},gres=${msi_resources_gpu},jobname=${msi_resources_jobname} \
--bind=${study_sharedfolder}:${study_sharedfolder} --container=${HOME}/qunex/qunex_suite-0.98.0.sif

Anyways, it sounds like this might be something to trouble shoot with the HCP helpdesk, right?

Estephan

It seems like GPU processes get killed quickly (the killed notification at the end of command execution). Also, the NVIDIA outputs suggest some issues with CUDA on the system. Note that Singularity container uses system CUDA drivers, so this is a possible issue yes.

For the time being, I would advise you to add the --nogpu="yes" option to the command call. I think this should work as the issues seem CUDA related. It will take more time, but it should work until CUDA issues are resolved.

Ok, I submitted the job using the ‘–nogpu=“yes”’ option and it is running without issues. I requested a 48-hour walltime but I’m not sure if it will run to completion within this period. I also contacted my HCP helpdesk about the CUDA errors I mentioned, let’s see what they say. Thank you.

Estephan

Updates on this issue. I found that requesting 128GB of RAM memory for the cluster call allowed me to go past this error:

msi_resources_time=06:00:00; msi_resources_nodes=1; msi_resources_ntaskspernode=24; msi_resources_mem=256000; msi_queue=a100-8; msi_resources_gpu=gpu:a100:1; msi_resources_jobname=PROBTRACKX; \
study_sharedfolder=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02; \
qunex_container dwi_probtrackx_dense_gpu \
--batchfile=${study_sharedfolder}/processing/batch_K99Aim2.txt --sessionsfolder=${study_sharedfolder}/sessions --sessions="10001" \
--omatrix1="yes" --omatrix3="yes" --nsamplesmatrix1="10000" --nsamplesmatrix3="3000" --distancecorrection="yes" --storestreamlineslength="yes" --overwrite="yes" \
--nv \
--bash_pre="module load cuda/11.2" \
--scheduler=SLURM,time=${msi_resources_time},nodes=${msi_resources_nodes},cpus-per-task=${msi_resources_ntaskspernode},mem=${msi_resources_mem},partition=${msi_queue},gres=${msi_resources_gpu},jobname=${msi_resources_jobname} \
--bind=${study_sharedfolder}:${study_sharedfolder} --container=${HOME}/qunex/qunex_suite-0.98.4.sif

But it then had another issue related to out of memory error, which I believe is for the GPU memory although it seemed to have enough available memory:

...................Allocated GPU 0...................
Device memory available (MB): 40116 ---- Total device memory(MB): 40536
Memory required for allocating data (MB): 943
CUDA Runtime Error: out of memory
Device memory available after copying data (MB): 39136
Running 376230 streamlines in parallel using 2 STREAMS
Total number of streamlines: 912820000

Full log here:
error_dwi_probtrackx_dense_gpu_10001_2023-08-26_15.31.01.327032.log (12.1 KB)

Then I tried again and looked at the GPU node’s activity, and it seemed to spit out errors related to the GPU:

ahl04:~ moana004$ ssh agb05
!! MSI Cluster Resource
!! Disconnect IMMEDIATELY if you are not an authorized user!
agb05:~ moana004$ nvidia-smi
Sat Aug 26 15:34:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:8B:00.0 Off |                    0 |
| N/A   32C    P0    57W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
agb05:~ moana004$ 
Message from syslogd@agb05 at Aug 26 15:34:21 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#39 stuck for 22s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:34:21 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#39 stuck for 22s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:35:13 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 23s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:35:13 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 23s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:35:41 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:35:41 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [probtrackx2_gpu:886161]
Connection to agb05 closed by remote host.
Connection to agb05 closed.

This made me think that this GPU cluster (NVIDIA A100 GPU) may be the issue. I tried to run the same command in a NVidia Tesla V100 GPU node:

msi_resources_time=06:00:00; msi_resources_nodes=1; msi_resources_ntaskspernode=12; msi_resources_mem=128000; msi_queue=v100; msi_resources_gpu=gpu:v100:1; msi_resources_jobname=PROBTRACKX; \
study_sharedfolder=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02; \
qunex_container dwi_probtrackx_dense_gpu \
--batchfile=${study_sharedfolder}/processing/batch_K99Aim2.txt --sessionsfolder=${study_sharedfolder}/sessions --sessions="10001" \
--omatrix1="yes" --omatrix3="yes" --nsamplesmatrix1="10000" --nsamplesmatrix3="3000" --distancecorrection="yes" --storestreamlineslength="yes" --overwrite="yes" \
--nv \
--bash_pre="module load cuda/11.2" \
--scheduler=SLURM,time=${msi_resources_time},nodes=${msi_resources_nodes},cpus-per-task=${msi_resources_ntaskspernode},mem=${msi_resources_mem},partition=${msi_queue},gres=${msi_resources_gpu},jobname=${msi_resources_jobname} \
--bind=${study_sharedfolder}:${study_sharedfolder} --container=${HOME}/qunex/qunex_suite-0.98.4.sif

And it ran past the above errors and the GPU is being used without issues:

ln0004:~ moana004$ ssh cn2106 nvidia-smi
!! MSI Cluster Resource
!! Disconnect IMMEDIATELY if you are not an authorized user!
Sat Aug 26 15:51:41 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   48C    P0    47W / 250W |  27137MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    714213      C   ...l/bin/probtrackx2_gpu10.2    27113MiB |
+-----------------------------------------------------------------------------+

Based on this, it seems that this might be some issue with the NVIDIA A100 GPU cluster and the probtrackx_gpu command. What are your thoughts on this? Any suggestions on how can I interact with my local cluster staff to figure this out? Thank you.

Estephan

Hi,

Thanks for the updates. Yes, dwi_probtrackx_dense_gpu needs vast amounts of memory, both on the computer and on the GPU. Above seems like some kind of a driver/GPU node issue, A100 is superior to V100 and should not run out of memory if V100 does not.

Do you have any newer CUDA drivers available. Current version of the container also supports those, so you could try using CUDA 12+ with A100 (--bash_pre="module load cuda/11.2").

Jure