Eddy_cuda9.1 erroring out when running dwi_legacy_gpu

Description:

Dear QuNex team,

When running dwi_legacy_gpu command on a GPU enabled AWS instance, eddy_cuda9.1 is erroring out at function EDDY::EddyCudaHelperFunctions::InitGpu(bool). Let me know on how to further debug this issue. Thanks in advance.

Call:

QuNex command used to start the docker container in interactive mode:

docker run --runtime=nvidia --gpus all -v "/test-data":"/data/" -v "/output":"/data/output" -it "gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:0.96.2"

dwi_legacy_gpu command and params are as follows, dwi has no field-map data.

qunex dwi_legacy_gpu
–sessionsfolder=‘/data/output/sessions’
–sessions=‘10171’
–diffdatasuffix=‘DWI_dir64_PA’
–usefieldmap=‘no’
–pedir=2
–echospacing=‘0.69’
–unwarpdir=‘-y’
–scanner=‘siemens’
–overwrite=‘yes’
–nv

Logs:

nvidia-smi output from host machine

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   31C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Qunex Logs:

→ unsetting the following environment variables: PATH MATLABPATH PYTHONPATH QUNEXVer TOOLS QUNEXREPO QUNEXPATH QUNEXLIBRARY QUNEXLIBRARYETC TemplateFolder FSL_FIXDIR FREESURFERDIR FREESURFER_HOME FREESURFER_SCHEDULER FreeSurferSchedulerDIR WORKBENCHDIR DCMNIIDIR DICMNIIDIR MATLABDIR MATLABBINDIR OCTAVEDIR OCTAVEPKGDIR OCTAVEBINDIR RDIR HCPWBDIR AFNIDIR ANTSDIR PYLIBDIR FSLDIR FSLGPUDIR PALMDIR QUNEXMCOMMAND HCPPIPEDIR CARET7DIR GRADUNWARPDIR HCPPIPEDIR_Templates HCPPIPEDIR_Bin HCPPIPEDIR_Config HCPPIPEDIR_PreFS HCPPIPEDIR_FS HCPPIPEDIR_PostFS HCPPIPEDIR_fMRISurf HCPPIPEDIR_fMRIVol HCPPIPEDIR_tfMRI HCPPIPEDIR_dMRI HCPPIPEDIR_dMRITract HCPPIPEDIR_Global HCPPIPEDIR_tfMRIAnalysis HCPCIFTIRWDIR MSMBin HCPPIPEDIR_dMRITractFull HCPPIPEDIR_dMRILegacy AutoPtxFolder FSL_GPU_SCRIPTS FSLGPUBinary EDDYCUDADIR USEOCTAVE QUNEXENV CONDADIR MSMBINDIR MSMCONFIGDIR R_LIBS FSL_FIX_CIFTIRW FSFAST_HOME SUBJECTS_DIR MINC_BIN_DIR MNI_DIR MINC_LIB_DIR MNI_DATAPATH FSF_OUTPUT_FORMAT

Generated by QuNex

Version: 0.96.2
User: root
System: d677e2409d35
OS: RedHat Linux #1 SMP Sun Nov 27 06:09:45 UTC 2022

    ██████\                  ║      ██\   ██\                        
   ██  __██\                 ║      ███\  ██ |                       
   ██ /  ██ |██\   ██\       ║      ████\ ██ | ██████\ ██\   ██\     
   ██ |  ██ |██ |  ██ |      ║      ██ ██\██ |██  __██\\██\ ██  | 
   ██ |  ██ |██ |  ██ |      ║      ██ \████ |████████ |\████  /     
   ██ ██\██ |██ |  ██ |      ║      ██ |\███ |██   ____|██  ██\      
   \██████ / \██████  |      ║      ██ | \██ |\███████\██  /\██\     
    \___███\  \______/       ║      \__|  \__| \_______\__/  \__|    
        \___|                ║                                       


                   DEVELOPED & MAINTAINED BY: 

                Anticevic Lab, Yale University 
           Mind & Brain Lab, University of Ljubljana 
                 Murray Lab, Yale University 

                  COPYRIGHT & LICENSE NOTICE: 

Use of this software is subject to the terms and conditions defined in
‘LICENSES’ which is a part of the QuNex Suite source code package:

—> Setting up Octave

(/opt/env/qunex) [QuNex qunex]$
(/opt/env/qunex) [QuNex qunex]$ qunex dwi_legacy_gpu \

--sessionsfolder='/data//sessions' \
--sessions='10171' \
--diffdatasuffix='DWI_dir64_PA' \
--usefieldmap='no' \
--pedir=2 \
--echospacing='0.69' \
--unwarpdir='-y' \
--scanner='siemens' \
--overwrite='yes' \
--nv

… Running QuNex v0.96.2 …

NOTE: Processing without FieldMap (TE option not needed)

Running dwi_legacy_gpu with the following parameters:

Study Folder: /data/
Sessions Folder: /data//sessions
Sessions: 10171
Study Log Folder:
Using FieldMap: no
Echo Spacing: 0.69
Phase Encoding Direction: 2
TE value for Fieldmap:
EPI Unwarp Direction: -y
Diffusion Data Suffix Name: DWI_dir64_PA
Overwrite prior run: yes

WARNING: QuNex study folder specification .qunexstudy in /data/ not found.
Check that /data/ is a valid QuNex folder.
Consider re-generating QuNex hierarchy…

— Full QuNex call for command: dwi_legacy_gpu

/opt/qunex/bash/qx_utilities/dwi_legacy_gpu.sh --sessionsfolder=/data//sessions --session=10171 --usefieldmap=no --pedir=2 --echospacing=0.69 --te= --unwarpdir=-y --diffdatasuffix=DWI_dir64_PA --overwrite=yes



Running dwi_legacy_gpu locally on d677e2409d35
Command log: /data//processing/logs/runlogs/Log-dwi_legacy_gpu_2023-01-24_07.22.50.362874.log
Command output: /data//processing/logs/comlogs/tmp_dwi_legacy_gpu_10171_2023-01-24_07.22.50.362874.log


– dwi_legacy_gpu.sh: Specified Command-Line Options - Start –
Sessionsfolder: /data//sessions
Session: 10171
Using fieldmap: no
Diffusion data sufix: DWI_dir64_PA
Overwrite: yes
– dwi_legacy_gpu.sh: Specified Command-Line Options - End –

------------------------- Start of work --------------------------------

— Establishing paths for all input and output folders:

T1w folder: /data/10171/hcp/10171/T1w
Diffusion folder: /data/10171/hcp/10171/Diffusion
T1w diffusion folder: /data/10171/hcp/10171/T1w/Diffusion

— Deleting prior runs for 10171_DWI_dir64_PA …

— Copying unprocesed data into the Diffusion folder

Copying /data/10171/hcp/10171/unprocessed/Diffusion/10171_DWI_dir64_PA.bval
Copying /data/10171/hcp/10171/unprocessed/Diffusion/10171_DWI_dir64_PA.bvec
Copying /data/10171/hcp/10171/unprocessed/Diffusion/10171_DWI_dir64_PA.nii.gz
— Setting up acquisition parameters:

Check acquisition parameter files:

acqparams.txt
index.txt

— Omitting FieldMap step…

Getting the first volume of each DWI image…

Run BET on the B0 EPI image to create masks…

IN=/data/10171/hcp/10171/Diffusion/rawdata/10171_DWI_dir64_PA_nodif
OUT=/data/10171/hcp/10171/Diffusion/rawdata/10171_DWI_dir64_PA_nodif_brain
bet2opts= -m -f 0.35 -v
verbose=1
debug=0
variation=0
min 0 thresh2 0 thresh 74.9385 thresh98 749.385 max 4095
c-of-g 101.297 92.7163 46.3714 mm
radius 73.38 mm
median within-brain intensity 300
self-intersection total 326.99 (threshold=4000.0)

— Checking if PreFreeSurfer was completed to obtain inputs for epi_reg…

PreFreeSurfer data found:

/data/10171/hcp/10171/T1w/T1w_acpc_dc_restore_brain.nii.gz

FAST already completed.

Setting inputs for epi_reg:

→ T1w Data: /data/10171/hcp/10171/T1w/T1w_acpc_dc_restore
→ T1w BET+FAST Data: /data/10171/hcp/10171/T1w/T1w_acpc_dc_restore_brain
→ WM Segment FAST Data: /data/10171/hcp/10171/T1w/T1w_acpc_dc_restore_brain_pve_2
→ T1w Brain Mask Data: /data/10171/hcp/10171/T1w/T1w_acpc_brain_mask

— Running eddy_cuda…

Using the following eddy_cuda binary: /opt/fsl/fsl/bin/eddy_cuda9.1

Running command:

/opt/fsl/fsl/bin/eddy_cuda9.1 --imain=/data/10171/hcp/10171/Diffusion/10171_DWI_dir64_PA --mask=/data/10171/hcp/10171/Diffusion/rawdata/10171_DWI_dir64_PA_nodif_brain_mask --acqp=/data/10171/hcp/10171/Diffusion/acqparams/10171_DWI_dir64_PA/acqparams.txt --index=/data/10171/hcp/10171/Diffusion/acqparams/10171_DWI_dir64_PA/index.txt --bvecs=/data/10171/hcp/10171/Diffusion/10171_DWI_dir64_PA.bvec --bvals=/data/10171/hcp/10171/Diffusion/10171_DWI_dir64_PA.bval --fwhm=10,0,0,0,0 --ff=10 --nvoxhp=2000 --flm=quadratic --out=/data/10171/hcp/10171/Diffusion/eddy/10171_DWI_dir64_PA_eddy_corrected --data_is_shelled --repol -v

Reading images
Performing volume-to-volume registration
Running Register

EDDY::: cuda/EddyCudaHelperFunctions.cu::: static void EDDY::EddyCudaHelperFunctions::InitGpu(bool): Exception thrown
EDDY::: cuda/EddyGpuUtils.cu::: static std::shared_ptrEDDY::DWIPredictionMaker EDDY::EddyGpuUtils::LoadPredictionMaker(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, const EDDY::ECScanManager&, unsigned int, float, NEWIMAGE::volume&, bool): Exception thrown
EDDY::: eddy.cpp::: EDDY::ReplacementManager EDDY::Register(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, unsigned int, const std::vector<float, std::allocator >&, EDDY::SecondLevelECModel, bool, EDDY::ECScanManager&, EDDY::ReplacementManager, NEWMAT::Matrix&, NEWMAT::Matrix&): Exception thrown**
EDDY::: Eddy failed with message EDDY::: eddy.cpp::: EDDY::ReplacementManager EDDY::DoVolumeToVolumeRegistration(const EDDY::EddyCommandLineOptions&, EDDY::ECScanManager&): Exception thrown*

HI!

Welcome to QuNex forums!

What is the GPU that you are using, I see that nvidia-smi outputs CUDA 11, so this is something very new most likely. New GPUs do not support CUDA 9.1 as it is too old. We are currently reworking the whole DWI pipeline so it will support never CUDA versions and newer GPUs as well.

One thing you could try is to add:

--bash="export DEFAULT_CUDA_VERSION=10.2"

to the QuNex call. This will execute the above bash code before running the command and will thus look for eddy_cuda10.2 executable. I believe 10.2 is the latest supported by the FSL in the container, you could also try 11 but I think that that one is not yet supported in there.

Kind regards, Jure

Dear Jure,

I have tried adding --bash="export DEFAULT_CUDA_VERSION=10.2"

I still see similar error, below is the log trace.

qunex dwi_legacy \
 --sessionsfolder='/data/output/dataset/sessions' \
 --sessions='10189' \
 --diffdatasuffix='DWI_dir64_PA' \
 --usefieldmap='no' \
 --pedir=2 \
 --echospacing='0.69' \
 --unwarpdir='-y' \
 --scanner='siemens' \
 --overwrite='yes' \
 --nv \
 --bash="export DEFAULT_CUDA_VERSION=10.2"
........................ Running QuNex v0.96.2 ........................ 


WARNING: Use of a deprecated command! Command dwi_legacy is now known as dwi_legacy_gpu

NOTE: Processing without FieldMap (TE option not needed)

Running dwi_legacy_gpu with the following parameters:
--------------------------------------------------------------
   Study Folder: /data/output/dataset
   Sessions Folder: /data/output/dataset/sessions
   Sessions: 10189
   Study Log Folder: 
   Using FieldMap: no
   Echo Spacing: 0.69
   Phase Encoding Direction: 2
   TE value for Fieldmap: 
   EPI Unwarp Direction: -y
   Diffusion Data Suffix Name: DWI_dir64_PA
   Overwrite prior run: yes


WARNING: QuNex study folder specification .qunexstudy in /data/output/dataset not found. 
         Check that /data/output/dataset is a valid QuNex folder. 
         Consider re-generating QuNex hierarchy... 


--- Full QuNex call for command: dwi_legacy_gpu 

/opt/qunex/bash/qx_utilities/dwi_legacy_gpu.sh     --sessionsfolder=/data/output/dataset/sessions     --session=10189     --usefieldmap=no     --pedir=2     --echospacing=0.69     --te=     --unwarpdir=-y     --diffdatasuffix=DWI_dir64_PA     --overwrite=yes 

-------------------------------------------------------------- 




-------------------------------------------------------------- 

   Running dwi_legacy_gpu locally on 2be9894b0b98 
   Command log:     /data/output/dataset/processing/logs/runlogs/Log-dwi_legacy_gpu_2023-01-30_06.49.22.046785.log   
   Command output: /data/output/dataset/processing/logs/comlogs/tmp_dwi_legacy_gpu_10189_2023-01-30_06.49.22.046785.log  

-------------------------------------------------------------- 



-- dwi_legacy_gpu.sh: Specified Command-Line Options - Start --
   Sessionsfolder: /data/output/dataset/sessions
   Session: 10189
   Using fieldmap: no
   Diffusion data sufix: DWI_dir64_PA
   Overwrite: yes
-- dwi_legacy_gpu.sh: Specified Command-Line Options - End --

 ------------------------- Start of work -------------------------------- 

 --- Establishing paths for all input and output folders: 

T1w folder:           /data/output/dataset/sessions/10189/hcp/10189/T1w
Diffusion folder:     /data/output/dataset/sessions/10189/hcp/10189/Diffusion
T1w diffusion folder: /data/output/dataset/sessions/10189/hcp/10189/T1w/Diffusion

 --- Deleting prior runs for 10189_DWI_dir64_PA ... 

 --- Copying unprocesed data into the Diffusion folder 

Copying /data/output/dataset/sessions/10189/hcp/10189/unprocessed/Diffusion/10189_DWI_dir64_PA.bval
Copying /data/output/dataset/sessions/10189/hcp/10189/unprocessed/Diffusion/10189_DWI_dir64_PA.bvec
Copying /data/output/dataset/sessions/10189/hcp/10189/unprocessed/Diffusion/10189_DWI_dir64_PA.nii.gz
 --- Setting up acquisition parameters: 

Check acquisition parameter files:

acqparams.txt
index.txt


 --- Omitting FieldMap step... 

 Getting the first volume of each DWI image... 

 Run BET on the B0 EPI image to create masks... 

IN=/data/output/dataset/sessions/10189/hcp/10189/Diffusion/rawdata/10189_DWI_dir64_PA_nodif
OUT=/data/output/dataset/sessions/10189/hcp/10189/Diffusion/rawdata/10189_DWI_dir64_PA_nodif_brain
bet2opts= -m -f 0.35 -v
verbose=1
debug=0
variation=0
min 0 thresh2 0 thresh 110.621 thresh98 1106.21 max 3312
c-of-g 97.3852 91.3497 43.6952 mm
radius 68.6348 mm
median within-brain intensity 293
self-intersection total 32.8768 (threshold=4000.0) 

 --- Checking if PreFreeSurfer was completed to obtain inputs for epi_reg... 

 PreFreeSurfer data found:  

/data/output/dataset/sessions/10189/hcp/10189/T1w/T1w_acpc_dc_restore_brain.nii.gz

 FAST already completed. 

 Setting inputs for epi_reg: 
  
 --> T1w Data:             /data/output/dataset/sessions/10189/hcp/10189/T1w/T1w_acpc_dc_restore 
 --> T1w BET+FAST Data:    /data/output/dataset/sessions/10189/hcp/10189/T1w/T1w_acpc_dc_restore_brain 
 --> WM Segment FAST Data: /data/output/dataset/sessions/10189/hcp/10189/T1w/T1w_acpc_dc_restore_brain_pve_2 
 --> T1w Brain Mask Data:  /data/output/dataset/sessions/10189/hcp/10189/T1w/T1w_acpc_brain_mask 

 --- Running eddy_cuda... 

 Using the following eddy_cuda binary: /opt/fsl/fsl-6.0.5.1/bin/eddy_cuda10.2 

Running command:

 /opt/fsl/fsl-6.0.5.1/bin/eddy_cuda10.2 --imain=/data/output/dataset/sessions/10189/hcp/10189/Diffusion/10189_DWI_dir64_PA --mask=/data/output/dataset/sessions/10189/hcp/10189/Diffusion/rawdata/10189_DWI_dir64_PA_nodif_brain_mask --acqp=/data/output/dataset/sessions/10189/hcp/10189/Diffusion/acqparams/10189_DWI_dir64_PA/acqparams.txt --index=/data/output/dataset/sessions/10189/hcp/10189/Diffusion/acqparams/10189_DWI_dir64_PA/index.txt --bvecs=/data/output/dataset/sessions/10189/hcp/10189/Diffusion/10189_DWI_dir64_PA.bvec --bvals=/data/output/dataset/sessions/10189/hcp/10189/Diffusion/10189_DWI_dir64_PA.bval --fwhm=10,0,0,0,0 --ff=10 --nvoxhp=2000 --flm=quadratic --out=/data/output/dataset/sessions/10189/hcp/10189/Diffusion/eddy/10189_DWI_dir64_PA_eddy_corrected --data_is_shelled --repol -v 

Reading images
Performing volume-to-volume registration
Running Register
  **EDDY:::  EddyCudaHelperFunctions::InitGpu: cudaGetDevice returned an error: cudaError_t = 100, cudaErrorName = cudaErrorNoDevice, cudaErrorString = no CUDA-capable device is detected**
  **EDDY:::  cuda/EddyCudaHelperFunctions.cu:::  static void EDDY::EddyCudaHelperFunctions::InitGpu(bool):  Exception thrown�**
  
  **EDDY:::  eddy.cpp:::  EDDY::ReplacementManager* EDDY::Register(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, unsigned int, const std::vector<float, std::allocator<float> >&, EDDY::SecondLevelECModel, bool, EDDY::ECScanManager&, EDDY::ReplacementManager*, NEWMAT::Matrix&, NEWMAT::Matrix&):  Exception thrown**
  **EDDY::: Eddy failed with message EDDY:::  eddy.cpp:::  EDDY::ReplacementManager* EDDY::DoVolumeToVolumeRegistration(const EDDY::EddyCommandLineOptions&, EDDY::ECScanManager&):  Exception thrown**

I assume that

nvcc -V

within the container works fine?

Also, you can remove the --nv flag from the command call as this is for Singularity containers only.

Jure