[RESOLVED] Setting up Diffusion MRI analysis

yzhong · March 10, 2025, 2:55am

Hi all,

When I tried running the “hcp_diffusion” command (without GPU), an error occurred. Could you please help me figure out why this error happened?

Here is my command:

qunex_container hcp_diffusion \
    --sessionsfolder="${STUDY_FOLDER}/sessions" \
    --batchfile="${STUDY_FOLDER}/processing/batch.txt" \
    --overwrite="no" \
    --hcp_nogpu \
    --container="${QUNEX_CONTAINER}"

and my error log is attached as below:
error_hcp_diffusion_HCPA001_2025-03-07_09.39.59.692004.log (78.0 KB)

Thanks for the help!

Best,
Acacius

demsarjure · March 10, 2025, 6:33am

Hi Acacius,

Based on your log, I assume that your processing ran out of resources (most likely memory). As you can see in the log, there is a word Killed in there. This usually happens when the system does not have enough resources to execute something, so the operating system kills the process. Diffusion is computationally extremely heavy, per my experience, you need about 32 GB of memory for hcp_diffusion and even more for some of the steps that can follow. For, example for dense tractography, you often need 64 GB.

Best, Jure

yzhong · March 10, 2025, 6:43am

Got it. Many thanks to your kindly help!

Best,
Acacius

yzhong · March 14, 2025, 2:32am

Hi Jure,

I meet some errors while running ‘hcp_diffusion’, which is ‘CUDA driver version is insufficient for CUDA runtime version’. My cuda version is 11.5, is it enough to drive the analysis?

For better find out the problems, here is my command:

qunex_container hcp_diffusion \
    --sessionsfolder="${STUDY_FOLDER}/sessions" \
    --batchfile="${STUDY_FOLDER}/processing/batch.txt" \
    --overwrite="no" \
    --container="${QUNEX_CONTAINER}" \
    --bash="module load CUDA/11.5" \
    --parelements=2 \
    --cuda-version=11.5 \
    --nv

And here is my log file:

(base) yumingz@localadmin:~$ qunex_container hcp_diffusion \
    --sessionsfolder="${STUDY_FOLDER}/sessions" \
    --batchfile="${STUDY_FOLDER}/processing/batch.txt" \
    --overwrite="no" \
    --container="${QUNEX_CONTAINER}" \
    --bash="module load CUDA/11.5" \
    --parelements=2 \
    --cuda-version=11.5 \
    --nv

---> QuNex will run the command over 1 sessions. It will utilize:

    Maximum sessions run in parallel for a job: 1.
    Maximum elements run in parallel for a session: 2.
    Up to 2 processes will be utilized for a job.

    Job #1 will run sessions: HCPA001
(base) yumingz@localadmin:~$ ---> unsetting the following environment variables: PATH MATLABPATH PYTHONPATH QUNEXVer TOOLS QUNEXREPO QUNEXPATH QUNEXEXTENSIONS QUNEXLIBRARY QUNEXLIBRARYETC TemplateFolder FSL_FIXDIR FREESURFERDIR FREESURFER_HOME FREESURFER_SCHEDULER FreeSurferSchedulerDIR WORKBENCHDIR DCMNIIDIR DICMNIIDIR MATLABDIR MATLABBINDIR OCTAVEDIR OCTAVEPKGDIR OCTAVEBINDIR RDIR HCPWBDIR AFNIDIR PYLIBDIR FSLDIR FSLBINDIR PALMDIR QUNEXMCOMMAND HCPPIPEDIR CARET7DIR GRADUNWARPDIR HCPPIPEDIR_Templates HCPPIPEDIR_Bin HCPPIPEDIR_Config HCPPIPEDIR_PreFS HCPPIPEDIR_FS HCPPIPEDIR_FS_CUSTOM HCPPIPEDIR_PostFS HCPPIPEDIR_fMRISurf HCPPIPEDIR_fMRIVol HCPPIPEDIR_tfMRI HCPPIPEDIR_dMRI HCPPIPEDIR_dMRITract HCPPIPEDIR_Global HCPPIPEDIR_tfMRIAnalysis HCPCIFTIRWDIR MSMBin HCPPIPEDIR_dMRITractFull HCPPIPEDIR_dMRILegacy AutoPtxFolder EDDYCUDA USEOCTAVE QUNEXENV CONDADIR MSMBINDIR MSMCONFIGDIR R_LIBS FSL_FIX_CIFTIRW FSFAST_HOME SUBJECTS_DIR MINC_BIN_DIR MNI_DIR MINC_LIB_DIR MNI_DATAPATH FSF_OUTPUT_FORMAT ANTSDIR CUDIMOT

========================================================================
Generated by QuNex
------------------------------------------------------------------------
Version: 1.0.4 [QIO]
User: root
System: 321e4d905de6
OS: Debian Linux #48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct  7 11:24:13 UTC 2
------------------------------------------------------------------------

        ██████\                  ║      ██\   ██\
       ██  __██\                 ║      ███\  ██ |
       ██ /  ██ |██\   ██\       ║      ████\ ██ | ██████\ ██\   ██\
       ██ |  ██ |██ |  ██ |      ║      ██ ██\██ |██  __██\\██\ ██  |
       ██ |  ██ |██ |  ██ |      ║      ██ \████ |████████ |\████  /
       ██ ██\██ |██ |  ██ |      ║      ██ |\███ |██   ____|██  ██\
       \██████ / \██████  |      ║      ██ | \██ |\███████\██  /\██\
        \___███\  \______/       ║      \__|  \__| \_______\__/  \__|
            \___|                ║


                       DEVELOPED & MAINTAINED BY:

               Mind & Brain Lab, University of Ljubljana
                       Cho Lab, Yale University

                      COPYRIGHT & LICENSE NOTICE:

Use of this software is subject to the terms and conditions defined in
QuNex LICENSES which can be found in the LICENSES folder of the QuNex
repository or at https://qunex.yale.edu/qunex-registration
========================================================================

---> Setting up Octave


.......................... Running QuNex v1.0.4 [QIO] ..........................


--- Full QuNex call for command: hcp_diffusion

qunex hcp_diffusion --sessionsfolder="/home/yumingz/qunex/diffusion/sessions" --overwrite="no" --bash="module load CUDA/11.5" --parelements="2" --cuda-version="11.5" --batchfile="/home/yumingz/qunex/diffusion/processing/batch.txt" --sessions="HCPA001"

---------------------------------------------------------


# Generated by QuNex 1.0.4 [QIO] on 2025-03-14_01.33.17.107961#
=================================================================
qunex hcp_diffusion \
  --sessionsfolder="/home/yumingz/qunex/diffusion/sessions" \
  --overwrite="no" \
  --bash="module load CUDA/11.5" \
  --parelements="2" \
  --cuda-version="11.5" \
  --sessions="/home/yumingz/qunex/diffusion/processing/batch.txt" \
  --sessionids="HCPA001" \
=================================================================

Starting multiprocessing sessions in /home/yumingz/qunex/diffusion/processing/batch.txt with a pool of 1 concurrent processes


Starting processing of sessions HCPA001 at Friday, 14. March 2025 01:33:17
Running external command: /opt/HCP/HCPpipelines/DiffusionPreprocessing/DiffPreprocPipeline.sh                 --path="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp"                 --subject="HCPA001"                 --PEdir=2                 --posData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_PA.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_PA.nii.gz"                 --negData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_AP.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_AP.nii.gz"                 --echospacing-seconds="0.000689998"                 --gdcoeffs="NONE"                 --combine-data-flag="1"                 --printcom=""                --gpu=True                --cuda-version=10.2

You can follow command's progress in:
/home/yumingz/qunex/diffusion/processing/logs/comlogs/tmp_hcp_diffusion_HCPA001_2025-03-14_01.33.17.108756.log
------------------------------------------------------------

------------------------------------------------------------
Session id: HCPA001
[started on Friday, 14. March 2025 01:33:17]
Running HCP DiffusionPreprocessing Pipeline [HCPStyleData] ...
---> The following pos direction files were found:
     HCPA001_DWI_dir98_PA.nii.gz
     HCPA001_DWI_dir99_PA.nii.gz
---> The following neg direction files were found:
     HCPA001_DWI_dir98_AP.nii.gz
     HCPA001_DWI_dir99_AP.nii.gz
---> Using image specific EchoSpacing: 0.000689998 s

------------------------------------------------------------
Running HCP Pipelines command via QuNex:

/opt/HCP/HCPpipelines/DiffusionPreprocessing/DiffPreprocPipeline.sh
    --path="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp"
    --subject="HCPA001"
    --PEdir=2
    --posData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_PA.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_PA.nii.gz"
    --negData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_AP.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_AP.nii.gz"
    --echospacing-seconds="0.000689998"
    --gdcoeffs="NONE"
    --combine-data-flag="1"
    --printcom=""
    --gpu=True
    --cuda-version=10.2
------------------------------------------------------------


Running HCP Diffusion Preprocessing

ERROR: Running HCP Diffusion Preprocessing failed with error 1
...
command executed:
/opt/HCP/HCPpipelines/DiffusionPreprocessing/DiffPreprocPipeline.sh                 --path="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp"                 --subject="HCPA001"                 --PEdir=2                 --posData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_PA.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_PA.nii.gz"                 --negData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_AP.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_AP.nii.gz"                 --echospacing-seconds="0.000689998"                 --gdcoeffs="NONE"                 --combine-data-flag="1"                 --printcom=""                --gpu=True                --cuda-version=10.2

---> logfile: /home/yumingz/qunex/diffusion/processing/logs/comlogs/error_hcp_diffusion_HCPA001_2025-03-14_01.33.17.108756.log

HCP Diffusion Preprocessing completed on Friday, 14. March 2025 02:12:30
------------------------------------------------------------


---> Final report for command hcp_diffusion
... HCPA001 ---> Error
---> Not all tasks completed fully!

error_hcp_diffusion_HCPA001_2025-03-14_01.33.17.108756.log (79.6 KB)

Thanks for your help!

Best,
Acacius

demsarjure · March 14, 2025, 7:45am

Hi Acacius,

Can you also please provide the content of ${QUNEX_CONTAINER} as this depends on whether you are using Docker or Singularity since each one of the two handles GPUs differently.

One thing I noticed is that you should be using bash-pre instead of bash. The bash parameter is not for qunex_container. With qunex_container you have bash_pre that is executed after you are on the compute note but pre entering the container and bash_post which is executed on the compute note post entering the container, so inside the container (see General overview — QuNex documentation for additional details).

These things are sometimes tricky to resolve as they are system/host dependent and often outside of our (QuNex) control. What works best is to use the --cuda flag instead of --nv, but for this you need NVIDIA Container Toolkit installed on your host system.

For additional details, please check the section Using CUDA in the container at General overview — QuNex documentation.

Note that --cuda-version is irrelevant here. As long as you have a CUDA version above 10.2 you should be good to go once we sort out the other things.

To sum up, try:

qunex_container hcp_diffusion \
    --sessionsfolder="${STUDY_FOLDER}/sessions" \
    --batchfile="${STUDY_FOLDER}/processing/batch.txt" \
    --overwrite="yes" \
    --container="${QUNEX_CONTAINER}" \
    --bash_pre="module load CUDA/11.5" \
    --cuda

Replace --cuda with --nv if you do not have the toolkit. Let me know how it goes.

Best, Jure

yzhong · March 14, 2025, 9:13am

Hi Jure,

I have also tried:

qunex_container hcp_diffusion \
    --sessionsfolder="${STUDY_FOLDER}/sessions" \
    --batchfile="${STUDY_FOLDER}/processing/batch.txt" \
    --overwrite="yes" \
    --container="${QUNEX_CONTAINER}" \
    --bash_pre="module load CUDA/11.5" \
    --nv

But the error report is the same. My qunex_container is “gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:1.0.4”. I will also tried to download NVIDIA Container Toolkit to follow your suggestion.

If there have any update, I will let you know.

Thanks for your kindly help!

Best,
Acacius

demsarjure · March 14, 2025, 9:29am

Another thing explained on the page I linked above is the --cuda_path parameter, that one will bind the local CUDA into the container, assuring that CUDA runtime and host CUDA drivers match. The assumption here is that they do actually match on your system. It is a common issue with our users that CUDA drivers and CUDA runtime do not match on their host system. You can try running:

nvcc --version
nvidia-smi

on the host system for some quick CUDA debugging.

If you have admin privileges on the system, I would advise you to update CUDA runetime and CUDA drivers. On our system we use (12.4) but if you are updating, I would recommend to update to the latest version (12.8) as everything should be backwards compatible.

Best, Jure

yzhong · March 21, 2025, 2:02am

Hi Jure,

I’ve tried different ways to call me cuda but it doesn’t seem to work. Could you help me view what this is caused by?

Here is my command:

qunex_container hcp_diffusion \
 --sessionsfolder="${STUDY_FOLDER}/sessions" \
 --batchfile="${STUDY_FOLDER}/processing/batch.txt" \
 --container="${QUNEX_CONTAINER}" \
    --bind="/home/yumingz:/home/yumingz,/usr/local/cuda-12.5/:/usr/local/cuda/" \
    --container="${QUNEX_CONTAINER}" \
    --nv \
    --bash_pre="module load CUDA/12.5"

where, QUNEX_CONTAINER=“gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:1.1.1”

And here is the output:


---> QuNex will run the command over 1 sessions. It will utilize:

    Maximum sessions run in parallel for a job: 1.
    Maximum elements run in parallel for a session: 1.
    Up to 1 processes will be utilized for a job.

    Job #1 will run sessions: HCPA001
/bin/sh: 1: module: not found
(base) yumingz@localadmin:/usr/local$ docker: Error response from daemon: error while creating mount source path '/usr/local/cuda-12.5': mkdir /usr/local/cuda-12.5: read-only file system

Run 'docker run --help' for more information

I’ve opened up read and write access, but it seems that the container still can’t be mapped.

Thanks for your kindly help!

Best
Acacius

demsarjure · March 21, 2025, 6:27am

Hi Acacius,

Unfortunately, the amount of help I can give here is limited as these are not really QuNex issues, but your system specific issues and as I do not have access to your system it is hard for me to figure this out. Maybe you can tell me a bit more about the system. Is this a shared high performance compute cluster, is this your personal/lab processing system? How do you load modules, install libraries, etc.?

Anyhow, I see two issues. The first one is /bin/sh: 1: module: not found, this suggests that on the compute node, the module command is not found. How do you usually load modules, on the login node before scheduling? If this is the case, then you need to remove the --bash_pre="module load CUDA/12.5" parameter and just call module load CUDA/12.5 before executing the qunex_container call.

For the mount issue, it seems like the path you are mounting from (/usr/local/cuda-12.5) does not exist and docker is trying to create the mount source path. Probably, it could be becase the load module ... thing failed.

How I usually debug this is that I fire up a compute node interactively and then execute things there step by step, without scheduling everything.

Like mentioned, what seems to be working the most robustly across different systems and containers is the https://docs.nvidia.com/datacenter/cloud-native/container-toolkit and the --cuda flag.

Best, Jure

demsarjure · March 28, 2025, 6:53am

Hi Acacius,

Did you manage to resolve this?

Best, Jure

yzhong · March 28, 2025, 7:01am

Hi Jure,

Unfortunately, I’m still having problems with my cuda configuration, but I’ve managed to get it running successfully in a server that’s already paired with the environment, thanks for the help!

Best
Acacius

demsarjure · March 28, 2025, 7:13am

Yeah, issues like this are usually system dependent and without access to that system hard to figure out. You also often need admin privileges to update various drivers and packages …

Glad you got it working on a different system!

Best, Jure