yzhong
March 10, 2025, 2:55am
1
Hi all,
When I tried running the “hcp_diffusion” command (without GPU), an error occurred. Could you please help me figure out why this error happened?
Here is my command:
qunex_container hcp_diffusion \
--sessionsfolder="${STUDY_FOLDER}/sessions" \
--batchfile="${STUDY_FOLDER}/processing/batch.txt" \
--overwrite="no" \
--hcp_nogpu \
--container="${QUNEX_CONTAINER}"
and my error log is attached as below:
error_hcp_diffusion_HCPA001_2025-03-07_09.39.59.692004.log (78.0 KB)
Thanks for the help!
Best,
Acacius
Hi Acacius,
Based on your log, I assume that your processing ran out of resources (most likely memory). As you can see in the log, there is a word Killed
in there. This usually happens when the system does not have enough resources to execute something, so the operating system kills the process. Diffusion is computationally extremely heavy, per my experience, you need about 32 GB of memory for hcp_diffusion
and even more for some of the steps that can follow. For, example for dense tractography, you often need 64 GB.
Best, Jure
yzhong
March 10, 2025, 6:43am
3
Got it. Many thanks to your kindly help!
Best,
Acacius
yzhong
March 14, 2025, 2:32am
4
Hi Jure,
I meet some errors while running ‘hcp_diffusion’, which is ‘CUDA driver version is insufficient for CUDA runtime version’. My cuda version is 11.5, is it enough to drive the analysis?
For better find out the problems, here is my command:
qunex_container hcp_diffusion \
--sessionsfolder="${STUDY_FOLDER}/sessions" \
--batchfile="${STUDY_FOLDER}/processing/batch.txt" \
--overwrite="no" \
--container="${QUNEX_CONTAINER}" \
--bash="module load CUDA/11.5" \
--parelements=2 \
--cuda-version=11.5 \
--nv
And here is my log file:
(base) yumingz@localadmin:~$ qunex_container hcp_diffusion \
--sessionsfolder="${STUDY_FOLDER}/sessions" \
--batchfile="${STUDY_FOLDER}/processing/batch.txt" \
--overwrite="no" \
--container="${QUNEX_CONTAINER}" \
--bash="module load CUDA/11.5" \
--parelements=2 \
--cuda-version=11.5 \
--nv
---> QuNex will run the command over 1 sessions. It will utilize:
Maximum sessions run in parallel for a job: 1.
Maximum elements run in parallel for a session: 2.
Up to 2 processes will be utilized for a job.
Job #1 will run sessions: HCPA001
(base) yumingz@localadmin:~$ ---> unsetting the following environment variables: PATH MATLABPATH PYTHONPATH QUNEXVer TOOLS QUNEXREPO QUNEXPATH QUNEXEXTENSIONS QUNEXLIBRARY QUNEXLIBRARYETC TemplateFolder FSL_FIXDIR FREESURFERDIR FREESURFER_HOME FREESURFER_SCHEDULER FreeSurferSchedulerDIR WORKBENCHDIR DCMNIIDIR DICMNIIDIR MATLABDIR MATLABBINDIR OCTAVEDIR OCTAVEPKGDIR OCTAVEBINDIR RDIR HCPWBDIR AFNIDIR PYLIBDIR FSLDIR FSLBINDIR PALMDIR QUNEXMCOMMAND HCPPIPEDIR CARET7DIR GRADUNWARPDIR HCPPIPEDIR_Templates HCPPIPEDIR_Bin HCPPIPEDIR_Config HCPPIPEDIR_PreFS HCPPIPEDIR_FS HCPPIPEDIR_FS_CUSTOM HCPPIPEDIR_PostFS HCPPIPEDIR_fMRISurf HCPPIPEDIR_fMRIVol HCPPIPEDIR_tfMRI HCPPIPEDIR_dMRI HCPPIPEDIR_dMRITract HCPPIPEDIR_Global HCPPIPEDIR_tfMRIAnalysis HCPCIFTIRWDIR MSMBin HCPPIPEDIR_dMRITractFull HCPPIPEDIR_dMRILegacy AutoPtxFolder EDDYCUDA USEOCTAVE QUNEXENV CONDADIR MSMBINDIR MSMCONFIGDIR R_LIBS FSL_FIX_CIFTIRW FSFAST_HOME SUBJECTS_DIR MINC_BIN_DIR MNI_DIR MINC_LIB_DIR MNI_DATAPATH FSF_OUTPUT_FORMAT ANTSDIR CUDIMOT
========================================================================
Generated by QuNex
------------------------------------------------------------------------
Version: 1.0.4 [QIO]
User: root
System: 321e4d905de6
OS: Debian Linux #48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 7 11:24:13 UTC 2
------------------------------------------------------------------------
██████\ ║ ██\ ██\
██ __██\ ║ ███\ ██ |
██ / ██ |██\ ██\ ║ ████\ ██ | ██████\ ██\ ██\
██ | ██ |██ | ██ | ║ ██ ██\██ |██ __██\\██\ ██ |
██ | ██ |██ | ██ | ║ ██ \████ |████████ |\████ /
██ ██\██ |██ | ██ | ║ ██ |\███ |██ ____|██ ██\
\██████ / \██████ | ║ ██ | \██ |\███████\██ /\██\
\___███\ \______/ ║ \__| \__| \_______\__/ \__|
\___| ║
DEVELOPED & MAINTAINED BY:
Mind & Brain Lab, University of Ljubljana
Cho Lab, Yale University
COPYRIGHT & LICENSE NOTICE:
Use of this software is subject to the terms and conditions defined in
QuNex LICENSES which can be found in the LICENSES folder of the QuNex
repository or at https://qunex.yale.edu/qunex-registration
========================================================================
---> Setting up Octave
.......................... Running QuNex v1.0.4 [QIO] ..........................
--- Full QuNex call for command: hcp_diffusion
qunex hcp_diffusion --sessionsfolder="/home/yumingz/qunex/diffusion/sessions" --overwrite="no" --bash="module load CUDA/11.5" --parelements="2" --cuda-version="11.5" --batchfile="/home/yumingz/qunex/diffusion/processing/batch.txt" --sessions="HCPA001"
---------------------------------------------------------
# Generated by QuNex 1.0.4 [QIO] on 2025-03-14_01.33.17.107961#
=================================================================
qunex hcp_diffusion \
--sessionsfolder="/home/yumingz/qunex/diffusion/sessions" \
--overwrite="no" \
--bash="module load CUDA/11.5" \
--parelements="2" \
--cuda-version="11.5" \
--sessions="/home/yumingz/qunex/diffusion/processing/batch.txt" \
--sessionids="HCPA001" \
=================================================================
Starting multiprocessing sessions in /home/yumingz/qunex/diffusion/processing/batch.txt with a pool of 1 concurrent processes
Starting processing of sessions HCPA001 at Friday, 14. March 2025 01:33:17
Running external command: /opt/HCP/HCPpipelines/DiffusionPreprocessing/DiffPreprocPipeline.sh --path="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp" --subject="HCPA001" --PEdir=2 --posData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_PA.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_PA.nii.gz" --negData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_AP.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_AP.nii.gz" --echospacing-seconds="0.000689998" --gdcoeffs="NONE" --combine-data-flag="1" --printcom="" --gpu=True --cuda-version=10.2
You can follow command's progress in:
/home/yumingz/qunex/diffusion/processing/logs/comlogs/tmp_hcp_diffusion_HCPA001_2025-03-14_01.33.17.108756.log
------------------------------------------------------------
------------------------------------------------------------
Session id: HCPA001
[started on Friday, 14. March 2025 01:33:17]
Running HCP DiffusionPreprocessing Pipeline [HCPStyleData] ...
---> The following pos direction files were found:
HCPA001_DWI_dir98_PA.nii.gz
HCPA001_DWI_dir99_PA.nii.gz
---> The following neg direction files were found:
HCPA001_DWI_dir98_AP.nii.gz
HCPA001_DWI_dir99_AP.nii.gz
---> Using image specific EchoSpacing: 0.000689998 s
------------------------------------------------------------
Running HCP Pipelines command via QuNex:
/opt/HCP/HCPpipelines/DiffusionPreprocessing/DiffPreprocPipeline.sh
--path="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp"
--subject="HCPA001"
--PEdir=2
--posData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_PA.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_PA.nii.gz"
--negData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_AP.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_AP.nii.gz"
--echospacing-seconds="0.000689998"
--gdcoeffs="NONE"
--combine-data-flag="1"
--printcom=""
--gpu=True
--cuda-version=10.2
------------------------------------------------------------
Running HCP Diffusion Preprocessing
ERROR: Running HCP Diffusion Preprocessing failed with error 1
...
command executed:
/opt/HCP/HCPpipelines/DiffusionPreprocessing/DiffPreprocPipeline.sh --path="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp" --subject="HCPA001" --PEdir=2 --posData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_PA.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_PA.nii.gz" --negData="/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir98_AP.nii.gz@/home/yumingz/qunex/diffusion/sessions/HCPA001/hcp/HCPA001/unprocessed/Diffusion/HCPA001_DWI_dir99_AP.nii.gz" --echospacing-seconds="0.000689998" --gdcoeffs="NONE" --combine-data-flag="1" --printcom="" --gpu=True --cuda-version=10.2
---> logfile: /home/yumingz/qunex/diffusion/processing/logs/comlogs/error_hcp_diffusion_HCPA001_2025-03-14_01.33.17.108756.log
HCP Diffusion Preprocessing completed on Friday, 14. March 2025 02:12:30
------------------------------------------------------------
---> Final report for command hcp_diffusion
... HCPA001 ---> Error
---> Not all tasks completed fully!
error_hcp_diffusion_HCPA001_2025-03-14_01.33.17.108756.log (79.6 KB)
Thanks for your help!
Best,
Acacius
Hi Acacius,
Can you also please provide the content of ${QUNEX_CONTAINER}
as this depends on whether you are using Docker or Singularity since each one of the two handles GPUs differently.
One thing I noticed is that you should be using bash-pre
instead of bash
. The bash
parameter is not for qunex_container
. With qunex_container
you have bash_pre
that is executed after you are on the compute note but pre
entering the container and bash_post
which is executed on the compute note post
entering the container, so inside the container (see General overview — QuNex documentation for additional details).
These things are sometimes tricky to resolve as they are system/host dependent and often outside of our (QuNex) control. What works best is to use the --cuda
flag instead of --nv
, but for this you need NVIDIA Container Toolkit installed on your host system.
For additional details, please check the section Using CUDA in the container
at General overview — QuNex documentation .
Note that --cuda-version
is irrelevant here. As long as you have a CUDA version above 10.2 you should be good to go once we sort out the other things.
To sum up, try:
qunex_container hcp_diffusion \
--sessionsfolder="${STUDY_FOLDER}/sessions" \
--batchfile="${STUDY_FOLDER}/processing/batch.txt" \
--overwrite="yes" \
--container="${QUNEX_CONTAINER}" \
--bash_pre="module load CUDA/11.5" \
--cuda
Replace --cuda
with --nv
if you do not have the toolkit. Let me know how it goes.
Best, Jure
yzhong
March 14, 2025, 9:13am
6
Hi Jure,
I have also tried:
qunex_container hcp_diffusion \
--sessionsfolder="${STUDY_FOLDER}/sessions" \
--batchfile="${STUDY_FOLDER}/processing/batch.txt" \
--overwrite="yes" \
--container="${QUNEX_CONTAINER}" \
--bash_pre="module load CUDA/11.5" \
--nv
But the error report is the same. My qunex_container is “gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:1.0.4”. I will also tried to download NVIDIA Container Toolkit to follow your suggestion.
If there have any update, I will let you know.
Thanks for your kindly help!
Best,
Acacius
Another thing explained on the page I linked above is the --cuda_path
parameter, that one will bind the local CUDA into the container, assuring that CUDA runtime and host CUDA drivers match. The assumption here is that they do actually match on your system. It is a common issue with our users that CUDA drivers and CUDA runtime do not match on their host system. You can try running:
nvcc --version
nvidia-smi
on the host system for some quick CUDA debugging.
If you have admin privileges on the system, I would advise you to update CUDA runetime and CUDA drivers. On our system we use (12.4) but if you are updating, I would recommend to update to the latest version (12.8) as everything should be backwards compatible.
Best, Jure