Based on your log, I assume that your processing ran out of resources (most likely memory). As you can see in the log, there is a word Killed in there. This usually happens when the system does not have enough resources to execute something, so the operating system kills the process. Diffusion is computationally extremely heavy, per my experience, you need about 32 GB of memory for hcp_diffusion and even more for some of the steps that can follow. For, example for dense tractography, you often need 64 GB.
I meet some errors while running ‘hcp_diffusion’, which is ‘CUDA driver version is insufficient for CUDA runtime version’. My cuda version is 11.5, is it enough to drive the analysis?
For better find out the problems, here is my command:
Can you also please provide the content of ${QUNEX_CONTAINER} as this depends on whether you are using Docker or Singularity since each one of the two handles GPUs differently.
One thing I noticed is that you should be using bash-pre instead of bash. The bash parameter is not for qunex_container. With qunex_container you have bash_pre that is executed after you are on the compute note but pre entering the container and bash_post which is executed on the compute note post entering the container, so inside the container (see General overview — QuNex documentation for additional details).
These things are sometimes tricky to resolve as they are system/host dependent and often outside of our (QuNex) control. What works best is to use the --cuda flag instead of --nv, but for this you need NVIDIA Container Toolkit installed on your host system.
But the error report is the same. My qunex_container is “gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:1.0.4”. I will also tried to download NVIDIA Container Toolkit to follow your suggestion.
Another thing explained on the page I linked above is the --cuda_path parameter, that one will bind the local CUDA into the container, assuring that CUDA runtime and host CUDA drivers match. The assumption here is that they do actually match on your system. It is a common issue with our users that CUDA drivers and CUDA runtime do not match on their host system. You can try running:
nvcc --version
nvidia-smi
on the host system for some quick CUDA debugging.
If you have admin privileges on the system, I would advise you to update CUDA runetime and CUDA drivers. On our system we use (12.4) but if you are updating, I would recommend to update to the latest version (12.8) as everything should be backwards compatible.
---> QuNex will run the command over 1 sessions. It will utilize:
Maximum sessions run in parallel for a job: 1.
Maximum elements run in parallel for a session: 1.
Up to 1 processes will be utilized for a job.
Job #1 will run sessions: HCPA001
/bin/sh: 1: module: not found
(base) yumingz@localadmin:/usr/local$ docker: Error response from daemon: error while creating mount source path '/usr/local/cuda-12.5': mkdir /usr/local/cuda-12.5: read-only file system
Run 'docker run --help' for more information
I’ve opened up read and write access, but it seems that the container still can’t be mapped.
Unfortunately, the amount of help I can give here is limited as these are not really QuNex issues, but your system specific issues and as I do not have access to your system it is hard for me to figure this out. Maybe you can tell me a bit more about the system. Is this a shared high performance compute cluster, is this your personal/lab processing system? How do you load modules, install libraries, etc.?
Anyhow, I see two issues. The first one is /bin/sh: 1: module: not found, this suggests that on the compute node, the module command is not found. How do you usually load modules, on the login node before scheduling? If this is the case, then you need to remove the --bash_pre="module load CUDA/12.5" parameter and just call module load CUDA/12.5 before executing the qunex_container call.
For the mount issue, it seems like the path you are mounting from (/usr/local/cuda-12.5) does not exist and docker is trying to create the mount source path. Probably, it could be becase the load module ... thing failed.
How I usually debug this is that I fire up a compute node interactively and then execute things there step by step, without scheduling everything.
Unfortunately, I’m still having problems with my cuda configuration, but I’ve managed to get it running successfully in a server that’s already paired with the environment, thanks for the help!
Yeah, issues like this are usually system dependent and without access to that system hard to figure out. You also often need admin privileges to update various drivers and packages …