Forgive me for asking a basic question.
I was wondering if there is a way to run hcp_diffusion with qunex_container? I tried following the instructions from oriadev / qunex / wiki / UsageDocs / RunningQunexContainer — Bitbucket but failed if I used the container (gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:0.90.6).
Your help would be much appreciated. Thank you very much.
All QuNex functionalities also work when running through the container. So, running hcp_diffusion is possible through qunex_container. Quick Start (oriadev / qunex / wiki / Overview / QuickStart — Bitbucket) is a good point to start using the qunex_container script. There you will also see how to onboard the data and prepare it for later processing steps.
I did follow the quick start guide and ran all the turnkey steps as listed in the page using the sample data. Forgive my ignorance as I am very new to qunex, I tried the following:
sudo ./qunex_container hcp_diffusion --sessions=“HCPA001” --sessionsfolder="/home/mrilab/qunex/quickstart/sessions" --nv --container=“gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:0.90.6” --overwrite=“yes”
However, there was an error shown below:
ERROR: Study folder or sessions folder is not defined or missing.
Check your inputs and re-run QuNex.
Your help would be much appreciated and apologies for the simple question!
thanks for reporting this. I will investigate and get back to you. The good news is that it seems like a bug in the qunex_container script, so it should be easy to fix by me and then also easy to update on your end.
There was a bug when passing one of the parameters to hcp_diffusion through the qunex_container. With the renewed script, I managed to execute the code snippet below without any problems. The snippet runs hcp_diffusion on the quick start data on our system.
Note that I have to load the CUDA module on the system first (through the bash parameter). You most likely do not have to do this since you are running things locally.
Thanks so much for your timely help indeed! The qunex container runs fine if I do not use scheduler (when I don’t use scheduler, does it mean then that parallel computing won’t be used?).
However, when I used scheduler, the following error appears after running qunex_container hcp_diffusion --sessionsfolder="${STUDY_FOLDER}/sessions" --sessions="${STUDY_FOLDER}/processing/HCPA001_batch.txt" --overwrite=“yes” --container="${CONTAINER}" --nv --scheduler=“SLURM,ntasks=1,cpus-per-task=1,mem-per-cpu=64000,gres=gpu:0,jobname=hcp_diffusion”
Submitting:
#!/bin/sh #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=64000 #SBATCH --gres=gpu:0 #SBATCH --job-name=hcp_diffusion
sudo docker run -v “$(pwd)”:/data -v /home/mrilab:/home/mrilab gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:0.90.6 bash /home/mrilab/tmppnmwui05
/bin/sh: 1: sbatch: not found
There are two possible explanations for the above error and none are QuNex related. The first explanation might be that you are trying to use a scheduler on a system that does not support/need scheduling. The second explanation is that you are indeed working on a high performance compute system (HPC), but your system does not use SLURM but a different scheduler. The missing sbatch is a SLURM command.
HPCs are dedicated and very expensive system in which users login remotely and then execute their computer programs there. HPCs are traditionally built and maintained by research institutions (Universities, Institutes …). Scheduling is used only on HPCs, the logic here is that there are a lot of users using the same system, if you would allow everyone to run what they want the system would get overburdened. This is when scheduling systems kick in, with scheduling users do not execute commands directly but they reserve resources and schedule them for execution. In other words scheduler allocates a time slot when there are enough resources for your job and your job will run in that time slot. So scheduling really has nothing to do with parallel execution, but more with whether you are running commands on a system with a single concurrent (or a few) users, or on a multi user HPC system.
With QuNex things can run in parallel both locally (e.g. on your computer or laptop) or on a HPC. The quick start example has only one session so it cannot run multiple sessions in parallel. If your dataset has multiple sessions, you can run several of them in parallel by specifying the parsessions parameter. For example, adding --parsessions = 2 to the above command would process hcp_diffusion on 2 sessions simultaneously. Note though that you need a lot of resources for such processing and your personal computer could not have enough RAM or CPU power for this to work well. Some commands can also run several elements in parallel, this is tuned through the parelements parameter. Sometimes a session has multiple elements of the same kind, for example your session might have multiple BOLD images. In fact, the quick start has 2 BOLDs and using --parelements = 2 would process both BOLDs in parallel in the hcp_fmri_volume and hcp_fmri_surface steps.
Yes, this seems like a problem with your GPU, which GPU do you have? The first line suggests that the eddy_cuda script used by HCP’s diffusion pipeline cannot get access to your device (cudeGetDevice returned an unknown error code). What happens when you run nvcc -v on your system, do you get the version of your CUDA libraries?
My GPU is GeForce RTX 3090 running in Ubuntu 20.04 LTS with nvidia driver 460. But looks like there is no command nvcc.
Forgive my ignorance, I thought the qunex container would have already taken care of the cuda version and etc. Or I still have to install cuda toolkit (regardless of the version?) on my local host?
Yes, you need CUDA installed on your system. The Singularity container needs to use CUDA that is sitting outside of the container, hence the --nv flag to link the system inside the container with your hardware and drivers. This is just how things are designed and we cannot overcome this by installing CUDA inside the container.
The CUDA version that QuNex is tested on is 9.1.
Another solution that is not that good is to add the --hcp_dwi_nogpu flag to the command call. This will force non GPU processing of hcp_diffusion but this would make the processing much slower.