[RESOLVED] Qunex container support for hcp_diffusion

Dear QuNex Experts:

Forgive me for asking a basic question.
I was wondering if there is a way to run hcp_diffusion with qunex_container? I tried following the instructions from oriadev / qunex / wiki / UsageDocs / RunningQunexContainer — Bitbucket but failed if I used the container (gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:0.90.6).

Your help would be much appreciated. Thank you very much.

Best regards,
Ed

Hi Ed,

All QuNex functionalities also work when running through the container. So, running hcp_diffusion is possible through qunex_container. Quick Start (oriadev / qunex / wiki / Overview / QuickStart — Bitbucket) is a good point to start using the qunex_container script. There you will also see how to onboard the data and prepare it for later processing steps.

To run hcp_diffusion on your data you first have to run a part of the HCP minimal preprocessing pipeline (https://bitbucket.org/oriadev/qunex/wiki/UsageDocs/HCPPreprocessing.md). You have to run hcp_pre_freesurfer I believe. After that processing is completed, you can run the hcp_diffusion command (https://bitbucket.org/oriadev/qunex/wiki/UsageDocs/HCPDiffusion.md).

If you stumble upon any problems please post them here and I will gladly help you out.

Cheers, Jure

Hi Jure:

Really appreciate your swift response indeed.

I did follow the quick start guide and ran all the turnkey steps as listed in the page using the sample data. Forgive my ignorance as I am very new to qunex, I tried the following:
sudo ./qunex_container hcp_diffusion --sessions=“HCPA001” --sessionsfolder="/home/mrilab/qunex/quickstart/sessions" --nv --container=“gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:0.90.6” --overwrite=“yes”

However, there was an error shown below:
ERROR: Study folder or sessions folder is not defined or missing.
Check your inputs and re-run QuNex.

Your help would be much appreciated and apologies for the simple question!

Best,
Ed

Hi Ed,

thanks for reporting this. I will investigate and get back to you. The good news is that it seems like a bug in the qunex_container script, so it should be easy to fix by me and then also easy to update on your end.

Cheers, Jure

Thanks a tonne Jure!

Please update your qunex_container script by executing

wget --no-check-certificate -r 'https://drive.google.com/uc?export=download&id=1wdWgKvr67yX5J8pVUa6tBGXNAg3fssWs' -O qunex_container

There was a bug when passing one of the parameters to hcp_diffusion through the qunex_container. With the renewed script, I managed to execute the code snippet below without any problems. The snippet runs hcp_diffusion on the quick start data on our system.

CONTAINER="/gpfs/project/fas/n3/software/Singularity/qunex_suite-0.90.7.sif"
STUDY_FOLDER="/gpfs/project/fas/n3/Studies/MBLab/jd_tests/quick_start/qx_study"

qunex_container hcp_diffusion \
  --sessionsfolder="${STUDY_FOLDER}/sessions" \
  --sessions="${STUDY_FOLDER}/processing/HCPA001_batch.txt" \
  --overwrite="yes" \
  --container="${CONTAINER}" \
  --bash="module load CUDA/9.1.85" \
  --nv \
  --scheduler="SLURM,time=24:00:00,ntasks=1,cpus-per-task=1,mem-per-cpu=64000,partition=pi_anticevic_gpu,gres=gpu:1,jobname=hcp_diffusion"

Note that I have to load the CUDA module on the system first (through the bash parameter). You most likely do not have to do this since you are running things locally.

Let me know how it goes.

Hi Jure:

Thanks so much for your timely help indeed! The qunex container runs fine if I do not use scheduler (when I don’t use scheduler, does it mean then that parallel computing won’t be used?).

However, when I used scheduler, the following error appears after running qunex_container hcp_diffusion --sessionsfolder="${STUDY_FOLDER}/sessions" --sessions="${STUDY_FOLDER}/processing/HCPA001_batch.txt" --overwrite=“yes” --container="${CONTAINER}" --nv --scheduler=“SLURM,ntasks=1,cpus-per-task=1,mem-per-cpu=64000,gres=gpu:0,jobname=hcp_diffusion”

Submitting:

#!/bin/sh
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=64000
#SBATCH --gres=gpu:0
#SBATCH --job-name=hcp_diffusion
sudo docker run -v “$(pwd)”:/data -v /home/mrilab:/home/mrilab gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:0.90.6 bash /home/mrilab/tmppnmwui05
/bin/sh: 1: sbatch: not found

Your help would be greatly appreciated!

Many thanks,
Ed

Hi Ed,

There are two possible explanations for the above error and none are QuNex related. The first explanation might be that you are trying to use a scheduler on a system that does not support/need scheduling. The second explanation is that you are indeed working on a high performance compute system (HPC), but your system does not use SLURM but a different scheduler. The missing sbatch is a SLURM command.

HPCs are dedicated and very expensive system in which users login remotely and then execute their computer programs there. HPCs are traditionally built and maintained by research institutions (Universities, Institutes …). Scheduling is used only on HPCs, the logic here is that there are a lot of users using the same system, if you would allow everyone to run what they want the system would get overburdened. This is when scheduling systems kick in, with scheduling users do not execute commands directly but they reserve resources and schedule them for execution. In other words scheduler allocates a time slot when there are enough resources for your job and your job will run in that time slot. So scheduling really has nothing to do with parallel execution, but more with whether you are running commands on a system with a single concurrent (or a few) users, or on a multi user HPC system.

With QuNex things can run in parallel both locally (e.g. on your computer or laptop) or on a HPC. The quick start example has only one session so it cannot run multiple sessions in parallel. If your dataset has multiple sessions, you can run several of them in parallel by specifying the parsessions parameter. For example, adding --parsessions = 2 to the above command would process hcp_diffusion on 2 sessions simultaneously. Note though that you need a lot of resources for such processing and your personal computer could not have enough RAM or CPU power for this to work well. Some commands can also run several elements in parallel, this is tuned through the parelements parameter. Sometimes a session has multiple elements of the same kind, for example your session might have multiple BOLD images. In fact, the quick start has 2 BOLDs and using --parelements = 2 would process both BOLDs in parallel in the hcp_fmri_volume and hcp_fmri_surface steps.

Cheers, Jure

Hi Jure:

I see, thanks a lot for your detailed explanation indeed! In which case I probably won’t need scheduler as I am not using any HPC.

After running hcp_diffusion without the scheduler flag, looks like there is another problem with eddy_cuda (see errors below for your reference):
EDDY::: EddyGpuUtils::InitGpu: cudeGetDevice returned an unknown error code
EDDY::: cuda/EddyGpuUtils.cu::: static void EDDY::EddyGpuUtils::InitGpu(bool): Exception thrown
EDDY::: cuda/EddyGpuUtils.cu::: static std::shared_ptrEDDY::DWIPredictionMaker EDDY::EddyGpuUtils::LoadPredictionMaker(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, const EDDY::ECScanManager&, unsigned int, float, NEWIMAGE::volume&, bool): Exception thrown
EDDY::: eddy.cpp::: EDDY::ReplacementManager* EDDY::Register(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, unsigned int, const std::vector<float, std::allocator >&, EDDY::SecondLevelECModel, bool, EDDY::ECScanManager&, EDDY::ReplacementManager*, NEWMAT::Matrix&, NEWMAT::Matrix&): Exception thrown
EDDY::: Eddy failed with message EDDY::: eddy.cpp::: EDDY::ReplacementManager* EDDY::DoVolumeToVolumeRegistration(const EDDY::EddyCommandLineOptions&, EDDY::ECScanManager&): Exception thrown
Wed Jun 23 01:33:36 EDT 2021:run_eddy.sh: Completed with return value: 1

I was wondering if this has anything to do with the problem with the local GPU?

Many thanks as always.
Ed

Yes, this seems like a problem with your GPU, which GPU do you have? The first line suggests that the eddy_cuda script used by HCP’s diffusion pipeline cannot get access to your device (cudeGetDevice returned an unknown error code). What happens when you run nvcc -v on your system, do you get the version of your CUDA libraries?

Hi Jure:

My GPU is GeForce RTX 3090 running in Ubuntu 20.04 LTS with nvidia driver 460. But looks like there is no command nvcc.

Forgive my ignorance, I thought the qunex container would have already taken care of the cuda version and etc. Or I still have to install cuda toolkit (regardless of the version?) on my local host?

Many thanks for your help indeed.

Ed

Yes, you need CUDA installed on your system. The Singularity container needs to use CUDA that is sitting outside of the container, hence the --nv flag to link the system inside the container with your hardware and drivers. This is just how things are designed and we cannot overcome this by installing CUDA inside the container.

The CUDA version that QuNex is tested on is 9.1.

Another solution that is not that good is to add the --hcp_dwi_nogpu flag to the command call. This will force non GPU processing of hcp_diffusion but this would make the processing much slower.

Hi Jure:

Understood. I will install cuda toolkit 11.2 and try it again.

Many thanks for your help!

Ed

I am not sure if eddy_cuda supports the latest CUDA toolkit (11.2). I recommend you install version 9.1.

For CUDA 9.1, I can’t seem to find the installation instructions from NVIDIA’s website for Ubuntu 20.04 LTS.

Or are there any tricks that you could kindly recommend for installing cuda 9.1 on ubuntu 20.04?

Many thanks Jure!

Unfortunately I cannot provide much help here, the only thing I have is the page from where you can download it (CUDA Toolkit 9.1 Download - Archived | NVIDIA Developer).

That’s ok, really appreciate your help along the way Jure! Thanks a lot.