[RESOLVED] Issues with dwi_probtrackx_dense_gpu

ehui · February 9, 2024, 2:07am

Dear Jure:

I have been trying to run dwi_probtrackx_dense_gpu using the 0.99.2d release but I got the following error:

# Generated by QuNex 0.99.2 on 2024-02-09_09.49.18.115382
#


-- qunex.sh: Specified Command-Line Options - Start --
   Study Folder: /home/ehui/qunex/cimt
   Sessions Folder: /home/ehui/qunex/cimt/sessions
   Session: 024A
   probtrackX GPU scripts Folder: /opt/qunex/bash/qx_utilities/diffusion_tractography_dense/tractography_gpu_scripts
   Compute Matrix1: no
   Compute Matrix3: yes
   Number of samples for Matrix1: 10000
   Number of samples for Matrix3: 3000
   Distance correction: no
   Store streamlines length: no
   Overwrite prior run: yes
   No GPU: no
-- qunex.sh: Specified Command-Line Options - End --

^[[32m ------------------------- Start of work -------------------------------- ^[[0m


^[[32m    --- probtrackX GPU for session 024A... ^[[0m


^[[31m  --- Removing existing Probtrackxgpu Matrix3 dense run for 024A... ^[[0m


^[[32m Checking if ProbtrackX Matrix 3 and dense connectome was completed on 024A... ^[[0m


^[[32m ProbtrackX Matrix 3 solution and dense connectome incomplete for 024A. Starting run with 3000 samples... ^[[0m

Running the following probtrackX GPU command:

---------------------------

   /opt/qunex/bash/qx_utilities/diffusion_tractography_dense/tractography_gpu_scripts/run_matrix3.sh /home/ehui/qunex/cimt/sessions 024A 3000 no no no

---------------------------

-- Queueing Probtrackx

/home/ehui/qunex/cimt/sessions/024A/hcp/024A/MNINonLinear/Results/Tractography/commands_Mat3.sh: line 1: /opt/fsl/fsl/bin/probtrackx2_gpu10.1: No such file or directory

-- Queueing Post-Matrix 3 Calls

Kindly note that below are the CUDA that I have installed on my workstation:
CUDA-10.1, CUDA-10.2, CUDA-11.8

Many thanks for your help as always!

Best,
Ed

demsarjure · February 9, 2024, 6:16am

Hi Ed,

Could you please provide the full command call that you used. I believe this is a minor thing, we just need to set the correct version of CUDA that will get used.

Best, Jure

ehui · February 9, 2024, 7:31am

Hi Jure:

Absolutely, please see below:

#!/bin/bash

#cd ~/qunex

#sudo docker login gitlab.qunex.yale.edu:5002 -u ehui
#sudo docker pull gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:0.93.6

export SDIR="${HOME}"
PATH=${SDIR}/qunex:${PATH}
export STUDY_NAME="cimt"
export WORK_DIR="${SDIR}/qunex"
export QUNEX_CONTAINER="${HOME}/qunex/qunexcontainer/qunex_suite-0.99.2d.sif" 
export RAW_DATA="${WORK_DIR}/data"
export INPUT_BATCH_FILE="${RAW_DATA}/006_batch.txt"
export INPUT_MAPPING_FILE="${RAW_DATA}/006_mapping__.txt" #pci.txt"
export SESSIONS="024A" 

qunex_container dwi_probtrackx_dense_gpu \
  --sessionsfolder="${WORK_DIR}/${STUDY_NAME}/sessions" \
  --sessions="${SESSIONS}" \
  --omatrix3="yes" \
  --overwrite="yes" \
  --container="${QUNEX_CONTAINER}" \
  --bash_pre="module load CUDA/10.1" \
  --bash_post="export DEFAULT_CUDA_VERSION=10.1" \
  --bind="/usr/local/cuda-10.1/:/usr/local/cuda/" \
  --nv;

Thanks a lot,
Ed

demsarjure · February 12, 2024, 6:54am

Please try replacing your command with:

qunex_container dwi_probtrackx_dense_gpu \
  --sessionsfolder="${WORK_DIR}/${STUDY_NAME}/sessions" \
  --sessions="${SESSIONS}" \
  --omatrix3="yes" \
  --overwrite="yes" \
  --container="${QUNEX_CONTAINER}" \
  --bash_pre="module load CUDA/10.2" \
  --bind="/usr/local/cuda-10.2/:/usr/local/cuda/" \
  --nv;

10.2 is the default used CUDA version and the one we tested the most. I think this should work. Let me know how it goes.

ehui · February 27, 2024, 3:00am

Hello Jure:

My apologies for the late reply.

It works! Many thanks for your help indeed!

Best,
Ed

ehui · April 8, 2024, 7:56am

Hi Jure:

I previously was able to run dwi_probtrackx_dense_gpu after using the command you suggested. But for some reason, the problem re-emerged again:

...................Allocated GPU 0...................
Device memory available (MB): 48412 ---- Total device memory(MB): 48685
Memory required for allocating data (MB): 897
CUDA Runtime Error: out of memory
Device memory available after copying data (MB): 47476
Running 456412 streamlines in parallel using 2 STREAMS
Total number of streamlines: 577848000

Do you know what could have caused this problem?

Many thanks,
Ed

demsarjure · April 8, 2024, 11:18am

Hi,

This is now a different error. Here the problem is that your GPU does not have enough memory. Based on the above info this is weird, as it seems like only a portion of GPU’s memory will be used here. What is your GPU? Is anyone else using the system/GPU?

Best, Jure

ehui · April 8, 2024, 2:14pm

Hi Jure:

My GPU is A6000 and no one was using it at the time. The strange thing was back in February it worked ok.

Thanks,
Ed

ehui · April 9, 2024, 2:01am

Hi Jure:

Weirdly, it works now when there is concurrent process using the same GPU. It failed when there isn’t any.

Best,
Ed

liuzhenqi77 · April 20, 2024, 1:08am

Hi @ehui,

Thank you for reporting the issue and providing a potential solution. I also saw your post in the FSL mailing list.

I met the exact same problem with A100 on a research cluster. I tried to run a pytorch script in the background that uses the same gpu, but it still gives the same error. (I have tried a few different CUDA versions.)

Could you kindly let me know what would be the concurrent process works for you please? Thanks!

The error

...................Allocated GPU 0...................
Device memory available (MB): 39396 ---- Total device memory(MB): 40326
Memory required for allocating data (MB): 924
CUDA Runtime Error: out of memory
Device memory available after copying data (MB): 38434
Running 369480 streamlines in parallel using 2 STREAMS
Total number of streamlines: 912820000

The simple pytorch program

import torch
import time

# Check if CUDA is available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create a small tensor on GPU
tensor = torch.tensor([1.0, 2.0, 3.0], device=device)

# Run indefinitely
while True:
    # Perform a simple addition
    tensor += torch.tensor([1.0, 1.0, 1.0], device=device)
    tensor -= torch.tensor([1.0, 1.0, 1.0], device=device)
    # print(f"Updated tensor: {tensor}")
    
    # Sleep for a second to make the loop slower
    time.sleep(1)

With it running in the background occupying some memory

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:01:00.0 Off |                    0 |
| N/A   31C    P0             61W /  400W |     504MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1385338      C   python                                        494MiB |
+-----------------------------------------------------------------------------------------+

Best,
Zhen-Qi Liu

demsarjure · April 20, 2024, 9:21am

Hi,

This seems like a system issue. Did you maybe try contacting system admins? An out of memory error triggering for that simple hello world PyTorch script does not make much sense.

Best, Jure

liuzhenqi77 · April 20, 2024, 2:10pm

Hi Jure,

Sorry about the confusion! The pytorch script runs fine without error. I meant probtrackx still gives the out of memory error when “there is concurrent process using the same GPU”.

Something worth noticing is that the available memory reported by probtrackx (close to all memory) does not fit the nvidia-smi report (~500M occupied by pytorch script), but I don’t know it well enough to tell if it is an issue or not.

Best,
Zhen-Qi

demsarjure · April 20, 2024, 4:53pm

The weirdest thing is that, based on the log the process tries to copy ~900MB worth of data to the GPU and runs out of memory, even though the GPU has 40kMBs of it …

Unfortunately, I have no idea how to help you out here, as I did not see this myself yet on our system.

liuzhenqi77 · April 21, 2024, 7:00pm

I agree that it might have more to do with probtrackx itself rather than Qunex.

It seems the reported errors are mostly on A100 and A6000, it would be great if you could try testing with these gpus, so we’d be sure to pinpoint the issue to probtrackx.

https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind2404&L=FSL&P=R55148
https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind1902&L=FSL&D=0&P=332507

Thanks!

Best,
Zhen-Qi

demsarjure · April 23, 2024, 5:58am

I believe that the version that is shipped with FSL is built and tested for CUDA 10.2 (same as eddy_gpu). Do you maybe know if we could somehow get a different version of it? The thing is that our hands are pretty tied when it comes to these external tools and usually to fix issues like these the developers of the tool need to take care of it.

liuzhenqi77 · April 23, 2024, 8:05pm

Hi,

I’m not an expert on this. I only briefly looked at the FSL code repo, and it seems they have a pretty complex and streamlined build system (which includes configs for cuda versions until 11.x and for the Ampere arch, although I don’t know if/how they used that).

It is certainly weird that probtrackx fails but eddy and bedpostx works (unless probtrackx was built differently somehow). If that’s the case, it might be something in the code that has to be fixed by the original developer.

Best,
Zhen-Qi