I have been trying to run dwi_probtrackx_dense_gpu using the 0.99.2d release but I got the following error:
# Generated by QuNex 0.99.2 on 2024-02-09_09.49.18.115382
#
-- qunex.sh: Specified Command-Line Options - Start --
Study Folder: /home/ehui/qunex/cimt
Sessions Folder: /home/ehui/qunex/cimt/sessions
Session: 024A
probtrackX GPU scripts Folder: /opt/qunex/bash/qx_utilities/diffusion_tractography_dense/tractography_gpu_scripts
Compute Matrix1: no
Compute Matrix3: yes
Number of samples for Matrix1: 10000
Number of samples for Matrix3: 3000
Distance correction: no
Store streamlines length: no
Overwrite prior run: yes
No GPU: no
-- qunex.sh: Specified Command-Line Options - End --
^[[32m ------------------------- Start of work -------------------------------- ^[[0m
^[[32m --- probtrackX GPU for session 024A... ^[[0m
^[[31m --- Removing existing Probtrackxgpu Matrix3 dense run for 024A... ^[[0m
^[[32m Checking if ProbtrackX Matrix 3 and dense connectome was completed on 024A... ^[[0m
^[[32m ProbtrackX Matrix 3 solution and dense connectome incomplete for 024A. Starting run with 3000 samples... ^[[0m
Running the following probtrackX GPU command:
---------------------------
/opt/qunex/bash/qx_utilities/diffusion_tractography_dense/tractography_gpu_scripts/run_matrix3.sh /home/ehui/qunex/cimt/sessions 024A 3000 no no no
---------------------------
-- Queueing Probtrackx
/home/ehui/qunex/cimt/sessions/024A/hcp/024A/MNINonLinear/Results/Tractography/commands_Mat3.sh: line 1: /opt/fsl/fsl/bin/probtrackx2_gpu10.1: No such file or directory
-- Queueing Post-Matrix 3 Calls
Kindly note that below are the CUDA that I have installed on my workstation:
CUDA-10.1, CUDA-10.2, CUDA-11.8
Could you please provide the full command call that you used. I believe this is a minor thing, we just need to set the correct version of CUDA that will get used.
I previously was able to run dwi_probtrackx_dense_gpu after using the command you suggested. But for some reason, the problem re-emerged again:
...................Allocated GPU 0...................
Device memory available (MB): 48412 ---- Total device memory(MB): 48685
Memory required for allocating data (MB): 897
CUDA Runtime Error: out of memory
Device memory available after copying data (MB): 47476
Running 456412 streamlines in parallel using 2 STREAMS
Total number of streamlines: 577848000
This is now a different error. Here the problem is that your GPU does not have enough memory. Based on the above info this is weird, as it seems like only a portion of GPU’s memory will be used here. What is your GPU? Is anyone else using the system/GPU?
Thank you for reporting the issue and providing a potential solution. I also saw your post in the FSL mailing list.
I met the exact same problem with A100 on a research cluster. I tried to run a pytorch script in the background that uses the same gpu, but it still gives the same error. (I have tried a few different CUDA versions.)
Could you kindly let me know what would be the concurrent process works for you please? Thanks!
The error
...................Allocated GPU 0...................
Device memory available (MB): 39396 ---- Total device memory(MB): 40326
Memory required for allocating data (MB): 924
CUDA Runtime Error: out of memory
Device memory available after copying data (MB): 38434
Running 369480 streamlines in parallel using 2 STREAMS
Total number of streamlines: 912820000
The simple pytorch program
import torch
import time
# Check if CUDA is available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Create a small tensor on GPU
tensor = torch.tensor([1.0, 2.0, 3.0], device=device)
# Run indefinitely
while True:
# Perform a simple addition
tensor += torch.tensor([1.0, 1.0, 1.0], device=device)
tensor -= torch.tensor([1.0, 1.0, 1.0], device=device)
# print(f"Updated tensor: {tensor}")
# Sleep for a second to make the loop slower
time.sleep(1)
With it running in the background occupying some memory
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:01:00.0 Off | 0 |
| N/A 31C P0 61W / 400W | 504MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1385338 C python 494MiB |
+-----------------------------------------------------------------------------------------+
This seems like a system issue. Did you maybe try contacting system admins? An out of memory error triggering for that simple hello world PyTorch script does not make much sense.
Sorry about the confusion! The pytorch script runs fine without error. I meant probtrackx still gives the out of memory error when “there is concurrent process using the same GPU”.
Something worth noticing is that the available memory reported by probtrackx (close to all memory) does not fit the nvidia-smi report (~500M occupied by pytorch script), but I don’t know it well enough to tell if it is an issue or not.
The weirdest thing is that, based on the log the process tries to copy ~900MB worth of data to the GPU and runs out of memory, even though the GPU has 40kMBs of it …
Unfortunately, I have no idea how to help you out here, as I did not see this myself yet on our system.
I agree that it might have more to do with probtrackx itself rather than Qunex.
It seems the reported errors are mostly on A100 and A6000, it would be great if you could try testing with these gpus, so we’d be sure to pinpoint the issue to probtrackx.
I believe that the version that is shipped with FSL is built and tested for CUDA 10.2 (same as eddy_gpu). Do you maybe know if we could somehow get a different version of it? The thing is that our hands are pretty tied when it comes to these external tools and usually to fix issues like these the developers of the tool need to take care of it.
I’m not an expert on this. I only briefly looked at the FSL code repo, and it seems they have a pretty complex and streamlined build system (which includes configs for cuda versions until 11.x and for the Ampere arch, although I don’t know if/how they used that).
It is certainly weird that probtrackx fails but eddy and bedpostx works (unless probtrackx was built differently somehow). If that’s the case, it might be something in the code that has to be fixed by the original developer.