[RESOLVED] Dwi_probtrackx_dense_gpu error

Updates on this issue. I found that requesting 128GB of RAM memory for the cluster call allowed me to go past this error:

msi_resources_time=06:00:00; msi_resources_nodes=1; msi_resources_ntaskspernode=24; msi_resources_mem=256000; msi_queue=a100-8; msi_resources_gpu=gpu:a100:1; msi_resources_jobname=PROBTRACKX; \
study_sharedfolder=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02; \
qunex_container dwi_probtrackx_dense_gpu \
--batchfile=${study_sharedfolder}/processing/batch_K99Aim2.txt --sessionsfolder=${study_sharedfolder}/sessions --sessions="10001" \
--omatrix1="yes" --omatrix3="yes" --nsamplesmatrix1="10000" --nsamplesmatrix3="3000" --distancecorrection="yes" --storestreamlineslength="yes" --overwrite="yes" \
--nv \
--bash_pre="module load cuda/11.2" \
--scheduler=SLURM,time=${msi_resources_time},nodes=${msi_resources_nodes},cpus-per-task=${msi_resources_ntaskspernode},mem=${msi_resources_mem},partition=${msi_queue},gres=${msi_resources_gpu},jobname=${msi_resources_jobname} \
--bind=${study_sharedfolder}:${study_sharedfolder} --container=${HOME}/qunex/qunex_suite-0.98.4.sif

But it then had another issue related to out of memory error, which I believe is for the GPU memory although it seemed to have enough available memory:

...................Allocated GPU 0...................
Device memory available (MB): 40116 ---- Total device memory(MB): 40536
Memory required for allocating data (MB): 943
CUDA Runtime Error: out of memory
Device memory available after copying data (MB): 39136
Running 376230 streamlines in parallel using 2 STREAMS
Total number of streamlines: 912820000

Full log here:
error_dwi_probtrackx_dense_gpu_10001_2023-08-26_15.31.01.327032.log (12.1 KB)

Then I tried again and looked at the GPU node’s activity, and it seemed to spit out errors related to the GPU:

ahl04:~ moana004$ ssh agb05
!! MSI Cluster Resource
!! Disconnect IMMEDIATELY if you are not an authorized user!
agb05:~ moana004$ nvidia-smi
Sat Aug 26 15:34:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:8B:00.0 Off |                    0 |
| N/A   32C    P0    57W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
agb05:~ moana004$ 
Message from syslogd@agb05 at Aug 26 15:34:21 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#39 stuck for 22s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:34:21 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#39 stuck for 22s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:35:13 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 23s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:35:13 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 23s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:35:41 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [probtrackx2_gpu:886161]

Message from syslogd@agb05 at Aug 26 15:35:41 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [probtrackx2_gpu:886161]
Connection to agb05 closed by remote host.
Connection to agb05 closed.

This made me think that this GPU cluster (NVIDIA A100 GPU) may be the issue. I tried to run the same command in a NVidia Tesla V100 GPU node:

msi_resources_time=06:00:00; msi_resources_nodes=1; msi_resources_ntaskspernode=12; msi_resources_mem=128000; msi_queue=v100; msi_resources_gpu=gpu:v100:1; msi_resources_jobname=PROBTRACKX; \
study_sharedfolder=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02; \
qunex_container dwi_probtrackx_dense_gpu \
--batchfile=${study_sharedfolder}/processing/batch_K99Aim2.txt --sessionsfolder=${study_sharedfolder}/sessions --sessions="10001" \
--omatrix1="yes" --omatrix3="yes" --nsamplesmatrix1="10000" --nsamplesmatrix3="3000" --distancecorrection="yes" --storestreamlineslength="yes" --overwrite="yes" \
--nv \
--bash_pre="module load cuda/11.2" \
--scheduler=SLURM,time=${msi_resources_time},nodes=${msi_resources_nodes},cpus-per-task=${msi_resources_ntaskspernode},mem=${msi_resources_mem},partition=${msi_queue},gres=${msi_resources_gpu},jobname=${msi_resources_jobname} \
--bind=${study_sharedfolder}:${study_sharedfolder} --container=${HOME}/qunex/qunex_suite-0.98.4.sif

And it ran past the above errors and the GPU is being used without issues:

ln0004:~ moana004$ ssh cn2106 nvidia-smi
!! MSI Cluster Resource
!! Disconnect IMMEDIATELY if you are not an authorized user!
Sat Aug 26 15:51:41 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   48C    P0    47W / 250W |  27137MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    714213      C   ...l/bin/probtrackx2_gpu10.2    27113MiB |
+-----------------------------------------------------------------------------+

Based on this, it seems that this might be some issue with the NVIDIA A100 GPU cluster and the probtrackx_gpu command. What are your thoughts on this? Any suggestions on how can I interact with my local cluster staff to figure this out? Thank you.

Estephan