Updates on this issue. I found that requesting 128GB of RAM memory for the cluster call allowed me to go past this error:
msi_resources_time=06:00:00; msi_resources_nodes=1; msi_resources_ntaskspernode=24; msi_resources_mem=256000; msi_queue=a100-8; msi_resources_gpu=gpu:a100:1; msi_resources_jobname=PROBTRACKX; \
study_sharedfolder=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02; \
qunex_container dwi_probtrackx_dense_gpu \
--batchfile=${study_sharedfolder}/processing/batch_K99Aim2.txt --sessionsfolder=${study_sharedfolder}/sessions --sessions="10001" \
--omatrix1="yes" --omatrix3="yes" --nsamplesmatrix1="10000" --nsamplesmatrix3="3000" --distancecorrection="yes" --storestreamlineslength="yes" --overwrite="yes" \
--nv \
--bash_pre="module load cuda/11.2" \
--scheduler=SLURM,time=${msi_resources_time},nodes=${msi_resources_nodes},cpus-per-task=${msi_resources_ntaskspernode},mem=${msi_resources_mem},partition=${msi_queue},gres=${msi_resources_gpu},jobname=${msi_resources_jobname} \
--bind=${study_sharedfolder}:${study_sharedfolder} --container=${HOME}/qunex/qunex_suite-0.98.4.sif
But it then had another issue related to out of memory error, which I believe is for the GPU memory although it seemed to have enough available memory:
...................Allocated GPU 0...................
Device memory available (MB): 40116 ---- Total device memory(MB): 40536
Memory required for allocating data (MB): 943
CUDA Runtime Error: out of memory
Device memory available after copying data (MB): 39136
Running 376230 streamlines in parallel using 2 STREAMS
Total number of streamlines: 912820000
Full log here:
error_dwi_probtrackx_dense_gpu_10001_2023-08-26_15.31.01.327032.log (12.1 KB)
Then I tried again and looked at the GPU node’s activity, and it seemed to spit out errors related to the GPU:
ahl04:~ moana004$ ssh agb05
!! MSI Cluster Resource
!! Disconnect IMMEDIATELY if you are not an authorized user!
agb05:~ moana004$ nvidia-smi
Sat Aug 26 15:34:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:8B:00.0 Off | 0 |
| N/A 32C P0 57W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
agb05:~ moana004$
Message from syslogd@agb05 at Aug 26 15:34:21 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#39 stuck for 22s! [probtrackx2_gpu:886161]
Message from syslogd@agb05 at Aug 26 15:34:21 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#39 stuck for 22s! [probtrackx2_gpu:886161]
Message from syslogd@agb05 at Aug 26 15:35:13 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 23s! [probtrackx2_gpu:886161]
Message from syslogd@agb05 at Aug 26 15:35:13 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 23s! [probtrackx2_gpu:886161]
Message from syslogd@agb05 at Aug 26 15:35:41 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [probtrackx2_gpu:886161]
Message from syslogd@agb05 at Aug 26 15:35:41 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [probtrackx2_gpu:886161]
Connection to agb05 closed by remote host.
Connection to agb05 closed.
This made me think that this GPU cluster (NVIDIA A100 GPU) may be the issue. I tried to run the same command in a NVidia Tesla V100 GPU node:
msi_resources_time=06:00:00; msi_resources_nodes=1; msi_resources_ntaskspernode=12; msi_resources_mem=128000; msi_queue=v100; msi_resources_gpu=gpu:v100:1; msi_resources_jobname=PROBTRACKX; \
study_sharedfolder=/home/moanae/shared/project_K99_ChrTMDHCP_qunex02; \
qunex_container dwi_probtrackx_dense_gpu \
--batchfile=${study_sharedfolder}/processing/batch_K99Aim2.txt --sessionsfolder=${study_sharedfolder}/sessions --sessions="10001" \
--omatrix1="yes" --omatrix3="yes" --nsamplesmatrix1="10000" --nsamplesmatrix3="3000" --distancecorrection="yes" --storestreamlineslength="yes" --overwrite="yes" \
--nv \
--bash_pre="module load cuda/11.2" \
--scheduler=SLURM,time=${msi_resources_time},nodes=${msi_resources_nodes},cpus-per-task=${msi_resources_ntaskspernode},mem=${msi_resources_mem},partition=${msi_queue},gres=${msi_resources_gpu},jobname=${msi_resources_jobname} \
--bind=${study_sharedfolder}:${study_sharedfolder} --container=${HOME}/qunex/qunex_suite-0.98.4.sif
And it ran past the above errors and the GPU is being used without issues:
ln0004:~ moana004$ ssh cn2106 nvidia-smi
!! MSI Cluster Resource
!! Disconnect IMMEDIATELY if you are not an authorized user!
Sat Aug 26 15:51:41 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
| N/A 48C P0 47W / 250W | 27137MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 714213 C ...l/bin/probtrackx2_gpu10.2 27113MiB |
+-----------------------------------------------------------------------------+
Based on this, it seems that this might be some issue with the NVIDIA A100 GPU cluster and the probtrackx_gpu command. What are your thoughts on this? Any suggestions on how can I interact with my local cluster staff to figure this out? Thank you.
Estephan