[RESOLVED] Hcp_fmri_surfer LSF job getting stuck

mmunsi · October 18, 2024, 5:40pm

Hello,

I have been trying to run hcp_fmri_surface for 6 subjects from HCP_EP dataset using LSF job. Previously, I was able to test this command successfully on individual subjects with and without run_turnkey command and not using LSF job. I have also been able to run successfully an LSF job for hcp_fmri_volume for these 6 subjects. Each of these subjects has 4 bold runs. So, I have allotted 24 cores across a single high memory node. For hcp_fmri_surface, I tried increasing the memory from 8GB per core to 32 GB and time from 4 hours to 24 hours. Although, for individual subjects I saw it doesn’t take more than an hour, but just to see if resources were an issue for the LSF job, I kept increasing those two values but that didn’t solve the problem. The process is not moving beyond the resampling and smoothing step. I terminated the process at 21.5 hours, since it was already too long. I have attached the output log of the LSF job and the error log for one of the bold runs for one of the subjects. The error log of the LSF job basically says that the process has been keyboard interrupted. When I tried using 8, 16 and 24GB memory for each core, the process timed out showing the tmp_logs for each bold run. Could the QuNex failure message in the output log file be due to keyboard interruption?

I have already reached out to the HPC Helpdesk at my institute and they said they cannot help since hcp_fmri_volume ran successfully, so it’s not an issue related to HPC system. Please let me know how to troubleshoot this issue and also if you need any more information.

output_file_fmriSurf_test_group.out (22.5 KB)
error_hcp_fmri_surface_rfMRI_REST1_PA_1003_01_MR_2024-10-17_13.06.55.259367.log (5.6 KB)
.

Thanks,
Mona

demsarjure · October 21, 2024, 5:39am

Hi Mona,

If the scheduling job worked for hcp_fmri_volume than it should also work for hcp_fmri_surface. The code that gets executed to prepare everything for scheduling is the same in both cases. So if it work for one command it should work for all others as well.

Yes, the error you are posting is most likely due to manual termination. I have never seen hcp_fmri_surface completed in an hour as you are claiming. The resampling and smoothing step is very complex. What I do see is that you are creating 1 job that processes 6 sessions in parallel. Meaning that the resources you allocated are for 6 simultaneous hcp_fmri_surface runs, if you want your system to be processing that in an “unconstrained” manner, you need a lot of resources.

Your approach is a bit unconventional, we usually create 6 jobs where each one processes a single session and then run these jobs in parallel. If you can paste the command call you are using, I can maybe see how to change this behavior.

Best, Jure

mmunsi · October 21, 2024, 6:21am

Hi Jure,

Thank you for the response. As I mentioned in my previous message, I tested running the pipeline on few individual subjects to get a feel about the pipeline and in all these cases, the hcp_fmri_surface for each bold run got done in less than an hour. How long does it usually take? I am attaching the log of the command for one of the bolds for one of the subjects where it finished successfully. Does taking less than an hour mean something did not get done properly?

Here are the command calls I have been using along with the LSF job command.

Calls:

qunex_container hcp_fmri_surface  \
    --sessionsfolder="${STUDY}/sessions" \
    --batchfile="${STUDY}/processing/batch1.txt" \
    --bind="$WORK_DIR:$WORK_DIR"  \
    --hcp_filename='userdefined'  \
    --parsessions="6"  \
    --parelements="4"  \
    --overwrite="yes"  \
    --container="$QUNEX_CONTAINER"

LSF:
#BSUB -P acc_WenglerLab_SczEnsembles
#BSUB -J fmriSurf_test_group         # Job name
#BSUB -oo output_file_fmriSurf_test_group.out  # Output file
#BSUB -eo error_file_fmriSurf_test_group.err   # Error file
#BSUB -R "span[hosts=1]"     # Run on a single host
#BSUB -q premium               # Queue name
#BSUB -R rusage[mem=32000]
#BSUB -R himem
#BSUB -n 24
#BSUB -W 24:00
#BSUB -L /bin/bash

I have started learning MRI processing around couple of months back. So lot of things are still new to me. I know how long it takes for FreeSurfer. If you could let me know how long the following steps generally take, then it will be very helpful for me.

Post_freesurfer
fmri_volume
fmri_surface
icafix
post_fix (I thought post_fix was default option under icafix, but I don’t know why it did not get done for 2 individual subjects. Hence I resorted to adding it separately ).
msmall
dedrift_and_resample (similar thing happened like post_fix)

done_hcp_fmri_surface_rfMRI_REST2_AP_1002_01_MR_2024-09-18_12.36.33.972756.log (7.2 KB)

Thanks,
Mona

demsarjure · October 21, 2024, 7:31am

Hi Mona,

In my previous post I did not notice that your system is using the LFS scheduler. QuNex has “out of the box” support for SLURM and PBS scheduling system and it makes life easier through the use of a special --scheduler parameter. Using this will automatically split processing in a way that each session will be processes as its own job. Unfortunately, we do not support LFS (we used to, but we no longer have access to an LFS system so we cannot develop and test that). If you want to process your sessions this way, you will need to code a loop that iterates over sessions and then schedules a job for each session. Something along these lines:

#!/bin/bash

SESSIONS="S001,S002,S003,S004"

# convert into an array
IFS=',' read -r -a session_array <<< "$SESSIONS"

# loop over the array and execute
for session in "${session_array[@]}"; do
  qunex_container hcp_fmri_surface  \
    --sessionsfolder="${STUDY}/sessions" \
    --batchfile="${STUDY}/processing/batch1.txt" \
    --sessions="${session}" \
    --bind="$WORK_DIR:$WORK_DIR"  \
    --hcp_filename='userdefined'  \
    --parelements="4"  \
    --overwrite="yes"  \
    --container="$QUNEX_CONTAINER"
done

Note that the above does not do any scheduling, but executes the qunex_container separately for each session. To integrate scheduling into this logic you need to integrate above with your current script. I guess your IT guys can help you here as they should be experts on the LSF scheduler. If you do not want to deal with this, you can just allocate more time so the process will finish in your current setup.

Ah, I see that you ran hcp_fmri_surface over a single session and a single BOLD. In that case, it can complete in a couple of hours. I was talking about processing all BOLDs in a non-parallel way (which is the default way if you do not set the --parelements parameter). If your session has for example 8 BOLDs than this can easily take a day. Everything also depends on how powerful your processing system is and also the quality of your data (data with a lot of defects can take very long to process) so it is hard to give good estimates. Below are some ballparks based on our quickstart (2 resting state BOLDs). I am leaning towards giving conservative numbers, I think in practice on a decent system the processing speed should be faster.
hcp_post_freesurfer, ~ 4 hours.
hcp_fmri_volume, ~ 8 hours.
hcp_fmri_surface, ~ 8 hours.
hcp_icafix, ~ 4 hours.
hcp_post_fix, ~ 30 minutes.
hcp_msmall, ~ 4 hours.
hcp_dedrift_and_resample, ~ 30 minutes.

Like I said, if you are processing multiple sessions at the same time and with more imaging data (e.g., 4 or 8 BOLDs instead of 2) things will start to add up.

Best, Jure

mmunsi · October 23, 2024, 1:59pm

Hi Jure,

Thank you so much for the detailed response! I tried the way you suggested. I haven’t heard back from the IT guys of my institute yet, so I just tried without scheduling. But it’s been more than 24 hours and hcp_fmri_surface is still stuck at the resampling and smoothing step. Do you have any suggestion on what to do? As in, if I have to troubleshoot any of the previous steps or the data or something? Sorry for bothering you with this, but I have been trying to get this done for a while now. I simply cannot understand why this batch process is not getting done when it ran smoothly for individual subjects and batch process got done for hcp_fmri_volume.

Thanks,
Mona

demsarjure · October 24, 2024, 8:17am

Hi Mona,

the things you are explaining are most likely related to specifics of your system and not QuNex. As such it is hard for me to help you out. As you said, jobs finish when they are much smaller in scope but get stuck or are extremelly slow when you increase them. There are several possibilities here, maybe your are not allocating enough resources, maybe your system is unable to handle that much work in parallel …

I do not think a solution without scheduling will work. When you login to a super compute cluster you traditionally end up on a weak login node that is not suitable for neuroimaging processing. Only when you schedule jobs they are executed on the powerful compute nodes.

Another thing that might happen is that the processing is not really stuck in resampling and smoothing step but it crashes abruptly and this is the last entry you see in the log.

Best, Jure

mmunsi · October 27, 2024, 3:13pm

Hi Jure,

I was able to run a batch for hcp_fmri_surface command successfully! Thank you so much for your help!

Thanks,
Mona