[RESOLVED] Job stops running w/ no error

smeisler · June 29, 2022, 4:05pm

Hello,

I am running the quick start instructions, and with minimal modification (see here) got it to submit a job to my scheduler.

I see in my $HOME directory the SLURM log file, and it looks like it begins successfully:

--> unsetting the following environment variables: PATH MATLABPATH PYTHONPATH QUNEXVer TOOLS QUNEXREPO QUNEXPATH QUNEXLIBRARY QUNEXLIBRARYETC TemplateFolder FSL_FIXDIR FREESURFERDIR FREESURFER_HOME FREESURFER_SCHEDULER FreeSurferSchedulerDIR WORKBENCHDIR DCMNIIDIR DICMNIIDIR MATLABDIR MATLABBINDIR OCTAVEDIR OCTAVEPKGDIR OCTAVEBINDIR RDIR HCPWBDIR AFNIDIR PYLIBDIR FSLDIR FSLGPUDIR PALMDIR QUNEXMCOMMAND HCPPIPEDIR CARET7DIR GRADUNWARPDIR HCPPIPEDIR_Templates HCPPIPEDIR_Bin HCPPIPEDIR_Config HCPPIPEDIR_PreFS HCPPIPEDIR_FS HCPPIPEDIR_PostFS HCPPIPEDIR_fMRISurf HCPPIPEDIR_fMRIVol HCPPIPEDIR_tfMRI HCPPIPEDIR_dMRI HCPPIPEDIR_dMRITract HCPPIPEDIR_Global HCPPIPEDIR_tfMRIAnalysis HCPCIFTIRWDIR MSMBin HCPPIPEDIR_dMRITractFull HCPPIPEDIR_dMRILegacy AutoPtxFolder FSLGPUScripts FSLGPUBinary EDDYCUDADIR USEOCTAVE QUNEXENV CONDADIR MSMBINDIR MSMCONFIGDIR R_LIBS FSL_FIX_CIFTIRW FSFAST_HOME SUBJECTS_DIR MINC_BIN_DIR MNI_DIR MINC_LIB_DIR MNI_DATAPATH FSF_OUTPUT_FORMAT
e[32m e[0m
e[32mGenerated by QuNex e[0m
e[32m------------------------------------------------------------------------ e[0m
e[32mVersion: 0.93.6 e[0m
e[32mUser: smeisler e[0m
e[32mSystem: node039 e[0m
e[32mOS: RedHat Linux #1 SMP Wed Aug 7 18:08:02 UTC 2019 e[0m
e[32m------------------------------------------------------------------------ e[0m
e[32m e[0m
e[32m        ██████\                  ║      ██\   ██\                        e[0m
e[32m       ██  __██\                 ║      ███\  ██ |                       e[0m
e[32m       ██ /  ██ |██\   ██\       ║      ████\ ██ | ██████\ ██\   ██\     e[0m
e[32m       ██ |  ██ |██ |  ██ |      ║      ██ ██\██ |██  __██\\██\ ██  | e[0m
e[32m       ██ |  ██ |██ |  ██ |      ║      ██ \████ |████████ |\████  /     e[0m
e[32m       ██ ██\██ |██ |  ██ |      ║      ██ |\███ |██   ____|██  ██\      e[0m
e[32m       \██████ / \██████  |      ║      ██ | \██ |\███████\██  /\██\     e[0m
e[32m        \___███\  \______/       ║      \__|  \__| \_______\__/  \__|    e[0m
e[32m            \___|                ║                                       e[0m
e[32m e[0m
e[32m e[0m
e[32m                       DEVELOPED & MAINTAINED BY: e[0m
e[32m e[0m
e[32m                    Anticevic Lab, Yale University e[0m
e[32m               Mind & Brain Lab, University of Ljubljana e[0m
e[32m                     Murray Lab, Yale University e[0m
e[32m e[0m
e[32m                      COPYRIGHT & LICENSE NOTICE: e[0m
e[32m e[0m
e[32mUse of this software is subject to the terms and conditions defined in e[0m
e[32m'LICENSE.md' which is a part of the QuNex Suite source code package: e[0m
e[32mhttps://bitbucket.org/oriadev/qunex/src/master/LICENSE.md e[0m
e[32m e[0m
e[36m ---> Setting up Octave  e[0m


e[32m ........................ Running QuNex v0.93.6 ........................ e[0m

but the job stops almost immediately (according the looking at the job schedule with squeue). I do not see any error messages in any outputs. Any idea of what might be happening here or how to best debug?

Thanks,
Steven

demsarjure · June 29, 2022, 4:25pm

I already updated the qunex_container script, so you can use that one. Specific QuNex command logs can be found inside the <study_folder>/processing/logs/comlogs while the runlogs subfolder has more general top level logs. Maybe the reason why everything breaks is in there. If a log has an error prefix something went wrong. The first step with the error is the one you should focus on. If it crashed immediately there is probably something wrong with data onboarding (import_dicom and map_raw_data).

demsarjure · June 29, 2022, 4:34pm

One issue could be that the data file is not downloaded completely, the HCPA001.zip should be around 2.35 GB. If it is smaller, then something went wrong during the download process. A better way of downloading files from Google Drive is gdown:

pip install gdown
gdown --id 1CbN9dtOQk3PwUeqnBdNeYmWizay2gSy7

Should download the full data file.

demsarjure · June 29, 2022, 4:37pm

If you cannot install gdown another option is to run this:

wget --load-cookies ~/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1CbN9dtOQk3PwUeqnBdNeYmWizay2gSy7' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1CbN9dtOQk3PwUeqnBdNeYmWizay2gSy7" -O HCPA001.zip && rm -rf ~/cookies.txt

I will update the quick start guide. Thanks for reporting this!

smeisler · June 29, 2022, 4:39pm

Thanks again for the prompt reply! Looks promising, as the file I originally downloaded was less than 1MB (which is something I should have flagged much earlier ). I am using gdown now and will retry, then update accordingly.

smeisler · June 29, 2022, 4:45pm

I unfortunately am experiencing the same issues (I updated the qunex_container script)

The job runs for about 5 seconds, besides the slurm log file in my home directory, no output files or directories are created (including /processing/logs/comlogs).

demsarjure · June 29, 2022, 4:58pm

As a sanity check, I just re-tested everything on our end and it seems to be working. It is just about to start with HCP Pipelines. In your case, is the study folder (named quickstart by default) created inside the work dir? If not then it seems that the scheduler or the system kills the job immediatelly. It is weird that there is no log of that though.

What are the settings you are using for the scheduler parameter. Those are quite system dependent. What is provided in the quick start is just some general setup which in theory works on most systems, but many HCP systems require at least some reconfiguration of that.

smeisler · June 29, 2022, 5:31pm

This is not created.

I used the default call in the quick start guide, which all appear to be valid SLURM sbatch variables.

I will try to do some more debugging, but nothing seems out of the ordinary.

demsarjure · June 29, 2022, 8:00pm

Yes, the variables are valid but maybe your system somehow does not allow such configuration or something. A debugging alternative would be to enter the QuNex container manually and try running some sample commands in there. Just so we know whether the issues are with the container or the system.

There is one more thing that could help you debug. You can enter the container manually and try executing some QuNex commands once in there.

# singularity
singularity shell <path_to_the_sif_image, e.g. /data/containers/qunex_suite-0.93.6.sif>

# or docker
docker run -it gitlab.qunex.yale.edu:5002/qunex/qunexcontainer:0.93.6 /bin/bash

# source qunex environment, I believe this is not required with docker
source /opt/qunex/env/qunex_environment.sh

# check if env status is OK
qunex_env_status

# create a study
qunex create_study --studyfolder=/home/test_study

If singularity or docker commands are not recognized you might need to load appropriate modules or switch from a login to a compute node:

srun --ntasks-per-node=1 --time=02:00:00 --mem-per-cpu=16000 --pty bash -i

Hope this gives us some answers.

demsarjure · July 5, 2022, 5:55pm

Hi Steven! Any updates regarding this, can I help out somehow?