[RESOLVED] importDICOM: command not working on .tgz files

Description:

I have downloaded data from the NIMH Data Archive (NDA), and I am attempting to onboard the data into QuNex. The dicoms are stored in .tgz files within sessions/<session_id>/inbox folder, importDICOM seems to be unable to recognize the .tgz files as files that need to be processed. I have tried two separate commands.

Call:
Container tag: qunex_suite_0.61.17.sif

qunex importDICOM
–sessionsfolder=“/gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions”
–sessions=“NDARINVBEG60AUC_baselineY1”
–check=“any”
–masterinbox=“none”
–archive=“leave”
–overwrite=yes
–options=“addImageType:1”

-----and------

qunex importHCP
–sesssionsfolder=“/gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions”
–inbox="/gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions/NDARINVBEG60AUC_baselineY1/inbox
–action=‘link’
–overwrite=‘yes’
–archive=‘leave’
–sessions=NDARINVBEG60AUC_baselineY1

Logs:
None: output in terminal

Running importDICOM
—> Checking for folders to process in ‘/gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions’
—> Found the following folders to process:
subject: NDARINVBEG60AUC, session: baselineY1 … NDARINVBEG60AUC_baselineY1 <= NDARINVBEG60AUC_baselineY1 ← NDARINVBEG60AUC_baselineY1
—> Starting to process 1 packets …
—=== PROCESSING NDARINVBEG60AUC_baselineY1 ===—
Running sortDicom
—> Processing 19 files from /gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions/NDARINVBEG60AUC_baselineY1/inbox
—> Created a dicom superfolder
—> Done
Running dicom2niix

—> Analyzing data
Final report

Failed to process:
… NDARINVBEG60AUC_baselineY1 [/gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions/NDARINVBEG60AUC_baselineY1]
dicom2niix: No source DICOM files
===> ERROR in completing importDICOM:
Some packages failed to process
Please check report!

—and----

Running importHCP

→ identifying files in /gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions/NDARINVBEG60AUC_baselineY1/inbox
===> ERROR in completing importHCP:
No files found
No files were found to be processed at the specified inbox [/gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions/NDARINVBEG60AUC_baselineY1/inbox]!
Please check your path!

Path:

/gpfs/project/fas/n3/Studies/Ryan_Test/sessions/NDARINVBEG60AUC_baselineY1
.
└── inbox
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-Diffusion-FM_20180218132700.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-DTI_20180218132823.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-fMRI-FM_20180218131258.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-fMRI-FM_20180218134318.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-fMRI-FM_20180218135605.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-fMRI-FM_20180218141046.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-fMRI-FM_20180218142507.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-MID-fMRI_20180218135711.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-MID-fMRI_20180218140441.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-nBack-fMRI_20180218141144.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-nBack-fMRI_20180218141937.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-rsfMRI_20180218131359.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-rsfMRI_20180218132017.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-rsfMRI_20180218134421.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-rsfMRI_20180218135014.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-SST-fMRI_20180218142604.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-SST-fMRI_20180218143250.tgz
├── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-T1_20180218130616.tgz
└── NDARINVBEG60AUC_baselineYear1Arm1_ABCD-T2_20180218133622.tgz

I have been able to get importDICOM to work when I manually untar the .tgz files.

Hi Ryan, we will assign a developer to this issue and will let you know once it has been resolved.

1 Like

The issue has been resolved and will be available in the next released version. If you have access to our develop version you can use that one.

1 Like

There may be another another part to this issue. When I ran tar -xvf on each of the .tgz files within the <session_id>/inbox folder, the contents of each .tgz file were placed into a new folder within the session’s inbox folder: <session_id>/inbox/ses-baselineY1Arm1/* When I tried running importDICOM on this session, it was still not able to find the dicoms in this location. However, when I moved the folders directly containing the .dcm files back into the inbox folder, importDICOM worked. In summary, in order to get importDICOM to run successfully I first needed to untar the .tgz files, and then move their contents back up to the inbox folder. importDICOM was unable to find the .dcm if they were held further down the file tree.

For example: importDICOM did not work with this file organization (the result of running tar -xvf on all .tgz files)

/gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions/NDARINV591LR2LK_baselineY1/inbox

└── ses-baselineYear1Arm1

    ├── anat
       ├── ABCD-T1_run-20170614132102
                 ├── *.dcm
        ├── ABCD-T1_run-20170614132102.json
        ├── ABCD-T2_run-20170614135218
                 ├── *.dcm
        ├── ABCD-T2_run-20170614135218.json

When I moved the dicom folders further up the tree, importDICOM worked, shown below:

/gpfs/loomis/pi/n3/Studies/Ryan_Test/sessions/NDARINV591LR2LK_baselineY1/inbox
└── ABCD-T1_run-20170614132102

     ├── *.dcm

└── ABCD-T1_run-20170614132102.json

└── ABCD-T2_run-20170614135218

    ├── *.dcm

└── ABCD-T2_run-20170614135218.json

Hi, did you try with 0.90.0, which is now live on grace and supports .tgz archives? It should work on .tgz files, however I will have to investigate further if there are still issues because of the specified file structure.

I am now using version 0.90.1, and yes import_dicom has been recognizing the .tgz files. At present, I am running into another issue with this however.

Right now, the .tgz files that I wish to import are located in the $study_folder/sessions/archive/BIDS folder. I tried running import_BIDS at first, which was the incorrect thing to do (as nifti files are not present, it is dicoms) which is why the data is located in the archive/BIDS folder.

qunex_container import_dicom
–sessionsfolder=“$study_folder/sessions”
–masterinbox=“$study_folder/sessions/archive/BIDS”
–archive=leave
–check=no
–nameformat=“(?P<subject_id>.?)_(?P<session_name>.).*.(?:.tgz…$|$)”
–overwrite=no
–parelements=10
–container=“$qunex_container”
–scheduler=“SLURM,time=3-24:00:00,ntasks=30,cpus-per-task=1,mem-per-cpu=8000,partition=pi_anticevic”

I have run this command and it has worked well on the .tgz files located in that archive folder, however, it is returning an error once it tries to import the dicoms for a second .tgz folder for a subject who already has an existing sessions folder.

For example, each imaging acquisition for every subject exists within its own .tgz file. I would like import_dicom to extract the dicoms for each subject across all of the .tgz files associated with that subject and import them into the subjects sessions folder. To give a concrete example let’s take NDARINVG4Y3RT8P_baselineYear1Arm1

this subject/session combonation has many .tgz associated with it. import_dicom works successfully on the first .tgz file it finds for this subject, but returns an error as soon as it runs into a second .tgz file:

ERROR
Traceback (most recent call last):
File “/opt/qunex/python/qx_utilities/general/core.py”, line 565, in runWithLog
result = function(**args)
File “/opt/qunex/python/qx_utilities/general/dicom.py”, line 2587, in import_dicom
os.makedirs(ifolder)
File “/opt/env/qunex/lib/python2.7/os.py”, line 157, in makedirs
mkdir(name, mode)
OSError: [Errno 17] File exists: ‘/gpfs/project/fas/n3/Studies/ABCD/site21/sessions/NDARINVG4Y3RT8P_baselineYear1Arm1/inbox’

I am thinking a potential work around could be to write a script to onboard the .tgz files directly into the individual sessions/inbox folder, and then run import_dicom without specifying a master inbox. However, if there is an easy solution to this please let me know!

I have to investigate this a bit further. I will get back to you once I have more information.

1 Like

Also, it would help me with debugging if you could provide the exact value for the used $study_folder variable.

Cheers, Jure

1 Like

/gpfs/project/fas/n3/Studies/ABCD/site21

@ryan.aker hi!

The issue you are running into is due to the organization of the input dataset. So far we have not encountered such dataset organization and import_dicom is not (yet) equiped to handle it directly. Let me describe the organization of the dataset as is and what import_dicom expects/requires.

The data are located in the study inbox folder in /gpfs/loomis/pi/n3/Studies/ABCD/site21/sessions/archive/BIDS. The folder includes .tgz files, one file per acquired session. There is no additional organization of the files within the folder.

The issue with processing the ABCD study master inbox using import_dicom
One possibility of using import_dicom is to have a “master inbox” where all the data for a study is located. In this case import_dicom is written to work with “packets”, where each packet contains all the dicom files from a data acquisition session. A packet can be a compressed archive or a folder with dicom files. In processing the data from the master inbox, import_dicom identifies each packet and creates a <study folder>/sessions/<session id> folder for that packet. The <session id> is extracted from the packet name using the grep pattern that is specified by the --nameformat parameter.

In your case this does not work, as there are multiple .tgz files for each session. Each of the files is identified and processed as a separate packet, however, as they belong to the same session id, an error will be reported when the second packet from the same session is processed. To correctly map the files to individual session folders, the files from the same session would need to be moved to a subfolder with the session id. E.g., all the NDARINV0APJMRD1_baselineYear1Arm1.* files should be moved to a NDARINV0APJMRD1_baselineYear1Arm1 subfolder in the dataset master inbox. Once that is done, the import_dicom command could be run on the master inbox folder with --nameformat="(?P<subject_id>.*?)_(?P<session_name>.*)" to extract the subject id and the session name from the folder name.

At this point you would encounter a second problem, namely, process_dicom expects the packets (in this case the folders) to contain individual dicom files, whereas in your case dicoms from a sequence are combined in compressed .tgz files, which import_dicom currently does not handle. Once the .tgz files are copied to the <study>/sessions/<session id>/inbox folder, they are not unpacked and so no dicom files are found to be processed in the next step. To resolve this situation, we will change the import_dicom command to identify any compressed files in the inbox folder before attempting to identify and process the dicom files. This update will be in one of the future updates to QuNex.

Before we update the code, these are the possible solutions:

Solution 1: manually unzip files in the master inbox.
For this solution, you would move the .tgz files into relevant subfolders as described above and unpack all the .tgz files so that only unpacked dicom files remain in each subfolder. Then you could run `import_dicom’:

qunex import_dicom \
  --sessionsfolder=<path to study sessions folder>  \
  --masterinbox=<path to master inbox folder> \ 
  --archive=leave \
  --nameformat="(?P<subject_id>.*?)_(?P<session_name>.*)"

Prepared this way the command should complete in one go.

Solution 2: manually unzip files after they are moved to sessions folder
If you do not unpack .tgz files and run the above call, the call will copy the .dgz files, but report that no DICOM files were found. Once the execution stops, you can then manually untar all the files in the sessions’ inbox folders and run import_dicom again, this time setting --masterinbox=none and specifying --sessions="*baseline* to find all the baseline sessions in the sessions folder. You will also need to specify --overwrite=yes as the first run of import_dicom created (empty) dicom and nii subfolders.

Solution 3: run import_dicom twice
In this scenario, you run import_dicom once to sort the .tgz files into correct <session id>/inbox folders and then run it again with the same settings as in solution 2. import_dicom will unpack and process the dicom files. I have prepared an example session:

/gpfs/loomis/pi/n3/Studies/ABCD/site21/sessions/masterinbox
└── NDARINV0APJMRD1_baselineYear1Arm1
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-Diffusion-FM-AP_20170625115751.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-Diffusion-FM-PA_20170625115713.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-DTI_20170625115843.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-AP_20170625114257.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-AP_20170625121303.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-AP_20170625122513.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-AP_20170625123909.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-AP_20170625125119.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-AP_20170625130406.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-PA_20170625114238.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-PA_20170625121244.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-PA_20170625122454.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-PA_20170625123850.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-PA_20170625125100.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-fMRI-FM-PA_20170625130347.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-MID-fMRI_20170625125209.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-MID-fMRI_20170625125754.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-nBack-fMRI_20170625124027.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-nBack-fMRI_20170625124542.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-rsfMRI_20170625114409.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-rsfMRI_20170625114956.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-rsfMRI_20170625121348.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-rsfMRI_20170625121924.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-rsfMRI_20170625130448.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-SST-fMRI_20170625122617.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-SST-fMRI_20170625123229.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-T1_20170625114157.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-T1-NORM_20170625114157.tgz
    ├── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-T2_20170625121227.tgz
    └── NDARINV0APJMRD1_baselineYear1Arm1_ABCD-T2-NORM_20170625121228.tgz

and a created a study folder at:

/gpfs/loomis/pi/n3/Studies/MBLab/abcdtest

Running these two commands, processes the data successfully:

# -- run to move the .tgz files to the correct <session id>/inbox folders

qunex import_dicom \
  --sessionsfolder=/gpfs/loomis/pi/n3/Studies/MBLab/abcdtest/sessions \
  --masterinbox=/gpfs/loomis/pi/n3/Studies/ABCD/site21/sessions/masterinbox \
  --archive=leave \
  --nameformat="(?P<subject_id>.*?)_(?P<session_name>.*)"

# -- run to process the data in the <session id>/inbox folders

qunex import_dicom \
  --sessionsfolder=/gpfs/loomis/pi/n3/Studies/MBLab/abcdtest/sessions \
  --masterinbox=none \
  --sessions="*baseline*" \
  --overwrite=yes

This solution is basically the same as solution 2 but without the manual extraction step. There is a caveat. It required a slight change in the code that is now in the latest develop branch, but is not yet available in the master or the container.

Hi @grega That is very helpful! Thank you so much for having a look at this !