Usage Guide

Project Configuration

config should have three subdirectories: datasets/, extraction/, and signatures/

datasets

Each dataset needs a configuration file with the following settings filled in

DATA_SOURCE: ""    # where the data came from, will be used for data organization
DATASET_NAME: ""   # the name of the dataset , will be use for data organization

### CLINICAL VARIABLE INFORMATION ###
CLINICAL:
    FILE: ""                     # Name of the clinical data file associated with the data. Not a full path, just the name including the file suffix.
    OUTCOME_VARIABLES:
        time_label: ""           # Column name for survival time in the `FILE`, should be a numeric type
        event_label: ""          # Column name for survival event in the `FILE`, can be numeric, string, or bool
        convert_to_years: False  # Boolean, whether the `time_label` needs to be converted from days to years
        event_value_mapping: {}  # Customize the `event_label` bool or string mapping to numeric type. Should be in the order {0: Alive_value, 1: Dead_value}
    EXCLUSION_VARIABLES: {}      # Column values of rows to drop in the clinical data (Ex. `{column_name: [val1, val2]}` )

### MED-IMAGETOOLS settings
MIT:
    MODALITIES:                 # Modalities to process with autopipeline
        image: CT
        mask: RTSTRUCT     
    ROI_STRATEGY: MERGE         # How to handle multiple ROI matches 
    ROI_MATCH_MAP:              # Matching map for ROIs in dataset (use if you only want to process some of the masks in a segmentation)
        KEY:ROI_NAME            # NOTE: there can be no spaces in KEY:ROI_NAME

### READII settings
READII:
    IMAGE_TYPES:                # Selection of image types to generate and perform feature extraction on (negative control settings)
        regions:                # Areas of image to apply permutation to
            - "full"
        permutations:           # Permutation type to apply to region
            - "original"
        crop:                   # How to crop the image prior to feature extraction
    TRAIN_TEST_SPLIT:           # If using data for modelling, set up method of data splitting here
        split: False            # Whether to split the data
        split_variable: {}      # What variable from `CLINICAL.FILE` to use to split the data and values to group by (Ex. {'split_var': ['training', 'test']})
        impute: null            # What to impute values in `split_variable` with. Should be one of the values provided in `split_variable`. If none provided, won't impute, samples with no split value will be dropped.

RANDOM_SEED: 10                 # Seed for reproducibility of analysis.

extraction

This directory should store any configuration settings used for feature extraction. They should be named/organized by the feature extraction method.

Example: pyradiomics_original_all_features.yaml

Different configuration set-ups can be documented here.

PyRadiomics

PyRadiomics feature extraction settings yaml files should be kept here. See the PyRadiomics 'Parameter File' documentation for details about this file.

signatures

Files in this directory should list selected features in a radiomic signature and the corresponding weights from a fitted prediction model.

PyRadiomics CoxPH signature

signature:
    'original_firstorder_Energy': 1.74e-11
    'original_shape_Compactness1': -1.65e+01
    'original_glrlm_GrayLevelNonUniformity': 4.95e-05
    'wavelet-HLH_glrlm_GrayLevelNonUniformity': 2.81e-06

Data Setup

All data should be stored in a Data directory separate from this project directory. Within the project repo, there's a data directory containing rawdata, procdata, and results directories. The rawdata and procdata directories should by symbolic links pointing to the corresponding data directory in your separate Data directory.

Aliasing `rawdata` and `procdata`

To set up the symbolic links for the rawdata and procdata directories, run the following commands, starting from your project directory:

ln -s /path/to/separate/data/dir/rawdata/{DiseaseRegion}/{DATASET_SOURCE}_{DATASET_NAME} data/rawdata/{DATASET_SOURCE}_{DATASET_NAME}

ln -s /path/to/separate/data/dir/procdata/{DiseaseRegion}/{DATASET_SOURCE}_{DATASET_NAME} data/procdata/{DATASET_SOURCE}_{DATASET_NAME}

Note

You will need to perform this step for each dataset you wish to process with the READII-2-ROQC pipeline.

Documenting datasets

When a new dataset has been added to the rawdata directory, you MUST document it on the Data Sources page.

Copy the following template and fill it in accordingly for each dataset. If anything about the dataset changes, make sure to keep this page up to date.

Data Source Template

NSCLC-Radiomics

- **Name**: NSCLC-Radiomics (or Lung1)
- **Version/Date**: Version 4: Updated 2020/10/22
- **URL**: <https://www.cancerimagingarchive.net/collection/nsclc-radiomics/>
- **Access Method**: NBIA Data Retriever
- **Access Date**: 2025-04-23
- **Data Format**: DICOM
- **Citation**: Aerts, H. J. W. L., Wee, L., Rios Velazquez, E., Leijenaar, R. T. H., Parmar, C., Grossmann, P., Carvalho, S., Bussink, J., Monshouwer, R., Haibe-Kains, B., Rietveld, D., Hoebers, F., Rietbergen, M. M., Leemans, C. R., Dekker, A., Quackenbush, J., Gillies, R. J., Lambin, P. (2014). Data From NSCLC-Radiomics (version 4) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI 
- **License**: [CC BY-NC 3.0](https://creativecommons.org/licenses/by-nc/3.0/)
- **Data Types**: 
    - Images: CT, RTSTRUCT
    - Clinical: CSV
- **Sample Size**: 422 subjects
- **ROI Name**: Tumour = GTV-1
- **Notes**: LUNG-128 does not have a GTV segmentation, so only 421 patients are processed.

Project `data` Directory Tree

data
|-- procdata
|   `-- {DATASET_SOURCE}_{DATASET_NAME} --> /path/to/separate/data/dir/procdata/{DiseaseRegion}/{DATASET_SOURCE}_{DATASET_NAME}
|       |-- correlations
|       |   `-- {extraction_method}
|       |       `-- {extraction_configuration_file_name}
|       |           |-- {image_type}_{correlation_method}_matrix.csv
|       |           `-- {image_type}_v_{image_type}_{correlation_method}_matrix.csv
|       |-- features
|       |   `-- {extraction_method}
|       |       |-- extraction_method_index.csv
|       |       `-- {extraction_configuration_file_name}
|       |           `-- {PatientID}_{SampleNumber}
|       |               `-- {ROI_name}
|       |                   |-- original_full_features.csv
|       |                   |-- {permutation}_{region}_features.csv
|       |                   `-- {permutation}_{region}_features.csv
|       |-- images
|       |   |-- mit_{DATASET_NAME}
|       |   |   `-- {PatientID}_{SampleNumber}
|       |   |       |-- {ImageModality}_{SeriesInstanceUID}
|       |   |       |   `-- {ImageModality}.nii.gz
|       |   |       `-- {SegmentationModality}_{SeriesInstanceUID}
|       |   |           `-- {ROI_name}.nii.gz
|       |   `-- readii_{DATASET_NAME}
|       |       `-- {PatientID}_{SampleNumber}
|       |           `-- {ImageModality}_{SeriesInstanceUID}
|       |               |-- {permutation}_{region}.nii.gz
|       |               `-- {permutation}_{region}.nii.gz
|       `-- signatures
|           `-- {signature_name}
|               |-- full_original_signature_features.csv
|               `-- {permutation}_{region}_signature_features.csv
|-- rawdata
|   `-- {DATASET_SOURCE}_{DATASET_NAME} --> /path/to/separate/data/dir/srcdata/{DiseaseRegion}/{DATASET_SOURCE}_{DATASET_NAME}
|       |-- clinical
|       |   `-- {Clinical Data File}.csv OR {Clinical Data File}.xlsx
|       `-- images
|           `-- {DATASET_NAME}
|               |-- {Sample1 DICOM directory}
|               |-- {Sample2 DICOM directory}
|               |-- ...
|               `-- {SampleN DICOM directory}
`-- results
    `-- {DATASET_SOURCE}_{DATASET_NAME}
        |-- correlation
        |   `-- {extraction_method}
        |       |-- {extraction_configuration_file_name}
        |       `-- {signature_name}
        |-- features
        |   `-- {extraction_method}
        |       `-- {extraction_configuration_file_name}
        `-- prediction
            `-- {signature_name}
                |-- prediction_metrics.csv
                `-- hazards_{bootstrap_count}
                    |-- original_full_features.csv
                    |-- {permutation}_{region}_features.csv
                    `-- {permutation}_{region}_features.csv

Best Practices

Store raw data in data/rawdata/ and never modify it
Store processed data in data/procdata/ and all code used to generate it should be in workflow/scripts/
Track data provenance (where data came from and how it was modified)
Respect data usage agreements and licenses! This is especially important for data that should not be shared publicly

Running Your Analysis

The pipeline is currently being run via pixi tasks. The following example shows how to run the pipeline using the NSCLC-Radiomics data.

DICOM Image and Mask file processing with Med-ImageTools

This step converts the DICOM image files to NIfTI files, creates a unique ID for Image and Mask pairs, and generates an index file containing relevant metadata.

Step 1: Run Med-ImageTools

This step converts the DICOM files to NIfTIs, assigns unique SampleIDs to image and mask pairs, and generates an index table for each file with associated metadata (e.g. DICOM tags)

pixi run mit NSCLC-Radiomics 'CT,RTSTRUCT' SEPARATE 'GTV:GTV-1,gtv-pre-op'

Step 2: Generate negative control images with READII

This step creates and saves READII negative controls specified in the config file for the provided dataset.

pixi run readii_negative NSCLC-Radiomics

Step 3: Run feature extraction

This step first generates an index file for the specific feature extraction method, where each row contains the information for the image and mask pair to use.

pixi run extract NSCLC-Radiomics pyradiomics pyradiomics_original_all_features.yaml

Data splitting

aerts_original

Train: NSCLC-Radiomics
Validation: HN1, RADCURE-test

aerts_RADCURE_refit

Train: RADCURE-train
Validation: HN1, RADCURE-test

r2r_NSCLC