Taming the Chaos: AutoPipeline for Medical Image Processing#
TLDR; From Raw Data to Research-Ready in Minutes#
AutoPipeline bridges the gap between raw clinical data and research analysis, making it easier for imaging researchers of all experience levels to work with complex medical imaging data.
By standardizing the processing pipeline and capturing provenance information, it also improves reproducibility - a crucial aspect of scientific research.
The Medical Imaging Data Problem#
Ever spent hours organizing messy DICOM files from different scanners and institutions? Medical imaging datasets are notoriously complex - with multiple scan types, segmentation masks, and treatment plans all interconnected but scattered across folder structures that seem designed to confuse.
As an imaging researcher or student, you've probably faced this frustration:
- Folder chaos: Hundreds of nested directories with cryptic names
- Format soup: DICOM files that need conversion to research-friendly formats
- Relationship puzzles: CT scans that reference segmentation masks that reference treatment plans
- Name confusion: Different institutions using different terms for the same anatomical structures
AutoPipeline: Streamlined Medical Imaging Workflow#
AutoPipeline is the core automation feature of med-imagetools designed to rescue you from these headaches. It provides a streamlined, reproducible workflow that:
- Crawls through your messy DICOM directories to discover all files
- Indexes the metadata and builds relationships between different series
- Processes linked series together, applying transformations as needed
- Organizes the output in a clean, standardized structure
- Tracks everything it does for complete reproducibility
Think of it as your personal research assistant that handles all the tedious data preparation work, letting you focus on the exciting science!
How AutoPipeline Works#
Behind the scenes, AutoPipeline brings together several powerful components:
flowchart TD
subgraph Ingest["Ingest π + π"]
A[Input DICOM Directory]
B[Crawler]
C[Index File]
A --> B --> C
end
subgraph Relationship_Builder["Relationship Builder πΈοΈ "]
D[Interlacer]
E[Query Results]
D --> E
end
subgraph Processing["Processing πΌοΈ + π§"]
F[Sample Input]
G["MedImage Objects: Scan/VectorMask"]
H["Transforms (optional): Resample, Windowing"]
I[Processed Images]
F --> G --> H --> I
end
subgraph Output["Output π + π"]
J[Sample Output]
K[NIfTI Files]
J --> K
end
Ingest --> Relationship_Builder
Relationship_Builder --> Processing
Processing --> Output
Crawler: The Data Explorer#
First, AutoPipeline dispatches the Crawler to search recursively through your input directories. The Crawler identifies all DICOM files and extracts key metadata like:
- Patient identifiers
- Study and series UIDs
- Modality information (CT, MR, RTSTRUCT, etc.)
- Spatial characteristics
- References between series
This information is compiled into an index that serves as a map of your entire dataset.
Interlacer: The Relationship Builder#
Next, the Interlacer takes this index and constructs a hierarchical forest that represents the relationships between different series. It understands the DICOM standards and knows, for example, that:
- An
RTSTRUCT
(radiation therapy structure) references a specific CT series - A
PET
scan might be registered to a corresponding CT - A treatment plan (
RTPLAN
) might hold the information about the referencedRTSTRUCT
used in aRTDOSE
file
The Interlacer can visualize these relationships and query them based on modality combinations you're interested in.
Sample Processing: The Conversion Engine#
When you specify which modalities you want (like "CT,RTSTRUCT"
), AutoPipeline:
- Queries the Interlacer for samples matching your criteria
- Loads the DICOM data into memory as MedImage objects
- Applies transformations like resampling or intensity windowing
- Saves the results in standard research formats (
NIfTI
)
For segmentation data (RTSTRUCT
or SEG
), AutoPipeline can match region names
(like "parotid" or "GTV") to standardized keys using regular expressions,
solving the problem of inconsistent naming across institutions.
Using AutoPipeline: A Quick Start#
Using AutoPipeline is simple!
The command line interface provides an intuitive way to process your data:
Standardizing Region Names with ROI Matching#
A common challenge in medical imaging is inconsistent naming of regions of interest (ROIs). AutoPipeline solves this with powerful pattern matching.
Med-ImageTools implements the ROIMatcher
class, which allows you to define
pattern matching rules for your ROIs.
AutoPipeline will match the ROI names from your RTSTRUCT or SEG files against
these patterns and standardize them in the output files.
We flexibly support describing these in the CLI and/or in a YAML file.
Create a YAML file like this:
# roi_patterns.yaml
GTV: ["GTV", "gtv", "Gross.*Volume"]
Parotid_L: ["LeftParotid", "PAROTID_L", "L_Parotid"]
Parotid_R: ["RightParotid", "PAROTID_R", "R_Parotid"]
Cord: ["SpinalCord", "Cord", "Spinal_Cord"]
Mandible: ["mandible.*"]
Then run AutoPipeline with the --roi-match-yaml
option:
Advanced Features#
Parallel Processing#
Processing large datasets? AutoPipeline can parallelize operations:
imgtools autopipeline /path/to/dicoms/ /path/to/output/ \
--modalities CT,RTSTRUCT \
--jobs 8 # Use 8 parallel processes
Custom Output Formatting#
Control your output file structure with formatting patterns:
imgtools autopipeline /path/to/dicoms/ /path/to/output/ \
--modalities CT,RTSTRUCT \
--filename-format "{PatientID}/{Modality}/{ImageID}.nii.gz"
Handling Existing Files#
Choose how to handle existing files in your output directory:
imgtools autopipeline /path/to/dicoms/ /path/to/output/ \
--modalities CT,RTSTRUCT \
--existing-file-mode skip # Options: skip, overwrite, fail
Additional Resources#
For more details on the components that AutoPipeline uses:
- Crawler Documentation
- Interlacer Documentation
- [ROI Matching and Masks] TODO::
References#
-
Kim S, Kazmierski M, et al. (2025). "Med-ImageTools: An open-source Python package for robust data processing pipelines and curating medical imaging data." F1000Research, 12:118.
-
Clark K, Vendt B, Smith K, et al. (2013). "The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository." Journal of Digital Imaging, 26(6):1045-1057.