Methodology#

Overview#

Med-ImageNet addresses the gap between raw, heterogeneous public oncology imaging and the harmonized, well-documented datasets that AI practitioners need. This page describes the principles and processes that guide data collection, curation, and publication.

Data Collection and Curation#

Source Selection#

Collections are drawn from publicly accessible, institutional-quality archives — primarily TCIA via the Imaging Data Commons (IDC), with additional sources on S3, Zenodo, HuggingFace, and Dropbox. Each collection's backend and file type are recorded in a validated source.json manifest using Pydantic-enforced schemas.

Indexing#

Each collection is indexed via a per-collection index.csv that records available series along with modality, body part, and study-level metadata. For DICOM-native collections, series are identified by SeriesInstanceUID and a companion crawl_db.json stores granular DICOM tag-level detail, enabling fine-grained filtering via tag-based query rules. For NIfTI-native collections, the index records file paths and modality labels directly, supporting the same query interface without requiring DICOM headers. Both index types are published as a versioned Hugging Face dataset and updated via commit-tracked snapshots.

Preprocessing#

For DICOM-native collections, the optional processing flag (imgnet download -p) passes raw data through the med-imagetools Autopipeline, which performs DICOM ingestion, voxel harmonization, intensity normalization, and conversion to NIfTI (.nii.gz) with an accompanying CSV index under procdata/. NIfTI-native collections are already in an analysis-ready format and are downloaded directly, bypassing the conversion pipeline.

Standards and Formats#

Med-ImageNet supports two internationally recognized medical imaging formats as first-class data types:

DICOM — the standard acquisition format for clinical imaging, used for raw data from TCIA/IDC and other clinical archives. DICOM's rich header metadata enables tag-based query rules for fine-grained series selection.
NIfTI (.nii.gz) — a volumetric format widely adopted in neuroimaging and AI research, used both as the output of Med-ImageNet's preprocessing pipeline and as the native format for collections that are published directly in NIfTI (e.g., from HuggingFace or Zenodo sources).

Tabular metadata uses CSV, structured configuration uses JSON, and query interoperability between CLI commands is achieved through a compact, reproducible token format (msgpack + zlib + base64).

FAIR Principles#

The platform's architecture embodies the FAIR principles:

Principle	Implementation
Findable	All collections are indexed by SeriesInstanceUID, modality, and body part; the index is published on Hugging Face with persistent identifiers.
Accessible	Open-source CLI and Python API; data retrieved from public archives via standard protocols (HTTPS, S3).
Interoperable	Both DICOM and NIfTI are domain-standard formats with broad tooling support; the unified query interface abstracts over format differences so users interact with a single API regardless of source file type. Validated Pydantic schemas enforce consistent structure across heterogeneous sources.
Reusable	GPL-3.0 licensed software; versioned index snapshots; configurable query rules enable adaptation to diverse research questions.