Crawler#
The Crawler
in Med-ImageTools has the core responsibility of walking through
a directory and collecting all the dicom files (.dcm
/.DCM
), regardless
of the directory structure.
It was designed to extract as much information as possible from the files, to build an internal database of the data users have.
Metadata extraction#
As of the 2.0 release, the crawler implements a modality-specific set of extractions which facilitates:
- raw extraction of a DICOM tag (i.e
SeriesInstanceUID
) - and possible 'ComputedValue' which requires computation on one or more tags
(i.e
ROINames
inRTSTRUCT
files are computed from theROIContourSequence
tag).
This is done using the ModalityMedataExtractor base class, which defines the base_tags
that are common to all modalities, and the computed_tags
which sub-classes
can implement to extract modality-specific information on top of their own modality_tags
.
See the following api documentation for the supported modalities:
- CTMetadataExtractor
- MRMetadataExtractor
- PTMetadataExtractor
- SEGMetadataExtractor
- RTSTRUCTMetadataExtractor
- RTDOSEMetadataExtractor
- RTPLANMetadataExtractor
- SRMetadataExtractor
For each of the above modalities, there is a modality_tags
property that
returns a set of tags that can be directly extracted from the DICOM file.
There also may be a computed_tags
property that returns a mapping of
keys defined by Med-ImageTools (i.e SegSpacing
in the SEG
extractor)
to a callable function that will compute the value from the DICOM file.
how can I find what tags are extracted?
You can view the modality_tags
and computed_tags
for each modality
in the above links, and open the Souce code
dropdown for each property.
Assumptions made when crawling#
As of 2.0, the crawler makes the following assumptions:
-
All DICOM files within the specified input directory are valid DICOM files. That is, they all contain a valid DICOM header and can be read by pydicom.
-
All DICOM files belonging to the same series are in the same directory. This assumption is made, to simplify pointing to a series of files, by just pointing to the directory containing the files.
Note
Though we require that all files in a series are in the same directory, we do not require that all files in a directory belong to the same series. That is, a directory may contain files from multiple series, and the crawler will still be able to extract the series information from each file.
Reference building#
One of the main features of Med-ImageTools is building the internal database of all the complex relationships between DICOM series. The first step in this process is to extract the necessary information from each DICOM file, which is modality-specific, and given the evolution of the DICOM standard, may change over time.
For example, while modern RTSTRUCT
files contain a RTReferencedSeriesSequence
tag that references the SeriesInstanceUID
of the series that the structure
was derived from, this is not always present in older files.
As such, we have to rely on the ReferencedSOPInstanceUID
tag, which is a
unique identifier for all the files in the Referenced Series.
During the crawling process, we also build a database that maps each found
DICOM file to its corresponding series SOPInstanceUID -> SeriesInstanceUID
.
If the RTSTRUCT
file does not contain the RTReferencedSeriesSequence
,
we use all the ReferencedSOPInstanceUID
tags in the file to find the
corresponding seires using the mapping.
After a successful crawl, the Crawler
will have two database interfaces:
-
crawl_db
: This is the database that contains all the information extracted from the DICOM files. It is a dictionary mapping theSeriesInstanceUID
to all the metadatal extracted from the files in the series using theModalityMetadataExtractor
extractors. -
index
: This is a slimmed database that contains the core metadata to build a representation of the data relationships.It contains the following information:
PatientID
StudyInstanceUID
SeriesInstanceUID
SubSeries
(possible unique Acquisition)Modality
ReferencedModality
ReferencedSeriesUID
instances
(number of instances in the series)folder
(path to the folder containing the series)
These files are stored in a .imgtools
directory next to the input directory.
That is if the input directory is /path/to/DICOM_FILES
, the index output files
will be stored in /path/to/.imgtools/DICOM_FILES
.