diff --git a/README.md b/README.md index 0f404f98..349e1ad9 100644 --- a/README.md +++ b/README.md @@ -81,6 +81,17 @@ client.download_from_selection( ) ``` +## The `indices` of `idc-index` + +`idc-index` is named this way because it wraps indices of IDC data: tables +containing the most important metadata attributes describing the files available +in IDC. The main metadata index is available in the `index` variable (which is a +pandas `DataFrame`) of `IDCClient`. Additional index tables such as the +`clinical_index` contain non-DICOM clinical data or slide microscopy specific +tables (indicated by the prefix `sm`) include metadata attributes specific to +slide microscopy images. A description of available attributes for all indices +can be found [here](column_descriptions). + ## Tutorial Please check out diff --git a/docs/column_descriptions.md b/docs/column_descriptions.md new file mode 100644 index 00000000..d662b41e --- /dev/null +++ b/docs/column_descriptions.md @@ -0,0 +1,116 @@ +# Metadata attributes in `idc-index`'s index tables + +## `index` + +The following is the list of the columns included in `index`. You can use those +to select cohorts and subsetting data. `index` is series-based, i.e, it has one +row per DICOM series. + +- non-DICOM attributes assigned/curated by IDC: + + - `collection_id`: short string with the identifier of the collection the + series belongs to + - `analysis_result_id`: this string is not empty if the specific series is + part of an analysis results collection; analysis results can be added to a + given collection over time + - `source_DOI`: Digital Object Identifier of the dataset that contains the + given series; note that a given collection can include one or more DOIs, + since analysis results added to the collection would typically have + independent DOI values! + - `instanceCount`: number of files in the series (typically, this matches the + number of slices in cross-sectional modalities) + - `license_short_name`: short name of the license that governs the use of the + files corresponding to the series + - `series_aws_url`: location of the series files in a public AWS bucket + - `series_size_MB`: total disk size needed to store the series + +- DICOM attributes extracted from the files + - `PatientID`: identifier of the patient + - `PatientAge` and `PatientSex`: attributes containing patient age and sex + - `StudyInstanceUID`: unique identifier of the DICOM study + - `StudyDescription`: textual description of the study content + - `StudyDate`: date of the study (note that those dates are shifted, and are + not real dates when images were acquired, to protect patient privacy) + - `SeriesInstanceUID`: unique identifier of the DICOM series + - `SeriesDate`: date when the series was acquired + - `SeriesDescription`: textual description of the series content + - `SeriesNumber`: series number + - `BodyPartExamined`: body part imaged + - `Modality`: acquisition modality + - `Manufacturer`: manufacturer of the equipment that generated the series + - `ManufacturerModelName`: model name of the equipment + +## `sm_index` + +The following is the list of the columns included in `sm_index`. `sm_index` is +series-based, i.e, it has one row per DICOM series, but only includes series +with slide microscopy data. + +- DICOM attributes extracted from the files: + - `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series + = one slide + - `embeddingMedium`: describes in what medium the slide was embedded before + the image was obtained + - `tissueFixative`: describes tissue fixatives used before the image was + obtained + - `staining_usingSubstance`: describes staining steps the specimen underwent + before the image was obtained + - `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution + - `max_TotalMatrixRows`: height of the image at the maximum resolution + - `min_PixelSpacing_2sf`: pixel spacing in mm at the maximum resolution layer, + rounded to 2 significant figures + - `ObjectiveLensPower`: power of the objective lens of the equipment used to + digitize the slide + - `primaryAnatomicStructure`: anatomic location from where the imaged specimen + was collected + - `primaryAnatomicStructureModifier`: additional characteristics of the + specimen, such as whether it is a tumor or normal tissue + - `illuminationType`: specifies the type of illumination used when obtaining + the image + +In case of `embeddingMedium`, `tissueFixative`, `staining_usingSubstance`, +`primaryAnatomicStructure`, `primaryAnatomicStructureModifier` and +`illuminationType` the attributes exist with suffix `_code_designator_value_str` +and `_CodeMeaning`, which indicates whether the column contains +CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a +brief explanation on the three-value based coding scheme in DICOM can be found +at https://learn.canceridc.dev/dicom/coding-schemes. + +## `sm_instance_index` + +The following is the list of the columns included in `sm_instance_index`. +`sm_instance_index` is instance-based, i.e, it has one row per DICOM instance +(pyramid level of a slide, plus potentially thumbnail or label images), but only +includes DICOM instances of the slide microscopy modality. + +- DICOM attributes extracted from the files: + + - `SOPInstanceUID`: unique identifier of the DICOM instance: one DICOM + instance = one level/label/thumbnail image of the slide + - `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series + = one slide + - `embeddingMedium`: describes in what medium the slide was embedded before + the image was obtained + - `tissueFixative`: describes tissue fixatives used before the image was + obtained + - `staining_usingSubstance`: describes staining steps the specimen underwent + before the image was obtained + - `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution + - `max_TotalMatrixRows`: height of the image at the maximum resolution + - `PixelSpacing_0`: pixel spacing in mm + - `ImageType`: specifies further characteristics of the image in a list, + including as the third value whether it is a VOLUME, LABEL, OVERVIEW or + THUMBNAIL image. + - `TransferSyntaxUID`: specifies the encoding scheme used for the image data + - `instance_size`: specifies the DICOM instance's size in bytes + +- non-DICOM attributes assigned/curated by IDC: + - `crdc_instance_uuid`: globally unique, versioned identifier of the DICOM + instance + +In case of `embeddingMedium`, `tissueFixative`, and `staining_usingSubstance` +the attributes exist with suffix `_code_designator_value_str` and +`_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator +and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the +three-value based coding scheme in DICOM can be found at +https://learn.canceridc.dev/dicom/coding-schemes. diff --git a/docs/index.md b/docs/index.md index 8bd83e6b..13b97b88 100644 --- a/docs/index.md +++ b/docs/index.md @@ -21,101 +21,14 @@ starting a discussion in [IDC User forum](https://discourse.canceridc.dev/). :start-after: ``` -## The `index` of `idc-index` - -`idc-index` is named this way because it wraps indices of IDC data: tables -containing most important metadata attributes describing the files available in -IDC. The main metadata index is available in the `index` variable (which is a pandas -`DataFrame`) of `IDCClient`. -Additional index tables such as the `clinical_index` contain non-DICOM clinical data or -slide microscopy specific tables (indicated by the prefix `sm`) include metadata attributes -specific to slide microscopy images. - - - -The following is the list of the columns included in `index`. You can use those -to select cohorts and subsetting data. `idc-index` is series-based, i.e, it has -one row per DICOM series. - -- non-DICOM attributes assigned/curated by IDC: - - `collection_id`: short string with the identifier of the collection the - series belongs to - - `analysis_result_id`: this string is not empty if the specific series is - part of an analysis results collection; analysis results can be added to a - given collection over time - - `source_DOI`: Digital Object Identifier of the dataset that contains the - given series; note that a given collection can include one or more DOIs, - since analysis results added to the collection would typically have - independent DOI values! - - `instanceCount`: number of files in the series (typically, this matches the - number of slices in cross-sectional modalities) - - `license_short_name`: short name of the license that governs the use of the - files corresponding to the series - - `series_aws_url`: location of the series files in a public AWS bucket - - `series_size_MB`: total disk size needed to store the series - -- DICOM attributes extracted from the files - - `PatientID`: identifier of the patient - - `PatientAge` and `PatientSex`: attributes containing patient age and sex - - `StudyInstanceUID`: unique identifier of the DICOM study - - `StudyDescription`: textual description of the study content - - `StudyDate`: date of the study (note that those dates are shifted, and are - not real dates when images were acquired, to protect patient privacy) - - `SeriesInstanceUID`: unique identifier of the DICOM series - - `SeriesDate`: date when the series was acquired - - `SeriesDescription`: textual description of the series content - - `SeriesNumber`: series number - - `BodyPartExamined`: body part imaged - - `Modality`: acquisition modality - - `Manufacturer`: manufacturer of the equipment that generated the series - - `ManufacturerModelName`: model name of the equipment - -The following is the list of the columns included in `sm_index`. `sm_index` is series-based, i.e, it has -one row per DICOM series, but only includes series with slide microscopy data. - -- DICOM attributes extracted from the files: - - `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series = one slide - - `embeddingMedium`: describes in what medium the slide was embedded before the image was obtained - - `tissueFixative`: describes tissue fixatives used before the image was obtained - - `staining_usingSubstance`: describes staining steps the specimen underwent before the image was obtained - - `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution - - `max_TotalMatrixRows`: height of the image at the maximum resolution - - `min_PixelSpacing_2sf`: pixel spacing in mm at the maximum resolution layer, rounded to 2 significant figures - - `ObjectiveLensPower`: power of the objective lens of the equipment used to digitize the slide - - `primaryAnatomicStructure`: anatomic location from where the imaged specimen was collected - - `primaryAnatomicStructureModifier`: additional characteristics of the specimen, such as whether it is a tumor or normal tissue - - `illuminationType`: specifies the type of illumination used when obtainig the image - -In case of `embeddingMedium`, `tissueFixative`, `staining_usingSubstance`, `primaryAnatomicStructure`, `primaryAnatomicStructureModifier` and `illuminationType` the attributes exist with suffix `_code_designator_value_str` and `_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the three-value based coding scheme in DICOM can be found at https://learn.canceridc.dev/dicom/coding-schemes. - -The following is the list of the columns included in `sm_instance_index`. `sm_instance_index` is instance-based, i.e, it has -one row per DICOM instance (pyramid level of a slide, plus potentially thumbnail or label images), but only includes DICOM instances of the slide microscopy modality. - -- DICOM attributes extracted from the files: - - `SOPInstanceUID`: unique identifier of the DICOM instance: one DICOM instance = one level/label/thumbnail image of the slide - - `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series = one slide - - `embeddingMedium`: describes in what medium the slide was embedded before the image was obtained - - `tissueFixative`: describes tissue fixatives used before the image was obtained - - `staining_usingSubstance`: describes staining steps the specimen underwent before the image was obtained - - `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution - - `max_TotalMatrixRows`: height of the image at the maximum resolution - - `PixelSpacing_0`: pixel spacing in mm - - `ImageType`: specifies further characteristics of the image, including whether it is a VOLUME, LABEL, OVERVIEW or THUMBNAIL image. - - `TransferSyntaxUID`: specifies the encoding scheme used for the image data - - `instance_size`: specifies the DICOM instance's size in bytes - -- non-DICOM attributes assigned/curated by IDC: - - `crdc_instance_uuid`: globally unique, versioned identifier of the DICOM instance - -In case of `embeddingMedium`, `tissueFixative`, and `staining_usingSubstance` the attributes exist with suffix `_code_designator_value_str` and `_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the three-value based coding scheme in DICOM can be found at https://learn.canceridc.dev/dicom/coding-schemes. - ## Contents ```{toctree} -:maxdepth: 1 +:maxdepth: 2 :titlesonly: :caption: API docs +column_descriptions api/idc_index ```