Skip to content

Commit

Permalink
include column descriptions from separate file
Browse files Browse the repository at this point in the history
minor text changes

updated ImageType description

check if include command works now

Moved section introducing indices of idc-index into README

Test to see if docs are build correctly

Docs: Added separate page for column description of indices and linked to it from README.

removed test

fixed link

removed doubled text

adapted toctree to generate column_descriptions.html

corrected toctree

another try

next try

a test

probably solved

added title and sections
  • Loading branch information
DanielaSchacherer committed Sep 20, 2024
1 parent 1361412 commit f18ec0b
Show file tree
Hide file tree
Showing 3 changed files with 129 additions and 89 deletions.
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,17 @@ client.download_from_selection(
)
```

## The `indices` of `idc-index`

`idc-index` is named this way because it wraps indices of IDC data: tables
containing the most important metadata attributes describing the files available
in IDC. The main metadata index is available in the `index` variable (which is a
pandas `DataFrame`) of `IDCClient`. Additional index tables such as the
`clinical_index` contain non-DICOM clinical data or slide microscopy specific
tables (indicated by the prefix `sm`) include metadata attributes specific to
slide microscopy images. A description of available attributes for all indices
can be found [here](column_descriptions).

## Tutorial

Please check out
Expand Down
116 changes: 116 additions & 0 deletions docs/column_descriptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Metadata attributes in `idc-index`'s index tables

## `index`

The following is the list of the columns included in `index`. You can use those
to select cohorts and subsetting data. `index` is series-based, i.e, it has one
row per DICOM series.

- non-DICOM attributes assigned/curated by IDC:

- `collection_id`: short string with the identifier of the collection the
series belongs to
- `analysis_result_id`: this string is not empty if the specific series is
part of an analysis results collection; analysis results can be added to a
given collection over time
- `source_DOI`: Digital Object Identifier of the dataset that contains the
given series; note that a given collection can include one or more DOIs,
since analysis results added to the collection would typically have
independent DOI values!
- `instanceCount`: number of files in the series (typically, this matches the
number of slices in cross-sectional modalities)
- `license_short_name`: short name of the license that governs the use of the
files corresponding to the series
- `series_aws_url`: location of the series files in a public AWS bucket
- `series_size_MB`: total disk size needed to store the series

- DICOM attributes extracted from the files
- `PatientID`: identifier of the patient
- `PatientAge` and `PatientSex`: attributes containing patient age and sex
- `StudyInstanceUID`: unique identifier of the DICOM study
- `StudyDescription`: textual description of the study content
- `StudyDate`: date of the study (note that those dates are shifted, and are
not real dates when images were acquired, to protect patient privacy)
- `SeriesInstanceUID`: unique identifier of the DICOM series
- `SeriesDate`: date when the series was acquired
- `SeriesDescription`: textual description of the series content
- `SeriesNumber`: series number
- `BodyPartExamined`: body part imaged
- `Modality`: acquisition modality
- `Manufacturer`: manufacturer of the equipment that generated the series
- `ManufacturerModelName`: model name of the equipment

## `sm_index`

The following is the list of the columns included in `sm_index`. `sm_index` is
series-based, i.e, it has one row per DICOM series, but only includes series
with slide microscopy data.

- DICOM attributes extracted from the files:
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series
= one slide
- `embeddingMedium`: describes in what medium the slide was embedded before
the image was obtained
- `tissueFixative`: describes tissue fixatives used before the image was
obtained
- `staining_usingSubstance`: describes staining steps the specimen underwent
before the image was obtained
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
- `max_TotalMatrixRows`: height of the image at the maximum resolution
- `min_PixelSpacing_2sf`: pixel spacing in mm at the maximum resolution layer,
rounded to 2 significant figures
- `ObjectiveLensPower`: power of the objective lens of the equipment used to
digitize the slide
- `primaryAnatomicStructure`: anatomic location from where the imaged specimen
was collected
- `primaryAnatomicStructureModifier`: additional characteristics of the
specimen, such as whether it is a tumor or normal tissue
- `illuminationType`: specifies the type of illumination used when obtaining
the image

In case of `embeddingMedium`, `tissueFixative`, `staining_usingSubstance`,
`primaryAnatomicStructure`, `primaryAnatomicStructureModifier` and
`illuminationType` the attributes exist with suffix `_code_designator_value_str`
and `_CodeMeaning`, which indicates whether the column contains
CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a
brief explanation on the three-value based coding scheme in DICOM can be found
at https://learn.canceridc.dev/dicom/coding-schemes.

## `sm_instance_index`

The following is the list of the columns included in `sm_instance_index`.
`sm_instance_index` is instance-based, i.e, it has one row per DICOM instance
(pyramid level of a slide, plus potentially thumbnail or label images), but only
includes DICOM instances of the slide microscopy modality.

- DICOM attributes extracted from the files:

- `SOPInstanceUID`: unique identifier of the DICOM instance: one DICOM
instance = one level/label/thumbnail image of the slide
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series
= one slide
- `embeddingMedium`: describes in what medium the slide was embedded before
the image was obtained
- `tissueFixative`: describes tissue fixatives used before the image was
obtained
- `staining_usingSubstance`: describes staining steps the specimen underwent
before the image was obtained
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
- `max_TotalMatrixRows`: height of the image at the maximum resolution
- `PixelSpacing_0`: pixel spacing in mm
- `ImageType`: specifies further characteristics of the image in a list,
including as the third value whether it is a VOLUME, LABEL, OVERVIEW or
THUMBNAIL image.
- `TransferSyntaxUID`: specifies the encoding scheme used for the image data
- `instance_size`: specifies the DICOM instance's size in bytes

- non-DICOM attributes assigned/curated by IDC:
- `crdc_instance_uuid`: globally unique, versioned identifier of the DICOM
instance

In case of `embeddingMedium`, `tissueFixative`, and `staining_usingSubstance`
the attributes exist with suffix `_code_designator_value_str` and
`_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator
and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the
three-value based coding scheme in DICOM can be found at
https://learn.canceridc.dev/dicom/coding-schemes.
91 changes: 2 additions & 89 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,101 +21,14 @@ starting a discussion in [IDC User forum](https://discourse.canceridc.dev/).
:start-after: <!-- SPHINX-START -->
```

## The `index` of `idc-index`

`idc-index` is named this way because it wraps indices of IDC data: tables
containing most important metadata attributes describing the files available in
IDC. The main metadata index is available in the `index` variable (which is a pandas
`DataFrame`) of `IDCClient`.
Additional index tables such as the `clinical_index` contain non-DICOM clinical data or
slide microscopy specific tables (indicated by the prefix `sm`) include metadata attributes
specific to slide microscopy images.



The following is the list of the columns included in `index`. You can use those
to select cohorts and subsetting data. `idc-index` is series-based, i.e, it has
one row per DICOM series.

- non-DICOM attributes assigned/curated by IDC:
- `collection_id`: short string with the identifier of the collection the
series belongs to
- `analysis_result_id`: this string is not empty if the specific series is
part of an analysis results collection; analysis results can be added to a
given collection over time
- `source_DOI`: Digital Object Identifier of the dataset that contains the
given series; note that a given collection can include one or more DOIs,
since analysis results added to the collection would typically have
independent DOI values!
- `instanceCount`: number of files in the series (typically, this matches the
number of slices in cross-sectional modalities)
- `license_short_name`: short name of the license that governs the use of the
files corresponding to the series
- `series_aws_url`: location of the series files in a public AWS bucket
- `series_size_MB`: total disk size needed to store the series

- DICOM attributes extracted from the files
- `PatientID`: identifier of the patient
- `PatientAge` and `PatientSex`: attributes containing patient age and sex
- `StudyInstanceUID`: unique identifier of the DICOM study
- `StudyDescription`: textual description of the study content
- `StudyDate`: date of the study (note that those dates are shifted, and are
not real dates when images were acquired, to protect patient privacy)
- `SeriesInstanceUID`: unique identifier of the DICOM series
- `SeriesDate`: date when the series was acquired
- `SeriesDescription`: textual description of the series content
- `SeriesNumber`: series number
- `BodyPartExamined`: body part imaged
- `Modality`: acquisition modality
- `Manufacturer`: manufacturer of the equipment that generated the series
- `ManufacturerModelName`: model name of the equipment

The following is the list of the columns included in `sm_index`. `sm_index` is series-based, i.e, it has
one row per DICOM series, but only includes series with slide microscopy data.

- DICOM attributes extracted from the files:
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series = one slide
- `embeddingMedium`: describes in what medium the slide was embedded before the image was obtained
- `tissueFixative`: describes tissue fixatives used before the image was obtained
- `staining_usingSubstance`: describes staining steps the specimen underwent before the image was obtained
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
- `max_TotalMatrixRows`: height of the image at the maximum resolution
- `min_PixelSpacing_2sf`: pixel spacing in mm at the maximum resolution layer, rounded to 2 significant figures
- `ObjectiveLensPower`: power of the objective lens of the equipment used to digitize the slide
- `primaryAnatomicStructure`: anatomic location from where the imaged specimen was collected
- `primaryAnatomicStructureModifier`: additional characteristics of the specimen, such as whether it is a tumor or normal tissue
- `illuminationType`: specifies the type of illumination used when obtainig the image

In case of `embeddingMedium`, `tissueFixative`, `staining_usingSubstance`, `primaryAnatomicStructure`, `primaryAnatomicStructureModifier` and `illuminationType` the attributes exist with suffix `_code_designator_value_str` and `_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the three-value based coding scheme in DICOM can be found at https://learn.canceridc.dev/dicom/coding-schemes.

The following is the list of the columns included in `sm_instance_index`. `sm_instance_index` is instance-based, i.e, it has
one row per DICOM instance (pyramid level of a slide, plus potentially thumbnail or label images), but only includes DICOM instances of the slide microscopy modality.

- DICOM attributes extracted from the files:
- `SOPInstanceUID`: unique identifier of the DICOM instance: one DICOM instance = one level/label/thumbnail image of the slide
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series = one slide
- `embeddingMedium`: describes in what medium the slide was embedded before the image was obtained
- `tissueFixative`: describes tissue fixatives used before the image was obtained
- `staining_usingSubstance`: describes staining steps the specimen underwent before the image was obtained
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
- `max_TotalMatrixRows`: height of the image at the maximum resolution
- `PixelSpacing_0`: pixel spacing in mm
- `ImageType`: specifies further characteristics of the image, including whether it is a VOLUME, LABEL, OVERVIEW or THUMBNAIL image.
- `TransferSyntaxUID`: specifies the encoding scheme used for the image data
- `instance_size`: specifies the DICOM instance's size in bytes

- non-DICOM attributes assigned/curated by IDC:
- `crdc_instance_uuid`: globally unique, versioned identifier of the DICOM instance

In case of `embeddingMedium`, `tissueFixative`, and `staining_usingSubstance` the attributes exist with suffix `_code_designator_value_str` and `_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the three-value based coding scheme in DICOM can be found at https://learn.canceridc.dev/dicom/coding-schemes.

## Contents

```{toctree}
:maxdepth: 1
:maxdepth: 2
:titlesonly:
:caption: API docs
column_descriptions
api/idc_index
```

Expand Down

0 comments on commit f18ec0b

Please sign in to comment.