Skip to content

Latest commit

 

History

History

metadata_collection

SMI metadata collection

Contents

Dependencies

Some of the scripts require metadata generated by other scripts. For example, all scripts have a dependency on populate_catalogue.py because they require the metadata catalogue to be initialised. Other dependencies are:

Script Dependency Reason
promotion_status.py create_blocklists.py In order to set the promotion status to blocked, the script needs knowledge of what tags and modalities are considered to be blocked.
tag_quality.py public_status.py Allows prioritisation of tags by the public status.
promotion_status.py Allows prioritisation of tags by the promotion status.
counts To calculate percentage

Initialise catalogue

This script will initialise the metadata catalogue with two collections, modalities and tags.

$ python populate_catalogue.py -i

Note: This script uses a default analytics database as both the source of metadata and destination for the catalogue. It assumes the metadata is in a series collection under the analytics database.

This will generate a modalities collection containing all the modalities found in the raw DICOM database and their corresponding tags:

[
   {
       "modality": "<MODALITY_NAME>",
       "tags": [
           {
               "tag": "<TAG_NAME>"
           }
       ]
   }
]

And a tags collection containing all the tags found in the raw DICOM database and the modalities they can be found in:

[
   {
       "tag": "<TAG_NAME>",
       "modalities": ["<MODALITY_NAME>"]
   }
]

Note: The tag extraction only covers top-level tags. See an example of the difficulty with querying nested objects with unknown keys here.

Generate blocklists

Blocklists are used to indicate modalities and tags that have been blocked from being processed and that require no further analysis or additional metadata.

If you have specific tags and modalities to block, you can create JSON blocklists to indicate what these are. For modalities, this would need to look something like this:

modality_blocklist.json

[
    {
        "modality": "<MODALITY_NAME>",
        "blockReason": "<DESCRIPTION>"
    }
]

And for tags, something like this:

tag_blocklist.json

[
    {
        "tag": "<TAG_NAME>",
        "modality": "<MODALITY|all>",
        "blockReason": "<DESCRIPTION>"
    }
]

Note: Blocking a modality will also block any modality-specific tags.

And you can pass them to the create_blocklists.py script to load them into the metadata catalogue:

$ python create_blocklists.py -m modality_blocklist.json -t tag_blocklist.json

If you do not have a list of specific modalities and tags, but want to block any that match a criteria (e.g., containining Unknown in the name), you can pass this criteria to the create_blocklists.py script instead:

$ python create_blocklists.py -b "Unknown"

You can also specify both:

$ python create_blocklists.py -m modality_blocklist.json -t tag_blocklist.json -b "Unknown"

Perform Mongo counts

To perform study, series and image level counts and statistics on a non-relational database, run the following command:

$ python mongo_counts.py

This will add the MongoDB modality counts to the previously initialised modality collection:

[
   {
        "modality": "<MODALITY_NAME>",
        "tags": [
            {
                "tag": "<TAG_NAME>"
            }
        ],
        "totalNoImagesRaw": "<NUMBER>",        #new
        "totalNoSeriesRaw": "<NUMBER>",        #new
        "totalNoStudiesRaw": "<NUMBER>",       #new
        "avgNoImgPerSeriesRaw": "<NUMBER>",    #new
        "minNoImgPerSeriesRaw": "<NUMBER>",    #new
        "maxNoImgPerSeriesRaw": "<NUMBER>",    #new
        "stdDevImgPerSeriesRaw": "<NUMBER>",   #new
        "avgNoSeriesPerStudyRaw": "<NUMBER>",  #new
        "minNoSeriesPerStudyRaw": "<NUMBER>",  #new
        "maxNoSeriesPerStudyRaw": "<NUMBER>",  #new
        "stdDevSeriesPerStudyRaw": "<NUMBER>", #new
        "countsPerMonthRaw": [                 #new
            {
                "date": "<YYYY/MM>",
                "imageCount": "<NUMBER>",
                "seriesCount": "<NUMBER>",
                "studyCount": "<NUMBER>"
            }
        ],
        "countsDateRaw": "<DATE_OF_COUNTS>"    #new
    }
]

Perform MySQL counts

To perform study, series and image level counts and statistics on a relational database, run the following command:

$ python mysql_counts.py -s <STATUS>

Note: The <STATUS> can be a choice of Staging or Processing and it will be attached to the generated metadata as an indicator of the database it was extracted from.

This will add the MySQL modality counts to the modalities collection:

[
   {
        "modality": "<MODALITY_NAME>",
        "tags": [
            {
                "tag": "<TAG_NAME>"
            }
        ],
        "totalNoImagesRaw": "<NUMBER>",
        "totalNoSeriesRaw": "<NUMBER>",
        "totalNoStudiesRaw": "<NUMBER>",
        "avgNoImgPerSeriesRaw": "<NUMBER>",
        "minNoImgPerSeriesRaw": "<NUMBER>",
        "maxNoImgPerSeriesRaw": "<NUMBER>",
        "stdDevImgPerSeriesRaw": "<NUMBER>",
        "avgNoSeriesPerStudyRaw": "<NUMBER>",
        "minNoSeriesPerStudyRaw": "<NUMBER>",
        "maxNoSeriesPerStudyRaw": "<NUMBER>",
        "stdDevSeriesPerStudyRaw": "<NUMBER>",
        "countsPerMonthRaw": [
            {
                "date": "<YYYY/MM>",
                "imageCount": "<NUMBER>",
                "seriesCount": "<NUMBER>",
                "studyCount": "<NUMBER>"
            }
        ],
        "countsDateRaw": "<DATE_OF_COUNTS>",
        "totalNoImages<STATUS>": "<NUMBER>",     #new
        "totalNoSeries<STATUS>": "<NUMBER>",     #new
        "totalNoStudies<STATUS>": "<NUMBER>",    #new
        "countsPerMonth<STATUS>": [              #new
            {
                "date": "<YYYY/MM>",
                "imageCount": "<NUMBER>",
                "seriesCount": "<NUMBER>",
                "studyCount": "<NUMBER>"
            }
        ],
        "countsDate<STATUS>": "<DATE_OF_COUNTS>" #new
   }
]

Set tag public status

In the context of DICOM tags, there can be tags known as public, which are recognised by the DICOM standard, and private, which are not part of the standard and are specific to the machine generating the information.

$ python public_status.py

This will go through the list of tags in the tags collection and label those without DICOM codes (e.g, Tag Name) as public (True) and tags with DICOM codes (e.g, (0000,0000) Tag Name) as not public (False):

[
   {
       "tag": "<TAG_NAME>",
       "modalities": ["<MODALITY_NAME>"],
       "public": "<True/False>" # new
   }
]

Set tag promotion status

The promotion status is an indicator of the stage at which a tag is in the SMI processing pipeline. Each tag has of of the following statuses at one time:

Status Promotion stage
blocked On a blocklist.
unavailable Raw, non-relational format.
processing Raw, relational format.
available Anonymised, relational format.
$ python promotion_status.py

This will add the modality promotion status to the modalities metadata:

[
   {
        "modality": "<MODALITY_NAME>",
        "tags": [
            {
                "tag": "<TAG_NAME>"
            }
        ],
        "totalNoImagesRaw": "<NUMBER>",
        "totalNoSeriesRaw": "<NUMBER>",
        "totalNoStudiesRaw": "<NUMBER>",
        "avgNoImgPerSeriesRaw": "<NUMBER>",
        "minNoImgPerSeriesRaw": "<NUMBER>",
        "maxNoImgPerSeriesRaw": "<NUMBER>",
        "stdDevImgPerSeriesRaw": "<NUMBER>",
        "avgNoSeriesPerStudyRaw": "<NUMBER>",
        "minNoSeriesPerStudyRaw": "<NUMBER>",
        "maxNoSeriesPerStudyRaw": "<NUMBER>",
        "stdDevSeriesPerStudyRaw": "<NUMBER>",
        "countsPerMonthRaw": [
            {
                "date": "<YYYY/MM>",
                "imageCount": "<NUMBER>",
                "seriesCount": "<NUMBER>",
                "studyCount": "<NUMBER>"
            }
        ],
        "countsDateRaw": "<DATE_OF_COUNTS>",
        "totalNoImages<STATUS>": "<NUMBER>",
        "totalNoSeries<STATUS>": "<NUMBER>",
        "totalNoStudies<STATUS>": "<NUMBER>",
        "countsPerMonth<STATUS>": [
            {
                "date": "<YYYY/MM>",
                "imageCount": "<NUMBER>",
                "seriesCount": "<NUMBER>",
                "studyCount": "<NUMBER>"
            }
        ],
        "countsDate<STATUS>": "<DATE_OF_COUNTS>",
        "promotionStatus": "<blocked|unavailable|processing|available>" # new
   }
]

And the following attributes to the tags metadata:

[
   {
       "tag": "<TAG_NAME>",
       "modalities": ["<MODALITY_NAME>"],
       "public": "<True/False>",
       "promotionStatus": "<blocked|unavailable|processing|available>" # new
   }
]

Measure tag quality

This calculates completeness for each tag. Completeness refers to the percent of images, out of those that have the tag, with a usable value (i.e., excluding null and empty string).

For an example dataset containing the following test records:

[
    { "_id" : 1, "item" : "pizza", "description" : "yummy", "quantity" : "infinite", "type" : "food" }
    { "_id" : 2, "item" : null, "description" : null, "quantity" : "1", "type" : "food" }
    { "_id" : 3, "item" : "cricket", "quantity" : "", "type" : "food" }
    { "_id" : 4, "item" : "" }
]
$ python tag_quality.py

The quality check would result in the following metadata:

Tag Completeness
_id 100%
item 50%
description 25%
quantity 50%
type 75%

For raw documents, the tag quality measurements will count the following:

  • exists - how many documents (images) have this tag, whether this is an empty string or not. Does not include nulls.
  • emptyStr - how many documents (images) have this tag, but its value is empty string

From these counts, tag completeness percentages can be calculated:

100 * ((exists - emptyStr) / totalNoImages)

You can also prioritise tags by whether they are available or public by specifying the -p flag:

$ python tag_quality.py -p public
$ python tag_quality.py -p available

By default, the priority will be set to all, including blocked tags.

This command will extend the modalities metadata with the following:

[
   {
        "modality": "<MODALITY_NAME>",
        "tags": [
            {
                "tag": "<TAG_NAME>",
                "completenessRaw": "<PERCENT>",                   # new
                "tagQualityDateRaw": "<DATE_OF_TAG_QUALITY_RUN>", # new
            }
        ],
        "totalNoImagesRaw": "<NUMBER>",
        "totalNoSeriesRaw": "<NUMBER>",
        "totalNoStudiesRaw": "<NUMBER>",
        "avgNoImgPerSeriesRaw": "<NUMBER>",
        "minNoImgPerSeriesRaw": "<NUMBER>",
        "maxNoImgPerSeriesRaw": "<NUMBER>",
        "stdDevImgPerSeriesRaw": "<NUMBER>",
        "avgNoSeriesPerStudyRaw": "<NUMBER>",
        "minNoSeriesPerStudyRaw": "<NUMBER>",
        "maxNoSeriesPerStudyRaw": "<NUMBER>",
        "stdDevSeriesPerStudyRaw": "<NUMBER>",
        "countsPerMonthRaw": [
            {
                "date": "<YYYY/MM>",
                "imageCount": "<NUMBER>",
                "seriesCount": "<NUMBER>",
                "studyCount": "<NUMBER>"
            }
        ],
        "countsDateRaw": "<DATE_OF_COUNTS>",
        "totalNoImages<STATUS>": "<NUMBER>",
        "totalNoSeries<STATUS>": "<NUMBER>",
        "totalNoStudies<STATUS>": "<NUMBER>",
        "countsPerMonth<STATUS>": [
            {
                "date": "<YYYY/MM>",
                "imageCount": "<NUMBER>",
                "seriesCount": "<NUMBER>",
                "studyCount": "<NUMBER>"
            }
        ],
        "countsDate<STATUS>": "<DATE_OF_COUNTS>"
   }
]

Import DICOM standard metadata

The DICOM standard metadata is imported from the Innolitcs website, via their repository, which provides JSON lists of the metadata extracted from the official DICOM standard website.

To download the required JSON files from the Innolitics GitHub repository, run the following script (in an online environment):

$ ./dicom_standard_download.sh <OUTPUT_PATH>

Note: Please be mindful with how often you download these files.

This will create a dicom_standard directory at the given location of the following structure:

dicom_standard/
  modalities.json
  modality_levels.json
  modality_tags.json
  tag_confidentiality.json
  tags.json

To import this metadata to the catalogue, run the dicom_standard_import.py script:

$ python dicom_standard_import.py -f <OUTPUT_PATH>/dicom_standard

This will pull the following metadata, based on the following links:

modalities.json

  • pull list of modality from SMI metacat modalities collection
  • match name from file to modality from list and pull the following from file into the modalities collection
    • ciodID
    • description
    • linkToStandard

tags.json, tag_confidentiality.json, and modality_tags.json

  • match tag from tags.json to tag from tag_confidentiality.json and merge metadata
  • match tag from merged metadata to tag from modality_tags.json and pull the following from file to the merged metadata
    • type
    • description
    • linkToStandard
    • informationEntity
  • pull list of tag and promotionStatus from SMI metacat tags collection
  • if promotionStatus is not blocked, match keyword from merged metadata to tag from list and import the following from the merged metadata into the tags collection
    • tag as dicomID
    • type
    • description
    • linkToStandard
    • valueRepresentation
    • valueMultiplicity
    • retired
    • basicProfile
    • cleanDescOpt (opt)
    • cleanStructContOpt (opt)
    • cleanGraphOpt (opt)
    • rtnLongFullDatesOpt (opt)
    • rtnLongModifDatesOpt (opt)
    • rtnPatCharsOpt (opt)
    • rtnInstIdOpt (opt)
    • rtnUIDsOpt (opt)
    • rtnDevIdOpt (opt)

For modalities, this will create the following metadata:

[
   {
        "modality": "<MODALITY_NAME>",
        "ciodID": "<CIOD_ID>",                       # new
        "description": "<MODALITY_DESCR>",           # new
        "linkToStandard": "<LINK_TO_STANDARD>",      # new
        "standardDate": "<DATE_OF_STANDARD_IMPORT>", # new
        "tags": [
            {
                "tag": "<TAG_NAME>",
                "completenessRaw": "<PERCENT>",
                "tagQualityDateRaw": "<DATE_OF_TAG_QUALITY_RUN>",
            }
        ],
        "totalNoImagesRaw": "<NUMBER>",
        "totalNoSeriesRaw": "<NUMBER>",
        "totalNoStudiesRaw": "<NUMBER>",
        "avgNoImgPerSeriesRaw": "<NUMBER>",
        "minNoImgPerSeriesRaw": "<NUMBER>",
        "maxNoImgPerSeriesRaw": "<NUMBER>",
        "stdDevImgPerSeriesRaw": "<NUMBER>",
        "avgNoSeriesPerStudyRaw": "<NUMBER>",
        "minNoSeriesPerStudyRaw": "<NUMBER>",
        "maxNoSeriesPerStudyRaw": "<NUMBER>",
        "stdDevSeriesPerStudyRaw": "<NUMBER>",
        "countsPerMonthRaw": [
            {
                "date": "<YYYY/MM>",
                "imageCount": "<NUMBER>",
                "seriesCount": "<NUMBER>",
                "studyCount": "<NUMBER>"
            }
        ],
        "countsDateRaw": "<DATE_OF_COUNTS>",
        "totalNoImages<STATUS>": "<NUMBER>",
        "totalNoSeries<STATUS>": "<NUMBER>",
        "totalNoStudies<STATUS>": "<NUMBER>",
        "countsPerMonth<STATUS>": [
            {
                "date": "<YYYY/MM>",
                "imageCount": "<NUMBER>",
                "seriesCount": "<NUMBER>",
                "studyCount": "<NUMBER>"
            }
        ],
        "countsDate<STATUS>": "<DATE_OF_COUNTS>"
   }
]

For tags, this will create the following metadata:

[
   {
       "tag": "<TAG_NAME>",
       "dicomID": "<TAG_DICOM_ID>",                 # new
       "informationEntity": "<TAG_LEVEL>",          # new
       "description": "<TAG_DESCRIPTION>",          # new
       "linkToStandard": "<TAG_LINK_TO_STD>",       # new
       "retired": "<Y/N>",                          # new
       "valueRepresentation": "<VR>",               # new
       "valueMultiplicity": "<VM>",                 # new
       "basicProfile": "<DICOM_BASIC_PROFILE>",     # new
       "<OTHER_OPTIONAL_META>": "<OPT_META>",       # new
       "standardDate": "<DATE_OF_STANDARD_IMPORT>", # new
       "modalities": ["<MODALITY_NAME>"],
       "public": "<True/False>",
       "promotionStatus": "<blocked|unavailable|processing|available>"
   }
]