Some of the scripts require metadata generated by other scripts. For example, all scripts have a dependency on populate_catalogue.py
because they require the metadata catalogue to be initialised. Other dependencies are:
Script | Dependency | Reason |
---|---|---|
promotion_status.py |
create_blocklists.py |
In order to set the promotion status to blocked , the script needs knowledge of what tags and modalities are considered to be blocked . |
tag_quality.py |
public_status.py |
Allows prioritisation of tags by the public status. |
promotion_status.py |
Allows prioritisation of tags by the promotion status. | |
counts |
To calculate percentage |
This script will initialise the metadata catalogue with two collections, modalities
and tags
.
$ python populate_catalogue.py -i
Note: This script uses a default
analytics
database as both the source of metadata and destination for the catalogue. It assumes the metadata is in aseries
collection under theanalytics
database.
This will generate a modalities
collection containing all the modalities found in the raw DICOM database and their corresponding tags:
[
{
"modality": "<MODALITY_NAME>",
"tags": [
{
"tag": "<TAG_NAME>"
}
]
}
]
And a tags
collection containing all the tags found in the raw DICOM database and the modalities they can be found in:
[
{
"tag": "<TAG_NAME>",
"modalities": ["<MODALITY_NAME>"]
}
]
Note: The tag extraction only covers top-level tags. See an example of the difficulty with querying nested objects with unknown keys here.
Blocklists are used to indicate modalities and tags that have been blocked from being processed and that require no further analysis or additional metadata.
If you have specific tags and modalities to block, you can create JSON blocklists to indicate what these are. For modalities, this would need to look something like this:
modality_blocklist.json
[
{
"modality": "<MODALITY_NAME>",
"blockReason": "<DESCRIPTION>"
}
]
And for tags, something like this:
tag_blocklist.json
[
{
"tag": "<TAG_NAME>",
"modality": "<MODALITY|all>",
"blockReason": "<DESCRIPTION>"
}
]
Note: Blocking a modality will also block any modality-specific tags.
And you can pass them to the create_blocklists.py
script to load them into the metadata catalogue:
$ python create_blocklists.py -m modality_blocklist.json -t tag_blocklist.json
If you do not have a list of specific modalities and tags, but want to block any that match a criteria (e.g., containining Unknown
in the name), you can pass this criteria to the create_blocklists.py
script instead:
$ python create_blocklists.py -b "Unknown"
You can also specify both:
$ python create_blocklists.py -m modality_blocklist.json -t tag_blocklist.json -b "Unknown"
To perform study, series and image level counts and statistics on a non-relational database, run the following command:
$ python mongo_counts.py
This will add the MongoDB modality counts to the previously initialised modality
collection:
[
{
"modality": "<MODALITY_NAME>",
"tags": [
{
"tag": "<TAG_NAME>"
}
],
"totalNoImagesRaw": "<NUMBER>", #new
"totalNoSeriesRaw": "<NUMBER>", #new
"totalNoStudiesRaw": "<NUMBER>", #new
"avgNoImgPerSeriesRaw": "<NUMBER>", #new
"minNoImgPerSeriesRaw": "<NUMBER>", #new
"maxNoImgPerSeriesRaw": "<NUMBER>", #new
"stdDevImgPerSeriesRaw": "<NUMBER>", #new
"avgNoSeriesPerStudyRaw": "<NUMBER>", #new
"minNoSeriesPerStudyRaw": "<NUMBER>", #new
"maxNoSeriesPerStudyRaw": "<NUMBER>", #new
"stdDevSeriesPerStudyRaw": "<NUMBER>", #new
"countsPerMonthRaw": [ #new
{
"date": "<YYYY/MM>",
"imageCount": "<NUMBER>",
"seriesCount": "<NUMBER>",
"studyCount": "<NUMBER>"
}
],
"countsDateRaw": "<DATE_OF_COUNTS>" #new
}
]
To perform study, series and image level counts and statistics on a relational database, run the following command:
$ python mysql_counts.py -s <STATUS>
Note: The
<STATUS>
can be a choice ofStaging
orProcessing
and it will be attached to the generated metadata as an indicator of the database it was extracted from.
This will add the MySQL modality counts to the modalities
collection:
[
{
"modality": "<MODALITY_NAME>",
"tags": [
{
"tag": "<TAG_NAME>"
}
],
"totalNoImagesRaw": "<NUMBER>",
"totalNoSeriesRaw": "<NUMBER>",
"totalNoStudiesRaw": "<NUMBER>",
"avgNoImgPerSeriesRaw": "<NUMBER>",
"minNoImgPerSeriesRaw": "<NUMBER>",
"maxNoImgPerSeriesRaw": "<NUMBER>",
"stdDevImgPerSeriesRaw": "<NUMBER>",
"avgNoSeriesPerStudyRaw": "<NUMBER>",
"minNoSeriesPerStudyRaw": "<NUMBER>",
"maxNoSeriesPerStudyRaw": "<NUMBER>",
"stdDevSeriesPerStudyRaw": "<NUMBER>",
"countsPerMonthRaw": [
{
"date": "<YYYY/MM>",
"imageCount": "<NUMBER>",
"seriesCount": "<NUMBER>",
"studyCount": "<NUMBER>"
}
],
"countsDateRaw": "<DATE_OF_COUNTS>",
"totalNoImages<STATUS>": "<NUMBER>", #new
"totalNoSeries<STATUS>": "<NUMBER>", #new
"totalNoStudies<STATUS>": "<NUMBER>", #new
"countsPerMonth<STATUS>": [ #new
{
"date": "<YYYY/MM>",
"imageCount": "<NUMBER>",
"seriesCount": "<NUMBER>",
"studyCount": "<NUMBER>"
}
],
"countsDate<STATUS>": "<DATE_OF_COUNTS>" #new
}
]
In the context of DICOM tags, there can be tags known as public
, which are recognised by the DICOM standard, and private
, which are not part of the standard and are specific to the machine generating the information.
$ python public_status.py
This will go through the list of tags in the tags
collection and label those without DICOM codes (e.g, Tag Name
) as public (True
) and tags with DICOM codes (e.g, (0000,0000) Tag Name
) as not public (False
):
[
{
"tag": "<TAG_NAME>",
"modalities": ["<MODALITY_NAME>"],
"public": "<True/False>" # new
}
]
The promotion status is an indicator of the stage at which a tag is in the SMI processing pipeline. Each tag has of of the following statuses at one time:
Status | Promotion stage |
---|---|
blocked | On a blocklist. |
unavailable | Raw, non-relational format. |
processing | Raw, relational format. |
available | Anonymised, relational format. |
$ python promotion_status.py
This will add the modality promotion status to the modalities
metadata:
[
{
"modality": "<MODALITY_NAME>",
"tags": [
{
"tag": "<TAG_NAME>"
}
],
"totalNoImagesRaw": "<NUMBER>",
"totalNoSeriesRaw": "<NUMBER>",
"totalNoStudiesRaw": "<NUMBER>",
"avgNoImgPerSeriesRaw": "<NUMBER>",
"minNoImgPerSeriesRaw": "<NUMBER>",
"maxNoImgPerSeriesRaw": "<NUMBER>",
"stdDevImgPerSeriesRaw": "<NUMBER>",
"avgNoSeriesPerStudyRaw": "<NUMBER>",
"minNoSeriesPerStudyRaw": "<NUMBER>",
"maxNoSeriesPerStudyRaw": "<NUMBER>",
"stdDevSeriesPerStudyRaw": "<NUMBER>",
"countsPerMonthRaw": [
{
"date": "<YYYY/MM>",
"imageCount": "<NUMBER>",
"seriesCount": "<NUMBER>",
"studyCount": "<NUMBER>"
}
],
"countsDateRaw": "<DATE_OF_COUNTS>",
"totalNoImages<STATUS>": "<NUMBER>",
"totalNoSeries<STATUS>": "<NUMBER>",
"totalNoStudies<STATUS>": "<NUMBER>",
"countsPerMonth<STATUS>": [
{
"date": "<YYYY/MM>",
"imageCount": "<NUMBER>",
"seriesCount": "<NUMBER>",
"studyCount": "<NUMBER>"
}
],
"countsDate<STATUS>": "<DATE_OF_COUNTS>",
"promotionStatus": "<blocked|unavailable|processing|available>" # new
}
]
And the following attributes to the tags
metadata:
[
{
"tag": "<TAG_NAME>",
"modalities": ["<MODALITY_NAME>"],
"public": "<True/False>",
"promotionStatus": "<blocked|unavailable|processing|available>" # new
}
]
This calculates completeness for each tag. Completeness refers to the percent of images, out of those that have the tag, with a usable value (i.e., excluding null and empty string).
For an example dataset containing the following test records:
[
{ "_id" : 1, "item" : "pizza", "description" : "yummy", "quantity" : "infinite", "type" : "food" }
{ "_id" : 2, "item" : null, "description" : null, "quantity" : "1", "type" : "food" }
{ "_id" : 3, "item" : "cricket", "quantity" : "", "type" : "food" }
{ "_id" : 4, "item" : "" }
]
$ python tag_quality.py
The quality check would result in the following metadata:
Tag | Completeness |
---|---|
_id | 100% |
item | 50% |
description | 25% |
quantity | 50% |
type | 75% |
For raw documents, the tag quality measurements will count the following:
exists
- how many documents (images) have this tag, whether this is an empty string or not. Does not include nulls.emptyStr
- how many documents (images) have this tag, but its value is empty string
From these counts, tag completeness percentages can be calculated:
100 * ((exists
- emptyStr
) / totalNoImages
)
You can also prioritise tags by whether they are available
or public
by specifying the -p
flag:
$ python tag_quality.py -p public
$ python tag_quality.py -p available
By default, the priority will be set to all
, including blocked tags.
This command will extend the modalities
metadata with the following:
[
{
"modality": "<MODALITY_NAME>",
"tags": [
{
"tag": "<TAG_NAME>",
"completenessRaw": "<PERCENT>", # new
"tagQualityDateRaw": "<DATE_OF_TAG_QUALITY_RUN>", # new
}
],
"totalNoImagesRaw": "<NUMBER>",
"totalNoSeriesRaw": "<NUMBER>",
"totalNoStudiesRaw": "<NUMBER>",
"avgNoImgPerSeriesRaw": "<NUMBER>",
"minNoImgPerSeriesRaw": "<NUMBER>",
"maxNoImgPerSeriesRaw": "<NUMBER>",
"stdDevImgPerSeriesRaw": "<NUMBER>",
"avgNoSeriesPerStudyRaw": "<NUMBER>",
"minNoSeriesPerStudyRaw": "<NUMBER>",
"maxNoSeriesPerStudyRaw": "<NUMBER>",
"stdDevSeriesPerStudyRaw": "<NUMBER>",
"countsPerMonthRaw": [
{
"date": "<YYYY/MM>",
"imageCount": "<NUMBER>",
"seriesCount": "<NUMBER>",
"studyCount": "<NUMBER>"
}
],
"countsDateRaw": "<DATE_OF_COUNTS>",
"totalNoImages<STATUS>": "<NUMBER>",
"totalNoSeries<STATUS>": "<NUMBER>",
"totalNoStudies<STATUS>": "<NUMBER>",
"countsPerMonth<STATUS>": [
{
"date": "<YYYY/MM>",
"imageCount": "<NUMBER>",
"seriesCount": "<NUMBER>",
"studyCount": "<NUMBER>"
}
],
"countsDate<STATUS>": "<DATE_OF_COUNTS>"
}
]
The DICOM standard metadata is imported from the Innolitcs website, via their repository, which provides JSON lists of the metadata extracted from the official DICOM standard website.
To download the required JSON files from the Innolitics GitHub repository, run the following script (in an online environment):
$ ./dicom_standard_download.sh <OUTPUT_PATH>
Note: Please be mindful with how often you download these files.
This will create a dicom_standard
directory at the given location of the following structure:
dicom_standard/
modalities.json
modality_levels.json
modality_tags.json
tag_confidentiality.json
tags.json
To import this metadata to the catalogue, run the dicom_standard_import.py script:
$ python dicom_standard_import.py -f <OUTPUT_PATH>/dicom_standard
This will pull the following metadata, based on the following links:
modalities.json
- pull list of
modality
from SMI metacatmodalities
collection - match
name
from file tomodality
from list and pull the following from file into themodalities
collection- ciodID
- description
- linkToStandard
tags.json
, tag_confidentiality.json
, and modality_tags.json
- match
tag
fromtags.json
totag
fromtag_confidentiality.json
and merge metadata - match
tag
from merged metadata totag
frommodality_tags.json
and pull the following from file to the merged metadata- type
- description
- linkToStandard
- informationEntity
- pull list of
tag
andpromotionStatus
from SMI metacattags
collection - if
promotionStatus
is notblocked
, matchkeyword
from merged metadata totag
from list and import the following from the merged metadata into thetags
collection- tag as dicomID
- type
- description
- linkToStandard
- valueRepresentation
- valueMultiplicity
- retired
- basicProfile
- cleanDescOpt (opt)
- cleanStructContOpt (opt)
- cleanGraphOpt (opt)
- rtnLongFullDatesOpt (opt)
- rtnLongModifDatesOpt (opt)
- rtnPatCharsOpt (opt)
- rtnInstIdOpt (opt)
- rtnUIDsOpt (opt)
- rtnDevIdOpt (opt)
For modalities
, this will create the following metadata:
[
{
"modality": "<MODALITY_NAME>",
"ciodID": "<CIOD_ID>", # new
"description": "<MODALITY_DESCR>", # new
"linkToStandard": "<LINK_TO_STANDARD>", # new
"standardDate": "<DATE_OF_STANDARD_IMPORT>", # new
"tags": [
{
"tag": "<TAG_NAME>",
"completenessRaw": "<PERCENT>",
"tagQualityDateRaw": "<DATE_OF_TAG_QUALITY_RUN>",
}
],
"totalNoImagesRaw": "<NUMBER>",
"totalNoSeriesRaw": "<NUMBER>",
"totalNoStudiesRaw": "<NUMBER>",
"avgNoImgPerSeriesRaw": "<NUMBER>",
"minNoImgPerSeriesRaw": "<NUMBER>",
"maxNoImgPerSeriesRaw": "<NUMBER>",
"stdDevImgPerSeriesRaw": "<NUMBER>",
"avgNoSeriesPerStudyRaw": "<NUMBER>",
"minNoSeriesPerStudyRaw": "<NUMBER>",
"maxNoSeriesPerStudyRaw": "<NUMBER>",
"stdDevSeriesPerStudyRaw": "<NUMBER>",
"countsPerMonthRaw": [
{
"date": "<YYYY/MM>",
"imageCount": "<NUMBER>",
"seriesCount": "<NUMBER>",
"studyCount": "<NUMBER>"
}
],
"countsDateRaw": "<DATE_OF_COUNTS>",
"totalNoImages<STATUS>": "<NUMBER>",
"totalNoSeries<STATUS>": "<NUMBER>",
"totalNoStudies<STATUS>": "<NUMBER>",
"countsPerMonth<STATUS>": [
{
"date": "<YYYY/MM>",
"imageCount": "<NUMBER>",
"seriesCount": "<NUMBER>",
"studyCount": "<NUMBER>"
}
],
"countsDate<STATUS>": "<DATE_OF_COUNTS>"
}
]
For tags
, this will create the following metadata:
[
{
"tag": "<TAG_NAME>",
"dicomID": "<TAG_DICOM_ID>", # new
"informationEntity": "<TAG_LEVEL>", # new
"description": "<TAG_DESCRIPTION>", # new
"linkToStandard": "<TAG_LINK_TO_STD>", # new
"retired": "<Y/N>", # new
"valueRepresentation": "<VR>", # new
"valueMultiplicity": "<VM>", # new
"basicProfile": "<DICOM_BASIC_PROFILE>", # new
"<OTHER_OPTIONAL_META>": "<OPT_META>", # new
"standardDate": "<DATE_OF_STANDARD_IMPORT>", # new
"modalities": ["<MODALITY_NAME>"],
"public": "<True/False>",
"promotionStatus": "<blocked|unavailable|processing|available>"
}
]