-
-
Notifications
You must be signed in to change notification settings - Fork 54
Improve speed of header deid with lookup tables and caching #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
During profiling, it was identified that repeated regex lookups inside loops were taking a significant amount of time. To reduce this, regex expressions were pre-compiled outside of loops, and plain string comparison was used when possible to sidestep the performance overhead of regex matching for simple things like string equality comparison.
The `get_fields_with_lookup` function was added to augment the `get_fields` function with a lookup table that allows quick identification of exact tag matches. This optimization significantly reduced the amount of time spent in the exact matching stage of `expand_field_expression`. For top-level DICOM tag searches, the search in "Case 2" would call `name_contains` on every single tag in the DICOM dataset. With lookup tables, we can look up a contender field based on the values for which we're checking exact matches-- this becomes a key lookup problem rather than searching all fields for a matching identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent! See my questions and comments. Akin to the other, we will need to bump the version and changelog. If you like we can merge the other with a bump to the version, and release both changes under that version.
# only way to enable the use of caching without incurring significant | ||
# performance overhead. Note that adding a proxy class around this | ||
# decreases performance substantially (50% slowdown measured). | ||
FileDataset.__hash__ = lambda self: id(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we should suggest for upstream (in that it might help other projects), or just appropriate to put here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question... from my understanding, hash functions are typically supposed to be based on the value of an object (as opposed to its identity). From what I've read, objects by default use their id()
as their hash method until you define an __eq__
method on the class, at which point you then have to define your own hash.
I can't think of any reasonable drawback to just using id()
as the hash in practice, except that using datasets as dictionary keys might be a little funny:
some_dict = {}
ds1 = pydicom.dcmread('./somefile.dcm')
some_dict[ds1] = ds1.filename
# Later, read the same file again
ds2 = pydicom.dcmread('./somefile.dcm')
ds2 in some_dict # evaluates to False
It could be worth proposing upstream to pydicom and we could just see if the maintainer is open to the change. But maybe we could just have it here for now until we figure out the next steps on the pydicom side.
# Contains | ||
|
||
def name_contains(self, expression, whole_string=False): | ||
def name_contains(self, expression): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this refactor - the whole_string
update was not to my liking.
deid/dicom/fields.py
Outdated
- ELEMENT_OFFSET: 2-digit hexadecimal element number (last 8 bits of full element) | ||
""" | ||
regexp_expression = f"^{expression}$" if whole_string else expression | ||
if type(expression) is str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not use isinstance(expression, str)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just an oversight on my part-- I'll switch to using isinstance
here for the type comparisons! (I've been switching between TypeScript and Python a lot recently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in dc6fc50
or re.search(regexp_expression, self.stripped_tag) | ||
or re.search(regexp_expression, self.element.name) | ||
or re.search(regexp_expression, self.element.keyword) | ||
expression.search(self.name.lower()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much easier to read with the compiled regex.
deid/dicom/fields.py
Outdated
# if no contenders provided, use top level of dicom headers | ||
if contenders is None: | ||
contenders = get_fields(dicom) | ||
contenders, contender_lookup_tables = get_fields_with_lookup(dicom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually needing to return >1 related thing is a pattern or sign for a class. In the future we might consider a class here that has easy accessibility to the tables and then getting a particular item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 0843c36. It was nice to remove all these lookup table variables lying around
deid/dicom/fields.py
Outdated
if expander.lower() in ["endswith", "startswith", "contains"]: | ||
if field.name_contains(expression): | ||
fields[uid] = field | ||
if type(field) is str and string_matches_expander(expander, expression, field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance
here again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
return fields | ||
|
||
def field_matches_expander(expander, expression_string, expression_re, field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please make sure all functions have docstrings (you can easily convert the comments I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in 37adf90
deid/dicom/fields.py
Outdated
""" | ||
skip = skip or [] | ||
seen = seen or [] | ||
fields, new_seen, new_skip = get_fields_inner( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly another opportunity for a class or Dataclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this function to be a "private" function and kept the interface as-is for simplicity, since this is only used within the get_fields_with_lookup
function.
deid/dicom/fields.py
Outdated
"element_keyword": defaultdict(list), | ||
} | ||
for uid, field in fields.items(): | ||
if type(field) is not DicomField: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance
? See https://switowski.com/blog/type-vs-isinstance/. it may only matter (for speed) for older versions of Python, which unfortunately are still present on many of our clusters... 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deid/dicom/parser.py
Outdated
if not self.fields: | ||
self.fields = get_fields( | ||
if not self.fields or not self.fields_by_name: | ||
self.fields, self.lookup_tables = get_fields_with_lookup( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to reset the looking tables when the class is init (or any update to parse a different file, for example). I'm trying to think of if there is a case where we might generate the lookup table for one dicom dataset and then load another (and have the tables mixed up and unintentionally combine patient data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as a part of 0843c36!
Thanks for the review and feedback @vsoch ! I believe I've addressed all outstanding comments, all tests are passing, and pre-commit hooks are all green.
That sounds like a plan to me-- I'll add the changelog adjustments and version bump in the other PR. |
Description
Related issues: None
I have been using
deid
in a webassembly execution context with Pyodide. In this configuration, performance issues are exacerbated, and I have been noticing prohibitively slow performance on DICOM header de-identification. I spent some time profiling the header deid functionality and identified three main areas where performance was getting limited:field.name_contains
was taking up a very large portion of the overall runtime (the inner loop within case 2 ofexpand_field_expression
)get_fields
was being run over and overThis PR consists of three commits, each tailored to one of the above points-- plus one extra commit to fix up remaining bugs and get all tests passing.
Performance gains here are, of course, dependent on the input DICOM files as well as the deid recipe that is used. In my test setup with 69 input DICOM files, I observed the following speed improvements:
origin/master
: 35.589 secondsget_fields
: 7.344 secondsget_fields
: 5.304 secondsChecklist
Open questions
In order to get caching working, I had to add a
__hash__
property topydicom.FileDataset
. This is obviously not ideal, but there wasn't another way to get the caching performance boost. I had originally thought I could just wrap FileDataset in a proxy class, but the overhead of creating the proxy class seemed to be enough to slow things down as much as just getting a cache miss.