Skip to content

Conversation

ReeceStevens
Copy link
Contributor

Description

Related issues: None

I have been using deid in a webassembly execution context with Pyodide. In this configuration, performance issues are exacerbated, and I have been noticing prohibitively slow performance on DICOM header de-identification. I spent some time profiling the header deid functionality and identified three main areas where performance was getting limited:

  1. Regex generation was taking up a significant amount of time
  2. Running field.name_contains was taking up a very large portion of the overall runtime (the inner loop within case 2 of expand_field_expression)
  3. get_fields was being run over and over

This PR consists of three commits, each tailored to one of the above points-- plus one extra commit to fix up remaining bugs and get all tests passing.

Performance gains here are, of course, dependent on the input DICOM files as well as the deid recipe that is used. In my test setup with 69 input DICOM files, I observed the following speed improvements:

  • origin/master: 35.589 seconds
  • After pre-compiling regexes: 26.727 seconds
  • After using lookup tables for get_fields: 7.344 seconds
  • After adding caching to get_fields: 5.304 seconds

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • My code follows the style guidelines of this project

Open questions

In order to get caching working, I had to add a __hash__ property to pydicom.FileDataset. This is obviously not ideal, but there wasn't another way to get the caching performance boost. I had originally thought I could just wrap FileDataset in a proxy class, but the overhead of creating the proxy class seemed to be enough to slow things down as much as just getting a cache miss.

During profiling, it was identified that repeated regex lookups inside
loops were taking a significant amount of time. To reduce this, regex
expressions were pre-compiled outside of loops, and plain string
comparison was used when possible to sidestep the performance overhead
of regex matching for simple things like string equality comparison.
The `get_fields_with_lookup` function was added to augment the
`get_fields` function with a lookup table that allows quick
identification of exact tag matches.

This optimization significantly reduced the amount of time spent in the
exact matching stage of `expand_field_expression`. For top-level DICOM
tag searches, the search in "Case 2" would call `name_contains` on every
single tag in the DICOM dataset. With lookup tables, we can look up a
contender field based on the values for which we're checking exact
matches-- this becomes a key lookup problem rather than searching all
fields for a matching identifier.
Copy link
Member

@vsoch vsoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent! See my questions and comments. Akin to the other, we will need to bump the version and changelog. If you like we can merge the other with a bump to the version, and release both changes under that version.

# only way to enable the use of caching without incurring significant
# performance overhead. Note that adding a proxy class around this
# decreases performance substantially (50% slowdown measured).
FileDataset.__hash__ = lambda self: id(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we should suggest for upstream (in that it might help other projects), or just appropriate to put here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question... from my understanding, hash functions are typically supposed to be based on the value of an object (as opposed to its identity). From what I've read, objects by default use their id() as their hash method until you define an __eq__ method on the class, at which point you then have to define your own hash.

I can't think of any reasonable drawback to just using id() as the hash in practice, except that using datasets as dictionary keys might be a little funny:

some_dict = {}
ds1 = pydicom.dcmread('./somefile.dcm')
some_dict[ds1] = ds1.filename

# Later, read the same file again
ds2 = pydicom.dcmread('./somefile.dcm')
ds2 in some_dict # evaluates to False

It could be worth proposing upstream to pydicom and we could just see if the maintainer is open to the change. But maybe we could just have it here for now until we figure out the next steps on the pydicom side.

# Contains

def name_contains(self, expression, whole_string=False):
def name_contains(self, expression):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this refactor - the whole_string update was not to my liking.

- ELEMENT_OFFSET: 2-digit hexadecimal element number (last 8 bits of full element)
"""
regexp_expression = f"^{expression}$" if whole_string else expression
if type(expression) is str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not use isinstance(expression, str) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just an oversight on my part-- I'll switch to using isinstance here for the type comparisons! (I've been switching between TypeScript and Python a lot recently)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in dc6fc50

or re.search(regexp_expression, self.stripped_tag)
or re.search(regexp_expression, self.element.name)
or re.search(regexp_expression, self.element.keyword)
expression.search(self.name.lower())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much easier to read with the compiled regex.

# if no contenders provided, use top level of dicom headers
if contenders is None:
contenders = get_fields(dicom)
contenders, contender_lookup_tables = get_fields_with_lookup(dicom)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually needing to return >1 related thing is a pattern or sign for a class. In the future we might consider a class here that has easy accessibility to the tables and then getting a particular item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 0843c36. It was nice to remove all these lookup table variables lying around

if expander.lower() in ["endswith", "startswith", "contains"]:
if field.name_contains(expression):
fields[uid] = field
if type(field) is str and string_matches_expander(expander, expression, field):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance here again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return fields

def field_matches_expander(expander, expression_string, expression_re, field):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make sure all functions have docstrings (you can easily convert the comments I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 37adf90

"""
skip = skip or []
seen = seen or []
fields, new_seen, new_skip = get_fields_inner(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly another opportunity for a class or Dataclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this function to be a "private" function and kept the interface as-is for simplicity, since this is only used within the get_fields_with_lookup function.

"element_keyword": defaultdict(list),
}
for uid, field in fields.items():
if type(field) is not DicomField:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance? See https://switowski.com/blog/type-vs-isinstance/. it may only matter (for speed) for older versions of Python, which unfortunately are still present on many of our clusters... 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not self.fields:
self.fields = get_fields(
if not self.fields or not self.fields_by_name:
self.fields, self.lookup_tables = get_fields_with_lookup(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to reset the looking tables when the class is init (or any update to parse a different file, for example). I'm trying to think of if there is a case where we might generate the lookup table for one dicom dataset and then load another (and have the tables mixed up and unintentionally combine patient data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a part of 0843c36!

@ReeceStevens
Copy link
Contributor Author

Thanks for the review and feedback @vsoch ! I believe I've addressed all outstanding comments, all tests are passing, and pre-commit hooks are all green.

If you like we can merge the other with a bump to the version, and release both changes under that version.

That sounds like a plan to me-- I'll add the changelog adjustments and version bump in the other PR.

@ReeceStevens ReeceStevens requested a review from vsoch September 24, 2025 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants