Improve speed of header deid with lookup tables and caching #289

ReeceStevens · 2025-09-23T16:20:07Z

Description

Related issues: None

I have been using deid in a webassembly execution context with Pyodide. In this configuration, performance issues are exacerbated, and I have been noticing prohibitively slow performance on DICOM header de-identification. I spent some time profiling the header deid functionality and identified three main areas where performance was getting limited:

Regex generation was taking up a significant amount of time
Running field.name_contains was taking up a very large portion of the overall runtime (the inner loop within case 2 of expand_field_expression)
get_fields was being run over and over

This PR consists of three commits, each tailored to one of the above points-- plus one extra commit to fix up remaining bugs and get all tests passing.

Performance gains here are, of course, dependent on the input DICOM files as well as the deid recipe that is used. In my test setup with 69 input DICOM files, I observed the following speed improvements:

origin/master: 35.589 seconds
After pre-compiling regexes: 26.727 seconds
After using lookup tables for get_fields: 7.344 seconds
After adding caching to get_fields: 5.304 seconds

Checklist

I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
My code follows the style guidelines of this project

Open questions

In order to get caching working, I had to add a __hash__ property to pydicom.FileDataset. This is obviously not ideal, but there wasn't another way to get the caching performance boost. I had originally thought I could just wrap FileDataset in a proxy class, but the overhead of creating the proxy class seemed to be enough to slow things down as much as just getting a cache miss.

During profiling, it was identified that repeated regex lookups inside loops were taking a significant amount of time. To reduce this, regex expressions were pre-compiled outside of loops, and plain string comparison was used when possible to sidestep the performance overhead of regex matching for simple things like string equality comparison.

The `get_fields_with_lookup` function was added to augment the `get_fields` function with a lookup table that allows quick identification of exact tag matches. This optimization significantly reduced the amount of time spent in the exact matching stage of `expand_field_expression`. For top-level DICOM tag searches, the search in "Case 2" would call `name_contains` on every single tag in the DICOM dataset. With lookup tables, we can look up a contender field based on the values for which we're checking exact matches-- this becomes a key lookup problem rather than searching all fields for a matching identifier.

vsoch

This is excellent! See my questions and comments. Akin to the other, we will need to bump the version and changelog. If you like we can merge the other with a bump to the version, and release both changes under that version.

vsoch · 2025-09-24T01:44:41Z

deid/dicom/fields.py

+# only way to enable the use of caching without incurring significant
+# performance overhead. Note that adding a proxy class around this
+# decreases performance substantially (50% slowdown measured).
+FileDataset.__hash__ = lambda self: id(self)


Is this something we should suggest for upstream (in that it might help other projects), or just appropriate to put here?

It's a good question... from my understanding, hash functions are typically supposed to be based on the value of an object (as opposed to its identity). From what I've read, objects by default use their id() as their hash method until you define an __eq__ method on the class, at which point you then have to define your own hash.

I can't think of any reasonable drawback to just using id() as the hash in practice, except that using datasets as dictionary keys might be a little funny:

some_dict = {} ds1 = pydicom.dcmread('./somefile.dcm') some_dict[ds1] = ds1.filename # Later, read the same file again ds2 = pydicom.dcmread('./somefile.dcm') ds2 in some_dict # evaluates to False

It could be worth proposing upstream to pydicom and we could just see if the maintainer is open to the change. But maybe we could just have it here for now until we figure out the next steps on the pydicom side.

vsoch · 2025-09-24T05:12:31Z

deid/dicom/fields.py

    # Contains

-    def name_contains(self, expression, whole_string=False):
+    def name_contains(self, expression):


I appreciate this refactor - the whole_string update was not to my liking.

vsoch · 2025-09-24T05:12:54Z

deid/dicom/fields.py

        - ELEMENT_OFFSET: 2-digit hexadecimal element number (last 8 bits of full element)
        """
-        regexp_expression = f"^{expression}$" if whole_string else expression
+        if type(expression) is str:


Any reason to not use isinstance(expression, str) here?

This was just an oversight on my part-- I'll switch to using isinstance here for the type comparisons! (I've been switching between TypeScript and Python a lot recently)

Addressed in dc6fc50

vsoch · 2025-09-24T05:13:50Z

deid/dicom/fields.py

-            or re.search(regexp_expression, self.stripped_tag)
-            or re.search(regexp_expression, self.element.name)
-            or re.search(regexp_expression, self.element.keyword)
+            expression.search(self.name.lower())


This is much easier to read with the compiled regex.

vsoch · 2025-09-24T05:15:48Z

deid/dicom/fields.py

    # if no contenders provided, use top level of dicom headers
    if contenders is None:
-        contenders = get_fields(dicom)
+        contenders, contender_lookup_tables = get_fields_with_lookup(dicom)


Usually needing to return >1 related thing is a pattern or sign for a class. In the future we might consider a class here that has easy accessibility to the tables and then getting a particular item.

That's a good idea!

Added in 0843c36. It was nice to remove all these lookup table variables lying around

vsoch · 2025-09-24T05:16:41Z

deid/dicom/fields.py

-        if expander.lower() in ["endswith", "startswith", "contains"]:
-            if field.name_contains(expression):
-                fields[uid] = field
+        if type(field) is str and string_matches_expander(expander, expression, field):


isinstance here again?

vsoch · 2025-09-24T05:17:53Z

deid/dicom/fields.py


-    return fields

+def field_matches_expander(expander, expression_string, expression_re, field):


Could you please make sure all functions have docstrings (you can easily convert the comments I think).

Resolved in 37adf90

vsoch · 2025-09-24T05:18:30Z

deid/dicom/fields.py

    """
-    skip = skip or []
-    seen = seen or []
+    fields, new_seen, new_skip = get_fields_inner(


Possibly another opportunity for a class or Dataclass.

I changed this function to be a "private" function and kept the interface as-is for simplicity, since this is only used within the get_fields_with_lookup function.

vsoch · 2025-09-24T05:20:19Z

deid/dicom/fields.py

+        "element_keyword": defaultdict(list),
+    }
+    for uid, field in fields.items():
+        if type(field) is not DicomField:


isinstance? See https://switowski.com/blog/type-vs-isinstance/. it may only matter (for speed) for older versions of Python, which unfortunately are still present on many of our clusters... 🙃

vsoch · 2025-09-24T05:23:58Z

deid/dicom/parser.py

-        if not self.fields:
-            self.fields = get_fields(
+        if not self.fields or not self.fields_by_name:
+            self.fields, self.lookup_tables = get_fields_with_lookup(


We might want to reset the looking tables when the class is init (or any update to parse a different file, for example). I'm trying to think of if there is a case where we might generate the lookup table for one dicom dataset and then load another (and have the tables mixed up and unintentionally combine patient data).

Added as a part of 0843c36!

ReeceStevens · 2025-09-24T10:37:25Z

Thanks for the review and feedback @vsoch ! I believe I've addressed all outstanding comments, all tests are passing, and pre-commit hooks are all green.

If you like we can merge the other with a bump to the version, and release both changes under that version.

That sounds like a plan to me-- I'll add the changelog adjustments and version bump in the other PR.

ReeceStevens added 4 commits September 19, 2025 05:51

Add cache layer around field retrieval

0130c93

Fix test failures

2737eb4

vsoch reviewed Sep 24, 2025

View reviewed changes

ReeceStevens added 3 commits September 24, 2025 04:45

Use isinstance for type comparison

dc6fc50

Add docstrings to expander functions

37adf90

Use a wrapper class for field lookup tables

0843c36

ReeceStevens requested a review from vsoch September 24, 2025 10:57


		return fields

		def field_matches_expander(expander, expression_string, expression_re, field):

Uh oh!

Improve speed of header deid with lookup tables and caching #289

Are you sure you want to change the base?

Improve speed of header deid with lookup tables and caching #289

Conversation

ReeceStevens commented Sep 23, 2025

Description

Checklist

Open questions

Uh oh!

vsoch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ReeceStevens commented Sep 24, 2025

Uh oh!

Uh oh!