bytes_terminate_multi(): improve worst-case performance

In theory, this change would only help in the "degenerate" worst case, in which the terminator would occur many times at unaligned offsets before finally occurring at an aligned offset. At the same time, the terminator would have to be longer than 2 bytes for the change to have any impact (since multi-byte terminators are not yet fully configurable from the .ksy spec, but are only available by specifying `type: strz` + `encoding: UTF-{16,32}*`, this would only occur for zero-terminated UTF-32 strings, which I don't think are as common as UTF-16 strings in binary formats). The idea is that when these conditions are true, which is for example when the input is a UTF-32LE string that has a lot of character sequences like this: ``` >>> print('\U00000041\U00020000'.encode('utf-32le').hex(' ')) 41 00 00 00 00 00 02 00 ``` ... the existing implementation of `bytes_terminate_multi` would first diagnose the occurrence of `'\x00\x00\x00\x00'` at index 1, which is not 4-byte aligned, so it would resume searching from index 2, which immediately reports a match at index 2, but that is also an unaligned offset, so we skip it as well. It's easy to see that resuming the search from offset 2 after we get a match at offset 1 is pointless because we know in advance that any matches at offsets 2 and 3 will be rejected. We can save ourselves some work by calculating the nearest next aligned offset that we would accept (e.g., after getting a match at offset 1, we can continue searching from offset 4), which the new version does.
kaitai-io · Jul 30, 2024 · 9bdaeb3 · 9bdaeb3
1 parent a6adc31
commit 9bdaeb3
Showing 1 changed file with 3 additions and 2 deletions.
diff --git a/kaitaistruct.py b/kaitaistruct.py
@@ -483,9 +483,10 @@ def bytes_terminate_multi(data, term, include_term):
         while True:
             if search_index == -1:
                 return data[:]
-            if search_index % unit_size == 0:
+            mod = search_index % unit_size
+            if mod == 0:
                 return data[:search_index + (unit_size if include_term else 0)]
-            search_index = data.find(term, search_index + 1)
+            search_index = data.find(term, search_index + (unit_size - mod))
 
     # endregion