-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trie has linear insertion time #32
Comments
Yeah, this is a problem. See also discussion at #12. |
Thanks for the hint! The data I'm trying to insert is already sorted, in lexicographic order. I didn't check the C code for libdatrie, but I assume this behaviour arises from the implementation and not from the Python wrapper. Should I file a bug against libdatrie? |
@semkath yes, firing a bug there is a good idea. |
See also: scikit-learn/scikit-learn#2639 (comment) |
@kmike So, if I understand correctly: The double array implementation of a trie results in a more succinct data structure than the pointer-based implementation. The tradeoff for less memory usage is degraded behaviour when it comes to updating the trie? In my case, I don't really need to update the trie, but of course I need to build the trie. For comparison: The same process with a (completely different) trie implementation in Java (namely this one) goes over without a hitch, the average time to insert 10000 entries into the trie is about 5ms. |
Yeah, it was also surprising to me that insertions are not O(1) in libdatrie. I think we should add this to the README, at least until there is a fix in libdatrie. If you don't need to update the trie then check https://github.com/pytries/marisa-trie or https://github.com/pytries/DAWG; they use much less memory and don't have this issue. |
Patricia trie (similar to the Java one) can be found in BioPython. |
The trie created by datrie has linear insertion time, i.e. the more entries are already in the trie, the longer it takes. This means the insertion complexiy looks like O(n) to me, with n being the number of elements in the trie. A quick look at the literature suggests that insertion time for a trie should be O(m) instead, with m being the length of the key.
This seems to be an an issue, because it basically makes the trie unusuable for any sufficient large collection of entries. For example, I'm trying to store 60 million database keys (strings with a maximum length of about 12 characters) in a trie. I'm using the following code to store the database keys in a trie to perform fast prefix operations on them (such as: "return all database keys that match a certain prefix"):
This code results in the following log file: trie_creation_log.txt
As you can see, the insertion process gets slower and slower as more entries are added to the trie. Given the fact that I have 60 million database keys, it obviously is too slow for my use-case. I have seen trie implementations that don't suffer from this problem, so I wanted to make you aware of this 😃
The text was updated successfully, but these errors were encountered: