Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trie has linear insertion time #32

Open
electrofink opened this issue Jul 17, 2016 · 7 comments
Open

Trie has linear insertion time #32

electrofink opened this issue Jul 17, 2016 · 7 comments

Comments

@electrofink
Copy link

electrofink commented Jul 17, 2016

The trie created by datrie has linear insertion time, i.e. the more entries are already in the trie, the longer it takes. This means the insertion complexiy looks like O(n) to me, with n being the number of elements in the trie. A quick look at the literature suggests that insertion time for a trie should be O(m) instead, with m being the length of the key.
This seems to be an an issue, because it basically makes the trie unusuable for any sufficient large collection of entries. For example, I'm trying to store 60 million database keys (strings with a maximum length of about 12 characters) in a trie. I'm using the following code to store the database keys in a trie to perform fast prefix operations on them (such as: "return all database keys that match a certain prefix"):

def create_trie(file):
    trie = datrie.Trie(string.ascii_uppercase + string.digits)
    i = 0
    start = datetime.datetime.now()
    with open(file) as database_keys:
        for line in database_keys:
            database_key = line.rstrip('\n')
            trie[database_key] = database_key
            i += 1
            if i % 10000 == 0:
                end = datetime.datetime.now()
                delta = end - start
                print('Processed lines: ' + str(i) + ". Time for 10000 lines: " + str(delta), flush=True)
                start = datetime.datetime.now()
    return trie

This code results in the following log file: trie_creation_log.txt
As you can see, the insertion process gets slower and slower as more entries are added to the trie. Given the fact that I have 60 million database keys, it obviously is too slow for my use-case. I have seen trie implementations that don't suffer from this problem, so I wanted to make you aware of this 😃

@kmike
Copy link
Member

kmike commented Jul 17, 2016

Yeah, this is a problem. See also discussion at #12.

@electrofink
Copy link
Author

Thanks for the hint! The data I'm trying to insert is already sorted, in lexicographic order. I didn't check the C code for libdatrie, but I assume this behaviour arises from the implementation and not from the Python wrapper. Should I file a bug against libdatrie?

@kmike
Copy link
Member

kmike commented Jul 17, 2016

@semkath yes, firing a bug there is a good idea.

@kmike
Copy link
Member

kmike commented Jul 17, 2016

@electrofink
Copy link
Author

@kmike So, if I understand correctly: The double array implementation of a trie results in a more succinct data structure than the pointer-based implementation. The tradeoff for less memory usage is degraded behaviour when it comes to updating the trie? In my case, I don't really need to update the trie, but of course I need to build the trie. For comparison: The same process with a (completely different) trie implementation in Java (namely this one) goes over without a hitch, the average time to insert 10000 entries into the trie is about 5ms.
I'm (unfortunately) not an expert in trie implementations but at least to me it was surprising that a trie implementation has O(n) insertion complexity when I expected O(m)

@kmike
Copy link
Member

kmike commented Jul 17, 2016

Yeah, it was also surprising to me that insertions are not O(1) in libdatrie. I think we should add this to the README, at least until there is a fix in libdatrie.

If you don't need to update the trie then check https://github.com/pytries/marisa-trie or https://github.com/pytries/DAWG; they use much less memory and don't have this issue.

@kmike
Copy link
Member

kmike commented Jul 17, 2016

Patricia trie (similar to the Java one) can be found in BioPython.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants