Trie has linear insertion time #32

electrofink · 2016-07-17T09:42:06Z

The trie created by datrie has linear insertion time, i.e. the more entries are already in the trie, the longer it takes. This means the insertion complexiy looks like O(n) to me, with n being the number of elements in the trie. A quick look at the literature suggests that insertion time for a trie should be O(m) instead, with m being the length of the key.
This seems to be an an issue, because it basically makes the trie unusuable for any sufficient large collection of entries. For example, I'm trying to store 60 million database keys (strings with a maximum length of about 12 characters) in a trie. I'm using the following code to store the database keys in a trie to perform fast prefix operations on them (such as: "return all database keys that match a certain prefix"):

def create_trie(file):
    trie = datrie.Trie(string.ascii_uppercase + string.digits)
    i = 0
    start = datetime.datetime.now()
    with open(file) as database_keys:
        for line in database_keys:
            database_key = line.rstrip('\n')
            trie[database_key] = database_key
            i += 1
            if i % 10000 == 0:
                end = datetime.datetime.now()
                delta = end - start
                print('Processed lines: ' + str(i) + ". Time for 10000 lines: " + str(delta), flush=True)
                start = datetime.datetime.now()
    return trie

This code results in the following log file: trie_creation_log.txt
As you can see, the insertion process gets slower and slower as more entries are added to the trie. Given the fact that I have 60 million database keys, it obviously is too slow for my use-case. I have seen trie implementations that don't suffer from this problem, so I wanted to make you aware of this 😃

kmike · 2016-07-17T11:30:25Z

Yeah, this is a problem. See also discussion at #12.

electrofink · 2016-07-17T11:41:54Z

Thanks for the hint! The data I'm trying to insert is already sorted, in lexicographic order. I didn't check the C code for libdatrie, but I assume this behaviour arises from the implementation and not from the Python wrapper. Should I file a bug against libdatrie?

kmike · 2016-07-17T11:43:50Z

@semkath yes, firing a bug there is a good idea.

kmike · 2016-07-17T11:44:46Z

See also: scikit-learn/scikit-learn#2639 (comment)

electrofink · 2016-07-17T12:12:04Z

@kmike So, if I understand correctly: The double array implementation of a trie results in a more succinct data structure than the pointer-based implementation. The tradeoff for less memory usage is degraded behaviour when it comes to updating the trie? In my case, I don't really need to update the trie, but of course I need to build the trie. For comparison: The same process with a (completely different) trie implementation in Java (namely this one) goes over without a hitch, the average time to insert 10000 entries into the trie is about 5ms.
I'm (unfortunately) not an expert in trie implementations but at least to me it was surprising that a trie implementation has O(n) insertion complexity when I expected O(m)

kmike · 2016-07-17T21:15:45Z

Yeah, it was also surprising to me that insertions are not O(1) in libdatrie. I think we should add this to the README, at least until there is a fix in libdatrie.

If you don't need to update the trie then check https://github.com/pytries/marisa-trie or https://github.com/pytries/DAWG; they use much less memory and don't have this issue.

kmike · 2016-07-17T21:17:08Z

Patricia trie (similar to the Java one) can be found in BioPython.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trie has linear insertion time #32

Trie has linear insertion time #32

electrofink commented Jul 17, 2016 •

edited

Loading

kmike commented Jul 17, 2016

electrofink commented Jul 17, 2016

kmike commented Jul 17, 2016

kmike commented Jul 17, 2016

electrofink commented Jul 17, 2016

kmike commented Jul 17, 2016

kmike commented Jul 17, 2016

Trie has linear insertion time #32

Trie has linear insertion time #32

Comments

electrofink commented Jul 17, 2016 • edited Loading

kmike commented Jul 17, 2016

electrofink commented Jul 17, 2016

kmike commented Jul 17, 2016

kmike commented Jul 17, 2016

electrofink commented Jul 17, 2016

kmike commented Jul 17, 2016

kmike commented Jul 17, 2016

electrofink commented Jul 17, 2016 •

edited

Loading