Slow when specify word length. #14

EonYang · 2021-04-21T20:55:33Z

Describe the bug

RandomWord is slow.

To Reproduce

run this:

from timeit import timeit
from wonderwords import RandomWord

r = RandomWord()

timeit(lambda: r.word(
    include_parts_of_speech=[
        'nouns',
    ],
    word_min_length=3,
    word_max_length=8), number=100)

Output:

4.55993747399998

Expected behavior

A thousand calls should finish in 1 second.

Additional context

I think the code here is not very efficient.

wonderwordsmodule/wonderwords/random_word.py

Line 183 in 80dcdf5

words = []

It copies all words from the category, then removing them if they don't match the length criteria.
also, removing an item from an array takes O(n) time where n is the length of the array, making the whole operation O(n^2) where n is the numbers of words in the category.

IMO, using words = set() instead of words = [] could make it O(n) time, but it's still not fast enough.

When just requesting one word, we shouldn't even read all words in the category. It's better to pre-sort words in all categories by length, then use bisect to find start_index and end_index that match the length criteria, and just generate a random integer to access read the specific word. This way the overall time complexity is O(log N).

The above approach didn't optimize start_with and end_with, which requires introducing Trie.

The text was updated successfully, but these errors were encountered:

EonYang · 2021-04-21T21:40:54Z

A working code snippet:

from math import floor
from time import perf_counter
from random import randint, random
from wonderwords import RandomWord

sorted_nouns = sorted(RandomWord()._categories['nouns'], key=len)


def bisect_by_length(arr: list, target_length: int):
    l, r = 0, len(arr) - 1
    while l <= r:
        m = l + (r - l) // 2
        if len(arr[m]) < target_length:
            l = m + 1
        else:
            r = m - 1
    return l


def get_word_by_length(min_length: int, max_length: int):
    start = bisect_by_length(sorted_nouns, min_length)
    end = bisect_by_length(sorted_nouns, max_length + 1) - 1
    i = start + floor(random() * (end - start))
    return sorted_nouns[i]

Evaluate it:

def test_get_word_by_length(number: int = 1000):
    start_ts = perf_counter()
    for _ in range(number):
        min_len, max_len = randint(len(sorted_nouns[0]), len(
            sorted_nouns[-1])), randint(len(sorted_nouns[0]), len(sorted_nouns[-1]))
        if min_len > max_len:
            min_len, max_len = max_len, min_len
        w = get_word_by_length(min_len, max_len)
        if min_len == max_len == 18:
            ...
            # no words in the array has length == 18
            # print('min: {}, max: {}, actuall: {}'.format(
            #     min_len, max_len, len(w)))
        else:
            assert min_len <= len(w) <= max_len, 'min: {}, max: {}, actuall: {}'.format(
                min_len, max_len, len(w))
    end_ts = perf_counter()
    t = end_ts - start_ts
    print(f'{number} loops took {round(t, 4)} seconds')
    print(f'each loop took {round(t * 1000 / number, 4)} milliseconds')

test_get_word_by_length(1000000)


# output:
# 1000000 loops took 10.2348 seconds
# each loop took 0.0102 milliseconds

mrmaxguns · 2021-04-22T14:06:47Z

Thank you for your thorough analysis and contribution. I agree; there is much to optimize.

I agree that the categories should be pre-sorted. This could be done when an instance of the class is created.
Why is generating a random integer index faster than random.choice?
I agree with the idea to implement a trie.
Where would the rest of the categorizations (such as RegEx) fit in?

EonYang · 2021-04-22T14:15:56Z

random.choice requires passing a slice of arr, and slicing the arr takes O(n) time and O(n) space.
Random index only takes O(1) time and O(1) space.

I don't have a good solution for Regex. Regex is not very efficient in certain scenarios. Maybe give it its own API and the user will decide to use it when they don't need to worry about performance.

mrmaxguns · 2021-04-24T02:59:39Z

Thanks for the advice. I created a new branch where I'm going to push optimizations. Only implementing the bisect method gave astronomically better results.

On my computer, it used to go at 5.106667959000333, now timeit shows 0.06474015199637506.

Implemented the length filter optimization suggested in #14.

mrmaxguns · 2021-04-26T21:33:31Z

I just implemented a trie. Currently, the system is clunky because for ends_with it simply generates a second trie for all words in reverse. This means that the matching words will then have to be re-reversed to their original states. I'm thinking of a way to forgo this.

Implemented the length filter optimization suggested in #14.

mrmaxguns added enhancement New feature or request optimization something can be optimized labels Apr 22, 2021

mrmaxguns pinned this issue Apr 22, 2021

mrmaxguns added a commit that referenced this issue Apr 24, 2021

feat: optimizations: code is roughly 10x faster

b77f467

Implemented the length filter optimization suggested in #14.

mrmaxguns added the fix implemented A fix has been implemented, the issue will soon be resolved label Apr 24, 2021

mrmaxguns self-assigned this Apr 24, 2021

mrmaxguns added a commit that referenced this issue Apr 9, 2024

feat: optimizations: code is roughly 10x faster

55682e9

Implemented the length filter optimization suggested in #14.

mrmaxguns mentioned this issue Apr 12, 2024

V3.0 #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow when specify word length. #14

Slow when specify word length. #14

EonYang commented Apr 21, 2021

EonYang commented Apr 21, 2021

mrmaxguns commented Apr 22, 2021

EonYang commented Apr 22, 2021

mrmaxguns commented Apr 24, 2021 •

edited

Loading

mrmaxguns commented Apr 26, 2021

Slow when specify word length. #14

Slow when specify word length. #14

Comments

EonYang commented Apr 21, 2021

EonYang commented Apr 21, 2021

mrmaxguns commented Apr 22, 2021

EonYang commented Apr 22, 2021

mrmaxguns commented Apr 24, 2021 • edited Loading

mrmaxguns commented Apr 26, 2021

mrmaxguns commented Apr 24, 2021 •

edited

Loading