Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow when specify word length. #14

Open
EonYang opened this issue Apr 21, 2021 · 5 comments
Open

Slow when specify word length. #14

EonYang opened this issue Apr 21, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request fix implemented A fix has been implemented, the issue will soon be resolved optimization something can be optimized

Comments

@EonYang
Copy link

EonYang commented Apr 21, 2021

Describe the bug

RandomWord is slow.

To Reproduce

run this:

from timeit import timeit
from wonderwords import RandomWord

r = RandomWord()

timeit(lambda: r.word(
    include_parts_of_speech=[
        'nouns',
    ],
    word_min_length=3,
    word_max_length=8), number=100)

Output:

4.55993747399998

Expected behavior

A thousand calls should finish in 1 second.

Additional context

I think the code here is not very efficient.

It copies all words from the category, then removing them if they don't match the length criteria.
also, removing an item from an array takes O(n) time where n is the length of the array, making the whole operation O(n^2) where n is the numbers of words in the category.

IMO, using words = set() instead of words = [] could make it O(n) time, but it's still not fast enough.

When just requesting one word, we shouldn't even read all words in the category. It's better to pre-sort words in all categories by length, then use bisect to find start_index and end_index that match the length criteria, and just generate a random integer to access read the specific word. This way the overall time complexity is O(log N).

The above approach didn't optimize start_with and end_with, which requires introducing Trie.

@EonYang
Copy link
Author

EonYang commented Apr 21, 2021

A working code snippet:

from math import floor
from time import perf_counter
from random import randint, random
from wonderwords import RandomWord

sorted_nouns = sorted(RandomWord()._categories['nouns'], key=len)


def bisect_by_length(arr: list, target_length: int):
    l, r = 0, len(arr) - 1
    while l <= r:
        m = l + (r - l) // 2
        if len(arr[m]) < target_length:
            l = m + 1
        else:
            r = m - 1
    return l


def get_word_by_length(min_length: int, max_length: int):
    start = bisect_by_length(sorted_nouns, min_length)
    end = bisect_by_length(sorted_nouns, max_length + 1) - 1
    i = start + floor(random() * (end - start))
    return sorted_nouns[i]

Evaluate it:

def test_get_word_by_length(number: int = 1000):
    start_ts = perf_counter()
    for _ in range(number):
        min_len, max_len = randint(len(sorted_nouns[0]), len(
            sorted_nouns[-1])), randint(len(sorted_nouns[0]), len(sorted_nouns[-1]))
        if min_len > max_len:
            min_len, max_len = max_len, min_len
        w = get_word_by_length(min_len, max_len)
        if min_len == max_len == 18:
            ...
            # no words in the array has length == 18
            # print('min: {}, max: {}, actuall: {}'.format(
            #     min_len, max_len, len(w)))
        else:
            assert min_len <= len(w) <= max_len, 'min: {}, max: {}, actuall: {}'.format(
                min_len, max_len, len(w))
    end_ts = perf_counter()
    t = end_ts - start_ts
    print(f'{number} loops took {round(t, 4)} seconds')
    print(f'each loop took {round(t * 1000 / number, 4)} milliseconds')

test_get_word_by_length(1000000)


# output:
# 1000000 loops took 10.2348 seconds
# each loop took 0.0102 milliseconds

@mrmaxguns mrmaxguns added enhancement New feature or request optimization something can be optimized labels Apr 22, 2021
@mrmaxguns
Copy link
Owner

Thank you for your thorough analysis and contribution. I agree; there is much to optimize.

  1. I agree that the categories should be pre-sorted. This could be done when an instance of the class is created.
  2. Why is generating a random integer index faster than random.choice?
  3. I agree with the idea to implement a trie.
  4. Where would the rest of the categorizations (such as RegEx) fit in?

@EonYang
Copy link
Author

EonYang commented Apr 22, 2021

random.choice requires passing a slice of arr, and slicing the arr takes O(n) time and O(n) space.
Random index only takes O(1) time and O(1) space.

I don't have a good solution for Regex. Regex is not very efficient in certain scenarios. Maybe give it its own API and the user will decide to use it when they don't need to worry about performance.

@mrmaxguns mrmaxguns pinned this issue Apr 22, 2021
@mrmaxguns
Copy link
Owner

mrmaxguns commented Apr 24, 2021

Thanks for the advice. I created a new branch where I'm going to push optimizations. Only implementing the bisect method gave astronomically better results.

On my computer, it used to go at 5.106667959000333, now timeit shows 0.06474015199637506.

mrmaxguns added a commit that referenced this issue Apr 24, 2021
Implemented the length filter optimization suggested in #14.
@mrmaxguns mrmaxguns added the fix implemented A fix has been implemented, the issue will soon be resolved label Apr 24, 2021
@mrmaxguns mrmaxguns self-assigned this Apr 24, 2021
@mrmaxguns
Copy link
Owner

I just implemented a trie. Currently, the system is clunky because for ends_with it simply generates a second trie for all words in reverse. This means that the matching words will then have to be re-reversed to their original states. I'm thinking of a way to forgo this.

mrmaxguns added a commit that referenced this issue Apr 9, 2024
Implemented the length filter optimization suggested in #14.
@mrmaxguns mrmaxguns mentioned this issue Apr 12, 2024
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fix implemented A fix has been implemented, the issue will soon be resolved optimization something can be optimized
Projects
None yet
Development

No branches or pull requests

2 participants