Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Time limit on long-running functions #338

Open
shirakaba opened this issue Jul 10, 2023 · 2 comments
Open

Feature request: Time limit on long-running functions #338

shirakaba opened this issue Jul 10, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@shirakaba
Copy link

shirakaba commented Jul 10, 2023

Background

I'm using fuzz.partial_ratio_alignment() to align each paragraph of a novel to a timestamped transcription from the novel's corresponding audiobook. This will allow me to know how far into the book I've "read" whilst listening to its audiobook.

Approach

For each paragraph in the novel, I take a substring of ~128 characters to query against the full audio transcription (around 200,000 characters).

Here's the general idea, in code:

for paragraph in charlesDickensNovel:
    alignment = fuzz.partial_ratio_alignment(
        paragraph[0:128],
        charlesDickensFullAudioTranscription
    )

Problem

Most of the lines are successfully aligned within milliseconds (or nanoseconds), but perhaps 1% of these queries hang for several seconds.

I suspect that a smaller query would produce a quicker alignment, but I do need to query at least about 128 characters to reduce false-positives. There's also no way around the fact that I have to match against the whole audio transcription.

Requested solution

I actually don't need to match up every line; just a large proportion would be good enough for me to build a progress map. So I'd be happy to cancel any matches that take longer than 100 milliseconds to execute and just move onto the next one. Would it be possible to add an optional time limit onto each of the RapidFuzz APIs? Something like this timeout parameter:

fuzz.partial_ratio_alignment(
  s1=line[0:128],
  s2=charlesDickensFullAudioTranscription,
  timeout=0.1
)

If the timeout is reached, it could simply return None without explanation, or (if it feels more Pythonic) throw an error.

@shirakaba
Copy link
Author

Follow-up: I found that the reason RapidFuzz was hanging was that I'd initialised the substring incorrectly, and so was querying with a larger and larger string (thousands of characters instead of just 128). So I no longer personally need this feature.

However, it still seems like a sensible feature to implement to protect against denial-of-service when users might be able to pass arbitrary search queries to it. In the Node.js package ecosystem (npm), RegEx denial-of-service attacks are treated as a severe vulnerability, for perspective.

@maxbachmann maxbachmann added the enhancement New feature or request label Jul 19, 2023
@maxbachmann
Copy link
Member

I am really still not sure what to do in these cases. One basic issue right now is that when processing very long sequences there is not really any way to stop execution yet. When processing multiple sequences using e.g. process.cdist I do check for KeyboardInterrupts from time to time, but this is not yet done when processing very long sequences, since the calculation is handed off into C++. One of the challenges in this regard is that the C++ implementation is used both standalone and in the Python Wrapper. So I have a hard time extending it with features that are not needed in C++. Thinking about this at the very least I had a rough idea how I could implement the check for KeyboardInterrupts in a compatible manner.

If the timeout is reached, it could simply return None without explanation, or (if it feels more Pythonic) throw an error.

Probably an exception would be the best option. Some scorers can already return None when no good enough match was found making it impossible to differentiate between the errors. In addition an exception would work better with the functions in the processor module.

I will keep this in the back of my head, but I am still unsure whether I will add this feature. I would need to add this argument to every function which adds complexity for all users. One compromise might be a global timeout value the user can set for all internal scorer functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants