-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
search_calls_per_second needs to be dialed down #137
Comments
Keeping this open to track other adjustments mentioned here to handle things better in general here (e.g. when we hit a 429) rather than just adjusting the limits (done in #140). This is also tied in with a long-running goal I’ve had to revamp rate limiting, since part of the problem here is that we implement the throttle on calls to |
Sketching out some more detailed thoughts here:
|
@edsu while doing some final cleanup on the patch release for this (sorry for the delays; I’ve been a bit distracted lately), I was reading the docstring for Raising instead of retrying seemed like a big change for v0.4.4, so for that release I just made a longer retry delay (see #142). But I’m thinking about making it always raise instead of ever retrying (responses with status code 429) in v0.5.0. As a user of this tool, do you have any thoughts on that? Would you prefer the current behavior (retrying but with a very long delay) instead? (Another option might be to make this configurable, but I’d prefer to avoid that complexity and just choose one behavior or the other.) |
Yes, I like raising an exception when a 429 is encountered. I think it will help wayback users avoid a situation where they are getting blocked? |
Great, that’s about what I was thinking as well. (I think the reason it was being handled here in the first place was that, at EDGI, we used to use this as part of a nightly bulk processing script that took a few hours to run, and having it automatically handle rate limits would have been useful there since you probably don’t want a job that big to stop in the middle.) |
We used to retry requests that got a response with a 429 status code (indicating that you've hit a rate limit on the server) automatically, but we've resolved that this is no longer a good approach. See the discussion in #137 (comment) for more detail, but the short version is that our previous retry behavior was specifically geared to a workflow EDGI used, and which had custom rate limits and other server behavior that most users won't have, and also helped to work around a bunch of deficiencies in our client-side rate limiting that were solved in the last release (but do need a *bit* more work in this release, too). It turns out this led to some users getting blocked, and it's better to leave handling this situation to user code.
We used to retry requests that got a response with a 429 status code (indicating that you've hit a rate limit on the server) automatically, but we've resolved that this is no longer a good approach. See the discussion in #137 (comment) for more detail.
Where we are on 429 responses:
In terms of refactoring rate limits:
Another alternative here I hadn’t thought through before is a bigger refactor — have separate sessions for endpoints/URL patterns that have different rate limits (so one Also, on reflection, I think option (1) ( (All that said, I plan to try and do #58 this month, too, which will make the above moot. BUT I expect that issue to be bigger and more complex than it sounds, and don’t want to force the changes here to wait for that to be done first.) |
OK, got some more details from the inside. Limits are calculated over a 5 minute period. Hard limits are:
But they’d like us to set the defaults at 80% of that in the general case. So we should use:
429 responses should stop everything for 60 seconds. We now raise a |
This also leaves me wondering if I should change the API so that instead of naming each rate limit: WaybackSession(
search_calls_per_second=RateLimit(1),
memento_calls_per_second=RateLimit(8)
) We just have a dict of URL prefixes and rate limits: WaybackSession(rate_limits={
'/cdx/': RateLimit(1),
'/web/': RateLimit(8),
'/': RateLimit(10),
}) This is probably a little less clear for users (now you have to know what URLs the various services live at if you want to make adjustments or understand how the rate limits apply to calls to Otherwise, I think I may at least deprecate the old argument names and change them to the less verbose: WaybackSession(
rate_search=RateLimit(1),
rate_memento=RateLimit(8)
) |
I like the route specific names. It should make it easier to connect the limit with the endpoint that is being used? |
Thanks! 🙏 I wasn’t sure whether it was better to clarify what client method is getting what limits, or what actual endpoints are getting what limits. If you are more concerned about the latter, that is definitely a good reason to change it up. |
I suppose the other complicated thing here is that we need to preserve the defaults for different services, so whatever you pass in will be treated like: rate_limits | {
'/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP, # shared instance of RateLimit(1.33)
'/cdx': DEFAULT_RATE_LIMIT_CDX, # shared instance of RateLimit(0.8)
'/': DEFAULT_RATE_LIMIT, # shared instance of RateLimit(8)
} So you get this behavior (mostly straightforward, but maybe a bit weird): WaybackSession(rate_limits = { '/': 10 })
# Resulting limits:
# {
# '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,
# '/cdx': DEFAULT_RATE_LIMIT_CDX,
# '/': 10,
# } But this more complicated behavior if you make a typo: WaybackSession(rate_limits = { '/cdx/': 10 })
# Resulting limits:
# {
# '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,
# '/cdx/': 10,
# '/cdx': DEFAULT_RATE_LIMIT_CDX,
# '/': DEFAULT_RATE_LIMIT,
# } (I guess we could also restrict which prefixes can be given limits and raise an error if you try to set a disallowed one, but that kills some of the flexibility here, and I feel like the nice names are less error-prone and more statically checkable and editor-assistable/tab-completable at that point.) |
Thinking about the above some more, I don’t think I’m going to make any changes to the There’s definitely room for rethinking this going forward but I’m not comfortable with including it in the other, more immediate work around rate limiting that this issue is focused on. |
Our approach to rate limiting has been in need of major refactoring, and this rearranges and fixes a bunch of things: - Replace our previous `rate_limited(calls_per_second, group)` function with a `RateLimit` class. The previous implementation led to different rate limits that were intermingled in non-obvious ways. The new class gives us a more straightforward object that models a single limit, independent of any other limits, but which can be shared as needed. - Update default rate limits based on consultation with Internet Archive staff. These limits are based on what they currently use as standardized limits, but backed off to 80% of the hard limit (which they requested). - Apply rate limits directly where the requests are made (in `WaybackSession.send()`), so they reliably limit requests to the desired rate. They were previously applied in `WaybackClient` methods, which meant they didn’t account correctly for retries or redirects. - Some other minor refactorings came along for the ride, as well as starting to do type annotations. Fixes #137.
Direct usage of `setup.py` (that is, calling `python setup.py some_command` to do builds and such) was deprecated and has not been supported for quite a while! This moves everything to `pyproject.toml` (the new format for declaring package metadata). It *seems* like `pyproject.toml` supports all the nice things now, so we can also drop `setup.cfg`. (This seems like a good time to do this, since there are some other semi-breaking changes on tap, like [rethinking how rate limit errors are handled](#137 (comment)).) --------- Co-authored-by: Dan Allan <[email protected]>
I was running some fairly simple data retrieval in this Notebook(see the Wayback section) and I discovered that I got completely blocked from accessing web.archive.org! Luckily I remembered that there was Internet Archive's #wayback-researchers Slack channel, where I got this reponse.
I thought that the openwayback module's defaults would have prevented me from going over the 60 requests per minute (one per second) and I thought wayback's support for handling 429 responses would have backed off sufficiently fast. I suspect that the goal posts on the server side have changed recently because I had code that worked a month or so ago, which stopped working (resulting in the block).
I was able to get around this by using a custom WaybackSession where I set the
search_calls_per_second
to0.5
, but I suspect1.0
would probably work better. Maybe the default could me moved down from 1.5 to 1.0?Also, perhaps there needs to be some logic to make sure to wait a minute when encountering a 429 as well?
The text was updated successfully, but these errors were encountered: