Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search_calls_per_second needs to be dialed down #137

Closed
edsu opened this issue Nov 1, 2023 · 12 comments · Fixed by #140 or #147
Closed

search_calls_per_second needs to be dialed down #137

edsu opened this issue Nov 1, 2023 · 12 comments · Fixed by #140 or #147
Labels
enhancement New feature or request
Milestone

Comments

@edsu
Copy link
Contributor

edsu commented Nov 1, 2023

I was running some fairly simple data retrieval in this Notebook(see the Wayback section) and I discovered that I got completely blocked from accessing web.archive.org! Luckily I remembered that there was Internet Archive's #wayback-researchers Slack channel, where I got this reponse.

Hi edsu - I found your /cdx requests from 4:29UTC. Those requests are limited to an average of 60/min. Over that and we start sending 429s. If 429s are ignored for more than a minute we block the IP at the firewall (no connection) for 1 hour, which is what happened to you. Subsequent 429s over a given period will double that time each occurrence. If you can keeping your api request < 60/minute you will prevent this from happening.

I thought that the openwayback module's defaults would have prevented me from going over the 60 requests per minute (one per second) and I thought wayback's support for handling 429 responses would have backed off sufficiently fast. I suspect that the goal posts on the server side have changed recently because I had code that worked a month or so ago, which stopped working (resulting in the block).

I was able to get around this by using a custom WaybackSession where I set the search_calls_per_second to 0.5, but I suspect 1.0 would probably work better. Maybe the default could me moved down from 1.5 to 1.0?

Also, perhaps there needs to be some logic to make sure to wait a minute when encountering a 429 as well?

@edsu edsu changed the title Getting blocked after too many requests followed by not waiting after a 429 search_calls_per_second needs to be dialed down Nov 1, 2023
This was referenced Nov 1, 2023
edsu added a commit to edsu/wayback that referenced this issue Nov 1, 2023
@Mr0grog
Copy link
Member

Mr0grog commented Nov 2, 2023

Keeping this open to track other adjustments mentioned here to handle things better in general here (e.g. when we hit a 429) rather than just adjusting the limits (done in #140).

This is also tied in with a long-running goal I’ve had to revamp rate limiting, since part of the problem here is that we implement the throttle on calls to search() and get_memento() instead of in the actual send() method (or around calls to it), where it should be happening. (Elsewhere in EDGI code, we have a nice RateLimit object that works like a lock that I’d like to use here, so we can pass a rate limit into send() or otherwise share a single lock around the codebase as needed.)

@Mr0grog Mr0grog reopened this Nov 2, 2023
@Mr0grog Mr0grog added the enhancement New feature or request label Nov 2, 2023
@Mr0grog
Copy link
Member

Mr0grog commented Nov 2, 2023

Sketching out some more detailed thoughts here:

  • Should bring in the RateLimit class from web-monitoring-processing and have the session hold two instances of that (one for the CDX server and one for the Memento server, since they do, or at least did, have independent rate limits) instead of just integers.

  • How do the above actually get applied? Some options:

    1. Used automatically in send() based on the URL requested. Pros: easy to implement (otherwise there is a layer of indirection between the client methods and send() — they call request(), which prepares a request to send() and I have to think about how to get through that); don’t have to worry about calling code making mistakes. Cons: maybe too magic, hard to override. Doesn’t give us a way to burst for requests that go together (e.g. following redirects from a memento request).
    2. Have the client pass the rate limit to use into the session when making a request. This keeps the current model, where client methods are in charge of what types of requests get which limits, and means we don’t have to magically associate limits with URLs, just with use cases. But we need to figure out how to actually pass that info all the way down to send().
    3. Have the client use the limits directly. This is kind of like today, but we just need to move where the limits get applied to be directly around calls to request() or send() rather than at the start of methods like get_memento() or search(). This requires those high-level methods to be more careful (and probably more error prone), but keeps the application of the limits in the place where we have more contextual information to do interesting things with (e.g. if we wanted to add the ability to burst briefly for requests that go together).

    I suspect (2) may present pretty hard-to-address logistical hurdles, so either (1) or (3) are better (at least until we get rid of requests — see Pool connections across threads and/or make WaybackClient thread-safe #58). I like that (1) just works, but (3) gives us more flexibility.

  • Consider adding some kind of burst-ability to the RateLimit object (e.g. for CDX, the real budget is 60/min, not 1/s, so you could do a few requests faster than 1/s as long as you don’t blow the 1-minute budget). I’m not sure this is that useful for CDX, but it would be super useful for mementos. We get around this right now by not implementing rate limits correctly (that is, implementing them at the call to get_memento() and not to the actual requests that method makes).

  • Add a more restrictive lock to WaybackSession to handle 429s. The idea here is that a 429 response would lock the session for a little while, and maybe subsequent 429s would lock the session for longer if they have a Retry-After header (IIRC, we do not currently obey Retry-After because we discovered they were frequently a lie and excessively long, but it’s possible that’s no longer true.)

    Note some care needs to be taken here to respect timeouts. If the retry is going to wait for too long, we may just want to abandon early. Or make timeouts predictable by waiting until the timeout and then raising. Dunno. (We may also want to add another value to the tuple version of timeouts that controls the overall timeout, rather than just the connect and read timeouts.)

@Mr0grog
Copy link
Member

Mr0grog commented Nov 28, 2023

@edsu while doing some final cleanup on the patch release for this (sorry for the delays; I’ve been a bit distracted lately), I was reading the docstring for RateLimitError and it certainly sounds like I had intended to raise on any 429 response instead of just when we’ve exceeded the maximum number of retries like we currently do. Thinking about it now, it feels like that might be more correct, too.

Raising instead of retrying seemed like a big change for v0.4.4, so for that release I just made a longer retry delay (see #142). But I’m thinking about making it always raise instead of ever retrying (responses with status code 429) in v0.5.0.

As a user of this tool, do you have any thoughts on that? Would you prefer the current behavior (retrying but with a very long delay) instead? (Another option might be to make this configurable, but I’d prefer to avoid that complexity and just choose one behavior or the other.)

@edsu
Copy link
Contributor Author

edsu commented Nov 28, 2023

Yes, I like raising an exception when a 429 is encountered. I think it will help wayback users avoid a situation where they are getting blocked?

@Mr0grog
Copy link
Member

Mr0grog commented Nov 28, 2023

Great, that’s about what I was thinking as well. (I think the reason it was being handled here in the first place was that, at EDGI, we used to use this as part of a nightly bulk processing script that took a few hours to run, and having it automatically handle rate limits would have been useful there since you probably don’t want a job that big to stop in the middle.)

Mr0grog added a commit that referenced this issue Dec 4, 2023
We used to retry requests that got a response with a 429 status code (indicating that you've hit a rate limit on the server) automatically, but we've resolved that this is no longer a good approach. See the discussion in #137 (comment) for more detail, but the short version is that our previous retry behavior was specifically geared to a workflow EDGI used, and which had custom rate limits and other server behavior that most users won't have, and also helped to work around a bunch of deficiencies in our client-side rate limiting that were solved in the last release (but do need a *bit* more work in this release, too). It turns out this led to some users getting blocked, and it's better to leave handling this situation to user code.
@Mr0grog Mr0grog added this to the v0.5.0 milestone Dec 4, 2023
Mr0grog added a commit that referenced this issue Dec 6, 2023
We used to retry requests that got a response with a 429 status code (indicating that you've hit a rate limit on the server) automatically, but we've resolved that this is no longer a good approach. See the discussion in #137 (comment) for more detail.
@Mr0grog
Copy link
Member

Mr0grog commented Dec 6, 2023

Where we are on 429 responses:

  • v0.4.4 with an automatic one minute pause before retrying (in cases where these responses were already being retried).
  • v0.5.0 (the next release) will always raise and never retry when receving 429 responses. The code for this is already merged in.

In terms of refactoring rate limits:

I suspect (2) may present pretty hard-to-address logistical hurdles, so either (1) or (3) are better (at least until we get rid of requests…

Another alternative here I hadn’t thought through before is a bigger refactor — have separate sessions for endpoints/URL patterns that have different rate limits (so one WaybackClient might hold multiple WaybackSession objects). This could be a better way to organize things, but I am somewhat concerned that separates things too much. I suspect we may want a single connection pool across all requests to the https://web.archive.org origin, even though we may have different rate limits per path. I’ve posted on the Internet Archive Slack to see if I can get some more official, up-to-date details on this stuff.

Also, on reflection, I think option (1) (WabackSession automatically picks the right rate limit to apply inside send()) is the only currently practical option until we drop our dependency on requests (see #58). Rate limits need to apply to retries and therefore have to be applied inside send(), so (3) is out, and (2) is indeed impractical to correctly implement on top of requests.Session.

(All that said, I plan to try and do #58 this month, too, which will make the above moot. BUT I expect that issue to be bigger and more complex than it sounds, and don’t want to force the changes here to wait for that to be done first.)

@Mr0grog
Copy link
Member

Mr0grog commented Dec 7, 2023

OK, got some more details from the inside. Limits are calculated over a 5 minute period. Hard limits are:

  • /cdx/*: 60/min
  • /web/timemap/*: 100/min (but this one has been more malleable than others in the past)
  • /web/NNNNNNNN*/*: 600/min

But they’d like us to set the defaults at 80% of that in the general case. So we should use:

  • /cdx/*: 48/min
  • /web/timemap/*: 80/min (but this one has been more malleable than others in the past)
  • /web/NNNNNNNN*/*: 480/min

429 responses should stop everything for 60 seconds. We now raise a RateLimitError exception here, so I’m not worried too much about this, but I do wonder if we should have a way to lock the rate limit object for that long when raising RateLimitError, so all threads using the same rate limiters get stopped. (Alternatively we leave it up to users to do any thread coordination if they get a RateLimitError.)

@Mr0grog
Copy link
Member

Mr0grog commented Dec 7, 2023

This also leaves me wondering if I should change the API so that instead of naming each rate limit:

WaybackSession(
  search_calls_per_second=RateLimit(1),
  memento_calls_per_second=RateLimit(8)
)

We just have a dict of URL prefixes and rate limits:

WaybackSession(rate_limits={
  '/cdx/': RateLimit(1),
  '/web/': RateLimit(8),
  '/': RateLimit(10),
})

This is probably a little less clear for users (now you have to know what URLs the various services live at if you want to make adjustments or understand how the rate limits apply to calls to get_memento() or search()), but it is more direct and flexible.

Otherwise, I think I may at least deprecate the old argument names and change them to the less verbose:

WaybackSession(
  rate_search=RateLimit(1),
  rate_memento=RateLimit(8)
)

@edsu
Copy link
Contributor Author

edsu commented Dec 7, 2023

I like the route specific names. It should make it easier to connect the limit with the endpoint that is being used?

@Mr0grog
Copy link
Member

Mr0grog commented Dec 7, 2023

Thanks! 🙏 I wasn’t sure whether it was better to clarify what client method is getting what limits, or what actual endpoints are getting what limits. If you are more concerned about the latter, that is definitely a good reason to change it up.

@Mr0grog
Copy link
Member

Mr0grog commented Dec 7, 2023

I suppose the other complicated thing here is that we need to preserve the defaults for different services, so whatever you pass in will be treated like:

rate_limits | {
    '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,  # shared instance of RateLimit(1.33)
    '/cdx': DEFAULT_RATE_LIMIT_CDX,              # shared instance of RateLimit(0.8)
    '/': DEFAULT_RATE_LIMIT,                     # shared instance of RateLimit(8)
}

So you get this behavior (mostly straightforward, but maybe a bit weird):

WaybackSession(rate_limits = { '/': 10 })
# Resulting limits:
# {
#     '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,
#     '/cdx': DEFAULT_RATE_LIMIT_CDX,
#     '/': 10,
# }

But this more complicated behavior if you make a typo:

WaybackSession(rate_limits = { '/cdx/': 10 })
# Resulting limits:
# {
#     '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,
#     '/cdx/': 10,
#     '/cdx': DEFAULT_RATE_LIMIT_CDX,
#     '/': DEFAULT_RATE_LIMIT,
# }

(I guess we could also restrict which prefixes can be given limits and raise an error if you try to set a disallowed one, but that kills some of the flexibility here, and I feel like the nice names are less error-prone and more statically checkable and editor-assistable/tab-completable at that point.)

@Mr0grog
Copy link
Member

Mr0grog commented Dec 12, 2023

Thinking about the above some more, I don’t think I’m going to make any changes to the WaybackSession arguments right now. Providing a dictionary with paths might seem more straightforward in the simplest cases, but it makes customizing some rates while leaving other defaults in place more complicated. More importantly, it’s harder to make non-breaking tweaks or changes in the library — the default paths cease to be tweakable and I’m not confident we won’t want/need to tweak them (we could do things like provide constants for the paths, but that is both error-prone and unnecessarily complicated for users in my experience).

There’s definitely room for rethinking this going forward but I’m not comfortable with including it in the other, more immediate work around rate limiting that this issue is focused on.

Mr0grog added a commit that referenced this issue Dec 13, 2023
Our approach to rate limiting has been in need of major refactoring, and this rearranges and fixes a bunch of things:

- Replace our previous `rate_limited(calls_per_second, group)` function with a `RateLimit` class. The previous implementation led to different rate limits that were intermingled in non-obvious ways. The new class gives us a more straightforward object that models a single limit, independent of any other limits, but which can be shared as needed.

- Update default rate limits based on consultation with Internet Archive staff. These limits are based on what they currently use as standardized limits, but backed off to 80% of the hard limit (which they requested).

- Apply rate limits directly where the requests are made (in `WaybackSession.send()`), so they reliably limit requests to the desired rate. They were previously applied in `WaybackClient` methods, which meant they didn’t account correctly for retries or redirects.

- Some other minor refactorings came along for the ride, as well as starting to do type annotations.

Fixes #137.
@Mr0grog Mr0grog moved this to Backlog in Wayback Roadmap Dec 13, 2023
@Mr0grog Mr0grog moved this from Backlog to Unreleased in Wayback Roadmap Dec 13, 2023
Mr0grog added a commit that referenced this issue Dec 16, 2023
Direct usage of `setup.py` (that is, calling `python setup.py some_command` to do builds and such) was deprecated and has not been supported for quite a while! This moves everything to `pyproject.toml` (the new format for declaring package metadata). It *seems* like `pyproject.toml` supports all the nice things now, so we can also drop `setup.cfg`.

(This seems like a good time to do this, since there are some other semi-breaking changes on tap, like [rethinking how rate limit errors are handled](#137 (comment)).)


---------

Co-authored-by: Dan Allan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
2 participants