-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a default limit
to WaybackClient.search()
#65
Comments
Update: this behavior is as designed, so we should add a default However, just setting an extremely high default turns out to be problematic, so I’m now thinking It turns out the query is not maxing out at a given number of results — it maxes out at a given number of blocks searched. That is, the CDX index is broken up into sections called blocks, and it stops after looking through N blocks, regardless of how many results were found. That means there’s no guaranteed safe value for Side note: The above is not an issue with block-based pagination (using the |
Thanks for all this research & documentation @Mr0grog -- it is very much appreciated! Am I correct in saying that the new timemap-cdx api doesn't not support |
If so, that’s news to me! Were you trying |
Yes, I did try it but it threw a |
Hmmmmm, ok, so given the following query:
Is that consistent with what you’re seeing? I’ll ping Wayback folks about support for different match types. |
Oh, I was using an incorrect URL for some reason: http://web.archive.org/web/timemap/cdx?matchType=prefix/https://inkdroid.org/ Which, now that I look at the error, was saying it was missing the url parameter. I just tried this instead: http://web.archive.org/web/timemap/cdx?matchType=prefix&url=twitter.com/realDonaldTrump/status/ which seems to work for a bit, but then curl closes after a seemingly random amount of time with:
|
Over on the Wayback package, we discovered a huge bug where timeouts are not actually applied. (edgi-govdata-archiving/wayback#66) The fix for that should be merged in v0.3.1 some time in the next week, but we really need to address this now, so I've added a patch here until that release is done. This also adds a limit to the CDX search calls, since we've recently discovered that's important (edgi-govdata-archiving/wayback#65). Finally, this bumps up the timeout on Mementos. We might need to tweak this as we go, but it seems reasonable to raise it since it hasn't been being applied at all lately. We might have it set *way* too short.
Over on the Wayback package, we discovered a huge bug where timeouts are not actually applied. (edgi-govdata-archiving/wayback#66) The fix for that should be merged in v0.3.1 some time in the next week, but we really need to address this now, so I've added a patch here until that release is done. This also adds a limit to the CDX search calls, since we've recently discovered that's important (edgi-govdata-archiving/wayback#65). Finally, this bumps up the timeout on Mementos. We might need to tweak this as we go, but it seems reasonable to raise it since it hasn't been being applied at all lately. We might have it set *way* too short.
Some updates here:
Re: the new CDX search at
So I think we probably need to ultimately have 3 methods for CDX search (these names are strawman proposals, they probably aren’t great):
I’m also thinking we might want to rename |
It turns out that the pagination strategy we use with Wayback's CDX search does not work correctly if you don't set a `limit`, so this change sets one by default. However, there is no value that's always safe, so this needs to be used with some care. A limit that's too low may mean you never get any results even when you should, and a limit that's too high may get ignored (or cause other unknown issues). The right values really depend on how frequently you expect the Internet Archive to have captured versions of the URLs you are looking for. :( Fixes #65.
The signature for `search()` was unwieldy, complicated, and included a bunch of parameters that did nothing or that might cause broken behavior and should never be used (e.g. `page` and `pageSize`). I've tried to clean up most of those issues here: - Adds a default value for `limit` in order to fix a very bad footgun. (See #65) - Adds a lot more detail to the docs, explains special formatting for `url` and complex considerations for `limit`. - Removes non-functional or breakage-prone parameters that were only included because they were part of the HTTP API. In some cases, they are things this library does automatically for you and aren’t useful to adjust (e.g. `gzip`), in others you could break things (e.g. `page` and `pageSize`), and some are implementation details that users shouldn’t be bothered with (e.g. `resumeKey`, `previous_result`). - Removes the ability to specify arbitrary extra keyword parameters to be passed directly to the API (there are so many ways to break things here; I argued for this originally so we didn’t have to maintain as much, but it’s just not good). - Makes all parameters use snake_case. Internally, the only real change is that this is now a loop instead of a recursive call. This was required in order to not expose internal details as parameters, but is also probably better for call stack and memory management on large queries. Fixes #65.
Search pagination does not always function correctly if the
limit
parameter is not set, so it might be a good example to set a [high] default value.For example, @edsu ran into this with a query like:
There’s no resume key in the response. But there are definitely more records than are returned: if you add
limit=100000
in there, you get a resume key, and can page through results, eventually getting far more than were returned in the query without limit.I thought we tested this when originally writing the
search
code, so I’m not sure if something has changed/broken or if I’m just misremembering. Either way, the result is really unintuitive, so it might be a good idea to add a large default forlimit
instead ofNone
(I’m thinking500_000
). The docs could potentially use some clarification here, too.I’m also trying to get some insight from the Wayback team about whether this behavior is intended or a bug (in which case maybe no change is needed).
The text was updated successfully, but these errors were encountered: