Add a default `limit` to `WaybackClient.search()` #65

Mr0grog · 2021-01-11T23:39:57Z

Search pagination does not always function correctly if the limit parameter is not set, so it might be a good example to set a [high] default value.

For example, @edsu ran into this with a query like:

# Equivalent to:
# http://web.archive.org/cdx/search/cdx?matchType=prefix&showResumeKey=true&url=twitter.com/realDonaldTrump/status/
client.search('twitter.com/realDonaldTrump/status/', matchType='prefix')

There’s no resume key in the response. But there are definitely more records than are returned: if you add limit=100000 in there, you get a resume key, and can page through results, eventually getting far more than were returned in the query without limit.

# Equivalent to:
# http://web.archive.org/cdx/search/cdx?matchType=prefix&showResumeKey=true&limit=100000&url=twitter.com/realDonaldTrump/status/
client.search('twitter.com/realDonaldTrump/status/', matchType='prefix', limit=100_000)

I thought we tested this when originally writing the search code, so I’m not sure if something has changed/broken or if I’m just misremembering. Either way, the result is really unintuitive, so it might be a good idea to add a large default for limit instead of None (I’m thinking 500_000). The docs could potentially use some clarification here, too.

I’m also trying to get some insight from the Wayback team about whether this behavior is intended or a bug (in which case maybe no change is needed).

The text was updated successfully, but these errors were encountered:

Mr0grog · 2021-01-12T17:48:07Z

Update: this behavior is as designed, so we should add a default limit.

However, just setting an extremely high default turns out to be problematic, so I’m now thinking 100_000. (And we definitely need to explain this in the docs because there’s actually no guaranteed safe value.)

It turns out the query is not maxing out at a given number of results — it maxes out at a given number of blocks searched. That is, the CDX index is broken up into sections called blocks, and it stops after looking through N blocks, regardless of how many results were found. That means there’s no guaranteed safe value for limit. For example, if the max block count was 5, and you issued a query where the first result was in block 6, even setting limit=1 would not get you any results or resume keys for pagination.

Side note: The above is not an issue with block-based pagination (using the page and pageSize parameters instead of showResumeKey and limit), but block-based pagination does not actually search all the various parts of the CDX index (most notably, it misses the most recent results). That’s why we don’t use that pagination method. The newer web/timemap/cdx endpoint (see #8) only supports block-based pagination, and does include all results, so this will be a non-issue when we implement that.

edsu · 2021-01-13T22:38:29Z

Thanks for all this research & documentation @Mr0grog -- it is very much appreciated! Am I correct in saying that the new timemap-cdx api doesn't not support matchType=prefix ?

Mr0grog · 2021-01-14T00:14:32Z

If so, that’s news to me! Were you trying matchType=prefix out and it didn’t appear to work? I’ve only done minimal exploration with the timemap CDX API.

edsu · 2021-01-14T00:23:23Z

Yes, I did try it but it threw a 400 BAD REQUEST. So if/when the old CDX API is turned off there will be no way to scan for snapshots.

Mr0grog · 2021-01-14T03:07:46Z

Hmmmmm, ok, so given the following query:

http://web.archive.org/web/timemap/cdx?url=twitter.com/realDonaldTrump/status/&pageSize=5&page=0

By itself, it works fine.
If I add &matchType=exact it works fine. (Makes sense since this should be effectively the same as above.)
If I add &matchType=host or &matchType=domain I get a 500 error.
If I add &matchType=prefix or add * to the end of the URL (which should be interpreted the same), I get a 500 error.
If I change page to a higher number I get a 400 error with the header X-Archive-Wayback-Runtime-Error: page must be smaller than numpages, which makes sense. (Note pagination for this starts at page 0, not page 1.)

Is that consistent with what you’re seeing? I’ll ping Wayback folks about support for different match types.

edsu · 2021-01-14T12:17:04Z

Oh, I was using an incorrect URL for some reason:

http://web.archive.org/web/timemap/cdx?matchType=prefix/https://inkdroid.org/

Which, now that I look at the error, was saying it was missing the url parameter.

I just tried this instead:

http://web.archive.org/web/timemap/cdx?matchType=prefix&url=twitter.com/realDonaldTrump/status/

which seems to work for a bit, but then curl closes after a seemingly random amount of time with:

curl: (18) transfer closed with outstanding read data remaining

Over on the Wayback package, we discovered a huge bug where timeouts are not actually applied. (edgi-govdata-archiving/wayback#66) The fix for that should be merged in v0.3.1 some time in the next week, but we really need to address this now, so I've added a patch here until that release is done. This also adds a limit to the CDX search calls, since we've recently discovered that's important (edgi-govdata-archiving/wayback#65). Finally, this bumps up the timeout on Mementos. We might need to tweak this as we go, but it seems reasonable to raise it since it hasn't been being applied at all lately. We might have it set *way* too short.

Mr0grog · 2022-10-26T17:12:51Z

Some updates here:

I’m going to do this (adding a default limit) shortly.
I think keeping with showResumeKey for pagination in the old endpoint is correct, even though it is problematic in so many ways. There are multiple indexes, and page+pageSize only searches the main index, which at the time of this writing, is typically about two months behind. showResumeKey includes all indexes.
The search() method needs a major overhaul. Its signature exposes all kinds of internals and non-useful arguments (e.g. page and pageSize shouldn’t be there since we don’t use them for pagination and they could really mess with results, while resumeKey and showResumeKey shouldn’t really be something an end user is bothered with). This can be simpler and better.

Re: the new CDX search at /web/timemap/cdx…

It turns out these issues with pagination in the old search was one of the major drivers for building the new one.
We obviously really need to implement support for the new CDX search (Implement CDX search based on newer timemap CDX API #8), if only for that reason.
The new search supports limit, but not showResumeKey, and doesn’t do weird stuff with limit.
The new search only paginates with page + pageSize (which are still about blocks; size is not referring to a number of results), and is reliable, and includes all the indexes (so it’s up-to-date).
BUT if you use a non-exact search (i.e. matchType=prefix|host|domain or you use an * in the URL), it does not include the index for recent SavePageNow captures. It takes roughly 3 days for things in that index to make it into other indexes that do support those queries. So there are still caveats here, but they are simpler to explain and are actually pretty predictable (the out-of-date issue is only a few days, not a few months).
archive.org is doing a slow transition to the new search, using it for some things under the hood to test it out.
Eventually (no concrete timeline yet) the old search will be replaced with the new one.
The new search includes extra fields (length, offset, WARC filename) that they expect to remove when replacing the old system, so we should not expect them to always be present.

So I think we probably need to ultimately have 3 methods for CDX search (these names are strawman proposals, they probably aren’t great):

search_v1() uses /cdx/search/cdx and paginates via showResumeKey.
search_v2() uses /web/timemap/cdx and paginates via page + pageSize.
search() just forwards to one of those implementations.

I’m also thinking we might want to rename search*() methods to listMementos() or listCaptures() or something, since the Internet Archive has an actual free text search of wayback now (e.g. https://web.archive.org/web/*/environment which is powered by https://web.archive.org/__wb/search/anchor?q=<text>, but also some endpoints at https://be-api.us.archive.org/ia-pub-fts-api, /services/search/v1/scrape, and /advancedsearch.php, all of which I don’t know enough about the differences or pros/cons for).

It turns out that the pagination strategy we use with Wayback's CDX search does not work correctly if you don't set a `limit`, so this change sets one by default. However, there is no value that's always safe, so this needs to be used with some care. A limit that's too low may mean you never get any results even when you should, and a limit that's too high may get ignored (or cause other unknown issues). The right values really depend on how frequently you expect the Internet Archive to have captured versions of the URLs you are looking for. :( Fixes #65.

The signature for `search()` was unwieldy, complicated, and included a bunch of parameters that did nothing or that might cause broken behavior and should never be used (e.g. `page` and `pageSize`). I've tried to clean up most of those issues here: - Adds a default value for `limit` in order to fix a very bad footgun. (See #65) - Adds a lot more detail to the docs, explains special formatting for `url` and complex considerations for `limit`. - Removes non-functional or breakage-prone parameters that were only included because they were part of the HTTP API. In some cases, they are things this library does automatically for you and aren’t useful to adjust (e.g. `gzip`), in others you could break things (e.g. `page` and `pageSize`), and some are implementation details that users shouldn’t be bothered with (e.g. `resumeKey`, `previous_result`). - Removes the ability to specify arbitrary extra keyword parameters to be passed directly to the API (there are so many ways to break things here; I argued for this originally so we didn’t have to maintain as much, but it’s just not good). - Makes all parameters use snake_case. Internally, the only real change is that this is now a loop instead of a recursive call. This was required in order to not expose internal details as parameters, but is also probably better for call stack and memory management on large queries. Fixes #65.

Mr0grog added the enhancement New feature or request label Jan 11, 2021

Mr0grog mentioned this issue Mar 17, 2021

HOTFIX: Make Wayback timeouts work edgi-govdata-archiving/web-monitoring-processing#700

Merged

Mr0grog mentioned this issue Oct 26, 2022

Implement CDX search based on newer timemap CDX API #8

Open

Mr0grog mentioned this issue Oct 26, 2022

Overhaul search() signature and docs #94

Merged

Mr0grog closed this as completed in #94 Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a default `limit` to `WaybackClient.search()` #65

Add a default `limit` to `WaybackClient.search()` #65

Mr0grog commented Jan 11, 2021

Mr0grog commented Jan 12, 2021

edsu commented Jan 13, 2021

Mr0grog commented Jan 14, 2021 •

edited

Loading

edsu commented Jan 14, 2021

Mr0grog commented Jan 14, 2021

edsu commented Jan 14, 2021 •

edited

Loading

Mr0grog commented Oct 26, 2022

Add a default limit to WaybackClient.search() #65

Add a default limit to WaybackClient.search() #65

Comments

Mr0grog commented Jan 11, 2021

Mr0grog commented Jan 12, 2021

edsu commented Jan 13, 2021

Mr0grog commented Jan 14, 2021 • edited Loading

edsu commented Jan 14, 2021

Mr0grog commented Jan 14, 2021

edsu commented Jan 14, 2021 • edited Loading

Mr0grog commented Oct 26, 2022

Add a default `limit` to `WaybackClient.search()` #65

Add a default `limit` to `WaybackClient.search()` #65

Mr0grog commented Jan 14, 2021 •

edited

Loading

edsu commented Jan 14, 2021 •

edited

Loading