-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement CDX search based on newer timemap
CDX API
#8
Comments
Other notes I have discovered in edgi-govdata-archiving/web-monitoring-processing#174: this new API doesn’t support |
Update: since the above conversation happened, Wayback folks have started gently pushing us to more actively use the newer services, like timemap CDX and SPN2. So I think the answer to this issue is probably “yes we should” now. |
timemap
CDX APItimemap
CDX API
Since this is still beta-ish, we should probably implement this alongside the old |
I’ve been holding off on this since @danielballan is in the middle of splitting off this code into http://github.com/edgi-govdata-archiving/wayback. It should be done, but in that new repo whenever it’s ready. |
Note to selves: once this is closed, it might be kind to state in the release notes how to migrate wayback v0.1 code to whatever API we settle on for timemap, if doing so is not too much trouble. |
FWIW, I think the API (from a user of this package’s perspective) would be the same. The Timemap CDX API (which, to be clear, is not the timemap API, which is a whole other thing!):
|
Ah, I was conflating the Timemap CDX API with the timemap API. I have half-absorbed the fact that they are different things, but I got confused here. Which one did wayback v0.1 implement? |
Wayback v0.1 implemented the Timemap API (not Timemap CDX, which isn’t really it’s name, but it doesn’t have one, and ¯\_(ツ)_/¯). If helpful (since Wayback APIs are a half-documented, scattered situation): The CDX API, which lets you search through a CDX-based index (and returns a subset of fields from each matching CDX record), is at The “Timemap CDX” API is the same thing, but uses different code and (I think?) a separate CDX index, is at (I call it “Timemap CDX” because of the URL. I have also heard “new CDX,” “beta CDX,” “CDX v2,” etc.) The Timemap API is part of the Memento protocol (guide, RFC, Wayback-specific “docs”) which is a semi-standard agreed to by lots of archives. It doesn’t allow searching (it just lists mementos for a given URL), and lists results in HTTP (There is supposed to be an official JSON format, but I don’t know how to get it from Wayback. |
I kind of feel like Timemap may be redundant when you have CDX available (since you can always search CDX for an “exact” [really SURT, not exact] URL match). But it’s possible timemap may be more optimized. |
Also, best documentation link I know of is here: https://archive.readme.io/docs It’s mostly links to other docs, but at least it gets most of all the APIs listed. (Not how much it’s kept up-to-date, though. 🙁) |
Some updates here from recent conversations:
So I think we probably need to ultimately have 3 methods for CDX search (these names are strawman proposals, they probably aren’t great):
I’m also thinking we might want to rename That renaming might be out of scope here, though. |
This adds support for the Internet Archive's new, beta CDX search endpoint at `/web/timemap/cdx`. It deals with pagination much better and is eventually slated to replace the search currently at `/cdx/search/cdx`, but is a little slower and still being tested. This commit is a start, but we still need to do more detailed testing and talk more with the Wayback Machine team about things that are unclear here. I'm also not sure if `filter`, `collapse`, `resolveRevisits`, etc. are actually supported. Fixes #8.
This adds support for the Internet Archive's new, beta CDX search endpoint at `/web/timemap/cdx`. It deals with pagination much better and is eventually slated to replace the search currently at `/cdx/search/cdx`, but is a little slower and still being tested. This commit is a start, but we still need to do more detailed testing and talk more with the Wayback Machine team about things that are unclear here. I'm also not sure if `filter`, `collapse`, `resolveRevisits`, etc. are actually supported. Fixes #8.
Circling back on the naming issue here, my current feeling is that the name should involve |
From a conversation on the Internet Archive’s Research Slack today:
We need to look into whether we should switch to
/web/timemap/cdx
.The text was updated successfully, but these errors were encountered: