-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An examples page would be useful #5
Comments
I'll add some examples in a comment here. I'm not sure at the moment of what the examples in the documentation should look like or how specific they should be. I also might not be able to see which parts look more difficult, so if you think there's a part that isn't explained very well then it would be helpful to let me know.
|
Thanks. The kind of queries I'd be likely to do are
|
|
I think the first part is probably addressed by one of the examples in the previous comment, but it's worth noting that at the moment there's no way to verify that the script did actually get all the URLs on the site, especially if there are any pages with more than 100 internal links. Setting something like
I already addressed the latter; because the flag relies on the URL to infer the content type it would be impossible to make this work on every site, but you could try
It depends on whether you would be doing this before or after running the script on all the input URLs. I don't think you can do this with the script yet, but you could use While it might be possible to do this using the script, I didn't have this in mind as a use case, since I use the CDX API for queries like this.
Again, this isn't something I had in mind for this script, and I'm not actually sure if it's possible. I think if I were to do this I would download data from the CDX API or from normal Wayback URLs (to see where they redirect) instead of using the script.
I think this is addressed by one of the examples. The capture's URL has no effect on the outlinks detection, so it would work the same as selecting outlinks to the current site (but with a different domain name, of course). |
I didn't know about the CDX API. I had a quick look and its rather daunting. If you could combine it with your comprehensive selection options in an easily accessible bash script, I think that would prove popular. Perhaps cdx_toolkit (https://pypi.org/project/cdx-toolkit/) does what I want, but I suspect you'd have a big head start on me in writing a solution. I will continue looking, but as this stuff is in active development, I might just wait. |
The script looks very powerful, but rather daunting, and the -o '' option doing outlinks beyond the site is probably not what people want. My use case is archiving sites that are moribund, so I generally want to archive page X and all pages beyond it in a tree. I think I want
spn -d 'force_get=1&if_not_archived_within=8640000' -o 'n49' http://www.novacon.org.uk/n49/
But an examples page would be helpful
The text was updated successfully, but these errors were encountered: