Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An examples page would be useful #5

Open
jrbray1 opened this issue May 3, 2021 · 5 comments
Open

An examples page would be useful #5

jrbray1 opened this issue May 3, 2021 · 5 comments

Comments

@jrbray1
Copy link

jrbray1 commented May 3, 2021

The script looks very powerful, but rather daunting, and the -o '' option doing outlinks beyond the site is probably not what people want. My use case is archiving sites that are moribund, so I generally want to archive page X and all pages beyond it in a tree. I think I want

spn -d 'force_get=1&if_not_archived_within=8640000' -o 'n49' http://www.novacon.org.uk/n49/

But an examples page would be helpful

@overcast07
Copy link
Owner

I'll add some examples in a comment here. I'm not sure at the moment of what the examples in the documentation should look like or how specific they should be. I also might not be able to see which parts look more difficult, so if you think there's a part that isn't explained very well then it would be helpful to let me know.

-o 'n49' would be sufficient but you might accidentally match too many URLs, so to be safe you could use -o 'https?://www\.novacon\.org\.uk/n49/'.

@jrbray1
Copy link
Author

jrbray1 commented May 8, 2021

Thanks. The kind of queries I'd be likely to do are

  1. Submit everything on a site if never recorded before

  2. Record all pdfs on a site, or everything apart from videos

  3. List what things have never been recorded (in combination with --dryrun)

  4. List when things were last recorded

  5. Submit all things mentioned on this site that are hosted on a different site X

@overcast07
Copy link
Owner

overcast07 commented May 8, 2021

Basic usage
spn.sh https://example.com/page1/ https://example.com/page2/
spn.sh urls.txt

Run in parallel
spn.sh -p 10 urls.txt

Save all outlinks (until either there are no more URLs or the script is terminated by the user)
spn.sh -o '' https://example.com/

Save all outlinks containing 'youtube' somewhere in the URL
spn.sh -o 'youtube' https://example.com/

Save all outlinks matching either 'youtube' or 'reddit'
spn.sh -o 'youtube|reddit' https://example.com/

Save all outlinks except those matching 'facebook'
spn.sh -o 'youtube|reddit' https://example.com/

Save all outlinks matching either 'youtube' or 'reddit', except those matching 'facebook'
spn.sh -o 'youtube|reddit' -x 'facebook' https://example.com/

Save all outlinks to the subdomain www.example.org
spn.sh -o 'https?://www\.example\.org(/|$)' https://example.com/

Save all outlinks to www.example.org and example.org
spn.sh -o 'https?://(www\.)?example\.org(/|$)' https://example.com/

Save all outlinks to all subdomains of example.org
spn.sh -o 'https?://([^/]+\.)?example\.org(/|$)' https://example.com/

Save all outlinks to example.org/files/ and all its subdirectories
spn.sh -o 'https?://(www\.)?example\.org/files/' https://example.com/

Save all outlinks to example.org/files/ and all its subdirectories, except for links with the file extension ".mp4"
spn.sh -o 'https?://(www\.)?example\.org/files/' -x '\.mp4(\?|$)'  https://example.com/

Save all outlinks matching YouTube video URLs
spn.sh -o 'https?://(www\.|m\.)?youtube\.com/watch\?(.*\&)?v=[a-zA-Z0-9_-]{11}|https?://youtu\.be/[a-zA-Z0-9_-]{11}' https://example.com/

Save all subdirectories and files in an IPFS folder, visiting each file twice (replace the example folder URL with that of the folder to be archived)
spn.sh -o 'https?://(gateway\.)?ipfs\.io/ipfs/(QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn/.+|[a-zA-Z0-9]{46}\?filename=)' https://ipfs.io/ipfs/QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn

@overcast07
Copy link
Owner

Submit everything on a site if never recorded before

I think the first part is probably addressed by one of the examples in the previous comment, but it's worth noting that at the moment there's no way to verify that the script did actually get all the URLs on the site, especially if there are any pages with more than 100 internal links. Setting something like -d if_not_archived_within=1000000000 would probably work for the second part.

Record all pdfs on a site, or everything apart from videos

I already addressed the latter; because the flag relies on the URL to infer the content type it would be impossible to make this work on every site, but you could try -o 'https?://example\.org/.*\.pdf(\?|$)'.

List what things have never been recorded

It depends on whether you would be doing this before or after running the script on all the input URLs. I don't think you can do this with the script yet, but you could use grep '"first_archive":true' success-json.log to check after running the script. As for before running the script, I think the Wayback CDX API in combination with wget or curl would be more suitable.

While it might be possible to do this using the script, I didn't have this in mind as a use case, since I use the CDX API for queries like this.

List when things were last recorded

Again, this isn't something I had in mind for this script, and I'm not actually sure if it's possible. I think if I were to do this I would download data from the CDX API or from normal Wayback URLs (to see where they redirect) instead of using the script. wget --max-redirect=0 https://web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/dQw4w9WgXcQ has been useful for me, for instance.

Submit all things mentioned on this site that are hosted on a different site X

I think this is addressed by one of the examples. The capture's URL has no effect on the outlinks detection, so it would work the same as selecting outlinks to the current site (but with a different domain name, of course).

@jrbray1
Copy link
Author

jrbray1 commented May 8, 2021

I didn't know about the CDX API. I had a quick look and its rather daunting. If you could combine it with your comprehensive selection options in an easily accessible bash script, I think that would prove popular. Perhaps cdx_toolkit (https://pypi.org/project/cdx-toolkit/) does what I want, but I suspect you'd have a big head start on me in writing a solution. I will continue looking, but as this stuff is in active development, I might just wait.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants