An examples page would be useful #5

jrbray1 · 2021-05-03T09:25:55Z

The script looks very powerful, but rather daunting, and the -o '' option doing outlinks beyond the site is probably not what people want. My use case is archiving sites that are moribund, so I generally want to archive page X and all pages beyond it in a tree. I think I want

spn -d 'force_get=1&if_not_archived_within=8640000' -o 'n49' http://www.novacon.org.uk/n49/

But an examples page would be helpful

overcast07 · 2021-05-08T11:11:26Z

I'll add some examples in a comment here. I'm not sure at the moment of what the examples in the documentation should look like or how specific they should be. I also might not be able to see which parts look more difficult, so if you think there's a part that isn't explained very well then it would be helpful to let me know.

-o 'n49' would be sufficient but you might accidentally match too many URLs, so to be safe you could use -o 'https?://www\.novacon\.org\.uk/n49/'.

jrbray1 · 2021-05-08T11:26:07Z

Thanks. The kind of queries I'd be likely to do are

Submit everything on a site if never recorded before
Record all pdfs on a site, or everything apart from videos
List what things have never been recorded (in combination with --dryrun)
List when things were last recorded
Submit all things mentioned on this site that are hosted on a different site X

overcast07 · 2021-05-08T11:34:28Z

Basic usage
spn.sh https://example.com/page1/ https://example.com/page2/
spn.sh urls.txt

Run in parallel
spn.sh -p 10 urls.txt

Save all outlinks (until either there are no more URLs or the script is terminated by the user)
spn.sh -o '' https://example.com/

Save all outlinks containing 'youtube' somewhere in the URL
spn.sh -o 'youtube' https://example.com/

Save all outlinks matching either 'youtube' or 'reddit'
spn.sh -o 'youtube|reddit' https://example.com/

Save all outlinks except those matching 'facebook'
spn.sh -o 'youtube|reddit' https://example.com/

Save all outlinks matching either 'youtube' or 'reddit', except those matching 'facebook'
spn.sh -o 'youtube|reddit' -x 'facebook' https://example.com/

Save all outlinks to the subdomain www.example.org
spn.sh -o 'https?://www\.example\.org(/|$)' https://example.com/

Save all outlinks to www.example.org and example.org
spn.sh -o 'https?://(www\.)?example\.org(/|$)' https://example.com/

Save all outlinks to all subdomains of example.org
spn.sh -o 'https?://([^/]+\.)?example\.org(/|$)' https://example.com/

Save all outlinks to example.org/files/ and all its subdirectories
spn.sh -o 'https?://(www\.)?example\.org/files/' https://example.com/

Save all outlinks to example.org/files/ and all its subdirectories, except for links with the file extension ".mp4"
spn.sh -o 'https?://(www\.)?example\.org/files/' -x '\.mp4(\?|$)'  https://example.com/

Save all outlinks matching YouTube video URLs
spn.sh -o 'https?://(www\.|m\.)?youtube\.com/watch\?(.*\&)?v=[a-zA-Z0-9_-]{11}|https?://youtu\.be/[a-zA-Z0-9_-]{11}' https://example.com/

Save all subdirectories and files in an IPFS folder, visiting each file twice (replace the example folder URL with that of the folder to be archived)
spn.sh -o 'https?://(gateway\.)?ipfs\.io/ipfs/(QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn/.+|[a-zA-Z0-9]{46}\?filename=)' https://ipfs.io/ipfs/QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn

overcast07 · 2021-05-08T11:56:11Z

Submit everything on a site if never recorded before

I think the first part is probably addressed by one of the examples in the previous comment, but it's worth noting that at the moment there's no way to verify that the script did actually get all the URLs on the site, especially if there are any pages with more than 100 internal links. Setting something like -d if_not_archived_within=1000000000 would probably work for the second part.

Record all pdfs on a site, or everything apart from videos

I already addressed the latter; because the flag relies on the URL to infer the content type it would be impossible to make this work on every site, but you could try -o 'https?://example\.org/.*\.pdf(\?|$)'.

List what things have never been recorded

It depends on whether you would be doing this before or after running the script on all the input URLs. I don't think you can do this with the script yet, but you could use grep '"first_archive":true' success-json.log to check after running the script. As for before running the script, I think the Wayback CDX API in combination with wget or curl would be more suitable.

While it might be possible to do this using the script, I didn't have this in mind as a use case, since I use the CDX API for queries like this.

List when things were last recorded

Again, this isn't something I had in mind for this script, and I'm not actually sure if it's possible. I think if I were to do this I would download data from the CDX API or from normal Wayback URLs (to see where they redirect) instead of using the script. wget --max-redirect=0 https://web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/dQw4w9WgXcQ has been useful for me, for instance.

Submit all things mentioned on this site that are hosted on a different site X

I think this is addressed by one of the examples. The capture's URL has no effect on the outlinks detection, so it would work the same as selecting outlinks to the current site (but with a different domain name, of course).

jrbray1 · 2021-05-08T12:37:30Z

I didn't know about the CDX API. I had a quick look and its rather daunting. If you could combine it with your comprehensive selection options in an easily accessible bash script, I think that would prove popular. Perhaps cdx_toolkit (https://pypi.org/project/cdx-toolkit/) does what I want, but I suspect you'd have a big head start on me in writing a solution. I will continue looking, but as this stuff is in active development, I might just wait.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An examples page would be useful #5

An examples page would be useful #5

jrbray1 commented May 3, 2021

overcast07 commented May 8, 2021

jrbray1 commented May 8, 2021

overcast07 commented May 8, 2021 •

edited

Loading

overcast07 commented May 8, 2021

jrbray1 commented May 8, 2021

An examples page would be useful #5

An examples page would be useful #5

Comments

jrbray1 commented May 3, 2021

overcast07 commented May 8, 2021

jrbray1 commented May 8, 2021

overcast07 commented May 8, 2021 • edited Loading

overcast07 commented May 8, 2021

jrbray1 commented May 8, 2021

overcast07 commented May 8, 2021 •

edited

Loading