Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added functionality for using savepagenow with authentication #45

Merged
merged 8 commits into from
Jul 1, 2023

Conversation

duckduckgrayduck
Copy link
Contributor

This PR adds the ability to use authentication to do wayback saves. The user needs to create local environment variables 'secret' which has their S3 secret key from the Internet Archive and 'access_key' which has their access key from the Internet archive as described in the Wayback API spec here: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit?pli=1

They are optional, so it falls back to default unauthenticated saves

savepagenow/api.py Outdated Show resolved Hide resolved
@palewire
Copy link
Owner

palewire commented Jun 30, 2023

Love patch. Here's my picky list of picky stuff. If we get this stuff in I'm ready to merge.

  • Lets get a unit test that does the auth. Please pull from env variables. I can add a login for myself to link with the GitHub Account so it can run in the cloud
  • Lets add a little snippet to the documentation explain how to do this and updating whatever would be outdated
  • Add a custom exception for when a bad user name or password is provided

@duckduckgrayduck
Copy link
Contributor Author

I've added the unit test(which will only pass when savepagenow is repackaged, because it is being imported as a library into tests and doesn't have access to the new method yet), changed the user agent back to savepagenow, added documentation (had to change to a new version of sphinx napoleon in order to do so and this resulted in a lot of files 'changed', added custom error messages and ran black and pylint. should be ready for review.

@palewire palewire merged commit 73652ed into palewire:main Jul 1, 2023
@palewire
Copy link
Owner

palewire commented Jul 1, 2023

I made a few tweaks and merged this in. Mainly I'd like to have more specific env variable names. The rest of my changes are all gloss.

Can you point me to where you sourced the 4 vs 12 request limit facts?

@duckduckgrayduck
Copy link
Contributor Author

https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit
Page 8
Max captures per minute for authenticated users = 12 and for anonymous users = 4.

@palewire
Copy link
Owner

palewire commented Jul 1, 2023

Thanks. We should be out as version 1.3.0. Give it a try. Thanks again.

@duckduckgrayduck
Copy link
Contributor Author

Works great! Thanks again.

@overcast07
Copy link

https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit Page 8 Max captures per minute for authenticated users = 12 and for anonymous users = 4.

The document has been out of date for a while. It seems they didn't update the document to reflect it (it was around May that it occurred), but they changed the limit for authenticated users to 6 per minute, and for anonymous users to 3.

@duckduckgrayduck
Copy link
Contributor Author

@overcast07 I'm not sure where you got those numbers. I've been in direct communication with the Internet Archive folks.

@overcast07
Copy link

overcast07 commented Sep 27, 2023

@overcast07 I'm not sure where you got those numbers. I've been in direct communication with the Internet Archive folks.

I created and frequently use a Bash script that can submit a list of URLs to Save Page Now, both with and without authentication. I haven't been in contact with the Internet Archive about it (I just didn't have much of a reason to) and they have never tried to contact me.

In my testing, it has been impossible to submit more than 6 URLs per minute for several months. The script submits URLs as frequently as every 3 seconds, and has done this for about 2 years, so it was quite noticeable when there suddenly started being a long gap between successful URL submissions after every 6th URL. Previously, the actual limit was probably 12 URLs, but it wasn't calculated in the same way until earlier this year (you could submit more than 12 URLs per minute by submitting them rapidly before the first one started processing), and shortly after they fixed this the limit was reduced to 6.

The website provides an endpoint (https://web.archive.org/save/status/user) which tells you if you don't have any slots left to use. The Bash script (since May 2023) uses the data that when authenticated to check if the site will return the "You have already reached the limit of active Save Page Now sessions" message for the next URL submitted, to avoid repeatedly receiving that error message.

duckduckgrayduck added a commit to duckduckgrayduck/savepagenow that referenced this pull request Sep 29, 2023
Was originally alerted to this by user overcast07 in this thread: palewire#45
Confirmed with the Internet Archive team the new rate limit
@duckduckgrayduck
Copy link
Contributor Author

Hey @overcast07, I've contacted the Internet Archive team and just want you to know that you are correct. I've made a new PR to update the documentation for savepagenow: #48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants