Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change original URLs for archived ones #1

Open
n1c4n0n opened this issue Jun 14, 2020 · 5 comments
Open

Change original URLs for archived ones #1

n1c4n0n opened this issue Jun 14, 2020 · 5 comments

Comments

@n1c4n0n
Copy link

n1c4n0n commented Jun 14, 2020

What do you think of archiving the original URLs and replacing them for their archived ones? I think it'd make this repo more future-proof.

@irsdl
Copy link
Owner

irsdl commented Jun 14, 2020

Absolutely I agree. The problem is that automating it might be tricky as some the links are completely dead, some have been redirected, some shows 404, some shows irrelevant data, and some are still alive! Unless we take a copy of them all automatically from the wayback machine it can be really hard (perhaps we can save both copy of wbm and the page itself if it shows 200 status). We should be able to use a certain algorithm to choose an appropriate snapshot (for example for 2010 we need the first snapshot between 2010 and 2015 perhaps) - not sure how wayback machine works with the apis and whether there is a rate limit etc etc.

Can you contribute to this perhaps? We can even publish the tool in this repository as well so we can use it in the future too!

@irsdl
Copy link
Owner

irsdl commented Jun 14, 2020

Another solution would be by doing this manually but that can take serious time... I may do it as a hobby but I will probably need help as categorising them can be a chore too (saving them all in PDF perhaps if not already in PDF?).

@n1c4n0n
Copy link
Author

n1c4n0n commented Jun 15, 2020

@irsdl I started doing something half manually and half automatically. Here's the first test I made to see how that would work:
https://github.com/n1c4n0n/top10webseclist/blob/master/2019.md

I'll start a PR on here asap so we can gradually tweak things as necessary, what do you think?

I'll also change a few things in the tool I've used and upload it so we can work on that too.

@n1c4n0n
Copy link
Author

n1c4n0n commented Jun 15, 2020

@irsdl We can define which way is best for archiving purposes, but I think it'll be ok if we just archive them as original format (be it HTML, PDF, etc.), tell me what you think

@irsdl
Copy link
Owner

irsdl commented Jun 16, 2020

For 2019 it is easy to do this because they are still live and we should be able to just save the endpoints if they are not in slideshare or something like that. I guess the ultimate approach would be to manually hunt them down one by one and save them in an appropriate format. It is a chore but can become very valuable - I may start doing this in my spare time ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants