Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic CI to find links that have gone bad #1431

Open
markcmiller86 opened this issue Sep 10, 2022 · 11 comments · May be fixed by #1633
Open

Periodic CI to find links that have gone bad #1431

markcmiller86 opened this issue Sep 10, 2022 · 11 comments · May be fixed by #1633
Labels
postponed-for-future-phase For item to be dealt with future phase of this project post-ECP Dec 2023 scope: site-internal

Comments

@markcmiller86
Copy link
Member

A lot of what we publish has links. I think links are really important. But, they also go stale over time as other content hosters change how they content we are linking to gets hosted.

We should have something that runs periodically, maybe once a week, and generates a report of bad links.

@markcmiller86
Copy link
Member Author

I tried https://www.drlinkcheck.com and got these results. Its a $10 monthly subscription for up to 10,000 links.

Overview

Screen Shot 2022-09-10 at 3 31 35 PM

Four bad links found amoung the 1500 checked

Screen Shot 2022-09-10 at 3 30 57 PM

@bernhold
Copy link
Member

Sandbox does this for us periodically.

Though I think it would not be a bad idea to also institute some regular checking within our repo. I have actions for the bssw-tutorial website that do this on files that change (though this has had some problems lately) and on a scheduled for the whole repository. See https://github.com/bssw-tutorial/bssw-tutorial.github.io/blob/main/.github/workflows/check-pr-urls.yml and https://github.com/bssw-tutorial/bssw-tutorial.github.io/blob/main/.github/workflows/check-all-urls.yml.

@markcmiller86
Copy link
Member Author

and https://github.com/bssw-tutorial/bssw-tutorial.github.io/blob/main/.github/workflows/check-all-urls.yml.

Thanks for mentioning. Took a very quick look at this run and it looks like it spews all URLs it checked (some timed out...what does that mean?) and then lists those it think failed. I clicked on some of the failed links and they worked. Something like this is likely sensitive to intermittent issues (in servers, networking, etc.).

We could expand to maintain a list of failed URLs over several successive checks and flag a URL as bad only if its gone into a consistent failure state. That would take a bit more work because it would require maintaining a list of the failures across CI runs. But, I think its possible.

@bernhold
Copy link
Member

Yes, it lists all URLs it checks and then the ones that failed. There are configuration options for the timeout on checks and the number of retries to attempt. There is also an exclusion list of links not to check. I had to tweak those some when I first set it up. Its not perfect -- I occasionally have experiences like you had. DOI links in particular like to fail, even though they are in fact good and because they're DOIs, there is a commitment behind them that they not disappear. My guess is that may have to do with the redirects (from doi.org to the actual provider). Occasionally other things fail the test too.

My strategy for bssw-tutorial is to check the links that fail and if they work when I check, I just let it go. If I see something failing several times in a row, I'll consider adding it to the exclusion list. This works fine for the tutorial.

For something on a larger scale, like bssw.io, I can imagine that this might be more of a problem. There are other URL-checker actions out there. I'm using one by Vanessa Sochat which was simple to adopt in part because she provided several good examples. I don't recall anything that described saving a list of failing URLs and comparing from run to run. But that doesn't mean such a thing doesn't exist -- or we could write one, of course.

I will note that in the tutorial, and in general, there are two different use cases:

  1. Checking new/changed files to ensure that links in new content are valid
  2. Periodic checks of the whole site to identify linkrot

They don't necessarily have to use the same tools. It would be nice if they could share a common exclusion list (if necessary).

@markcmiller86
Copy link
Member Author

If I see something failing several times in a row, I'll consider adding it to the exclusion list.

Or, maybe its time to update or remove that link?

@bernhold
Copy link
Member

Or, maybe its time to update or remove that link?

Sorry. 90%+ of the failures are false positives. If it is evident that it is a real failure, of course I'll find an alternative or remove it. The frequent false positives I will add to the exclusion list.

@rinkug
Copy link
Member

rinkug commented Oct 11, 2022

I have fixed broken links that Sandbox sent us. We have around 70 broken links while the https://www.drlinkcheck.com/ showed us 5. I am not sure what tool the former folks are using..

@mrmundt
Copy link
Contributor

mrmundt commented Oct 11, 2022

Just a note on this: There is an available GitHub action for this. The US-RSE uses it for checking spelling and links - https://github.com/USRSE/usrse.github.io/blob/main/.github/workflows/linting.yaml

@rinkug
Copy link
Member

rinkug commented Mar 31, 2023

@bartlettroscoe : is this action worth implementing for us?

@bartlettroscoe
Copy link
Member

@bartlettroscoe : is this action worth implementing for us?

Every maintained website should be running regular link checks and fix issues as they come up.

@bernhold
Copy link
Member

bernhold commented Apr 1, 2023

I do this for the bssw-tutorial website. Every PR gets checked, and we periodically check all of the links on the site. https://github.com/bssw-tutorial/bssw-tutorial.github.io/tree/main/.github/workflows has two actions, which we should be able to use with little or no modification on BSSw: check-pr-urls.yml and check-all-urls.yml. These actions also reference the file exclude-urls.txt. BSSw's exclude list would be different, but this can be a reference for what it looks like.

@rinkug rinkug added the postponed-for-future-phase For item to be dealt with future phase of this project post-ECP Dec 2023 label Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postponed-for-future-phase For item to be dealt with future phase of this project post-ECP Dec 2023 scope: site-internal
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

5 participants