Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

429 errors with concurrent mode disabled #152

Open
ChrisChinchilla opened this issue Oct 15, 2020 · 7 comments
Open

429 errors with concurrent mode disabled #152

ChrisChinchilla opened this issue Oct 15, 2020 · 7 comments
Labels

Comments

@ChrisChinchilla
Copy link

ChrisChinchilla commented Oct 15, 2020

Describe the bug

I can see there are similar reports on this issue (#105), but I have the same result even without concurrent mode enabled. Interestingly, the checks work beautifully when run locally, but we get the error when it runs on CI jobs.

Non-OK status: 429 --- docs/operational_guide/resource_limits/index.html --> https://github.com/m3db/m3/tree/docs-test/site/content/docs/operational_guide/resource_limits.md

etc. And yes, it's just GitHub links.

To Reproduce

htmltest is running in CI over a Hugo site like this.

docker run --rm -it -v $(PWD)/site:/src klakegg/hugo:ext-alpine
curl https://htmltest.wjdp.uk | bash
./bin/htmltest -c site/.htmltest.yml

.htmltest.yml

Please copy in your config file

DirectoryPath: site/public
IgnoreDirectoryMissingTrailingSlash: true
CheckExternal: true
IgnoreAltMissing: false
CheckImages: false
CheckScripts: false
CheckMeta: false
CheckMetaRefresh: false
CheckLinks: false
EnforceHTML5: false
EnforceHTTPS: false
IgnoreInternalEmptyHash: true
IgnoreEmptyHref: true
TestFilesConcurrently: false
DocumentConcurrencyLimit: 64
HTTPConcurrencyLimit: 8
LogLevel: 0
IgnoreURLs:
- "youtu.be"
- "youtube.com"
- "cassandra.apache.org"

Versions

  • OS: local - macOS 10.15.7 (19H2), CI - Not completely sure… Sorry, Linux of some flavour
  • htmltest: htmltest 0.13.0
@robskillington
Copy link

@wjdp would you be open to accepting a PR that added a self-imposed rate limit for certain URLs?

e.g. something like this (what do you think of the general concept if someone else was to open the change?)

RateLimit:
- URL: "github.com"
  MaxRequests: 10
  Within: 1s

@wjdp
Copy link
Owner

wjdp commented Oct 19, 2020

Hi @ChrisChinchilla 👋

From "works locally, fails on CI" I suspect your local broadband bottlenecks you enough that rate limiting isn't a problem there but the fat pipe at your CI provider allows for a much more effective DoS directed at github 😉 hence the 429s.

This is a problem with htmltest, right now we just scale to whatever bandwidth is available even if that hits external rate limits.

Hey @robskillington 👋

Yes, this sounds like the solution we need! I think we probably want a generic option first rather than a specific option but am open to either really. Very happy for someone else to pick this up!

Generic option

Rate limit connections per hostname without needing to specify it per host in config. You'd set a limit in config, or perhaps we have a sensible default, which is applied to all hostnames.

Specific option

As you suggest. Or perhaps both? A default and custom overrides.

Config naming

Likely the most contentious part of a PR for this would be the config names.

  • Your suggestion looks alright, but how do we handle the generic/default case (if there is to be one, or for one in the future)
  • Separate MaxRequests and Within vars is quite flexible, but could we do with assuming any rate limit is going to be per second? Or do we leave Within in there and just default it to 1s so most users don't need to think about it.

Perhaps something like

RateLimit:
- Default: true
  MaxRequests: 20
  Within: 1s [optional]
- URL: "github.com"
  MaxRequests: 10
  Within: 1s [optional]

What do you think?

Other solutions

A thought I had half way through writing this reply:

If we see a 429 we could do an exponential backoff to attempt to recover. This would solve the problem without having to resort to hard-coded rate limits which could still break by hitting unpredictable rate limits.

Arguably this could be a better solution as it will fix the 429 problem for everyone without them having to set up rate limits nor htmltest having to impose a default rate limit.

We may find ourselves in a position of needing to do both as some servers likely aren't as nice as GitHub with their status codes though.

I'm sure there's a Go package that could handle this for us somewhat easily. A cursory Google throws up https://github.com/cenkalti/backoff

@ChrisChinchilla
Copy link
Author

OK thanks @wjdp we (@robskillington and I work together) ended up getting it working on CI by reducing the links checked to a bare minimum, but hopefully Rob's suggestion will solve the issue for others in the long term :)

@hellt
Copy link

hellt commented Aug 11, 2021

I wonder if HEAD requests are not affected by rate limiting? I find that using Get with Range=0 is not great for link checking...

@hellt
Copy link

hellt commented Jan 17, 2022

Hi @robskillington
I take it you never started working on this? Or is there anything someone can continue to work on for that issue?

@MarkvanMents
Copy link

I'm getting this too. Doesn't just apply to GitHub but to lots of other endpoints too.
It would be useful to have a generic retry for these, but specifying specific sites would also work (although more work for me).
Is there any caching of results? I see the same endpoint returning 429 multiple times. Does htmltest check http://example.com/page.html every time if finds it, or does it remember the first result and return it from the cache next time? If there is caching it might help to reduce the number of 429 errors.

@brunchboy
Copy link

I have just run into this in a major way trying to check links on https://afterglow-guide.deepsymmetry.org/ since it has many dozens of links to source code lines in the GitHub repository. For now, I am going to have to disable checking them, but it would be very nice if 429s could lead to per-host exponential backoff (or tailored backoff if an interval could be parsed from the 429 response, but that is a nice-to-have compared to the necessity of just having a queue to retry these links at progressively longer intervals).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants