429 errors with concurrent mode disabled #152

ChrisChinchilla · 2020-10-15T15:29:14Z

Describe the bug

I can see there are similar reports on this issue (#105), but I have the same result even without concurrent mode enabled. Interestingly, the checks work beautifully when run locally, but we get the error when it runs on CI jobs.

Non-OK status: 429 --- docs/operational_guide/resource_limits/index.html --> https://github.com/m3db/m3/tree/docs-test/site/content/docs/operational_guide/resource_limits.md

etc. And yes, it's just GitHub links.

To Reproduce

htmltest is running in CI over a Hugo site like this.

docker run --rm -it -v $(PWD)/site:/src klakegg/hugo:ext-alpine
curl https://htmltest.wjdp.uk | bash
./bin/htmltest -c site/.htmltest.yml

.htmltest.yml

Please copy in your config file

DirectoryPath: site/public
IgnoreDirectoryMissingTrailingSlash: true
CheckExternal: true
IgnoreAltMissing: false
CheckImages: false
CheckScripts: false
CheckMeta: false
CheckMetaRefresh: false
CheckLinks: false
EnforceHTML5: false
EnforceHTTPS: false
IgnoreInternalEmptyHash: true
IgnoreEmptyHref: true
TestFilesConcurrently: false
DocumentConcurrencyLimit: 64
HTTPConcurrencyLimit: 8
LogLevel: 0
IgnoreURLs:
- "youtu.be"
- "youtube.com"
- "cassandra.apache.org"

Versions

OS: local - macOS 10.15.7 (19H2), CI - Not completely sure… Sorry, Linux of some flavour
htmltest: htmltest 0.13.0

The text was updated successfully, but these errors were encountered:

robskillington · 2020-10-19T13:24:05Z

@wjdp would you be open to accepting a PR that added a self-imposed rate limit for certain URLs?

e.g. something like this (what do you think of the general concept if someone else was to open the change?)

RateLimit:
- URL: "github.com"
  MaxRequests: 10
  Within: 1s

wjdp · 2020-10-19T14:20:55Z

Hi @ChrisChinchilla 👋

From "works locally, fails on CI" I suspect your local broadband bottlenecks you enough that rate limiting isn't a problem there but the fat pipe at your CI provider allows for a much more effective DoS directed at github 😉 hence the 429s.

This is a problem with htmltest, right now we just scale to whatever bandwidth is available even if that hits external rate limits.

Hey @robskillington 👋

Yes, this sounds like the solution we need! I think we probably want a generic option first rather than a specific option but am open to either really. Very happy for someone else to pick this up!

Generic option

Rate limit connections per hostname without needing to specify it per host in config. You'd set a limit in config, or perhaps we have a sensible default, which is applied to all hostnames.

Specific option

As you suggest. Or perhaps both? A default and custom overrides.

Config naming

Likely the most contentious part of a PR for this would be the config names.

Your suggestion looks alright, but how do we handle the generic/default case (if there is to be one, or for one in the future)
Separate MaxRequests and Within vars is quite flexible, but could we do with assuming any rate limit is going to be per second? Or do we leave Within in there and just default it to 1s so most users don't need to think about it.

Perhaps something like

RateLimit:
- Default: true
  MaxRequests: 20
  Within: 1s [optional]
- URL: "github.com"
  MaxRequests: 10
  Within: 1s [optional]

What do you think?

Other solutions

A thought I had half way through writing this reply:

If we see a 429 we could do an exponential backoff to attempt to recover. This would solve the problem without having to resort to hard-coded rate limits which could still break by hitting unpredictable rate limits.

Arguably this could be a better solution as it will fix the 429 problem for everyone without them having to set up rate limits nor htmltest having to impose a default rate limit.

We may find ourselves in a position of needing to do both as some servers likely aren't as nice as GitHub with their status codes though.

I'm sure there's a Go package that could handle this for us somewhat easily. A cursory Google throws up https://github.com/cenkalti/backoff

ChrisChinchilla · 2020-10-20T09:06:14Z

OK thanks @wjdp we (@robskillington and I work together) ended up getting it working on CI by reducing the links checked to a bare minimum, but hopefully Rob's suggestion will solve the issue for others in the long term :)

hellt · 2021-08-11T12:09:48Z

I wonder if HEAD requests are not affected by rate limiting? I find that using Get with Range=0 is not great for link checking...

hellt · 2022-01-17T20:37:11Z

Hi @robskillington
I take it you never started working on this? Or is there anything someone can continue to work on for that issue?

MarkvanMents · 2022-03-31T10:40:41Z

I'm getting this too. Doesn't just apply to GitHub but to lots of other endpoints too.
It would be useful to have a generic retry for these, but specifying specific sites would also work (although more work for me).
Is there any caching of results? I see the same endpoint returning 429 multiple times. Does htmltest check http://example.com/page.html every time if finds it, or does it remember the first result and return it from the cache next time? If there is caching it might help to reduce the number of 429 errors.

brunchboy · 2022-04-26T14:38:50Z

I have just run into this in a major way trying to check links on https://afterglow-guide.deepsymmetry.org/ since it has many dozens of links to source code lines in the GitHub repository. For now, I am going to have to disable checking them, but it would be very nice if 429s could lead to per-host exponential backoff (or tailored backoff if an interval could be parsed from the 429 response, but that is a nice-to-have compared to the necessity of just having a queue to retry these links at progressively longer intervals).

ChrisChinchilla added the bug label Oct 15, 2020

yzdocs mentioned this issue Nov 9, 2020

Restore nightly CI builds cockroachdb/docs#8854

Merged

theunrepentantgeek mentioned this issue Jun 10, 2023

Limit per-host concurrency and add retries #217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

429 errors with concurrent mode disabled #152

429 errors with concurrent mode disabled #152

ChrisChinchilla commented Oct 15, 2020 •

edited

Loading

robskillington commented Oct 19, 2020

wjdp commented Oct 19, 2020

ChrisChinchilla commented Oct 20, 2020

hellt commented Aug 11, 2021

hellt commented Jan 17, 2022

MarkvanMents commented Mar 31, 2022

brunchboy commented Apr 26, 2022

429 errors with concurrent mode disabled #152

429 errors with concurrent mode disabled #152

Comments

ChrisChinchilla commented Oct 15, 2020 • edited Loading

Describe the bug

To Reproduce

.htmltest.yml

Versions

robskillington commented Oct 19, 2020

wjdp commented Oct 19, 2020

Hi @ChrisChinchilla 👋

Hey @robskillington 👋

Generic option

Specific option

Config naming

Other solutions

ChrisChinchilla commented Oct 20, 2020

hellt commented Aug 11, 2021

hellt commented Jan 17, 2022

MarkvanMents commented Mar 31, 2022

brunchboy commented Apr 26, 2022

ChrisChinchilla commented Oct 15, 2020 •

edited

Loading