docs: add lychee github action for test purposes #839

TC-MO · 2024-01-17T15:40:23Z

No description provided.

barjin · 2024-02-08T16:05:09Z

Context for other reviewers:

The setup in this PR currently checks for broken link URLs in the .md / .mdx files of the project.
This is slightly dangerous - during build, @docusaurus/plugin-content-docs compiles those markdown files, replacing (among others) Markdown links - something that Lychee doesn't do and fails because of:

See that there are also issues with mixing HTML and JS in the .mdx syntax.

Proposed solution:

We can run the Docusaurus build first and analyze the generated html files.
- The building logic can be stolen from the current building GH action here.
- Note that this might introduce new issues (e.g. this with different names of newly generated JS bundles):
  
  this is probably easily removed with some --exclude directives
It still might consider an article added to the tested PR as missing (as Lychee is taking the links from the "local version", but checking for the resources online, in the currently deployed version, right?)

TC-MO · 2024-02-12T06:50:59Z

If we would be able to check links only after build process is done this would save us a lot of headache with fixing file-path based links
To the point of Lychee considering new page as false positive
- While that is true, this also is an edge case that happens rarely that we would be linking to a new page from older docs, and even if we would to do this, we can just make some workaround for this.

barjin · 2024-02-12T10:36:43Z

Check out my PR which uses the built website version for the link checking. There are still some quirks to figure out, but there's probably less of them than here. Feel free to merge that branch in here and close it... or whatever :)

TC-MO · 2024-02-13T08:55:18Z

I thought that with this new approach we will have way less errors, since the links will be built by docusaurus, but if I see it correctly we have 10x the amount of checked, but similar count of errors?

barjin · 2024-02-13T09:21:38Z

When you look at the single documentation page, it contains many more links than the original markdown document - navigation bars, previous / next article or Edit this page links leading to the GitHub repository of the docs. The hike up in the links count is a result of this (and is IMO correct, as now Lychee tests every single link a user can click, not only the ones directly in the content).

The Edit this page link also causes (most|many) of the issues - GitHub is perhaps not too happy with the Lychee HTTP client accessing it (seems like a bot). Maybe it's safe to ignore those links (https://github.com/apify/apify-docs/edit/master/...), as we can assume they are automatically 1:1 generated - and that they are also for internal use only? Without those, it seems to only report some actual, more interesting issues.

TC-MO · 2024-02-13T09:37:05Z

Ah yes, fair enough the point about lychee now checking also navbars and other clickable content makes perfect sense.

As for Edit this page , I don't think this is for internal use only, as our docs are open source, technically anyone can click on them and create fork/pr to our docs, at the same since those are generated by Docusaurus, if I'm not mistaken, we can safely assume that they are always ok and ignore those in link checking? Or we could just accept 429 status code as valid ? But that seems like not the greatest approach.

TC-MO · 2024-02-13T10:54:42Z

One other thing to consider is that it seems that we have a lot of false positives in the lychee report since a lot of our links are linking to .md files but relatively for example in file sources/academy/webscraping/web_scraping_for_beginners/data_extraction/index.md we are linking to other file using ./browser_devtools.md. Since we are using -- base argument in lychee it tries to check it against docs.apify.com and fails. But, if the link would be sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md then it seems that lychee checks it as internal md link instead trying to check it against docs.apify.com. So maybe a better approach would be to make sure that all md links are not relative, and thus fixing the false positives, instead of checking after docusaurus builds? @B4nan could you weigh in on the issue?

B4nan · 2024-02-13T11:57:54Z

One other thing to consider is that it seems that we have a lot of false positives in the lychee report since a lot of our links are linking to .md files but relatively for example in file sources/academy/webscraping/web_scraping_for_beginners/data_extraction/index.md we are linking to other file using ./browser_devtools.md. Since we are using -- base argument in lychee it tries to check it against docs.apify.com and fails. But, if the link would be sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md then it seems that lychee checks it as internal md link instead trying to check it against docs.apify.com.

How is that relevant? The relative paths are transformed to actual links on build time, lychee won't test those but the actual rendered links, right? Sounds like a misconfiguration to me.

TC-MO · 2024-02-13T13:07:17Z

@mtrunkat

TC-MO · 2024-02-13T13:34:30Z

I think this boils down to which approach we choose:

Check after build
This way we check a lot of extra stuff like navbar, github edit links etc. But we can filter them out with proper configuration and we can think less about how we link within our content (but still should default to one way of doing it and make all the links adhere to this way)
Check .md and .mdx files
This way we won't be checking all the extra stuff from the built page, just links within content itself. This approach tho necessitates that we fix all relative links md links because lychee tries to check them against docs.apify.com. Unless I'm mistaken about lychee and the way it tries to check the relative links when using --base argument .

B4nan · 2024-02-13T13:37:12Z

Check .md and .mdx files

Keep in mind docusaurus is already checking those links. A relative path link is always pointing to something local which is exactly the thing that docusaurus verifies (and breaks the CI checks if it fails)

This way we won't be checking all the extra stuff from the built page, just links within content itself. This approach tho necessitates that we fix all relative links md links because lychee tries to check them against docs.apify.com. Unless I'm mistaken about lychee and the way it tries to check the relative links when using --base argument .

I disagree this is "something to fix", the links are not broken.

B4nan · 2024-02-13T13:48:04Z

Found this example repo:

https://github.com/StarRocks/docusaurus-build

Which explicitly skips the links pointing to .md from lychee, sounds like a good approach to me:

  - name: link check
    if: always()
    uses: lycheeverse/[email protected]
    with:
      fail: true
      args: >
        --config docs/lychee.toml
        --offline "docs/**/*.md"

(we should skip both .md and .mdx here I guess)

TC-MO · 2024-02-13T14:09:51Z

Fix was in a sense that it does throw false positives.

I don't think --offline will work in our case. From what I understand it blocks network requests, but we have links to other parts of documentation that need checking (docs > sdk docs for example) and external links that also would be nice to check.

B4nan · 2024-02-13T14:25:57Z

I don't think --offline will work in our case. From what I understand it blocks network requests, but we have links to other parts of documentation that need checking (docs > sdk docs for example) and external links that also would be nice to check.

Those links are never "relative file links", they cannot be, such links can only point to the local files in the specific docusaurus project.

TC-MO · 2024-02-14T12:41:55Z

I think there might have been some misunderstanding regarding the approach to link checking with Lychee.

We're dealing with three types of links in our documentation:

./foo/bar.md - These are relative links within our project. They correctly resolve post-build but currently trigger false positives in Lychee because they're checked against our --base URL, leading to errors.
/foo/bar - These are relative URLs pointing to other parts of our documentation outside this repository (e.g., from docs to SDK docs or CLI docs). To properly check these, we need to use the --base argument.
Standard external links - These link to pages outside our documentation and are not problematic in our current setup.

The challenge arises with how Lychee handles the first type of links when the --base argument is applied. It misinterprets them and tries to resolve them against --base URL, hence the false positives.

We've identified two potential solutions:

Post-build Lychee check: This method seems to address the issue of false positives for internal links. However, it introduces new challenges, such as the need to exclude GitHub links to avoid rate limiting and additional false positives from external sources.
Pre-build Lychee check with modifications to internal links: By changing our internal linking strategy from ./foo/bar.md to a more absolute path like /foo/bar/foobar/bar.md, we can prevent Lychee from misinterpreting them. This adjustment should eliminate the false positives without requiring changes to our use of the --base argument.

B4nan · 2024-02-14T13:03:56Z

I don't think this is about misunderstanding, more like I don't like what you are saying and I have a hard time believing there isn't a better way that does not involve rewriting the content :]

./foo/bar.md - These are relative links within our project. They correctly resolve post-build but currently trigger false positives in Lychee because they're checked against our --base URL, leading to errors.

Yes, and the solution to this is to skip those from lychee, it feels like you don't want to do that, but it is the correct fix.

/foo/bar - These are relative URLs pointing to other parts of our documentation outside this repository (e.g., from docs to SDK docs or CLI docs). To properly check these, we need to use the --base argument.

Not necessarily, they can point to the same docusaurus instance as well, but yes.

The challenge arises with how Lychee handles the first type of links when the --base argument is applied. It misinterprets them and tries to resolve them against --base URL, hence the false positives.

Ignore rules that need to be applied to the source, not to post-processed URLs (so this needs to ignore anything, including the base arg), I am sure lychee allows to do that.

Maybe the offline option is not exactly what we need, but given the linked repo is dealing with exactly this problem, it feels like it might do what we need. We should first try that instead of reasoning about its name and description and whatnot. In the end, checking the links without network access sounds exactly like "skipping the actual problematic check", not sure what it checks in this mode, I'd guess it's gonna be more like an ignore rule rather than offline rule... I will take a closer look myself during the afternoon - just finishing a PR that updates the eslint setup in this repo, so need to finish with that first.

Changing the URLs would be a step back to me - the paths are pretty handy, as they decouple the content from the URL, and they provide intellisense/autocomplete too.

barjin · 2024-02-14T13:29:00Z

--offline is even worse than that, since it only checks links with file:// scheme:

From a glance into the Lychee source code, it's not really equipped for checking for local files - the only notion of a local file is the file:// URL scheme, which we don't do / want / have.

Aside from skipping the GitHub 429s on the edit pages, what other issues are there with checking the built website version? Imo having Lychee check the built, properly formed HTML links solves all our issues just like that. Being able to check both external and internal links is also a plus for me, as that PR of mine already found at least two issues, like linking https://pptr.dev/docs/puppeteer.page.evaluateonnewdocument and https://docs.apify.com/api/v2/ (with trailing slash), which both return 404.

TC-MO · 2024-02-14T13:36:21Z

Doesn't docs.apify.com/api/v2 always return 404 and only after 404 loads the API ref? At least that is how I always saw it when entering API Reference for the first time.

barjin · 2024-02-14T13:44:36Z

That doesn't seem to be the case - check the Network tab while accessing the URLs with and without the trailing slash:

Let's not digress too much from the Lychee thing - we'll have enough time to fix broken links after we know where they are 😄

B4nan · 2024-02-14T13:51:40Z

Doesn't docs.apify.com/api/v2 always return 404 and only after 404 loads the API ref? At least that is how I always saw it when entering API Reference for the first time.

https://docs.apify.com/api/v2 works fine, as that is the actual URL of the page. If you add the trailing slash https://docs.apify.com/api/v2/, you get 404 (which is imo correct, or better say expected). This could be probably improved in the nginx config.

edit: this is rather general issue, its not just about this page - its a client side app, docusaurus does not handle server side routing, so its capability of detecting what is actually missing is a bit weak. we have the exact same problem on every page, if you add the trailing slash, you get 404 in the server response (but things get properly hydrated client side).

TC-MO · 2024-03-07T10:39:51Z

Since we are moving forward with #855 I'll go ahead and close this PR

TC-MO added 13 commits January 17, 2024 16:40

docs: add lychee github action for test purposes

c692d83

bump checkout action & add .lycheeignore file

f8a7759

fix broken regex

aa3ce3e

fix typo in file name

56599f6

add yt to ignored links

87c8ba4

add base URL flag

4352529

add regex to ignore image links

3b7986f

add webp format to ignored

da2a79a

add svg format to ignored

5d5dde3

set sources as start point for lychee - test

f800ecf

remove trailing slash from base arg

7b9f40d

Merge branch 'master' into lychee-test

c6d3107

Merge branch 'master' into lychee-test

757dd5d

github-actions bot assigned TC-MO Feb 5, 2024

github-actions bot added the t-docs Issues owned by technical writing team. label Feb 5, 2024

TC-MO closed this Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add lychee github action for test purposes #839

docs: add lychee github action for test purposes #839

TC-MO commented Jan 17, 2024

barjin commented Feb 8, 2024 •

edited

Loading

TC-MO commented Feb 12, 2024

barjin commented Feb 12, 2024

TC-MO commented Feb 13, 2024

barjin commented Feb 13, 2024

TC-MO commented Feb 13, 2024

TC-MO commented Feb 13, 2024 •

edited

Loading

B4nan commented Feb 13, 2024

TC-MO commented Feb 13, 2024

TC-MO commented Feb 13, 2024

B4nan commented Feb 13, 2024

B4nan commented Feb 13, 2024 •

edited

Loading

TC-MO commented Feb 13, 2024

B4nan commented Feb 13, 2024

TC-MO commented Feb 14, 2024

B4nan commented Feb 14, 2024 •

edited

Loading

barjin commented Feb 14, 2024

TC-MO commented Feb 14, 2024

barjin commented Feb 14, 2024

B4nan commented Feb 14, 2024 •

edited

Loading

TC-MO commented Mar 7, 2024

docs: add lychee github action for test purposes #839

docs: add lychee github action for test purposes #839

Conversation

TC-MO commented Jan 17, 2024

barjin commented Feb 8, 2024 • edited Loading

Context for other reviewers:

Proposed solution:

TC-MO commented Feb 12, 2024

barjin commented Feb 12, 2024

TC-MO commented Feb 13, 2024

barjin commented Feb 13, 2024

TC-MO commented Feb 13, 2024

TC-MO commented Feb 13, 2024 • edited Loading

B4nan commented Feb 13, 2024

TC-MO commented Feb 13, 2024

TC-MO commented Feb 13, 2024

B4nan commented Feb 13, 2024

B4nan commented Feb 13, 2024 • edited Loading

TC-MO commented Feb 13, 2024

B4nan commented Feb 13, 2024

TC-MO commented Feb 14, 2024

B4nan commented Feb 14, 2024 • edited Loading

barjin commented Feb 14, 2024

TC-MO commented Feb 14, 2024

barjin commented Feb 14, 2024

B4nan commented Feb 14, 2024 • edited Loading

TC-MO commented Mar 7, 2024

barjin commented Feb 8, 2024 •

edited

Loading

TC-MO commented Feb 13, 2024 •

edited

Loading

B4nan commented Feb 13, 2024 •

edited

Loading

B4nan commented Feb 14, 2024 •

edited

Loading

B4nan commented Feb 14, 2024 •

edited

Loading