Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add lychee github action for test purposes #839

Closed
wants to merge 13 commits into from
Closed

Conversation

TC-MO
Copy link
Contributor

@TC-MO TC-MO commented Jan 17, 2024

No description provided.

@github-actions github-actions bot added the t-docs Issues owned by technical writing team. label Feb 5, 2024
@barjin
Copy link
Contributor

barjin commented Feb 8, 2024

Context for other reviewers:

  • The setup in this PR currently checks for broken link URLs in the .md / .mdx files of the project.
  • This is slightly dangerous - during build, @docusaurus/plugin-content-docs compiles those markdown files, replacing (among others) Markdown links - something that Lychee doesn't do and fails because of:

obrazek
See that there are also issues with mixing HTML and JS in the .mdx syntax.

Proposed solution:

  • We can run the Docusaurus build first and analyze the generated html files.
    • The building logic can be stolen from the current building GH action here.
    • Note that this might introduce new issues (e.g. this with different names of newly generated JS bundles):

      obrazek
      this is probably easily removed with some --exclude directives

  • It still might consider an article added to the tested PR as missing (as Lychee is taking the links from the "local version", but checking for the resources online, in the currently deployed version, right?)

@TC-MO
Copy link
Contributor Author

TC-MO commented Feb 12, 2024

  • If we would be able to check links only after build process is done this would save us a lot of headache with fixing file-path based links

  • To the point of Lychee considering new page as false positive

    • While that is true, this also is an edge case that happens rarely that we would be linking to a new page from older docs, and even if we would to do this, we can just make some workaround for this.

@barjin
Copy link
Contributor

barjin commented Feb 12, 2024

Check out my PR which uses the built website version for the link checking. There are still some quirks to figure out, but there's probably less of them than here. Feel free to merge that branch in here and close it... or whatever :)

@TC-MO
Copy link
Contributor Author

TC-MO commented Feb 13, 2024

I thought that with this new approach we will have way less errors, since the links will be built by docusaurus, but if I see it correctly we have 10x the amount of checked, but similar count of errors?

obraz

@barjin
Copy link
Contributor

barjin commented Feb 13, 2024

When you look at the single documentation page, it contains many more links than the original markdown document - navigation bars, previous / next article or Edit this page links leading to the GitHub repository of the docs. The hike up in the links count is a result of this (and is IMO correct, as now Lychee tests every single link a user can click, not only the ones directly in the content).

The Edit this page link also causes (most|many) of the issues - GitHub is perhaps not too happy with the Lychee HTTP client accessing it (seems like a bot). Maybe it's safe to ignore those links (https://github.com/apify/apify-docs/edit/master/...), as we can assume they are automatically 1:1 generated - and that they are also for internal use only? Without those, it seems to only report some actual, more interesting issues.

@TC-MO
Copy link
Contributor Author

TC-MO commented Feb 13, 2024

Ah yes, fair enough the point about lychee now checking also navbars and other clickable content makes perfect sense.

  • As for Edit this page , I don't think this is for internal use only, as our docs are open source, technically anyone can click on them and create fork/pr to our docs, at the same since those are generated by Docusaurus, if I'm not mistaken, we can safely assume that they are always ok and ignore those in link checking? Or we could just accept 429 status code as valid ? But that seems like not the greatest approach.

@TC-MO
Copy link
Contributor Author

TC-MO commented Feb 13, 2024

One other thing to consider is that it seems that we have a lot of false positives in the lychee report since a lot of our links are linking to .md files but relatively for example in file sources/academy/webscraping/web_scraping_for_beginners/data_extraction/index.md we are linking to other file using ./browser_devtools.md. Since we are using -- base argument in lychee it tries to check it against docs.apify.com and fails. But, if the link would be sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md then it seems that lychee checks it as internal md link instead trying to check it against docs.apify.com. So maybe a better approach would be to make sure that all md links are not relative, and thus fixing the false positives, instead of checking after docusaurus builds? @B4nan could you weigh in on the issue?

@B4nan
Copy link
Member

B4nan commented Feb 13, 2024

One other thing to consider is that it seems that we have a lot of false positives in the lychee report since a lot of our links are linking to .md files but relatively for example in file sources/academy/webscraping/web_scraping_for_beginners/data_extraction/index.md we are linking to other file using ./browser_devtools.md. Since we are using -- base argument in lychee it tries to check it against docs.apify.com and fails. But, if the link would be sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md then it seems that lychee checks it as internal md link instead trying to check it against docs.apify.com.

How is that relevant? The relative paths are transformed to actual links on build time, lychee won't test those but the actual rendered links, right? Sounds like a misconfiguration to me.

@TC-MO
Copy link
Contributor Author

TC-MO commented Feb 13, 2024

@mtrunkat

@TC-MO
Copy link
Contributor Author

TC-MO commented Feb 13, 2024

I think this boils down to which approach we choose:

  1. Check after build
    This way we check a lot of extra stuff like navbar, github edit links etc. But we can filter them out with proper configuration and we can think less about how we link within our content (but still should default to one way of doing it and make all the links adhere to this way)

  2. Check .md and .mdx files
    This way we won't be checking all the extra stuff from the built page, just links within content itself. This approach tho necessitates that we fix all relative links md links because lychee tries to check them against docs.apify.com. Unless I'm mistaken about lychee and the way it tries to check the relative links when using --base argument .

@B4nan
Copy link
Member

B4nan commented Feb 13, 2024

Check .md and .mdx files

Keep in mind docusaurus is already checking those links. A relative path link is always pointing to something local which is exactly the thing that docusaurus verifies (and breaks the CI checks if it fails)

This way we won't be checking all the extra stuff from the built page, just links within content itself. This approach tho necessitates that we fix all relative links md links because lychee tries to check them against docs.apify.com. Unless I'm mistaken about lychee and the way it tries to check the relative links when using --base argument .

I disagree this is "something to fix", the links are not broken.

@B4nan
Copy link
Member

B4nan commented Feb 13, 2024

Found this example repo:

https://github.com/StarRocks/docusaurus-build

Which explicitly skips the links pointing to .md from lychee, sounds like a good approach to me:

  - name: link check
    if: always()
    uses: lycheeverse/[email protected]
    with:
      fail: true
      args: >
        --config docs/lychee.toml
        --offline "docs/**/*.md"

(we should skip both .md and .mdx here I guess)

@TC-MO
Copy link
Contributor Author

TC-MO commented Feb 13, 2024

Fix was in a sense that it does throw false positives.

I don't think --offline will work in our case. From what I understand it blocks network requests, but we have links to other parts of documentation that need checking (docs > sdk docs for example) and external links that also would be nice to check.

@B4nan
Copy link
Member

B4nan commented Feb 13, 2024

I don't think --offline will work in our case. From what I understand it blocks network requests, but we have links to other parts of documentation that need checking (docs > sdk docs for example) and external links that also would be nice to check.

Those links are never "relative file links", they cannot be, such links can only point to the local files in the specific docusaurus project.

@TC-MO
Copy link
Contributor Author

TC-MO commented Feb 14, 2024

I think there might have been some misunderstanding regarding the approach to link checking with Lychee.

We're dealing with three types of links in our documentation:

  1. ./foo/bar.md - These are relative links within our project. They correctly resolve post-build but currently trigger false positives in Lychee because they're checked against our --base URL, leading to errors.
  2. /foo/bar - These are relative URLs pointing to other parts of our documentation outside this repository (e.g., from docs to SDK docs or CLI docs). To properly check these, we need to use the --base argument.
  3. Standard external links - These link to pages outside our documentation and are not problematic in our current setup.

The challenge arises with how Lychee handles the first type of links when the --base argument is applied. It misinterprets them and tries to resolve them against --base URL, hence the false positives.

We've identified two potential solutions:

  1. Post-build Lychee check: This method seems to address the issue of false positives for internal links. However, it introduces new challenges, such as the need to exclude GitHub links to avoid rate limiting and additional false positives from external sources.
  2. Pre-build Lychee check with modifications to internal links: By changing our internal linking strategy from ./foo/bar.md to a more absolute path like /foo/bar/foobar/bar.md, we can prevent Lychee from misinterpreting them. This adjustment should eliminate the false positives without requiring changes to our use of the --base argument.

@B4nan
Copy link
Member

B4nan commented Feb 14, 2024

I don't think this is about misunderstanding, more like I don't like what you are saying and I have a hard time believing there isn't a better way that does not involve rewriting the content :]

./foo/bar.md - These are relative links within our project. They correctly resolve post-build but currently trigger false positives in Lychee because they're checked against our --base URL, leading to errors.

Yes, and the solution to this is to skip those from lychee, it feels like you don't want to do that, but it is the correct fix.

/foo/bar - These are relative URLs pointing to other parts of our documentation outside this repository (e.g., from docs to SDK docs or CLI docs). To properly check these, we need to use the --base argument.

Not necessarily, they can point to the same docusaurus instance as well, but yes.

The challenge arises with how Lychee handles the first type of links when the --base argument is applied. It misinterprets them and tries to resolve them against --base URL, hence the false positives.

Ignore rules that need to be applied to the source, not to post-processed URLs (so this needs to ignore anything, including the base arg), I am sure lychee allows to do that.

Maybe the offline option is not exactly what we need, but given the linked repo is dealing with exactly this problem, it feels like it might do what we need. We should first try that instead of reasoning about its name and description and whatnot. In the end, checking the links without network access sounds exactly like "skipping the actual problematic check", not sure what it checks in this mode, I'd guess it's gonna be more like an ignore rule rather than offline rule... I will take a closer look myself during the afternoon - just finishing a PR that updates the eslint setup in this repo, so need to finish with that first.

Changing the URLs would be a step back to me - the paths are pretty handy, as they decouple the content from the URL, and they provide intellisense/autocomplete too.

@barjin
Copy link
Contributor

barjin commented Feb 14, 2024

--offline is even worse than that, since it only checks links with file:// scheme:
obrazek

From a glance into the Lychee source code, it's not really equipped for checking for local files - the only notion of a local file is the file:// URL scheme, which we don't do / want / have.

Aside from skipping the GitHub 429s on the edit pages, what other issues are there with checking the built website version? Imo having Lychee check the built, properly formed HTML links solves all our issues just like that. Being able to check both external and internal links is also a plus for me, as that PR of mine already found at least two issues, like linking https://pptr.dev/docs/puppeteer.page.evaluateonnewdocument and https://docs.apify.com/api/v2/ (with trailing slash), which both return 404.

@TC-MO
Copy link
Contributor Author

TC-MO commented Feb 14, 2024

Doesn't docs.apify.com/api/v2 always return 404 and only after 404 loads the API ref? At least that is how I always saw it when entering API Reference for the first time.

@barjin
Copy link
Contributor

barjin commented Feb 14, 2024

That doesn't seem to be the case - check the Network tab while accessing the URLs with and without the trailing slash:

obrazek

Let's not digress too much from the Lychee thing - we'll have enough time to fix broken links after we know where they are 😄

@B4nan
Copy link
Member

B4nan commented Feb 14, 2024

Doesn't docs.apify.com/api/v2 always return 404 and only after 404 loads the API ref? At least that is how I always saw it when entering API Reference for the first time.

https://docs.apify.com/api/v2 works fine, as that is the actual URL of the page. If you add the trailing slash https://docs.apify.com/api/v2/, you get 404 (which is imo correct, or better say expected). This could be probably improved in the nginx config.

edit: this is rather general issue, its not just about this page - its a client side app, docusaurus does not handle server side routing, so its capability of detecting what is actually missing is a bit weak. we have the exact same problem on every page, if you add the trailing slash, you get 404 in the server response (but things get properly hydrated client side).

@TC-MO
Copy link
Contributor Author

TC-MO commented Mar 7, 2024

Since we are moving forward with #855 I'll go ahead and close this PR

@TC-MO TC-MO closed this Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-docs Issues owned by technical writing team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants