Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for targets prefixed with user-content- #12702

Closed
wants to merge 1 commit into from

Conversation

madmiraal
Copy link

When looking for anchors, check for targets prefixed with user-content-.

Feature or Bugfix

  • Bugfix

Purpose

When parsing user .md and .rst files, GitHub, GitLab and Gitea automatically create href fragment targets to headings. To avoid DOM clobbering they append user-content- to the automatically created name and id targets and use JavaScript to manually scroll to the heading when the URL contains the un-prefixed fragment. Since prefixing targets with user-content- is the de facto standard, to avoid linkcheck.py failing to find the un-prefixed anchors, we need to also check for targets prefixed with user-content- when parsing the HTML looking for a URL's fragment.

Detail

Using Sphinx's own README.rst, we can reference the Features section using https://github.com/sphinx-doc/sphinx#features.

However, when testing this reference with:
index.rst:

https://github.com/sphinx-doc/sphinx#features

conf.py: empty

Running linkcheck:

sphinx-build -b linkcheck . _linkcheck

Without this PR, it fails:

(           index: line    1) broken    https://github.com/sphinx-doc/sphinx#features - Anchor 'features' not found
build finished with problems.

With this PR, it succeeds:

(           index: line    1) ok        https://github.com/sphinx-doc/sphinx#features
build succeeded.

Note: this does not reintroduce regression #9435, because linkcheck.py currently fails with these #Lxxx fragments. The #Lxxx fragments only work with JavaScript. There are no prefixed Lxxx name and id targets.

Relates

@@ -630,21 +633,6 @@ class RateLimit(NamedTuple):
next_check: float


def rewrite_github_anchor(app: Sphinx, uri: str) -> str | None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still use this, rather than attempting to insert 'user-content-' for every website.

@jayaddison
Copy link
Contributor

@madmiraal thanks for the pull request!

Similar to @AA-Turner, I'm wary of applying a change to handle a few websites that affects all websites - it's tricky to roll those kind of changes back. This one seems fine, but even so let's figure out what we can figure out.

From some digging around on the web: it seems that the user-content- prefix is a default one added to JavaScript-added DOM content by DOMPurify - I don't know if that's exactly why it's becoming more prevalent, but it does make sense for platforms to do something to prevent user-sourced content from clobbering platform content.

Thinking about what this change does: its transforming broken links into working ones by trying to find differently-named entities on the page.

Thinking aloud: what if instead we emitted a warning and a Did you mean <#user-content-features> ? or similar output -- could we implement that in a way that doesn't require logic tailored to specific websites? And isn't that better long-term, to encourage folks to fix their outbound hyperlinks to entities that can be found in the HTML markup?

@@ -630,21 +633,6 @@ class RateLimit(NamedTuple):
next_check: float


def rewrite_github_anchor(app: Sphinx, uri: str) -> str | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note: some people may still be using rewrite_github_anchor in their Sphinx project configurations. We should add a deprecation notice to the function instead of removing it completely, so that they have time to adjust. Features are removed two major versions after they are first indicated for removal (so Sphinx v10.0 for this).

More information about this can be found in Sphinx's deprecation policy documentation: https://www.sphinx-doc.org/en/master/internals/release-process.html#deprecation-policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR to add a deprecation message to rewrite_github_anchor instead of deleting it.

@madmiraal
Copy link
Author

Thanks for all the feedback.

I think we should still use this, rather than attempting to insert 'user-content-' for every website.

@AA-Turner

I'm wary of applying a change to handle a few websites that affects all websites

@jayaddison

I agree that this may lead to some false positives, but I think these will be insignificantly few, and there will be many websites using Gitea that would benefit from this too. However, we could add an option defaulting to True to enable or disable this.

what if instead we emitted a warning and a Did you mean <#user-content-features> ? or similar output -- could we implement that in a way that doesn't require logic tailored to specific websites? And isn't that better long-term, to encourage folks to fix their outbound hyperlinks to entities that can be found in the HTML markup?

The problem is: the un-prefixed target is the correct one. It is the one provided; so it's what people would expect to work:
sphinx-features-link
The prefixed target is used by the site's JavaScript to scroll to the correct location and simulate opening the HTML link with the un-prefixed fragment. Without looking under the hood, like we have done, there is no way for the average (user (including) me), to realise this. It's also worth noting that Gitea deliberately removed the user-conent- from the presented href: go-gitea/gitea#11903

Please note: some people may still be using rewrite_github_anchor in their Sphinx project configurations. We should add a deprecation notice to the function instead of removing it completely

Please correct me if I am wrong, but from what I can tell, rewrite_github_anchor is only used by the commented out line:

# app.connect('linkcheck-process-uri', rewrite_github_anchor)

How else would people use it that we would need to mark it as deprecated?

@jayaddison
Copy link
Contributor

It is the one provided; so it's what people would expect to work:

That's the root cause, yep - sites presenting shortened/prettified hyperlinks that act like standard navigation (browser location on page) and yet aren't referencing the corresponding HTML content.

How else would people use it that we would need to mark it as deprecated?

After verifying your example .rst file / use-case, and subsequently checking the behaviour of rewrite_github_anchor, I created the following conf.py in a test project:

from sphinx.builders.linkcheck import rewrite_github_anchor

def setup(app):
    app.connect('linkcheck-process-uri', rewrite_github_anchor)

(in other words: yep, it's commented out, meaning that it isn't enabled by default, but it is still available for usage)

The rewrite_github_anchor function doesn't currently handle line-number fragments (foo.c#L200) but some downstream projects appear to have adapted it slightly to handle that and a few other commonly-encountered code-forge anchors.

@AA-Turner
Copy link
Member

I agree that this may lead to some false positives, but I think these will be insignificantly few, and there will be many websites using Gitea that would benefit from this too. However, we could add an option defaulting to True to enable or disable this.

For now, let's use the site-specific filter with GitHub and perhaps GitLab. If people want to use this with Gitea we will get feature requests, but let's start simple (without another config option!)

A

@AA-Turner
Copy link
Member

@jayaddison please could I test something with you? I've updated the repository settings to give members of the triage team more permissions on GitHub.

  1. Are you able to edit PR titles now?
  2. Are you able to edit other people's comments now?
  3. Is the merge button still disabled?

A

@madmiraal madmiraal force-pushed the fix-linkcheck-anchors branch 3 times, most recently from 961aa32 to 2a4ef91 Compare July 31, 2024 08:53
@jayaddison
Copy link
Contributor

Thanks @AA-Turner - I'll likely only use the second permission very sparingly, but yep, I can confirm that all three permissions/restrictions are in place as expected:

  • Are you able to edit PR titles now?

  • Are you able to edit other people's comments now?

  • Is the merge button still disabled?

@madmiraal
Copy link
Author

For now, let's use the site-specific filter with GitHub and perhaps GitLab. If people want to use this with Gitea we will get feature requests, but let's start simple (without another config option!)

I'm not convinced that making a site-specific filter is maintainable. In addition to Gitea, GitLab is also used by a number of sites that don't use the gitlab.com host. Furthermore in addition to Gitea there is also Forgejo and Gogs; so the number of potential sites increases again.

That been said, I've tested this with a number of Github type sites:

(           index: line    1) ok        https://github.com/sphinx-doc/sphinx#user-content-features
(           index: line    1) broken    https://gitlab.com/gitlab-org/gitlab#gitlab - Anchor 'gitlab' not found
(           index: line    1) ok        https://codeberg.org/FreeBSD/freebsd-src#freebsd-source
(           index: line    1) broken    https://notabug.org/xiph/vorbis#vorbis - Anchor 'vorbis' not found```
(           index: line    1) broken    https://gitlab.xfce.org/xfce/thunar#homepage - Anchor 'homepage' not found
(           index: line    1) ok        https://git.fsfe.org/FSFE/fsfe-website#technical-information

Unfortunately, neither this approach nor the original rewrite_github_anchor approach works on GitLab or GitLab based sites. I assume it is fully JavaScript based; so although it creates a user-content- prefixed id, the id is not retrieved, so appending user-content- using either approach doesn't work. Similarly, Gogs based sites, such as https://notabug.org, don't even create an user-content- id or prefix. So there isn't even a prefixed id or name to look for. Again, I assume it's fully JavaScript based.

Therefore, I'm closing this PR in lieu of finding a more comprehensive solution.

@madmiraal madmiraal closed this Jul 31, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants