Skip to content

Anchor links changed? #2002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
To1ne opened this issue May 7, 2025 · 17 comments · Fixed by #2007
Closed

Anchor links changed? #2002

To1ne opened this issue May 7, 2025 · 17 comments · Fixed by #2007
Assignees

Comments

@To1ne
Copy link
Collaborator

To1ne commented May 7, 2025

I ran across a link on the interwebs:

https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt

and I noticed it's not anchoring to the correct piece of the page. The nowadays seems to be:

https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode

What changed the code to be added around the anchor link? I'd prefer to not have this behavior broken.

@To1ne
Copy link
Collaborator Author

To1ne commented May 7, 2025

I might look into this myself later, but I had to write this down and do something else first.

@dscho
Copy link
Member

dscho commented May 7, 2025

Oh no! I highly suspect the culprit to be related to the recent style changes (@jnavila does this sound plausible to you?)

@To1ne
Copy link
Collaborator Author

To1ne commented May 7, 2025

Oh no! I highly suspect the culprit to be related to the recent style changes

I hope it's recent and not broken since the Rails to Hugo transition, otherwise I expect both types of permalinks to be around on the interwebs. So that means we'd have to figure out a way to support both.

@dscho Shall/Can we add test coverage for these permalinks?

@jnavila
Copy link
Contributor

jnavila commented May 7, 2025

Oh no! I highly suspect the culprit to be related to the recent style changes (@jnavila does this sound plausible to you?)

I suspect so. All the command options are now formatted as <code>, and it seems this bleeds into the automatic anchor generation... Quite an inexpected side-effect. We are never sure that the machine generated anchor name is reproducible.

@jnavila
Copy link
Contributor

jnavila commented May 7, 2025

the generative part is here:

# HTML anchor on hdlist1 (i.e. command options)
html.gsub!(/<dt class="hdlist1">(.*?)<\/dt>/) do |_m|
text = $1.tr("^A-Za-z0-9-", "")
# use txt_path for backward compatibility of already-existing
# deep links shared across the interwebs.
anchor = "#{txt_path}-#{text}"
# handle anchor collisions by appending -1
anchor += "-1" while ids.include?(anchor)

And indeed it works on final html and does not select the "text" part of the html node.

@jnavila
Copy link
Contributor

jnavila commented May 7, 2025

We may have similar issues where I added "<placehoder>" marks. And it will have even a larger change effect after applying the new style formatting of #1921

@dscho
Copy link
Member

dscho commented May 7, 2025

I hope it's recent and not broken since the Rails to Hugo transition

It is recent: https://web.archive.org/web/20250113050613/https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode has the old-style links, and is well after the transition.

Shall/Can we add test coverage for these permalinks?

Absolutely! It's relatively easy in the Playwright tests, too, all you do is to call await page.goto(...) with, say, the git-clone page and then verify the link, like this:

// the repository URL is correct
await expect(page.getByRole('link', { name: 'hosted on GitHub' }))
.toHaveAttribute('href', 'https://github.com/progit/progit2')

You probably need to use a different way to locate the link, as its text is not "--local", instead the element following the link has this text, possibly something like this:

// The `<a>` element of class `anchor` directly preceding the `<code>` element whose text is "--local"
const anchor = page.locator('xpath=//code[text()="--local"]/preceding-sibling::a[@class="anchor"][1]')
await expect(anchor).toHaveAttribute('href', /#Documentation\/git-clone\.txt-l$/)

My XPath is not very good, so I'll refrain from refining it: It should probably catch instances where the "--local" text is not enclosed in a tag, too.

it works on final html and does not select the "text" part of the html node.

Yes, and that's most likely really wrong. I guess the easiest way to fix this would not be to switch from html to text, but to do something like this:

-        text = $1.tr("^A-Za-z0-9-", "")
+        text = $1.gsub(/<[^>]*>/, "").tr("^A-Za-z0-9-", "")

This would also proactively address the <placeholder> issue, at the expense of breaking potentially a lot of those <tt> links that are already in wide use.

I guess we could also do this instead:

-        text = $1.tr("^A-Za-z0-9-", "")
+        text = $1.gsub(/<\/?(?:code|placeholder)>/, "").tr("^A-Za-z0-9-", "")

Thoughts?

@jnavila
Copy link
Contributor

jnavila commented May 7, 2025

I was more thinking of settling on a true html parsing such as:

html.gsub!(/<dt class="hdlist1">(.*?)<\/dt>/) do |m| 
  sp = Nokogiri::HTML.parse(m)
  sp.text.tr("^A-Za-z0-9-", "")
end

Even with HTML5 now, we are not limited in the form of ids and do not need the last tr command.

@dscho
Copy link
Member

dscho commented May 8, 2025

@jnavila we're still limited by backwards-compatibility: Previously-working links using anchors should continue to work. That's why I would highly suggest to stick with parsing text rather than html, and to retain the tr() call.

@jnavila
Copy link
Contributor

jnavila commented May 19, 2025

Managing old style anchors does not mean that we still should publish them.

dscho added a commit to dscho/git-scm.com that referenced this issue May 20, 2025
Command options in Git's documentation are now being wrapped in HTML
tags like `<code>` for prettier formatting. Unfortunately, these tags
were being included in the generated anchor IDs, breaking links that
used to work. For example, an anchor for the
`--recurse-submodules[=<pathspec>]` option of the `clone` command would
now include 'code' in the anchor, causing existing deep links across the
web to fail to navigate to said option.

The documentation generation has been using these anchors for years and
many links using the old format (without HTML tags) are widely shared.
These links should continue to work, even as the document formatting
evolves to include more HTML.

This change modifies the anchor ID generation to strip HTML tags before
creating IDs, ensuring backward compatibility with existing references
while also supporting the new HTML-enhanced documentation format. Both
regular and localized documentation processing now handle HTML tags in
command options correctly.

These anchor IDs still include the `lt` and `gt` left-overs from the
pointy brackets that are encoded as `&lt;`/`&gt;` in HTML. This is
intentional because the existing deep links all over the web do include
those.

Fixes git#2002

Reported-by: Toon Claes <[email protected]>
Helped-by: Jean-Noël Avila <[email protected]>
Co-authored-by: Johannes Schindelin <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
@dscho
Copy link
Member

dscho commented May 20, 2025

Hrm. We might have a problem here, one that is not that easy to solve:

git grep 'href=[^>]*\.txt[^>]*code' bdcfcc29f7e775b1cea403f5e6ea101685c79ab4 -- external/docs/content/

shows 16,722 matches. bdcfcc2 was the huge commit that introduced the initial manual pages' HTML for consumption with Hugo.

These anchor IDs would all be broken by #2007 in the current form (using essentially this suggestion to address the problem reported in this here ticket, by removing the <code> and </code> substrings before generating the anchor ID). I.e. it would worsimprove the situation.

For example, https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode (notice how the anchor contains code twice?) is totally what was even on the first public Hugo built site. Even the version before the switch to Hugo had that anchor, according to the Wayback machine: https://web.archive.org/web/20240921004857/https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode (the last recorded version before the switch to Hugo).

Therefore, a wholesale removal of those <code> parts is inadvisable :-( Unless, that is, we can come up with a smarter idea that lets original anchors work somehow while still transforming the anchors to a nicer format.

My current thinking is that we could potentially detect in JavaScript when an anchor is specified in the current URL (window.location.hash !== '') that does not match any element (document.querySelector(CSS.escape(window.location.hash).slice(1)) === null). If that is the case, that JavaScript code could search for the closest match among the existing anchor IDs ([...document.querySelectorAll('[id]')].map(e => e.id)), and then assign that to window.location.hash. I have tried to come up with something super simple but even a Levenshtein distance is less trivial than I would have liked...

dscho added a commit to dscho/git-scm.com that referenced this issue May 21, 2025
The Git project is in the process of changing the formatting of the
manual pages. Unfortunately this leads to changed (read: backwards
incompatible) anchors.

For example, the following link used to work:

  https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt

Now it no longer jumps to the `--recurse-submodules` description. The
reason is that the anchor changed from:

  #Documentation/git-clone.txt---recurse-submodulesltpathspecgt

to:

  #Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode

I was well on my way to change the code to simply strip out the `<code>`
and `</code>` text before generating the anchor in `update-docs.rb`, but
alas... it would break e.g. #Documentation/git-clone.txt-code-lcode,
which is a valid anchor and has been since forever.

Nevertheless, it would be nice if that (quite ugly) anchor was turned
into #Documentation/git-clone.txt-l instead.

Let's bite the bullet and add a small JavaScript snippet that detects
when the URL contains an anchor that is not actually to be found on the
current page, and then look for an existing anchor that is most similar
to that, then use that (if found).

This is not a perfect solution by any stretch of imagination, but it
should definitely help, and it makes a future changes in the way anchors
are generated a lot less controversial.

Fixes git#2002

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to dscho/git-scm.com that referenced this issue May 21, 2025
Command options in Git's documentation are now being wrapped in HTML
tags like `<code>` for prettier formatting. Unfortunately, these tags
were being included in the generated anchor IDs, breaking links that
used to work. For example, an anchor for the
`--recurse-submodules[=<pathspec>]` option of the `clone` command would
now include 'code' in the anchor, causing existing deep links across the
web to fail to navigate to said option.

The documentation generation has been using these anchors for years and
many links using the old format (without HTML tags) are widely shared.
These links should continue to work, even as the document formatting
evolves to include more HTML.

This change modifies the anchor ID generation to strip HTML tags before
creating IDs, ensuring backward compatibility with existing references
while also supporting the new HTML-enhanced documentation format. Both
regular and localized documentation processing now handle HTML tags in
command options correctly.

These anchor IDs still include the `lt` and `gt` left-overs from the
pointy brackets that are encoded as `&lt;`/`&gt;` in HTML. This is
intentional because the existing deep links all over the web do include
those.

Technically, this here patch is not needed to fix the regression because
the preceding commit introduced code that is running on the client side
and automagically adjusts incorrect anchors in URLs when practical.

Having said that, the resulting anchors look a lot nicer, so let's do
this anyway, even if it is not strictly needed to address
git#2002.

Reported-by: Toon Claes <[email protected]>
Helped-by: Jean-Noël Avila <[email protected]>
Co-authored-by: Johannes Schindelin <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
@dscho
Copy link
Member

dscho commented May 21, 2025

I have tried to come up with something super simple but even a Levenshtein distance is less trivial than I would have liked...

I did find Wu's algorithm to calculate the simple Levenshtein distance with an upper bound and implemented it with some help by Copilot (my intention here is to learn how to use this tool more effectively even if my impatience won toward the end of the exercise).

With this change, I am rather happy to report that both old and new anchor work, and the site is more resilient that way to other, related changes. See for example:

  1. https://dscho.github.io/git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt

  2. https://dscho.github.io/git-scm.com/docs/git-clone#Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode

Both links end up at the same, currently-correct anchor.

@To1ne
Copy link
Collaborator Author

To1ne commented May 21, 2025

Oh, @dscho... You're making it way too complicated.

You know what's better than an anchor? That's right, two anchors!

I've applied the patch below to #2007 and deployed it to my fork. Also both links work:

You know what's the best part: no JS required. As always, the solution is not throwing moar javascript onto the fire. 🚎

One small downside (compared to your solution), the URL in the address bar doesn't get cleaned up.

With these two types of anchors, I'd say: ship it! (pun intended) 🚢

diff --git a/script/update-docs.rb b/script/update-docs.rb
index 6d1791603..0b29ea7b1 100644
--- a/script/update-docs.rb
+++ b/script/update-docs.rb
@@ -471,10 +471,21 @@ def index_doc(filter_tags, doc_list, get_content)
           # use txt_path for backward compatibility of already-existing
           # deep links shared across the interwebs.
           anchor = "#{txt_path}-#{text}"
+
+          # some anchors used to have HTML tags in it, thus
+          # add a "dirty anchor" for backward compatibility.
+          dirty_anchor = txt_path + '-' + inner_html.tr("^A-Za-z0-9-", "")
+          if dirty_anchor != anchor
+            dirty_anchor += "-1" while ids.include?(dirty_anchor)
+            ids.add(dirty_anchor)
+            dirty_anchor_span = "<span id=\"#{dirty_anchor}\"></span>"
+          end
+
           # handle anchor collisions by appending -1
           anchor += "-1" while ids.include?(anchor)
           ids.add(anchor)
-          "<dt class=\"hdlist1\" id=\"#{anchor}\"> <a class=\"anchor\" href=\"##{anchor}\"></a>#{inner_html} </dt>"
+
+          "<dt class=\"hdlist1\" id=\"#{anchor}\">#{dirty_anchor_span}<a class=\"anchor\" href=\"##{anchor}\"></a>#{inner_html} </dt>"
         end
         # Make links relative
         html.gsub!(/(<a href=['"])\/([^'"]*)/) do |relurl|

(FYI still need to fix this for l10n)

@dscho
Copy link
Member

dscho commented May 21, 2025

You know what's better than an anchor? That's right, two anchors!

I had thought about that, but the challenge I am seeing is that we already have plenty of anchors that contain code, see e.g. https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode. I am worried about such existing anchors that embed HTML tag names (and there are a lot of them) will be broken by the upcoming style changes where the original anchor ID is recreated neither by stripping all HTML tags nor by stripping none of them.

That's why I would like to at least keep the fuzzy matching logic.

If you insist on having the two-for-one anchor logic, I would like to have that in addition to, not instead of, that fuzzy matching.

BTW in your diff, I do not see any place where dirty_anchor_span is reset to empty in the else case; I guess it is a local variable that wouldn't be initialized unless dirty_anchor != anchor and hence be equivalent to the empty string? My lack of Ruby-fu shows...

@To1ne
Copy link
Collaborator Author

To1ne commented May 21, 2025

I had thought about that, but the challenge I am seeing is that we already have plenty of anchors that contain code, see e.g. https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode.

@dscho Yes, but while it creates clutter in the DOM, I don't think there's much of an downside to having double anchors.

I am worried about such existing anchors that embed HTML tag names (and there are a lot of them) will be broken by the upcoming style changes where the original anchor ID is recreated neither by stripping all HTML tags nor by stripping none of them.

Do we have any idea what the "upcoming" changes are? In my experience I consider it impossible to predict which "upcoming" changes will be made in the future. So personally I doubt we can protect ourself from them. I rather solve those issues when they arise.

That's why I would like to at least keep the fuzzy matching logic.

I understand. But at what cost. I mean, a recent patch https://lore.kernel.org/git/20250221-pks-cat-file-object-type-filter-v1-2-0852530888e2@pks.im/ introduces --filter= to git-cat-file(1), while --filters already exists. I imagine the fuzzy anchor navigator might send users from one to the other, can we accept that?

I'm not against the fuzzy anchor navigator, I just want to make sure we're aware of the limitations. In some way you could argue it's better to be sent to --filter if you were looking for --filters. I mean, if you're sent to the first one, both will be in the scroll view, so you see what you need. But I'm afraid we cannot ensure that.

BTW in your diff, I do not see any place where dirty_anchor_span is reset to empty in the else case; I guess it is a local variable that wouldn't be initialized unless dirty_anchor != anchor and hence be equivalent to the empty string? My lack of Ruby-fu shows...

Yeah, it might not be the cleanest code, but it seems to work.


Thinking about it more thoroughly: we should add tests to the script/update-docs.rb script. We already collect all ids in ids, we should store that somewhere and compare that list every time we update the docs. This helps us discover a change before it lands on the website.

@dscho
Copy link
Member

dscho commented May 21, 2025

Do we have any idea what the "upcoming" changes are?

@jnavila talked about adding <placeholder> tags.

I rather solve those issues when they arise.

Good luck with that, the issues' root cause lies in the Git project not exactly accommodating for how its manual pages are rendered on git-scm.com, despite git-scm.com officially being the "Git homepage".

a recent patch https://lore.kernel.org/git/20250221-pks-cat-file-object-type-filter-v1-2-0852530888e2@pks.im/ introduces --filter= to git-cat-file(1), while --filters already exists. I imagine the fuzzy anchor navigator might send users from one to the other, can we accept that?

A best effort is a best effort. Sometimes it misses, but at least it tried.

That's why I would like to at least keep the fuzzy matching logic.

I understand. But at what cost.

The cost is negligible: If the URL contains no anchor, or if the anchor is found, the performance hit is minimal because the early return is hit.

Thinking about it more thoroughly: we should add tests to the script/update-docs.rb script. We already collect all ids in ids, we should store that somewhere and compare that list every time we update the docs. This helps us discover a change before it lands on the website.

Unfortunately, such changes already happened, and I am not sure that we can reasonably keep track of all those changes. There are a lot of anchor IDs. A lot. As I pointed out earlier, even only looking at the ones containing code turns up over 16k matches.

And what to do if an ID truly goes away? Inspect it manually? Or risk an automated way that also could result in false positives?

I do think that #2007 strikes a good balance between complexity and benefit. I am not opposed to add the "two-for-one" approach on top, but I do want to keep that "Postel-izer".

@To1ne
Copy link
Collaborator Author

To1ne commented May 21, 2025

I do think that #2007 strikes a good balance between complexity and benefit. I am not opposed to add the "two-for-one" approach on top, but I do want to keep that "Postel-izer".

@dscho works for me. As you say, broken anchors aren’t the end of the world. I would look different at it if it would cause 404s.

dscho added a commit to dscho/git-scm.com that referenced this issue May 23, 2025
The Git project is in the process of changing the formatting of the
manual pages. Unfortunately this leads to changed (read: backwards
incompatible) anchors.

For example, the following link used to work:

  https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt

Now it no longer jumps to the `--recurse-submodules` description. The
reason is that the anchor changed from:

  #Documentation/git-clone.txt---recurse-submodulesltpathspecgt

to:

  #Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode

I was well on my way to change the code to simply strip out the `<code>`
and `</code>` text before generating the anchor in `update-docs.rb`, but
alas... it would break e.g. #Documentation/git-clone.txt-code-lcode,
which is a valid anchor and has been since forever.

Nevertheless, it would be nice if that (quite ugly) anchor was turned
into #Documentation/git-clone.txt-l instead.

Let's bite the bullet and add a small JavaScript snippet that detects
when the URL contains an anchor that is not actually to be found on the
current page, and then look for an existing anchor that is most similar
to that, then use that (if found).

This is not a perfect solution by any stretch of imagination, but it
should definitely help, and it makes a future changes in the way anchors
are generated a lot less controversial.

Fixes git#2002

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to dscho/git-scm.com that referenced this issue May 23, 2025
Command options in Git's documentation are now being wrapped in HTML
tags like `<code>` for prettier formatting. Unfortunately, these tags
were being included in the generated anchor IDs, breaking links that
used to work. For example, an anchor for the
`--recurse-submodules[=<pathspec>]` option of the `clone` command would
now include 'code' in the anchor, causing existing deep links across the
web to fail to navigate to said option.

The documentation generation has been using these anchors for years and
many links using the old format (without HTML tags) are widely shared.
These links should continue to work, even as the document formatting
evolves to include more HTML.

This change modifies the anchor ID generation to strip HTML tags before
creating IDs, ensuring backward compatibility with existing references
while also supporting the new HTML-enhanced documentation format. Both
regular and localized documentation processing now handle HTML tags in
command options correctly.

These anchor IDs still include the `lt` and `gt` left-overs from the
pointy brackets that are encoded as `&lt;`/`&gt;` in HTML. This is
intentional because the existing deep links all over the web do include
those.

Technically, this here patch is not needed to fix the regression because
the preceding commit introduced code that is running on the client side
and automagically adjusts incorrect anchors in URLs when practical.

Having said that, the resulting anchors look a lot nicer, so let's do
this anyway, even if it is not strictly needed to address
git#2002.

Reported-by: Toon Claes <[email protected]>
Helped-by: Jean-Noël Avila <[email protected]>
Co-authored-by: Johannes Schindelin <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants