Anchor links changed? #2002

To1ne · 2025-05-07T08:15:20Z

I ran across a link on the interwebs:

https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt

and I noticed it's not anchoring to the correct piece of the page. The nowadays seems to be:

https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode

What changed the code to be added around the anchor link? I'd prefer to not have this behavior broken.

The text was updated successfully, but these errors were encountered:

To1ne · 2025-05-07T08:16:03Z

I might look into this myself later, but I had to write this down and do something else first.

dscho · 2025-05-07T08:35:21Z

Oh no! I highly suspect the culprit to be related to the recent style changes (@jnavila does this sound plausible to you?)

To1ne · 2025-05-07T11:36:25Z

Oh no! I highly suspect the culprit to be related to the recent style changes

I hope it's recent and not broken since the Rails to Hugo transition, otherwise I expect both types of permalinks to be around on the interwebs. So that means we'd have to figure out a way to support both.

@dscho Shall/Can we add test coverage for these permalinks?

jnavila · 2025-05-07T12:24:51Z

Oh no! I highly suspect the culprit to be related to the recent style changes (@jnavila does this sound plausible to you?)

I suspect so. All the command options are now formatted as <code>, and it seems this bleeds into the automatic anchor generation... Quite an inexpected side-effect. We are never sure that the machine generated anchor name is reproducible.

jnavila · 2025-05-07T12:28:26Z

the generative part is here:

git-scm.com/script/update-docs.rb

Lines 175 to 182 in 2e65651

    
                # HTML anchor on hdlist1 (i.e. command options) 
        
                html.gsub!(/<dt class="hdlist1">(.*?)<\/dt>/) do |_m| 
        
                  text = $1.tr("^A-Za-z0-9-", "") 
        
                  # use txt_path for backward compatibility of already-existing 
        
           # deep links shared across the interwebs. 
        
           anchor = "#{txt_path}-#{text}" 
        
                  # handle anchor collisions by appending -1 
        
                  anchor += "-1" while ids.include?(anchor)

And indeed it works on final html and does not select the "text" part of the html node.

jnavila · 2025-05-07T12:48:29Z

We may have similar issues where I added "<placehoder>" marks. And it will have even a larger change effect after applying the new style formatting of #1921

dscho · 2025-05-07T16:46:55Z

I hope it's recent and not broken since the Rails to Hugo transition

It is recent: https://web.archive.org/web/20250113050613/https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode has the old-style links, and is well after the transition.

Shall/Can we add test coverage for these permalinks?

Absolutely! It's relatively easy in the Playwright tests, too, all you do is to call await page.goto(...) with, say, the git-clone page and then verify the link, like this:

git-scm.com/tests/git-scm.spec.js

Lines 203 to 205 in 2e65651

    
           // the repository URL is correct 
        
           await expect(page.getByRole('link', { name: 'hosted on GitHub' })) 
        
             .toHaveAttribute('href', 'https://github.com/progit/progit2')

You probably need to use a different way to locate the link, as its text is not "--local", instead the element following the link has this text, possibly something like this:

// The `<a>` element of class `anchor` directly preceding the `<code>` element whose text is "--local"
const anchor = page.locator('xpath=//code[text()="--local"]/preceding-sibling::a[@class="anchor"][1]')
await expect(anchor).toHaveAttribute('href', /#Documentation\/git-clone\.txt-l$/)

My XPath is not very good, so I'll refrain from refining it: It should probably catch instances where the "--local" text is not enclosed in a tag, too.

it works on final html and does not select the "text" part of the html node.

Yes, and that's most likely really wrong. I guess the easiest way to fix this would not be to switch from html to text, but to do something like this:

-        text = $1.tr("^A-Za-z0-9-", "")
+        text = $1.gsub(/<[^>]*>/, "").tr("^A-Za-z0-9-", "")

This would also proactively address the <placeholder> issue, at the expense of breaking potentially a lot of those <tt> links that are already in wide use.

I guess we could also do this instead:

-        text = $1.tr("^A-Za-z0-9-", "")
+        text = $1.gsub(/<\/?(?:code|placeholder)>/, "").tr("^A-Za-z0-9-", "")

Thoughts?

jnavila · 2025-05-07T19:55:54Z

I was more thinking of settling on a true html parsing such as:

html.gsub!(/<dt class="hdlist1">(.*?)<\/dt>/) do |m| 
  sp = Nokogiri::HTML.parse(m)
  sp.text.tr("^A-Za-z0-9-", "")
end

Even with HTML5 now, we are not limited in the form of ids and do not need the last tr command.

dscho · 2025-05-08T06:47:23Z

@jnavila we're still limited by backwards-compatibility: Previously-working links using anchors should continue to work. That's why I would highly suggest to stick with parsing text rather than html, and to retain the tr() call.

jnavila · 2025-05-19T08:54:05Z

Managing old style anchors does not mean that we still should publish them.

Command options in Git's documentation are now being wrapped in HTML tags like `<code>` for prettier formatting. Unfortunately, these tags were being included in the generated anchor IDs, breaking links that used to work. For example, an anchor for the `--recurse-submodules[=<pathspec>]` option of the `clone` command would now include 'code' in the anchor, causing existing deep links across the web to fail to navigate to said option. The documentation generation has been using these anchors for years and many links using the old format (without HTML tags) are widely shared. These links should continue to work, even as the document formatting evolves to include more HTML. This change modifies the anchor ID generation to strip HTML tags before creating IDs, ensuring backward compatibility with existing references while also supporting the new HTML-enhanced documentation format. Both regular and localized documentation processing now handle HTML tags in command options correctly. These anchor IDs still include the `lt` and `gt` left-overs from the pointy brackets that are encoded as `<`/`>` in HTML. This is intentional because the existing deep links all over the web do include those. Fixes git#2002 Reported-by: Toon Claes <[email protected]> Helped-by: Jean-Noël Avila <[email protected]> Co-authored-by: Johannes Schindelin <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>

dscho · 2025-05-20T20:16:58Z

Hrm. We might have a problem here, one that is not that easy to solve:

git grep 'href=[^>]*\.txt[^>]*code' bdcfcc29f7e775b1cea403f5e6ea101685c79ab4 -- external/docs/content/

shows 16,722 matches. bdcfcc2 was the huge commit that introduced the initial manual pages' HTML for consumption with Hugo.

These anchor IDs would all be broken by #2007 in the current form (using essentially this suggestion to address the problem reported in this here ticket, by removing the <code> and </code> substrings before generating the anchor ID). I.e. it would worsimprove the situation.

For example, https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode (notice how the anchor contains code twice?) is totally what was even on the first public Hugo built site. Even the version before the switch to Hugo had that anchor, according to the Wayback machine: https://web.archive.org/web/20240921004857/https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode (the last recorded version before the switch to Hugo).

Therefore, a wholesale removal of those <code> parts is inadvisable :-( Unless, that is, we can come up with a smarter idea that lets original anchors work somehow while still transforming the anchors to a nicer format.

My current thinking is that we could potentially detect in JavaScript when an anchor is specified in the current URL (window.location.hash !== '') that does not match any element (document.querySelector(CSS.escape(window.location.hash).slice(1)) === null). If that is the case, that JavaScript code could search for the closest match among the existing anchor IDs ([...document.querySelectorAll('[id]')].map(e => e.id)), and then assign that to window.location.hash. I have tried to come up with something super simple but even a Levenshtein distance is less trivial than I would have liked...

The Git project is in the process of changing the formatting of the manual pages. Unfortunately this leads to changed (read: backwards incompatible) anchors. For example, the following link used to work: https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt Now it no longer jumps to the `--recurse-submodules` description. The reason is that the anchor changed from: #Documentation/git-clone.txt---recurse-submodulesltpathspecgt to: #Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode I was well on my way to change the code to simply strip out the `<code>` and `</code>` text before generating the anchor in `update-docs.rb`, but alas... it would break e.g. #Documentation/git-clone.txt-code-lcode, which is a valid anchor and has been since forever. Nevertheless, it would be nice if that (quite ugly) anchor was turned into #Documentation/git-clone.txt-l instead. Let's bite the bullet and add a small JavaScript snippet that detects when the URL contains an anchor that is not actually to be found on the current page, and then look for an existing anchor that is most similar to that, then use that (if found). This is not a perfect solution by any stretch of imagination, but it should definitely help, and it makes a future changes in the way anchors are generated a lot less controversial. Fixes git#2002 Signed-off-by: Johannes Schindelin <[email protected]>

Command options in Git's documentation are now being wrapped in HTML tags like `<code>` for prettier formatting. Unfortunately, these tags were being included in the generated anchor IDs, breaking links that used to work. For example, an anchor for the `--recurse-submodules[=<pathspec>]` option of the `clone` command would now include 'code' in the anchor, causing existing deep links across the web to fail to navigate to said option. The documentation generation has been using these anchors for years and many links using the old format (without HTML tags) are widely shared. These links should continue to work, even as the document formatting evolves to include more HTML. This change modifies the anchor ID generation to strip HTML tags before creating IDs, ensuring backward compatibility with existing references while also supporting the new HTML-enhanced documentation format. Both regular and localized documentation processing now handle HTML tags in command options correctly. These anchor IDs still include the `lt` and `gt` left-overs from the pointy brackets that are encoded as `<`/`>` in HTML. This is intentional because the existing deep links all over the web do include those. Technically, this here patch is not needed to fix the regression because the preceding commit introduced code that is running on the client side and automagically adjusts incorrect anchors in URLs when practical. Having said that, the resulting anchors look a lot nicer, so let's do this anyway, even if it is not strictly needed to address git#2002. Reported-by: Toon Claes <[email protected]> Helped-by: Jean-Noël Avila <[email protected]> Co-authored-by: Johannes Schindelin <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>

dscho · 2025-05-21T08:13:25Z

I have tried to come up with something super simple but even a Levenshtein distance is less trivial than I would have liked...

I did find Wu's algorithm to calculate the simple Levenshtein distance with an upper bound and implemented it with some help by Copilot (my intention here is to learn how to use this tool more effectively even if my impatience won toward the end of the exercise).

With this change, I am rather happy to report that both old and new anchor work, and the site is more resilient that way to other, related changes. See for example:

Both links end up at the same, currently-correct anchor.

To1ne · 2025-05-21T08:45:42Z

Oh, @dscho... You're making it way too complicated.

You know what's better than an anchor? That's right, two anchors!

I've applied the patch below to #2007 and deployed it to my fork. Also both links work:

You know what's the best part: no JS required. As always, the solution is not throwing moar javascript onto the fire. 🚎

One small downside (compared to your solution), the URL in the address bar doesn't get cleaned up.

With these two types of anchors, I'd say: ship it! (pun intended) 🚢

diff --git a/script/update-docs.rb b/script/update-docs.rb
index 6d1791603..0b29ea7b1 100644
--- a/script/update-docs.rb
+++ b/script/update-docs.rb
@@ -471,10 +471,21 @@ def index_doc(filter_tags, doc_list, get_content)
           # use txt_path for backward compatibility of already-existing
           # deep links shared across the interwebs.
           anchor = "#{txt_path}-#{text}"
+
+          # some anchors used to have HTML tags in it, thus
+          # add a "dirty anchor" for backward compatibility.
+          dirty_anchor = txt_path + '-' + inner_html.tr("^A-Za-z0-9-", "")
+          if dirty_anchor != anchor
+            dirty_anchor += "-1" while ids.include?(dirty_anchor)
+            ids.add(dirty_anchor)
+            dirty_anchor_span = "<span id=\"#{dirty_anchor}\"></span>"
+          end
+
           # handle anchor collisions by appending -1
           anchor += "-1" while ids.include?(anchor)
           ids.add(anchor)
-          "<dt class=\"hdlist1\" id=\"#{anchor}\"> <a class=\"anchor\" href=\"##{anchor}\"></a>#{inner_html} </dt>"
+
+          "<dt class=\"hdlist1\" id=\"#{anchor}\">#{dirty_anchor_span}<a class=\"anchor\" href=\"##{anchor}\"></a>#{inner_html} </dt>"
         end
         # Make links relative
         html.gsub!(/(<a href=['"])\/([^'"]*)/) do |relurl|

(FYI still need to fix this for l10n)

dscho · 2025-05-21T09:26:46Z

You know what's better than an anchor? That's right, two anchors!

I had thought about that, but the challenge I am seeing is that we already have plenty of anchors that contain code, see e.g. https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode. I am worried about such existing anchors that embed HTML tag names (and there are a lot of them) will be broken by the upcoming style changes where the original anchor ID is recreated neither by stripping all HTML tags nor by stripping none of them.

That's why I would like to at least keep the fuzzy matching logic.

If you insist on having the two-for-one anchor logic, I would like to have that in addition to, not instead of, that fuzzy matching.

BTW in your diff, I do not see any place where dirty_anchor_span is reset to empty in the else case; I guess it is a local variable that wouldn't be initialized unless dirty_anchor != anchor and hence be equivalent to the empty string? My lack of Ruby-fu shows...

To1ne · 2025-05-21T12:01:22Z

I had thought about that, but the challenge I am seeing is that we already have plenty of anchors that contain code, see e.g. https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode.

@dscho Yes, but while it creates clutter in the DOM, I don't think there's much of an downside to having double anchors.

I am worried about such existing anchors that embed HTML tag names (and there are a lot of them) will be broken by the upcoming style changes where the original anchor ID is recreated neither by stripping all HTML tags nor by stripping none of them.

Do we have any idea what the "upcoming" changes are? In my experience I consider it impossible to predict which "upcoming" changes will be made in the future. So personally I doubt we can protect ourself from them. I rather solve those issues when they arise.

That's why I would like to at least keep the fuzzy matching logic.

I understand. But at what cost. I mean, a recent patch https://lore.kernel.org/git/20250221-pks-cat-file-object-type-filter-v1-2-0852530888e2@pks.im/ introduces --filter= to git-cat-file(1), while --filters already exists. I imagine the fuzzy anchor navigator might send users from one to the other, can we accept that?

I'm not against the fuzzy anchor navigator, I just want to make sure we're aware of the limitations. In some way you could argue it's better to be sent to --filter if you were looking for --filters. I mean, if you're sent to the first one, both will be in the scroll view, so you see what you need. But I'm afraid we cannot ensure that.

BTW in your diff, I do not see any place where dirty_anchor_span is reset to empty in the else case; I guess it is a local variable that wouldn't be initialized unless dirty_anchor != anchor and hence be equivalent to the empty string? My lack of Ruby-fu shows...

Yeah, it might not be the cleanest code, but it seems to work.

Thinking about it more thoroughly: we should add tests to the script/update-docs.rb script. We already collect all ids in ids, we should store that somewhere and compare that list every time we update the docs. This helps us discover a change before it lands on the website.

dscho · 2025-05-21T13:49:52Z

Do we have any idea what the "upcoming" changes are?

@jnavila talked about adding <placeholder> tags.

I rather solve those issues when they arise.

Good luck with that, the issues' root cause lies in the Git project not exactly accommodating for how its manual pages are rendered on git-scm.com, despite git-scm.com officially being the "Git homepage".

a recent patch https://lore.kernel.org/git/20250221-pks-cat-file-object-type-filter-v1-2-0852530888e2@pks.im/ introduces --filter= to git-cat-file(1), while --filters already exists. I imagine the fuzzy anchor navigator might send users from one to the other, can we accept that?

A best effort is a best effort. Sometimes it misses, but at least it tried.

That's why I would like to at least keep the fuzzy matching logic.

I understand. But at what cost.

The cost is negligible: If the URL contains no anchor, or if the anchor is found, the performance hit is minimal because the early return is hit.

Thinking about it more thoroughly: we should add tests to the script/update-docs.rb script. We already collect all ids in ids, we should store that somewhere and compare that list every time we update the docs. This helps us discover a change before it lands on the website.

Unfortunately, such changes already happened, and I am not sure that we can reasonably keep track of all those changes. There are a lot of anchor IDs. A lot. As I pointed out earlier, even only looking at the ones containing code turns up over 16k matches.

And what to do if an ID truly goes away? Inspect it manually? Or risk an automated way that also could result in false positives?

I do think that #2007 strikes a good balance between complexity and benefit. I am not opposed to add the "two-for-one" approach on top, but I do want to keep that "Postel-izer".

To1ne · 2025-05-21T15:01:46Z

I do think that #2007 strikes a good balance between complexity and benefit. I am not opposed to add the "two-for-one" approach on top, but I do want to keep that "Postel-izer".

@dscho works for me. As you say, broken anchors aren’t the end of the world. I would look different at it if it would cause 404s.

The Git project is in the process of changing the formatting of the manual pages. Unfortunately this leads to changed (read: backwards incompatible) anchors. For example, the following link used to work: https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt Now it no longer jumps to the `--recurse-submodules` description. The reason is that the anchor changed from: #Documentation/git-clone.txt---recurse-submodulesltpathspecgt to: #Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode I was well on my way to change the code to simply strip out the `<code>` and `</code>` text before generating the anchor in `update-docs.rb`, but alas... it would break e.g. #Documentation/git-clone.txt-code-lcode, which is a valid anchor and has been since forever. Nevertheless, it would be nice if that (quite ugly) anchor was turned into #Documentation/git-clone.txt-l instead. Let's bite the bullet and add a small JavaScript snippet that detects when the URL contains an anchor that is not actually to be found on the current page, and then look for an existing anchor that is most similar to that, then use that (if found). This is not a perfect solution by any stretch of imagination, but it should definitely help, and it makes a future changes in the way anchors are generated a lot less controversial. Fixes git#2002 Signed-off-by: Johannes Schindelin <[email protected]>

Command options in Git's documentation are now being wrapped in HTML tags like `<code>` for prettier formatting. Unfortunately, these tags were being included in the generated anchor IDs, breaking links that used to work. For example, an anchor for the `--recurse-submodules[=<pathspec>]` option of the `clone` command would now include 'code' in the anchor, causing existing deep links across the web to fail to navigate to said option. The documentation generation has been using these anchors for years and many links using the old format (without HTML tags) are widely shared. These links should continue to work, even as the document formatting evolves to include more HTML. This change modifies the anchor ID generation to strip HTML tags before creating IDs, ensuring backward compatibility with existing references while also supporting the new HTML-enhanced documentation format. Both regular and localized documentation processing now handle HTML tags in command options correctly. These anchor IDs still include the `lt` and `gt` left-overs from the pointy brackets that are encoded as `<`/`>` in HTML. This is intentional because the existing deep links all over the web do include those. Technically, this here patch is not needed to fix the regression because the preceding commit introduced code that is running on the client side and automagically adjusts incorrect anchors in URLs when practical. Having said that, the resulting anchors look a lot nicer, so let's do this anyway, even if it is not strictly needed to address git#2002. Reported-by: Toon Claes <[email protected]> Helped-by: Jean-Noël Avila <[email protected]> Co-authored-by: Johannes Schindelin <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>

dscho mentioned this issue May 13, 2025

New manpage format #1921

Open

dscho assigned Copilot May 20, 2025

Copilot AI mentioned this issue May 20, 2025

Make anchor links in Git's documentation backwards-compatible again #2007

Merged

dscho closed this as completed in #2007 May 23, 2025

Anchor links changed? #2002

Anchor links changed? #2002

Comments

To1ne commented May 7, 2025

To1ne commented May 7, 2025

Uh oh!

dscho commented May 7, 2025

Uh oh!

To1ne commented May 7, 2025

Uh oh!

jnavila commented May 7, 2025

Uh oh!

jnavila commented May 7, 2025

Uh oh!

jnavila commented May 7, 2025

Uh oh!

dscho commented May 7, 2025

Uh oh!

jnavila commented May 7, 2025

Uh oh!

dscho commented May 8, 2025

Uh oh!

jnavila commented May 19, 2025

Uh oh!

dscho commented May 20, 2025

Uh oh!

dscho commented May 21, 2025

Uh oh!

To1ne commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dscho commented May 21, 2025

Uh oh!

To1ne commented May 21, 2025

Uh oh!

dscho commented May 21, 2025

Uh oh!

To1ne commented May 21, 2025

Uh oh!

To1ne commented May 21, 2025 •

edited

Loading