-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Anchor links changed? #2002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I might look into this myself later, but I had to write this down and do something else first. |
Oh no! I highly suspect the culprit to be related to the recent style changes (@jnavila does this sound plausible to you?) |
I hope it's recent and not broken since the Rails to Hugo transition, otherwise I expect both types of permalinks to be around on the interwebs. So that means we'd have to figure out a way to support both. @dscho Shall/Can we add test coverage for these permalinks? |
I suspect so. All the command options are now formatted as <code>, and it seems this bleeds into the automatic anchor generation... Quite an inexpected side-effect. We are never sure that the machine generated anchor name is reproducible. |
the generative part is here: git-scm.com/script/update-docs.rb Lines 175 to 182 in 2e65651
And indeed it works on final html and does not select the "text" part of the html node. |
We may have similar issues where I added "<placehoder>" marks. And it will have even a larger change effect after applying the new style formatting of #1921 |
It is recent: https://web.archive.org/web/20250113050613/https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode has the old-style links, and is well after the transition.
Absolutely! It's relatively easy in the Playwright tests, too, all you do is to call git-scm.com/tests/git-scm.spec.js Lines 203 to 205 in 2e65651
You probably need to use a different way to locate the link, as its text is not "--local", instead the element following the link has this text, possibly something like this: // The `<a>` element of class `anchor` directly preceding the `<code>` element whose text is "--local"
const anchor = page.locator('xpath=//code[text()="--local"]/preceding-sibling::a[@class="anchor"][1]')
await expect(anchor).toHaveAttribute('href', /#Documentation\/git-clone\.txt-l$/) My XPath is not very good, so I'll refrain from refining it: It should probably catch instances where the "--local" text is not enclosed in a tag, too.
Yes, and that's most likely really wrong. I guess the easiest way to fix this would not be to switch from html to text, but to do something like this: - text = $1.tr("^A-Za-z0-9-", "")
+ text = $1.gsub(/<[^>]*>/, "").tr("^A-Za-z0-9-", "") This would also proactively address the I guess we could also do this instead: - text = $1.tr("^A-Za-z0-9-", "")
+ text = $1.gsub(/<\/?(?:code|placeholder)>/, "").tr("^A-Za-z0-9-", "") Thoughts? |
I was more thinking of settling on a true html parsing such as: html.gsub!(/<dt class="hdlist1">(.*?)<\/dt>/) do |m|
sp = Nokogiri::HTML.parse(m)
sp.text.tr("^A-Za-z0-9-", "")
end Even with HTML5 now, we are not limited in the form of ids and do not need the last |
@jnavila we're still limited by backwards-compatibility: Previously-working links using anchors should continue to work. That's why I would highly suggest to stick with parsing text rather than html, and to retain the |
Managing old style anchors does not mean that we still should publish them. |
Command options in Git's documentation are now being wrapped in HTML tags like `<code>` for prettier formatting. Unfortunately, these tags were being included in the generated anchor IDs, breaking links that used to work. For example, an anchor for the `--recurse-submodules[=<pathspec>]` option of the `clone` command would now include 'code' in the anchor, causing existing deep links across the web to fail to navigate to said option. The documentation generation has been using these anchors for years and many links using the old format (without HTML tags) are widely shared. These links should continue to work, even as the document formatting evolves to include more HTML. This change modifies the anchor ID generation to strip HTML tags before creating IDs, ensuring backward compatibility with existing references while also supporting the new HTML-enhanced documentation format. Both regular and localized documentation processing now handle HTML tags in command options correctly. These anchor IDs still include the `lt` and `gt` left-overs from the pointy brackets that are encoded as `<`/`>` in HTML. This is intentional because the existing deep links all over the web do include those. Fixes git#2002 Reported-by: Toon Claes <[email protected]> Helped-by: Jean-Noël Avila <[email protected]> Co-authored-by: Johannes Schindelin <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
Hrm. We might have a problem here, one that is not that easy to solve: git grep 'href=[^>]*\.txt[^>]*code' bdcfcc29f7e775b1cea403f5e6ea101685c79ab4 -- external/docs/content/ shows 16,722 matches. bdcfcc2 was the huge commit that introduced the initial manual pages' HTML for consumption with Hugo. These anchor IDs would all be broken by #2007 in the current form (using essentially this suggestion to address the problem reported in this here ticket, by removing the For example, https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code-lcode (notice how the anchor contains Therefore, a wholesale removal of those My current thinking is that we could potentially detect in JavaScript when an anchor is specified in the current URL ( |
The Git project is in the process of changing the formatting of the manual pages. Unfortunately this leads to changed (read: backwards incompatible) anchors. For example, the following link used to work: https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt Now it no longer jumps to the `--recurse-submodules` description. The reason is that the anchor changed from: #Documentation/git-clone.txt---recurse-submodulesltpathspecgt to: #Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode I was well on my way to change the code to simply strip out the `<code>` and `</code>` text before generating the anchor in `update-docs.rb`, but alas... it would break e.g. #Documentation/git-clone.txt-code-lcode, which is a valid anchor and has been since forever. Nevertheless, it would be nice if that (quite ugly) anchor was turned into #Documentation/git-clone.txt-l instead. Let's bite the bullet and add a small JavaScript snippet that detects when the URL contains an anchor that is not actually to be found on the current page, and then look for an existing anchor that is most similar to that, then use that (if found). This is not a perfect solution by any stretch of imagination, but it should definitely help, and it makes a future changes in the way anchors are generated a lot less controversial. Fixes git#2002 Signed-off-by: Johannes Schindelin <[email protected]>
Command options in Git's documentation are now being wrapped in HTML tags like `<code>` for prettier formatting. Unfortunately, these tags were being included in the generated anchor IDs, breaking links that used to work. For example, an anchor for the `--recurse-submodules[=<pathspec>]` option of the `clone` command would now include 'code' in the anchor, causing existing deep links across the web to fail to navigate to said option. The documentation generation has been using these anchors for years and many links using the old format (without HTML tags) are widely shared. These links should continue to work, even as the document formatting evolves to include more HTML. This change modifies the anchor ID generation to strip HTML tags before creating IDs, ensuring backward compatibility with existing references while also supporting the new HTML-enhanced documentation format. Both regular and localized documentation processing now handle HTML tags in command options correctly. These anchor IDs still include the `lt` and `gt` left-overs from the pointy brackets that are encoded as `<`/`>` in HTML. This is intentional because the existing deep links all over the web do include those. Technically, this here patch is not needed to fix the regression because the preceding commit introduced code that is running on the client side and automagically adjusts incorrect anchors in URLs when practical. Having said that, the resulting anchors look a lot nicer, so let's do this anyway, even if it is not strictly needed to address git#2002. Reported-by: Toon Claes <[email protected]> Helped-by: Jean-Noël Avila <[email protected]> Co-authored-by: Johannes Schindelin <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
I did find Wu's algorithm to calculate the simple Levenshtein distance with an upper bound and implemented it with some help by Copilot (my intention here is to learn how to use this tool more effectively even if my impatience won toward the end of the exercise). With this change, I am rather happy to report that both old and new anchor work, and the site is more resilient that way to other, related changes. See for example: Both links end up at the same, currently-correct anchor. |
Oh, @dscho... You're making it way too complicated. You know what's better than an anchor? That's right, two anchors! I've applied the patch below to #2007 and deployed it to my fork. Also both links work:
You know what's the best part: no JS required. As always, the solution is not throwing moar javascript onto the fire. 🚎 One small downside (compared to your solution), the URL in the address bar doesn't get cleaned up. With these two types of anchors, I'd say: ship it! (pun intended) 🚢 diff --git a/script/update-docs.rb b/script/update-docs.rb
index 6d1791603..0b29ea7b1 100644
--- a/script/update-docs.rb
+++ b/script/update-docs.rb
@@ -471,10 +471,21 @@ def index_doc(filter_tags, doc_list, get_content)
# use txt_path for backward compatibility of already-existing
# deep links shared across the interwebs.
anchor = "#{txt_path}-#{text}"
+
+ # some anchors used to have HTML tags in it, thus
+ # add a "dirty anchor" for backward compatibility.
+ dirty_anchor = txt_path + '-' + inner_html.tr("^A-Za-z0-9-", "")
+ if dirty_anchor != anchor
+ dirty_anchor += "-1" while ids.include?(dirty_anchor)
+ ids.add(dirty_anchor)
+ dirty_anchor_span = "<span id=\"#{dirty_anchor}\"></span>"
+ end
+
# handle anchor collisions by appending -1
anchor += "-1" while ids.include?(anchor)
ids.add(anchor)
- "<dt class=\"hdlist1\" id=\"#{anchor}\"> <a class=\"anchor\" href=\"##{anchor}\"></a>#{inner_html} </dt>"
+
+ "<dt class=\"hdlist1\" id=\"#{anchor}\">#{dirty_anchor_span}<a class=\"anchor\" href=\"##{anchor}\"></a>#{inner_html} </dt>"
end
# Make links relative
html.gsub!(/(<a href=['"])\/([^'"]*)/) do |relurl| (FYI still need to fix this for |
I had thought about that, but the challenge I am seeing is that we already have plenty of anchors that contain That's why I would like to at least keep the fuzzy matching logic. If you insist on having the two-for-one anchor logic, I would like to have that in addition to, not instead of, that fuzzy matching. BTW in your |
@dscho Yes, but while it creates clutter in the DOM, I don't think there's much of an downside to having double anchors.
Do we have any idea what the "upcoming" changes are? In my experience I consider it impossible to predict which "upcoming" changes will be made in the future. So personally I doubt we can protect ourself from them. I rather solve those issues when they arise.
I understand. But at what cost. I mean, a recent patch https://lore.kernel.org/git/20250221-pks-cat-file-object-type-filter-v1-2-0852530888e2@pks.im/ introduces I'm not against the fuzzy anchor navigator, I just want to make sure we're aware of the limitations. In some way you could argue it's better to be sent to
Yeah, it might not be the cleanest code, but it seems to work. Thinking about it more thoroughly: we should add tests to the |
@jnavila talked about adding
Good luck with that, the issues' root cause lies in the Git project not exactly accommodating for how its manual pages are rendered on git-scm.com, despite git-scm.com officially being the "Git homepage".
A best effort is a best effort. Sometimes it misses, but at least it tried.
The cost is negligible: If the URL contains no anchor, or if the anchor is found, the performance hit is minimal because the early
Unfortunately, such changes already happened, and I am not sure that we can reasonably keep track of all those changes. There are a lot of anchor IDs. A lot. As I pointed out earlier, even only looking at the ones containing And what to do if an ID truly goes away? Inspect it manually? Or risk an automated way that also could result in false positives? I do think that #2007 strikes a good balance between complexity and benefit. I am not opposed to add the "two-for-one" approach on top, but I do want to keep that "Postel-izer". |
@dscho works for me. As you say, broken anchors aren’t the end of the world. I would look different at it if it would cause 404s. |
The Git project is in the process of changing the formatting of the manual pages. Unfortunately this leads to changed (read: backwards incompatible) anchors. For example, the following link used to work: https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt Now it no longer jumps to the `--recurse-submodules` description. The reason is that the anchor changed from: #Documentation/git-clone.txt---recurse-submodulesltpathspecgt to: #Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode I was well on my way to change the code to simply strip out the `<code>` and `</code>` text before generating the anchor in `update-docs.rb`, but alas... it would break e.g. #Documentation/git-clone.txt-code-lcode, which is a valid anchor and has been since forever. Nevertheless, it would be nice if that (quite ugly) anchor was turned into #Documentation/git-clone.txt-l instead. Let's bite the bullet and add a small JavaScript snippet that detects when the URL contains an anchor that is not actually to be found on the current page, and then look for an existing anchor that is most similar to that, then use that (if found). This is not a perfect solution by any stretch of imagination, but it should definitely help, and it makes a future changes in the way anchors are generated a lot less controversial. Fixes git#2002 Signed-off-by: Johannes Schindelin <[email protected]>
Command options in Git's documentation are now being wrapped in HTML tags like `<code>` for prettier formatting. Unfortunately, these tags were being included in the generated anchor IDs, breaking links that used to work. For example, an anchor for the `--recurse-submodules[=<pathspec>]` option of the `clone` command would now include 'code' in the anchor, causing existing deep links across the web to fail to navigate to said option. The documentation generation has been using these anchors for years and many links using the old format (without HTML tags) are widely shared. These links should continue to work, even as the document formatting evolves to include more HTML. This change modifies the anchor ID generation to strip HTML tags before creating IDs, ensuring backward compatibility with existing references while also supporting the new HTML-enhanced documentation format. Both regular and localized documentation processing now handle HTML tags in command options correctly. These anchor IDs still include the `lt` and `gt` left-overs from the pointy brackets that are encoded as `<`/`>` in HTML. This is intentional because the existing deep links all over the web do include those. Technically, this here patch is not needed to fix the regression because the preceding commit introduced code that is running on the client side and automagically adjusts incorrect anchors in URLs when practical. Having said that, the resulting anchors look a lot nicer, so let's do this anyway, even if it is not strictly needed to address git#2002. Reported-by: Toon Claes <[email protected]> Helped-by: Jean-Noël Avila <[email protected]> Co-authored-by: Johannes Schindelin <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
I ran across a link on the interwebs:
https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---recurse-submodulesltpathspecgt
and I noticed it's not anchoring to the correct piece of the page. The nowadays seems to be:
https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code--recurse-submodulesltpathspecgtcode
What changed the
code
to be added around the anchor link? I'd prefer to not have this behavior broken.The text was updated successfully, but these errors were encountered: