-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please reconsider the use of collator-based search #233
Comments
@hsivonen: This is very thorough and convincing, thank you for writing it up! I'll confess that I'm not an expert in this area so I didn't appreciate some of the nuances back when I was considering these issues. IIRC our primary motivation here was reusing the find-in-page machinery and so we worked backwards to try and specify (loosely, it turns out) what that was doing. But your point about why Ctrl-F find should work differently from text-directives makes sense to me. Definitely agree that having UI language change the meaning of a link is wrong - I hadn't realized the collator depended on the UI language! I'll double-check with consumers of this API internally to make sure I'm not missing anything but I believe we'd be ok removing the collator-based search. As for case-sensitivity...I'll have a think on it but as long as it doesn't make links more brittle I don't feel strongly about it. Given that Safari also does this I wonder if they have an opinion here. +@annevk |
I think we'd be okay with changing that. As Henri notes, it's not important for the use cases. |
The important thing is to be referencing the source DOM and not take |
That said, I assume hand-rolling links is pretty rare (I do it sometimes but I'm probably atypical here:) and over |
I think everyone's in agreement that we should remove collator-based search. We're also ok to make the search case-sensitive so I'll start working on spec-changes in this direction.
@fantasai - the copy-paste example I'm more worried about right now is whitespace, since DOM nodes will include whitespace that's collapsed in the rendered output and not included in copy/paste. I'm worried even programmatic tools will get this wrong and server-side systems are almost certain to do some basic normalization; I'd guess collapsing whitespace collapsing would be a common one. I need to familiarize myself a bit more with the relevant specs first (any pointers would be appreciated) so don't have great ideas yet, but I wonder if it'd make sense to perform the search over some processed form of text rather than the source DOM (e.g. what's seen via |
I suspect we want to essentially use Having said that, I haven't double checked how it's implemented today, but my preference would be that this would not be a completely novel code path. |
I see, the specification already has the whitespace issue: #98. And it seems like So what has Chromium implemented if it's not in the specification? |
Chromium seems to use the transformed casing rather than the source casing when creating a text fragment URL from the context menu. Demo: https://software.hixie.ch/utilities/js/live-dom-viewer/#:~:text=LIVE%20DOM%20VIEWER,-Markup%20to%20test That seems like an issue if we move to case-sensitive matching. Since styling can change for various reasons (e.g. viewport size), it seems more robust to match against the non-transformed text. That also seems in line with the following
|
Chrome performs the search on the text-as-rendered (i.e. post I agree it's bad for the result to depend on style so ideally we'd match on the pre-
WDYT of searching across post-whitespace-collapsed text but before
Interesting that |
Re I think In any case, the |
That makes sense in the context of how (I don't know what pre-processing |
Re: shadow trees, there's a question in #190 whether the search should pierce. I think it should but I'm currently measuring how often this happens in practice, would appreciate additional opinions there. The search is currently specified as a tree walk that collects a search string from text nodes. We could keep the shadow piercing property by keeping the tree walk and collecting the innerText of leaf elements. One issue that we'd have to resolve is how to map a match in the innerText string back to a DOM range
Would it make sense to spec a shared "find text" algorithm with knobs needed to differentiate behavior among its callers? This algorithm would be a common place that defines how the search text is collected, what's considered "visible", etc. (i.e. much of what this spec does). e.g. The text directive search would call "find text with @annevk is this essentially what you meant? |
@bokand yeah that is what I meant. I think it has to pierce shadow trees from a UX perspective. We have to be careful though that if we ever add some kind of API that doesn't result in shadow trees being exposed. |
Whether 'accents, umlauts, and other marks' are significant depends on language. |
This PR includes borrowing text from an example by @hsivonen, which I intend to replace before merging with a better-adapted version. In addition, some of the text or comments from WICG/scroll-to-text-fragment#233 are begin adapted into the prose of this document. **_Submitting as draft. Not ready for review._**
Step 6.1 of find a range from a node list has this sentence: “The string search must be performed using a base character comparison, or the primary level, as defined in [UTS10].” Followed by the note: “Intuitively, this is a case-insensitive search also ignoring accents, umlauts, and other marks.”
There are multiple problems with this:
In Chrome, the meaning of links depends on the UI language
Consider the link https://hsivonen.com/test/moz/text-fragment-language.html#:~:text=H%C3%A4n . Test this in Chrome with the UI language set to English, German, and Finnish. A different word (in a Finnish part of the text) is highlighted in the three different UI languages.
Also test the link https://hsivonen.com/test/moz/text-fragment-language.html#:~:text=ILIK in Chrome with the UI language set to English and Turkish. A different word is highlighted depending on the UI language.
Obviously, it’s very bad for the meaning of links to depend on the UI language!
The current situation is not interoperable
Safari (tested 16.6 on Ventura) appears to perform a case-insensitive search that doesn’t ignore diacritics and does not appear to depend on either the UI language or the language of the content (i.e. the case mapping is wrong for languages that use Turkish i).
The use case isn’t well-motivated
Why should the search be ignoring certain aspects of the text? This seems unnecessary when the text fragments are generated by copying concrete text from the page (by plain copy-paste, by using Chrome’s context menu item “Copy link to highlight”, or by a server-side system having loaded the target page for server-side processing). This seems to mainly cater to typing text fragments by hand when the text input method in use isn’t well matched with the page text (e.g. using U.S. English keyboard to try to match English text that has French-style accents like naïve or café). What use case justifies the resulting complexity?
The spec doesn’t actually say what Chrome does
The definition is too vague to actually use collator-based search. The spec doesn’t say how to choose what collation to use. The referenced spec, UTS 10, specifies DUCET, but no browser carries an implementation of raw DUCET. All browsers use CLDR collations instead. Chrome seems to be using the CLDR search collation for the UI language of the browser (as with cmd/ctrl-f).
Distinction from cmd/ctrl-f
Using the search collation for the UI language makes sense for cmd/ctrl-f, since the text is user-entered, so it’s reasonable to align locale-dependent behavior to the locale of the user as opposed to the locale of the page. However, links should mean the same thing when shared between people who use different UI languages.
Speccing the root collation doesn’t make sense
The whole point of collator-based search is being able to make what accent-insensitive and case-insensitive mean reuse the collation data for language-dependent meanings. The meaning of case-insensitive changes for Turkish i. A bunch of European Latin-script languages treat a letter with what English analyzes as an accent as a base letter on its own right.
Saying that the CLDR search root collation shall always be used would make things always wrong for the languages that were supposed to be the beneficiaries of collator-based search.
Collator-based search doesn’t deal with multiple language spans in the haystack
Collator-based search maps the text for both the needle and the haystack into collation elements, performs the search over collation elements, and then tries to correlate the result back to the original text.
This requires the use of the same mapping from text to collation elements for both the needle and the haystack. In practice, the allocation of tailored primary weights in a language-specific tailoring is scoped to that tailoring, so collator-based search isn’t suited for handling spans of different languages in the haystack.
Creating a Web-exposed dependency on a complex operation
The spec as written brings collator-based search into the scope of the Web Platform when previously collator-based search hasn’t been a standardized part of the Web Platform. (WebKit and Blink implement their cmd/ctrl-f functionality via collator-based search but Firefox doesn’t.
window.find
exposes this to the Web, but the API remains unstandardized.)Notably, while
Intl.Collator
exposes search collation data, it does so only in the context of providing a sorting API.Intl.Collator
doesn’t provide a search API, so it can only be used for testing full-string matches–not substring or (properly) even prefix matches. (I think it’s a design error for Intl.Collator to expose search data.)Collator-based search means shipping ICU4C or undertaking major implementation effort
Before implementing the collator for ICU4X, I surveyed the use cases of collation in Firefox and determined that certain capabilities of ICU4C’s collator aren’t exposed to the Web or otherwise needed by Firefox. (These include: collator-based search, generating search keys that a database can store after up-front computation and then use multiple times with a computationally lighter comparison operation, and generating an alphabetical index as seen in the Contacts apps on mobile phones with items grouped under exemplar characters–typically the recitable alphabet of the locale.)
I wish to keep the option open for Firefox to migrate from ICU4C to ICU4X without bringing collator-based search into scope as a blocker feature to implement.
There’s a scope document that contains the rationale for excluding collator-based search from ICU4X.
I suggest revisiting case-insensitivity and accent-insensitivity in light of use cases and reassessing if addressing use cases that would call for case-insensitivity or accent-insensitivity are really worth the complexity.
I am not advocating in favor of case-insensitivity, but I note that implementing case-insensitivity on top of the Unicode database’s notion of fold case would involve much less complexity than involving a collator in any way. Unlike in Safari, it could even be sensitive to Turkish i based on the language declared for each node in text content without making things infeasible. Still, even that level of complexity would need to be justified by use cases.
Accent-insensitivity is big enough a deal to do well that the use cases would need to be extraordinarily compelling.
Collator-based search conceptually is insensitive to the Unicode Normalization Form. However, in practice the Web in is Normalization Form C, and in any case generating the URL fragments by copying whatever text the page happens to have doesn't seem to justify making the comparison insensitive to the Unicode Normalization Form.
The text was updated successfully, but these errors were encountered: