Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping from BCP47 extensions to collator options is inconsistent #6033

Open
hsivonen opened this issue Jan 24, 2025 · 11 comments
Open

Mapping from BCP47 extensions to collator options is inconsistent #6033

hsivonen opened this issue Jan 24, 2025 · 11 comments
Assignees
Labels
2.0-breaking Changes that are breaking API changes C-collator Component: Collation, normalization C-locale Component: Locale identifiers, BCP47

Comments

@hsivonen
Copy link
Member

hsivonen commented Jan 24, 2025

Currently, the option enum re-exports in the collator look like this:

pub use icu_locale_core::preferences::extensions::unicode::keywords::CollationCaseFirst as CaseFirst;
pub use icu_locale_core::preferences::extensions::unicode::keywords::CollationNumericOrdering as NumericOrdering;
pub use icu_locale_core::preferences::extensions::unicode::keywords::CollationType;
pub use options::AlternateHandling;
pub use options::BackwardSecondLevel;
pub use options::CaseLevel;
// [...]
pub use options::MaxVariable;
// [...]
pub use options::Strength;

That is, the Unicode Collation Identifier from https://www.unicode.org/reports/tr35/#UnicodeCollationIdentifier is supported as CollationType (with deprecated values excluded but the unsupported-by-ICU4X value "ducet" included), but only two of the Setting Options from https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options are mapped. The mapped ones are the ones that are supported as part of the locale keys by ECMA-402: https://402.ecma-international.org/#sec-intl.collator

Commentary

co / CollationType

search and searchjl are not like the others. They don't actually make sense to for the user to be able to specify in UI that tweaks the sort order. Yet, we have them in CollationType since LDML has them is the same namespace.

ducet isn't supported by ICU4X (or ICU4C/J).

The history of compat (the Arabic sort order ICU4C had before the current Arabic sort order) looks similar to the history of now deprecated/obsolete reformed, big5han and gb2312, and an upstream issue should probably be filed about compat.

(Arguably unihan is a CJK-specific way of saying dict, but that's not an ICU4X-level thing to argue about.

In practice, apart from standard and search, only trad applies to multiple languages as opposed to being specific to one language (or in the case of unihan specific to CJK).

emoji and eor are invitations for confusion, because they only apply to the root, so languages that use the root collation make it look like they can be applied to a variety of languages, when that's not actually the case.)

ks / Strength

This is currently not mapped form BCP47 extension to enum in ICU4X.

For the most part, Strength should be a sorting call site setting and not a system-wide user preference, but a case can be made that since the default is level3 even for Japanese but the Japanese standard corresponds to identic, ja-u-ks-identic could be a legitimate user preference for a user who wants adherence to JIS at the expense of performance.

ka / AlternateHandling

This is currently not mapped form BCP47 extension to enum in ICU4X.

Since the default for Thai differs from other languages, there's evidence that this can attach to locale, which means that it could be argued it to be legitimate to customize the locale by overriding this in the locale identifier.

kb / BackwardSecondLevel

This is currently not mapped form BCP47 extension to enum in ICU4X.

Of the locales that we have data for, only fr-CA enables this via locale data. The only way to get this behavior via ECMA-402 is to request fr-CA. In particular, both fr-FR and en-CA resolve to the root collation without this option. I think that we should delete the code for this feature right away if the authorities for fr-CA decide to align with fr-FR and en-CA. At that point, it would be very sad to have the deletion be a "breaking change" due to us having exposed the feature by other means. This is why I argued against exposing BackwardSecondLevel in the ICU4X API, but the group decided otherwise. As long as we have the BackwardSecondLevel enum, it seems weird not to map kb.

kk

ICU4X does not support setting this to false. Setting this to false is an optimization in ICU4C/ICU4J that apparently is not a pure "as if" optimization.

kc / CaseLevel

This is currently not mapped form BCP47 extension to enum in ICU4X.

ECMA-402 sensitivity maps to a combination of Strength and CaseLevel. In that sense, it seems quite questionable to treat this as part of user-settable system preference as opposed to treating this as a call-site option.

kf / CaseFirst

Already mapped.

Since Danish and Maltese set this from the locale data, there's evidence of this attaching to the locale, which means that it could be argued it to be legitimate to customize the locale by overriding this in the locale identifier e.g. to specify Danish collation that's aligned with the behavior for other bicameral-script languages on this specific point.

kh

Deprecated in the spec and unsupported by ICU4X.

kn / NumericOrdering

Already mapped.

Arguably, this should be a call-site thing, but it's already mapped and ECMA-402 requires the mapping.

kr

ICU4X only supports this behavior from locale data, because we don't have a Rust port for the code that builds the relevant data structure. Therefore, we can't map this right now.

It's been suggested that ar-u-kr-Latn could have demand in regions where both Arabic and French are used if the user wants French in the Latin script to sort before Arabic in the Arabic script. Simply specifying fr to have French sort before Arabic would alter the way Arabic strings sort among Arabic strings.

kv / MaxVariable

This is currently not mapped form BCP47 extension to enum in ICU4X.

I suspect this would be more like a call-site thing than a system-wide preference, but I haven't thought about this deeply.

vt

Deprecated in the spec and unsupported by ICU4X.

@hsivonen hsivonen added 2.0-breaking Changes that are breaking API changes C-collator Component: Collation, normalization labels Jan 24, 2025
@hsivonen
Copy link
Member Author

CC @Manishearth

@hsivonen hsivonen added the C-locale Component: Locale identifiers, BCP47 label Jan 24, 2025
@Manishearth Manishearth added this to the ICU4X 2.0 ⟨P1⟩ milestone Jan 24, 2025
@srl295
Copy link
Member

srl295 commented Jan 27, 2025

co: search and searchjl are not like the others. They don't actually make sense to for the user to be able to specify in UI that tweaks the sort order. Yet, we have them in CollationType since LDML has them is the same namespace.

In a popup menu, 'choose your collation', no. they are in the same namespace because, well, they are collation types.

Perhaps what would help here, for UI, is 'collation preference' metadata in CLDR: something to say which collation types are actually of interest to users in which languages.

@sffc
Copy link
Member

sffc commented Jan 31, 2025

Thanks for the commentary. It is helpful.

What is the concrete proposal?

I see a few options:

  1. Do nothing. The ECMA-402 preferences are in the ICU4X preferences bag, and the rest are in the options bag.
  2. Move all TR35 preferences into the ICU4X preferences bag.
  3. Move some, but not all, of the TR35 preferences into the ICU4X preferences bag, based on the TC's judgement, which will largely be @hsivonen's judgement. It sounds like we might move ks and ka, but leave the others where they are.
    • If we do this, consider also adding these to the ECMA-402 preferences.

And separately, we could consider removing kb from the options bag. @hsivonen says that the TC "decided otherwise", but I don't really recall an in-depth discussion on this, and I'd be fine dropping it in 2.0.

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Jan 31, 2025
@srl295
Copy link
Member

srl295 commented Jan 31, 2025

Might I also suggest providing feedback/request to CLDR for:

  • guidance on the use of these tags in APIs and/or exposure to end users
  • perhaps preference data on collator types

There are valid concerns and observations written here - please upstream them.

@sffc
Copy link
Member

sffc commented Jan 31, 2025

CC @markusicu for feedback on the above options proposals.

@robertbastian
Copy link
Member

  • @hsivonen Three options:
    1. Use ECMA-402 options in ICU4X preferences (status quo)
    2. Add all BCP-47 options to ICU4X preferences.
    3. Look at locale data. If it can be modified, add it to ICU4X preferences.
  • @hsivonen If we map more than what ECMA-402 does, then ECMA-402 implementations need a way to drop the unused extensions. We should probably have a convenient API for that.
  • @hsivonen Not all of the options are equal. Some are deprecated, and one cannot be applied at runtime due to the ICU4C pipeline.
  • @hsivonen We also had a discussion about backward second level. @sffc didn't recall this meeting; I recall it was a discussion with @markusicu. I think we should be ready to delete this code if fr-CA authorities give up on this feature, but we can't do that if it is part of the API. In ECMA-402, the only way to get this behavior is to request fr-CA.
  • @sffc The 3 options listed are sensible. We currently do one of these. The nice thing of using ECMA-402 (status quo) is that we're not the body deciding what goes in the preferences options bag. Some of the collator preferences could/should be discussed first in ECMA-402. Based on ECMA-402 decisions, ICU4X can implement. Does ICU4C follow the options in the preferences? Presumably it does. We shouldn't go with option 2 (obey locale subtags) because then we become the authority. Options 1 or 3 seem likely, but we should consult with i18n people in ECMA-402, ICU, etc.
  • @Manishearth A question... implementation-wise, everything is in the Preferences? And for option 2 we would just need a way to clear out non-ECMA-402 options. Is that sufficient?
  • @sffc I don't necesarily think we should have an API for that; it is an ECMA-402 concern
  • @Manishearth I don't want to discuss API right now: I'm just discussing what is sufficient for the option to make sure I'm not missing subtleties.
  • @hsivonen Yes, it would just be clearing Preferences
  • @Manishearth I'd say, unless we want to put in all the work right now, we should follow ECMA-402 as a default. If someone files bugs, we can address those bugs as they come in.
  • @hsivonen There are a couple of anti-patterns at work, here. I myself might be prepetuating one of these. One is that the stuff in LDML for collation tuenables was taken from whatever is in ICU. Then we have the -u-co-xxxxxxxx subtags. So even after ICU features were deprecated, the subtags were still kept around. Ex: the Japanese (?) and the way the Swedish collation was handled, and the old Arabic collation was kept around after IBM Egypt created the new one. So it is not enough to just look the LDML spec as a guide. Some work went into ECMA-402, although I wasn't present for that.
  • @hsivonen We don't necesarily know why ECMA-402 did what they did. We should hear from @markusicu, @FrankYFTang, @macchiati. If we want to do the due diligence on this, then we should probably do this in the CLDR meeting. My concern is that, with the major stability points -- the ICU4X major releases -- we have to wait until the next one to make changes. On the other hand, just importing all of the options from LDML is repeating the mistake of being too permissive or wholesale uncritical that LDML made when pulling options from ICU.
  • @zbraniecki The first thing is, @hsivonen, there is no good ground. The history of Collator is that, it was agreed by @nciric based on the intersection of the old Microsoft API and Unicode. I don't think we should assume what is in ECMA-402 was properly vetted. It was just the first version. The second question I have is, if we were to follow ECMA-402, we'd have to evaluate whether we want to do this for all components. Is ECMA-402 also how we separate which extensions we use for list, date, etc?
  • @sffc I think @hsivonen is correct that ECMA-402 just did what was best at the time. At the same time, I would prefer ECMA-402 as an authority more than us. I do agree that @markusicu, @macchiati, and @FrankYFTang should be in the room. I think we can do this in a forwards-compatible way. We can have the options in the preferences bag and the options bag. If an option is set in both places, we can decide which place's value wins.
  • @zbraniecki I'd say that newer APIs, starting with Caridy, are well-thought through in ECMA-402. The initial version was a hard-fought battle with Microsoft and Yahoo, and I remember that Collator was a difficult one. The way I think about this is, ECMA-402 is the min, and BCP-47 is the max. Are we comfortable committing to minimal? My sense is that we're not. Even if we accept ECMA-402 as the default, we need a way to consume the rest of the BCP-47 options. We can't say we only support ECMA-402 extensions.
  • @sffc We do supoprt the non-ECMA-402 options, but only programmatically, so it's a bit more work. When we were discussing filtering of ECMA-402 options, then it's a bit more work for ECMA-402. But if you want non-ECMA-402 options, then you have to do the work to set the non-ECMA-402 options. We have at least 3 options that we're fairly confident we should not accept. 2 are marked "deprecated", and the other (2nd-level backwards) is only used for fr-CA as @hsivonen mentioned.
  • @Manishearth I don't think it's more work one way or the other... it seems more work to parse things rather than clear them. If that's the actual problem, one thing you can do is... CollatorOptions could add an API that parses them from a locale. I think it's good for us to have defaults we have confidence in. I hear that the full set is one where we don't have confidence. So we can add an API, not necesarily in 2.0, that is well documented.
  • @hsivonen Maybe we should think about this in terms of use cases. Why would you want these in the locale object in the first place? One is, there's a good use case for expressing in system preferences, the configuration of this level of detail. For this, many of the use cases are, okay, maybe there's one special case for this thing, and maybe there's a special case for that thing, and it's theoretical, such as for strength, there's the Japanese standard, but not really for other languages. Maybe you want to override the Danish or Maltese or Thai thing. Do you really need to tweak it for the given locale? If you're in a region where both fr and ar are used, maybe you want to configure a script? So there are technically use cases, but does anyone really have a UI for setting it? The second use case is whether you might want to serialize a collator configuration to a BCP-47 string and deserialize it later. It seems theoretical.
  • @zbraniecki I like hsivonen's mental model. We should look at it from the angle of consumers. We can evaluate whether any operating system is going to allow setting these values. And I don't see that happening (?). And the second option is, we can think of ICU4X as used for localization of an application at an airport, for example, and they want to use ICU4X, but they might have specific guidance from a government. I don't mind the approach @Manishearth proposed about an API to extract info from a locale.
  • @sffc Responding to @hsivonen's question about whether you need to serialize collator options. We have the ability to add a resolved options API, and that's the way that you would serialize a collator. I am puzzled why Collator put so many options into the locale, when it's not clear whether those options are related to the locale.
  • @zbraniecki I think they made a mistake putting everything in BCP-47.
  • @sffc I want to have this discussion with @markusicu in one way or another, but I don't think this should be a blocker for 2.0. We can do this for 3.0, and a separate discussion is what the timeline for that is. There are multiple pathways to adding this in 2.x. One alternative is adding the options in both places, another alternative is to have helper APIs to convert / migrate the options. We should be deciding the policy, not the specific design itself.
  • @Manishearth Checking in with the status of the discussion... I think everyone generally agrees that the options not in ECMA-402 are marginal right now and need investigation at best.
  • @sffc From my read, there are maybe 1-2 that are strongly motivated, and the rest are weakly or poorly motivated.
  • @Manishearth It seems like the best way to backwards-compatibly handle this is.
  • @hsivonen I think they are all questionably motivated.
  • @Manishearth In that case, I think doing the least amount of work here, and leaving the door open to iterate, seems good. And we should document why we do what we currently do. I would be in support of not having these options anywhere.
  • @sffc So you're suggesting we remove these options from public API?
  • @Manishearth Maybe... we could remove them from the API and worry about it later. If we think these are marginal, it seems fine that we add them when people ask for them. The whole point of options vs preferences is whether it comes from user or code. I'm not advocating strongly, but coming from the perspective of designing a new component, it seems like we wouldn't add these without a solid design.
  • @hsivonen Doesn't make sense to remove from everywhere because some of them it's a valid callsite decision. We have 3 potential sources of options. We can't throw all of these away from the API. The least quesitonable is the callsite. Ex: Case handling is a locale thing, alternate handling is a callsite thing, and then it contextually depends with Thai.
  • @Manishearth We have a similar type of issue in Decimal where there are two places to put settings and it's not clear where they go. So I think option 1 with an additional API to build options from the locale if desired, with docs, seems good. Because it sounds like we don't want to remove these from options.
  • @hsivonen The one I'd like to remove is backward-second-level. The rest should be in the API, and it's just a matter of which ones are user options in the locale. Case-first should maybe be in the same bucket as backward-second-level, but ECMA-402 has it. These are mostly good for call sites, but that isn't even consistent with ECMA-402.
  • @zbraniecki I think we trust Henri's judgement, and we should bring a proposal to the meeting with Mark etc.
  • @sffc +1
  • @Manishearth I don't think we need to block 2.0. For 2.0, we do option 1, and make sure they are wrapped in Option, which gives us flexibility in the future. We should add a potential API to consume the settings from Locale as 2.0-stretch. Should we consider removing these all from Preferences?
  • @sffc That's how it was until 3 weeks ago when we accepted a PR from Jedel to upstream them.
  • @Manishearth My litmus test is, it goes on Preferences if there's a reasonable reason, and ECMA-402 is a reasonable reason.

@Manishearth Manishearth removed their assignment Feb 9, 2025
@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Feb 10, 2025
@hsivonen
Copy link
Member Author

Let's try to categorize the settings to:

  • system setting
  • locale data
  • usage scenario (call site)

System Setting

  • co / CollationType

A looked at the GUI systems settings available on:

  • Ubuntu 24.04 Gnome
  • macOS 15.3
  • iOS/iPadOS 18.3
  • Android 15 (Google/Pixel version)
  • Windows 11

Of these, the only collation-relevant GUI that I found was on macOS controlling co / CollationType (excluding search types, direct extensions of root like eor and emoji, and deprecated).

(Notably, I failed to find GUI for co / CollationType on Windows 11, and Windows 11 set to zh-TW sorts by stroke count consistent with CLDR defaults. Is there really no GUI for Windows to switch to zhuyin sort instead?)

Locale data

  • kr (script reordering)
  • kf / CaseFirst
  • kb / BackwardSecondLevel

In practice, ka / AlternateHandling occurs in Thai locale data. It's not exactly clear to me why this is. Lao, Khmer, Japanese, and Chinese don't do this, so the other known non-users of spaces don't do the same.

The Japanese locale data explicitly sets strength, but the explicit setting is to the same level (3) as the general default. It would be more JIS-compliant (but slower) for Japanese to set a higher default strength in the locale data. So a locale in theory should be able to raise the default level above 3 via locale data, but then in practice CLDR doesn't do that for the known candidate.

Usage scenario (call site)

AFAICT, these are usage scenario (call site) settings:

  • (The search collations under co if there was a search API.)
  • ks / Strength
  • kc / CaseLevel
  • ka / AlternateHandling
  • kv / MaxVariable
  • kn / NumericOrdering

Is CaseFirst also a usage scenario setting?

Conclusions

  1. If we want to be able to represent system-level preferences as BCP47 locale identifiers, we need to support co.
  2. We should probably drop BackwardSecondLevel from the API and only allow it to be activated by fr-CA locale data, consistent with ECMA-402 implementations.
  3. ECMA-402 requires parsing kn and kf out of a BCP47 string. While last week's meeting indicated that this requirement might not be particularly well motivated and arises from reconciling Microsoft's and ICU's API capabilities, as long as ECMA-402 requires this, we should make it easy enough to implement ECMA-402 on top of ICU4X.
  4. The LDML set of BCP47 extensions for collation enables the full ICU4C collator configuration state to be serialized as a BCP47 string. I am not aware of use cases for this, but that does not mean that the aren't any. I'd appreciate @markusicu's comments on this.

Of the above, removing BackwardSecondLevel from the API would be realistically actionable in the ICU4X 2.0 time frame.

Side note: If the potential use cases for ks / Strength for Japanese and kr for Arabic in Arabic + French regions actually have substantial user demand, it would probably be more productive to mint co values for those cases (something like ja-u-co-strictjis or ar-u-co-latnarab) instead of assuming that UI could come about for constructing ks or kr-using BCP47 strings to represent these as user preferences. After all, even co seems to get GUI surface only on macOS. (I'm not sure how to assess the level of demand, since users may not know to contribute to CLDR: I have seen evidence of user demand for a non-root-order Sanskrit collation, but this does not appear to have resulted in CLDR activity.)

@Manishearth
Copy link
Member

I am in favor of this plan.

@hsivonen
Copy link
Member Author

Oh, and co could be also be a usage scenario thing: de and de-AT phonebk sorts for Contacts apps and dict/unihan sorts for dictionary apps.

@hsivonen hsivonen added discuss-priority Discuss at the next ICU4X meeting and removed discuss-priority Discuss at the next ICU4X meeting labels Feb 13, 2025
@hsivonen
Copy link
Member Author

While discussing this with @robertbastian in today's meeting, I realized that I can see a pattern in ECMA-402:

ECMA-402 option bag has options that map to a single BCP47 extension keys and options that map to a combination of two BCP47 extension keys. The option bag options that map to a single BCP47 extension key are also supported as BCP47 extension keys, but the option bag options that map to a combination of two BCP47 extension keys are not supported as BCP47 extension keys. (And the option bag takes precedence.)

This makes ECMA-402 look quite a bit less arbitrary than I thought at the time of last week's meeting and at the time of filing this issue!

That I had missed this (and that no one pointed it out to me in last week's meeting…) makes me more concerned that I might be missing something about point 4 above, too.

For reference, ECMA-402 has these option bag options:

usage maps to co-search or absence thereof.

localeMatcher is not really an option for the collator itself.

collation maps to co

numeric maps to kn

caseFirst maps to kf

sensitivity maps to a combination of ks and kc.

ignorePunctuation maps to a combination of ka and kv. (Aside: Characterizing it as "ignore punctuation" makes it easier to understand how Thai collation could have this option set as a locale-specific convention without other non-space-using scripts having to do the same.)

As noted, the effect of kb is available in ECMA-402 only as a consequence of requesting fr-CA, which sets the option via locale data.

kh and vt are deprecated and, therefore, are excluded from ECMA-402.

kk leaks a non-as-if ICU4C/J optimization, so it makes sense that it's excluded from ECMA-402.

kr isn't available via the option bag, either, and differs from the other collation-related BCP47 extensions by being list-valued (and as a matter of implementation involving more data builder-like back end needs).

@srl295
Copy link
Member

srl295 commented Feb 13, 2025

usage maps to co-search or absence thereof.

(The search collations under co if there was a search API.)

Is this saying you should only choose a search collation for use with a search API?

Applications (Usage scenario) should be able to choose search collation even outside of a search API, apologies if I misunderstood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0-breaking Changes that are breaking API changes C-collator Component: Collation, normalization C-locale Component: Locale identifiers, BCP47
Projects
None yet
Development

No branches or pull requests

5 participants