-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping from BCP47 extensions to collator options is inconsistent #6033
Comments
CC @Manishearth |
In a popup menu, 'choose your collation', no. they are in the same namespace because, well, they are collation types. Perhaps what would help here, for UI, is 'collation preference' metadata in CLDR: something to say which collation types are actually of interest to users in which languages. |
Thanks for the commentary. It is helpful. What is the concrete proposal? I see a few options:
And separately, we could consider removing |
Might I also suggest providing feedback/request to CLDR for:
There are valid concerns and observations written here - please upstream them. |
CC @markusicu for feedback on the above options proposals. |
|
Let's try to categorize the settings to:
System Setting
A looked at the GUI systems settings available on:
Of these, the only collation-relevant GUI that I found was on macOS controlling (Notably, I failed to find GUI for Locale data
In practice, The Japanese locale data explicitly sets strength, but the explicit setting is to the same level (3) as the general default. It would be more JIS-compliant (but slower) for Japanese to set a higher default strength in the locale data. So a locale in theory should be able to raise the default level above 3 via locale data, but then in practice CLDR doesn't do that for the known candidate. Usage scenario (call site)AFAICT, these are usage scenario (call site) settings:
Is Conclusions
Of the above, removing Side note: If the potential use cases for |
I am in favor of this plan. |
Oh, and |
While discussing this with @robertbastian in today's meeting, I realized that I can see a pattern in ECMA-402: ECMA-402 option bag has options that map to a single BCP47 extension keys and options that map to a combination of two BCP47 extension keys. The option bag options that map to a single BCP47 extension key are also supported as BCP47 extension keys, but the option bag options that map to a combination of two BCP47 extension keys are not supported as BCP47 extension keys. (And the option bag takes precedence.) This makes ECMA-402 look quite a bit less arbitrary than I thought at the time of last week's meeting and at the time of filing this issue! That I had missed this (and that no one pointed it out to me in last week's meeting…) makes me more concerned that I might be missing something about point 4 above, too. For reference, ECMA-402 has these option bag options:
As noted, the effect of
|
…
Is this saying you should only choose a search collation for use with a search API? Applications (Usage scenario) should be able to choose search collation even outside of a search API, apologies if I misunderstood. |
Currently, the option enum re-exports in the collator look like this:
That is, the Unicode Collation Identifier from https://www.unicode.org/reports/tr35/#UnicodeCollationIdentifier is supported as
CollationType
(with deprecated values excluded but the unsupported-by-ICU4X value "ducet" included), but only two of the Setting Options from https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options are mapped. The mapped ones are the ones that are supported as part of the locale keys by ECMA-402: https://402.ecma-international.org/#sec-intl.collatorCommentary
co
/CollationType
search
andsearchjl
are not like the others. They don't actually make sense to for the user to be able to specify in UI that tweaks the sort order. Yet, we have them inCollationType
since LDML has them is the same namespace.ducet
isn't supported by ICU4X (or ICU4C/J).The history of
compat
(the Arabic sort order ICU4C had before the current Arabic sort order) looks similar to the history of now deprecated/obsoletereformed
,big5han
andgb2312
, and an upstream issue should probably be filed aboutcompat
.(Arguably
unihan
is a CJK-specific way of sayingdict
, but that's not an ICU4X-level thing to argue about.In practice, apart from
standard
andsearch
, onlytrad
applies to multiple languages as opposed to being specific to one language (or in the case ofunihan
specific to CJK).emoji
andeor
are invitations for confusion, because they only apply to the root, so languages that use the root collation make it look like they can be applied to a variety of languages, when that's not actually the case.)ks
/Strength
This is currently not mapped form BCP47 extension to enum in ICU4X.
For the most part,
Strength
should be a sorting call site setting and not a system-wide user preference, but a case can be made that since the default islevel3
even for Japanese but the Japanese standard corresponds toidentic
,ja-u-ks-identic
could be a legitimate user preference for a user who wants adherence to JIS at the expense of performance.ka
/AlternateHandling
This is currently not mapped form BCP47 extension to enum in ICU4X.
Since the default for Thai differs from other languages, there's evidence that this can attach to locale, which means that it could be argued it to be legitimate to customize the locale by overriding this in the locale identifier.
kb
/BackwardSecondLevel
This is currently not mapped form BCP47 extension to enum in ICU4X.
Of the locales that we have data for, only
fr-CA
enables this via locale data. The only way to get this behavior via ECMA-402 is to requestfr-CA
. In particular, bothfr-FR
anden-CA
resolve to the root collation without this option. I think that we should delete the code for this feature right away if the authorities forfr-CA
decide to align withfr-FR
anden-CA
. At that point, it would be very sad to have the deletion be a "breaking change" due to us having exposed the feature by other means. This is why I argued against exposingBackwardSecondLevel
in the ICU4X API, but the group decided otherwise. As long as we have theBackwardSecondLevel
enum, it seems weird not to mapkb
.kk
ICU4X does not support setting this to
false
. Setting this tofalse
is an optimization in ICU4C/ICU4J that apparently is not a pure "as if" optimization.kc
/CaseLevel
This is currently not mapped form BCP47 extension to enum in ICU4X.
ECMA-402
sensitivity
maps to a combination ofStrength
andCaseLevel
. In that sense, it seems quite questionable to treat this as part of user-settable system preference as opposed to treating this as a call-site option.kf
/CaseFirst
Already mapped.
Since Danish and Maltese set this from the locale data, there's evidence of this attaching to the locale, which means that it could be argued it to be legitimate to customize the locale by overriding this in the locale identifier e.g. to specify Danish collation that's aligned with the behavior for other bicameral-script languages on this specific point.
kh
Deprecated in the spec and unsupported by ICU4X.
kn
/NumericOrdering
Already mapped.
Arguably, this should be a call-site thing, but it's already mapped and ECMA-402 requires the mapping.
kr
ICU4X only supports this behavior from locale data, because we don't have a Rust port for the code that builds the relevant data structure. Therefore, we can't map this right now.
It's been suggested that
ar-u-kr-Latn
could have demand in regions where both Arabic and French are used if the user wants French in the Latin script to sort before Arabic in the Arabic script. Simply specifyingfr
to have French sort before Arabic would alter the way Arabic strings sort among Arabic strings.kv
/MaxVariable
This is currently not mapped form BCP47 extension to enum in ICU4X.
I suspect this would be more like a call-site thing than a system-wide preference, but I haven't thought about this deeply.
vt
Deprecated in the spec and unsupported by ICU4X.
The text was updated successfully, but these errors were encountered: