ICU-23031 Reinstate special case for "-u-va-posix" lost by ICU-22520 #3379

roubert · 2025-02-07T19:24:23Z

Inside of locimp_forLanguageTag() in _appendKeywords() in uloc_tag.cpp there's a hardcoded special case for -u-va-posix which appends the _POSIX variant but this was missed during the refactoring made for ICU-22520 (there isn't any test case that covers this).

So the call to locimp_forLanguageTag() did more than previously understood, but we still don't want to have to call that for every language tag that has BCP-47 extensions just in order to get to this special case. Instead, add a special case also to ulocimp_getSubtags().

For this to work nicely, the loop in _getVariant() that copies variants needs to be refactored so that it easily can break when encountering the start of any BCP-47 extension (which also has the welcome side-effect of making it more efficient, being able to append an entire variant at once to the output sink).

This was broken by commit 678d5c1.

Checklist

Required: Issue filed: ICU-23031
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

markusicu · 2025-02-07T20:31:31Z

icu4c/source/common/uloc.cpp

-        for (; index < localeID.size() && !_isTerminator(localeID[index]); index++) {
-            if (index >= MAX_VARIANTS_LENGTH) { // same as length > MAX_VARIANTS_LENGTH
+        for (std::string_view sub = localeID;;) {
+            size_t next = sub.find_first_of(".@_-");


By searching for literal @ instead of calling a function, we break EBCDIC platforms.
For ICU 76, we got a contribution that supposedly made ICU work again on z/OS.
I don't think this has to be a blocker now, but it would be good to at least add a TODO about it.

By searching for literal @ instead of calling a function, we break EBCDIC platforms.

Sure, but that's not a regression, this loop has been checking for a literal ASCII @ ever since that check was first introduced in 2001 by commit 8e5d162.

markusicu · 2025-02-07T20:35:48Z

icu4c/source/common/uloc.cpp

+            size_t next = sub.find_first_of(".@_-");
+            // For historical reasons, a trailing separator is included in the variant.
+            bool finished = next == std::string_view::npos || next + 1 == sub.length();
+            size_t end = finished ? sub.length() : next;


nit: ICU standard naming is "limit" for an exclusive end

markusicu · 2025-02-07T20:41:28Z

icu4c/source/common/uloc.cpp

+    constexpr char tail[] = "-u-va-posix";
+    constexpr size_t length = sizeof tail - 1;
+    if (localeID.length() == length && uprv_strnicmp(localeID.data(), tail, length) == 0) {


This only works if va-posix is the only keyword, right?
It doesn't work for something like en-US-u-co-search-va-posix-kc I think.

richgillam

I had a really hard time reading through the code and figuring out what it did. I'm mostly relying on the fact that Markus didn't squawk to convince myself that the code is okay.

I second Markus's concerns. Like him, I think a "TODO" would be sufficient to deal with the EBCDIC thing and that you should use "limit" instead of "end" for that one variable. I agree it'd be good if this were a little more robust against other locale IDs that have "va-posix" somewhere in their contents, but I think we're fairly unlikely to see that kind of thing and am okay just supporting "-u-va-posix` for now. Again, you might want to add a TODO.

Inside of locimp_forLanguageTag() in _appendKeywords() in uloc_tag.cpp there's a hardcoded special case for "-u-va-posix" which appends the "_POSIX" variant but this was missed during the refactoring made for ICU-22520 (there isn't any test case that covers this). So the call to locimp_forLanguageTag() did more than previously understood, but we still don't want to have to call that for every language tag that has BCP-47 extensions just in order to get to this special case. Instead, add a special case also to ulocimp_getSubtags(). For this to work nicely, the loop in _getVariant() that copies variants needs to be refactored so that it easily can break when encountering the start of any BCP-47 extension (which also has the welcome side-effect of making it more efficient, being able to append an entire variant at once to the output sink). This was broken by commit 678d5c1.

jira-pull-request-webhook · 2025-02-10T16:04:38Z

Notice: the branch changed across the force-push!

icu4c/source/common/uloc.cpp is different
icu4c/source/test/cintltst/cloctst.c is different
icu4c/source/test/cintltst/cloctst.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

richgillam

LOKTM

roubert marked this pull request as ready for review February 7, 2025 20:18

roubert assigned richgillam Feb 7, 2025

roubert requested review from richgillam and markusicu February 7, 2025 20:18

markusicu reviewed Feb 7, 2025

View reviewed changes

richgillam previously approved these changes Feb 7, 2025

View reviewed changes

roubert dismissed richgillam’s stale review via aca1c83 February 10, 2025 16:04

roubert force-pushed the 23031 branch from 362aa3e to aca1c83 Compare February 10, 2025 16:04

richgillam approved these changes Feb 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-23031 Reinstate special case for "-u-va-posix" lost by ICU-22520 #3379

ICU-23031 Reinstate special case for "-u-va-posix" lost by ICU-22520 #3379

roubert commented Feb 7, 2025

markusicu Feb 7, 2025

roubert Feb 10, 2025

markusicu Feb 7, 2025

roubert Feb 10, 2025

markusicu Feb 7, 2025

roubert Feb 10, 2025

richgillam left a comment

jira-pull-request-webhook bot commented Feb 10, 2025

richgillam left a comment

ICU-23031 Reinstate special case for "-u-va-posix" lost by ICU-22520 #3379

Are you sure you want to change the base?

ICU-23031 Reinstate special case for "-u-va-posix" lost by ICU-22520 #3379

Conversation

roubert commented Feb 7, 2025

Checklist

markusicu Feb 7, 2025

Choose a reason for hiding this comment

roubert Feb 10, 2025

Choose a reason for hiding this comment

markusicu Feb 7, 2025

Choose a reason for hiding this comment

roubert Feb 10, 2025

Choose a reason for hiding this comment

markusicu Feb 7, 2025

Choose a reason for hiding this comment

roubert Feb 10, 2025

Choose a reason for hiding this comment

richgillam left a comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Feb 10, 2025

richgillam left a comment

Choose a reason for hiding this comment