Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-23031 Reinstate special case for "-u-va-posix" lost by ICU-22520 #3379

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

roubert
Copy link
Member

@roubert roubert commented Feb 7, 2025

Inside of locimp_forLanguageTag() in _appendKeywords() in uloc_tag.cpp there's a hardcoded special case for -u-va-posix which appends the _POSIX variant but this was missed during the refactoring made for ICU-22520 (there isn't any test case that covers this).

So the call to locimp_forLanguageTag() did more than previously understood, but we still don't want to have to call that for every language tag that has BCP-47 extensions just in order to get to this special case. Instead, add a special case also to ulocimp_getSubtags().

For this to work nicely, the loop in _getVariant() that copies variants needs to be refactored so that it easily can break when encountering the start of any BCP-47 extension (which also has the welcome side-effect of making it more efficient, being able to append an entire variant at once to the output sink).

This was broken by commit 678d5c1.

Checklist

  • Required: Issue filed: ICU-23031
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@roubert roubert marked this pull request as ready for review February 7, 2025 20:18
for (; index < localeID.size() && !_isTerminator(localeID[index]); index++) {
if (index >= MAX_VARIANTS_LENGTH) { // same as length > MAX_VARIANTS_LENGTH
for (std::string_view sub = localeID;;) {
size_t next = sub.find_first_of(".@_-");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By searching for literal @ instead of calling a function, we break EBCDIC platforms.
For ICU 76, we got a contribution that supposedly made ICU work again on z/OS.
I don't think this has to be a blocker now, but it would be good to at least add a TODO about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By searching for literal @ instead of calling a function, we break EBCDIC platforms.

Sure, but that's not a regression, this loop has been checking for a literal ASCII @ ever since that check was first introduced in 2001 by commit 8e5d162.

size_t next = sub.find_first_of(".@_-");
// For historical reasons, a trailing separator is included in the variant.
bool finished = next == std::string_view::npos || next + 1 == sub.length();
size_t end = finished ? sub.length() : next;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ICU standard naming is "limit" for an exclusive end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 1611 to 1613
constexpr char tail[] = "-u-va-posix";
constexpr size_t length = sizeof tail - 1;
if (localeID.length() == length && uprv_strnicmp(localeID.data(), tail, length) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works if va-posix is the only keyword, right?
It doesn't work for something like en-US-u-co-search-va-posix-kc I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

richgillam
richgillam previously approved these changes Feb 7, 2025
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a really hard time reading through the code and figuring out what it did. I'm mostly relying on the fact that Markus didn't squawk to convince myself that the code is okay.

I second Markus's concerns. Like him, I think a "TODO" would be sufficient to deal with the EBCDIC thing and that you should use "limit" instead of "end" for that one variable. I agree it'd be good if this were a little more robust against other locale IDs that have "va-posix" somewhere in their contents, but I think we're fairly unlikely to see that kind of thing and am okay just supporting "-u-va-posix` for now. Again, you might want to add a TODO.

Inside of locimp_forLanguageTag() in _appendKeywords() in uloc_tag.cpp
there's a hardcoded special case for "-u-va-posix" which appends the
"_POSIX" variant but this was missed during the refactoring made for
ICU-22520 (there isn't any test case that covers this).

So the call to locimp_forLanguageTag() did more than previously
understood, but we still don't want to have to call that for every
language tag that has BCP-47 extensions just in order to get to this
special case. Instead, add a special case also to ulocimp_getSubtags().

For this to work nicely, the loop in _getVariant() that copies variants
needs to be refactored so that it easily can break when encountering the
start of any BCP-47 extension (which also has the welcome side-effect of
making it more efficient, being able to append an entire variant at once
to the output sink).

This was broken by commit 678d5c1.
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/uloc.cpp is different
  • icu4c/source/test/cintltst/cloctst.c is different
  • icu4c/source/test/cintltst/cloctst.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOKTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants