-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[RFC] Add a locale for grapheme case-insensitive functions #18792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ext/intl/grapheme/grapheme_string.c
Outdated
Z_PARAM_STRING(haystack, haystack_len) | ||
Z_PARAM_STRING(needle, needle_len) | ||
Z_PARAM_OPTIONAL | ||
Z_PARAM_LONG(loffset) | ||
Z_PARAM_STRING_OR_NULL(locale, locale_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you tell what happens if locale_len == 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand. Could you tell me what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant if locale_len == 0
, means locale
is empty so in this case what happens down the line ? should we get a default locale in this case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, In this case, calls default locales (not root locale if locale is NULL
).
I'm wrong. Thanks.
ext/intl/grapheme/grapheme_string.c
Outdated
@@ -84,6 +84,7 @@ PHP_FUNCTION(grapheme_strpos) | |||
char *haystack, *needle; | |||
size_t haystack_len, needle_len; | |||
const char *found; | |||
char *locale = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should initialise these with = NULL;
, the empty string is done in a code segment like this, and will not be happy when you free it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much. Indeed. locale
should = NULL;
. Therefore, signature is wrong.
@@ -1043,7 +1061,7 @@ PHP_FUNCTION(grapheme_levenshtein) | |||
RETVAL_FALSE; | |||
goto out_bi2; | |||
} | |||
UCollator *collator = ucol_open("", &ustatus); | |||
UCollator *collator = ucol_open(locale, &ustatus); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the locale
parameter is passed in as an actual NULL
, wouldn't this fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. I should change to signature of RFC. Thanks.
I have a probably stupid question but I try to understand the use-case: Do you have an example where the locale is relevant to determine uppercase or lowercase characters? From what I understand for example the lower-case letter So what do we need a locale for if we want to find Or is the idea to also find If the later, then IMO that is faaaaar more than "just" adding a locale to grapheme_functions. |
@heiglandreas
Yes, right.
Yes, later it is. I think below:
Thanks for give to me to example. From https://unicode-org.github.io/icu/userguide/transforms/casemappings.html#full-language-specific-case-mapping , Turkish |
It's midnight. I'll fix this morning. thanks. |
I find this highly problematic as it completely changes the way the This is not to say that it isn't a valuable addition! But I think this should be a separate (set of) functions. So far the This is what most people expect when they pass a string to an This change here though not only compares without considering the case, it also does some character replacements based on the locale. So it is basically doing the following under the hood: $string = 'Aarhus';
$comparison = 'år';
$normalizedString = Transliterator::create('dk-lower')->transliterate($string);
$normalizedComparison = Transliterator::create('dk-lower')->transliterate($comparison);
grapheme_strpos($normalizedString, $normalizedComparison); Where all I would expect is grapheme_strpos(
mb_strtolower('Aarhus'),
mb_strtolower('år')
); To give you an example why that might become really messy: In german we have two distinct different words: When comparing them with a german locale they would not be identical. Fine. But when comparing them with a for example US-locale they would suddenly be the same. Why? Because in languages that do not have the By nonchalantly doing more than a case-insensitive comparison suddenly Or - for the Octoberfest fans: A This is a hard no from my side! |
@heiglandreas I saw your example code, so I understand this PR is many problem.
I'm going to trying to ICU depends, However, this means many problems. Your information is valuable. This RFC moved back to "Under Discussion". |
Dear @heiglandreas (or anyone) Are you satisfied with the current behavior of I think need to locale from your comment. Because changes to behavior of regions. |
I do see the issue with especially the turkish alphabet where the Capital letter "I" (U+0049) is associated with the lower case letter "ı" (U+0131) and the Capital letter "İ" (U+0131) is associated with the lower case letter "i" (U+0069) - so essentially matching the "wrong" letters to one another. TBH I am not sure this is something that can be handled by a standard locale as "en-EN" as that is region-specific and does - in it's general form - not say anything about the used alphabet. So for example "sr-RS" (Serbian as spoken in Serbia) does not say whether to use the cyrilic or the latin alphabet. One would need to be explicit and use "sr-Latn-RS" to also include the alphabet. Similarily would you need to explicitly specify whether to use the Latin, the Arabic or the Cyrilic alphabet in Azerbaijani like via "az-Latn-AZ" or "az-Cyr-AZ". The best option IMO with the current tools would be something like this: $t = Transliterator::create('tr-Lower');
var_dump(grapheme_strpos(
$t->transliterate('İ'),
$t->transliterate('i')
)); But this uses an ID from the ICUs Transliterator which has only very little to do with a locale. So from my point of view a new set of functions (names something like So in this case a Sidenote: mb_strtolower returns grapheme_stripos('İ', 'i̇'); will output |
@heiglandreas Just a moment, please. |
Add
$locale
parameter for grapheme case insensitive functions.RFC: https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive