Skip to content

[RFC] Add a locale for grapheme case-insensitive functions #18792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

youkidearitai
Copy link
Contributor

Add $locale parameter for grapheme case insensitive functions.
RFC: https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive

Z_PARAM_STRING(haystack, haystack_len)
Z_PARAM_STRING(needle, needle_len)
Z_PARAM_OPTIONAL
Z_PARAM_LONG(loffset)
Z_PARAM_STRING_OR_NULL(locale, locale_len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you tell what happens if locale_len == 0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand. Could you tell me what happens?

Copy link
Member

@devnexen devnexen Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if locale_len == 0, means locale is empty so in this case what happens down the line ? should we get a default locale in this case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, In this case, calls default locales (not root locale if locale is NULL).
I'm wrong. Thanks.

@@ -84,6 +84,7 @@ PHP_FUNCTION(grapheme_strpos)
char *haystack, *needle;
size_t haystack_len, needle_len;
const char *found;
char *locale = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should initialise these with = NULL;, the empty string is done in a code segment like this, and will not be happy when you free it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much. Indeed. locale should = NULL;. Therefore, signature is wrong.

@@ -1043,7 +1061,7 @@ PHP_FUNCTION(grapheme_levenshtein)
RETVAL_FALSE;
goto out_bi2;
}
UCollator *collator = ucol_open("", &ustatus);
UCollator *collator = ucol_open(locale, &ustatus);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the locale parameter is passed in as an actual NULL, wouldn't this fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. I should change to signature of RFC. Thanks.

@heiglandreas
Copy link
Contributor

I have a probably stupid question but I try to understand the use-case:

Do you have an example where the locale is relevant to determine uppercase or lowercase characters?

From what I understand for example the lower-case letter å has a corresponding upper-case letter Å - regardless of the locale.

So what do we need a locale for if we want to find år in Århus?

Or is the idea to also find ar in Århus?

If the later, then IMO that is faaaaar more than "just" adding a locale to grapheme_functions.

@youkidearitai
Copy link
Contributor Author

@heiglandreas
Thank you very much for response.

From what I understand for example the lower-case letter å has a corresponding upper-case letter Å - regardless of the locale.

Yes, right.

So what do we need a locale for if we want to find år in Århus?
Or is the idea to also find ar in Århus?

Yes, later it is. I think below:

$ sapi/cli/php -r 'var_dump(grapheme_stripos("Aarhus", "å", locale: "da"));'
int(0)
$ sapi/cli/php -r 'var_dump(grapheme_stripos("Aarhus", "å"));'
bool(false)

Thanks for give to me to example.

From https://unicode-org.github.io/icu/userguide/transforms/casemappings.html#full-language-specific-case-mapping , Turkish I of small letter is ‘İ’(`U+0130). This RFC is supports these case.

@youkidearitai
Copy link
Contributor Author

It's midnight. I'll fix this morning. thanks.

@heiglandreas
Copy link
Contributor

I find this highly problematic as it completely changes the way the i comparison functions work.

This is not to say that it isn't a valuable addition! But I think this should be a separate (set of) functions.

So far the i functions compare on a "binary" level ignoring the 3rd bit. So a (01100001) and A (01000001) are the same.

This is what most people expect when they pass a string to an i function.

This change here though not only compares without considering the case, it also does some character replacements based on the locale.

So it is basically doing the following under the hood:

$string = 'Aarhus';
$comparison = 'år';
$normalizedString = Transliterator::create('dk-lower')->transliterate($string);
$normalizedComparison = Transliterator::create('dk-lower')->transliterate($comparison);
grapheme_strpos($normalizedString, $normalizedComparison);

Where all I would expect is

grapheme_strpos(
	mb_strtolower('Aarhus'),
	mb_strtolower('år')
);

To give you an example why that might become really messy:

In german we have two distinct different words: Busse (plural of a bus) and Buße (penance).

When comparing them with a german locale they would not be identical. Fine.

But when comparing them with a for example US-locale they would suddenly be the same. Why? Because in languages that do not have the ß character this is often converted to ss. Which then gives a totally new meaning to the word.

By nonchalantly doing more than a case-insensitive comparison suddenly bus becomes part of Buße...

Or - for the Octoberfest fans: A Maß suddenly is included in Masse.

This is a hard no from my side!

@youkidearitai
Copy link
Contributor Author

@heiglandreas
Thank you for your comment.
I understand just a little your problem.

I saw your example code, so I understand this PR is many problem.

But when comparing them with a for example US-locale they would suddenly be the same. Why? Because in languages that do not have the ß character this is often converted to ss. Which then gives a totally new meaning to the word.

I'm going to trying to ICU depends, However, this means many problems.
Because the meaning changes.

Your information is valuable.
Thank you very much. thanks again.

This RFC moved back to "Under Discussion".
I'm not going to force it.

@youkidearitai
Copy link
Contributor Author

Dear @heiglandreas (or anyone)
Please give me more information.

Are you satisfied with the current behavior of grapheme_stripos(not locale insensitive and case-insensitive grapheme functions?

I think need to locale from your comment. Because changes to behavior of regions.

@heiglandreas
Copy link
Contributor

I do see the issue with especially the turkish alphabet where the Capital letter "I" (U+0049) is associated with the lower case letter "ı" (U+0131) and the Capital letter "İ" (U+0131) is associated with the lower case letter "i" (U+0069) - so essentially matching the "wrong" letters to one another.
Which is not only a problem with turkish but also some other languages from the area like Azerbaijani, Crimean Tatar, Gagauz, Kazakh and Tatar.

TBH I am not sure this is something that can be handled by a standard locale as "en-EN" as that is region-specific and does - in it's general form - not say anything about the used alphabet.

So for example "sr-RS" (Serbian as spoken in Serbia) does not say whether to use the cyrilic or the latin alphabet. One would need to be explicit and use "sr-Latn-RS" to also include the alphabet. Similarily would you need to explicitly specify whether to use the Latin, the Arabic or the Cyrilic alphabet in Azerbaijani like via "az-Latn-AZ" or "az-Cyr-AZ".

The best option IMO with the current tools would be something like this:

$t = Transliterator::create('tr-Lower');

var_dump(grapheme_strpos(
	$t->transliterate('İ'),
	$t->transliterate('i')
));

But this uses an ID from the ICUs Transliterator which has only very little to do with a locale.

So from my point of view a new set of functions (names something like grapheme_strt* for Grapheme-function (grapheme_) for strings (str) based on transliteration (t)) would make sense to not overload the current functions with a slightly different functionality.

So in this case a grapheme_strtipos('İ', 'i', 'tr-Lower', 0) would return 0

Sidenote: mb_strtolower returns as lower-case version of İ so

grapheme_stripos('İ', '');

will output 0...

@youkidearitai
Copy link
Contributor Author

@heiglandreas
Thank you. I see.
That means override to grapheme_stri* is not make sense. So we are going to new functions (grapheme_strt*), right?
I'm interested to that idea.

Just a moment, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants