[RFC] Add a locale for grapheme case-insensitive functions #18792

youkidearitai · 2025-06-07T06:57:34Z

Add $locale parameter for grapheme case insensitive functions.
RFC: https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive

devnexen · 2025-06-27T14:01:41Z

ext/intl/grapheme/grapheme_string.c

 		Z_PARAM_STRING(haystack, haystack_len)
 		Z_PARAM_STRING(needle, needle_len)
 		Z_PARAM_OPTIONAL
 		Z_PARAM_LONG(loffset)
+		Z_PARAM_STRING_OR_NULL(locale, locale_len)


can you tell what happens if locale_len == 0 ?

Sorry, I don't understand. Could you tell me what happens?

I meant if locale_len == 0, means locale is empty so in this case what happens down the line ? should we get a default locale in this case ?

Ah, In this case, calls default locales (not root locale if locale is NULL).
I'm wrong. Thanks.

derickr · 2025-06-30T09:41:54Z

ext/intl/grapheme/grapheme_string.c

@@ -84,6 +84,7 @@ PHP_FUNCTION(grapheme_strpos)
 	char *haystack, *needle;
 	size_t haystack_len, needle_len;
 	const char *found;
+	char *locale = "";


You should initialise these with = NULL;, the empty string is done in a code segment like this, and will not be happy when you free it.

Thank you very much. Indeed. locale should = NULL;. Therefore, signature is wrong.

derickr · 2025-06-30T09:44:30Z

ext/intl/grapheme/grapheme_string.c

@@ -1043,7 +1061,7 @@ PHP_FUNCTION(grapheme_levenshtein)
 		RETVAL_FALSE;
 		goto out_bi2;
 	}
-	UCollator *collator = ucol_open("", &ustatus);
+	UCollator *collator = ucol_open(locale, &ustatus);


If the locale parameter is passed in as an actual NULL, wouldn't this fail?

Yes, that's right. I should change to signature of RFC. Thanks.

heiglandreas · 2025-06-30T14:08:09Z

I have a probably stupid question but I try to understand the use-case:

Do you have an example where the locale is relevant to determine uppercase or lowercase characters?

From what I understand for example the lower-case letter å has a corresponding upper-case letter Å - regardless of the locale.

So what do we need a locale for if we want to find år in Århus?

Or is the idea to also find ar in Århus?

If the later, then IMO that is faaaaar more than "just" adding a locale to grapheme_functions.

youkidearitai · 2025-06-30T15:25:45Z

@heiglandreas
Thank you very much for response.

From what I understand for example the lower-case letter å has a corresponding upper-case letter Å - regardless of the locale.

Yes, right.

So what do we need a locale for if we want to find år in Århus?
Or is the idea to also find ar in Århus?

Yes, later it is. I think below:

$ sapi/cli/php -r 'var_dump(grapheme_stripos("Aarhus", "å", locale: "da"));'
int(0)
$ sapi/cli/php -r 'var_dump(grapheme_stripos("Aarhus", "å"));'
bool(false)

Thanks for give to me to example.

From https://unicode-org.github.io/icu/userguide/transforms/casemappings.html#full-language-specific-case-mapping , Turkish I of small letter is ‘İ’(`U+0130). This RFC is supports these case.

youkidearitai · 2025-06-30T16:08:55Z

It's midnight. I'll fix this morning. thanks.

heiglandreas · 2025-06-30T16:15:27Z

I find this highly problematic as it completely changes the way the i comparison functions work.

This is not to say that it isn't a valuable addition! But I think this should be a separate (set of) functions.

So far the i functions compare on a "binary" level ignoring the 3rd bit. So a (01100001) and A (01000001) are the same.

This is what most people expect when they pass a string to an i function.

This change here though not only compares without considering the case, it also does some character replacements based on the locale.

So it is basically doing the following under the hood:

$string = 'Aarhus';
$comparison = 'år';
$normalizedString = Transliterator::create('dk-lower')->transliterate($string);
$normalizedComparison = Transliterator::create('dk-lower')->transliterate($comparison);
grapheme_strpos($normalizedString, $normalizedComparison);

Where all I would expect is

grapheme_strpos(
	mb_strtolower('Aarhus'),
	mb_strtolower('år')
);

To give you an example why that might become really messy:

In german we have two distinct different words: Busse (plural of a bus) and Buße (penance).

When comparing them with a german locale they would not be identical. Fine.

But when comparing them with a for example US-locale they would suddenly be the same. Why? Because in languages that do not have the ß character this is often converted to ss. Which then gives a totally new meaning to the word.

By nonchalantly doing more than a case-insensitive comparison suddenly bus becomes part of Buße...

Or - for the Octoberfest fans: A Maß suddenly is included in Masse.

This is a hard no from my side!

youkidearitai · 2025-06-30T20:19:31Z

@heiglandreas
Thank you for your comment.
I understand just a little your problem.

I saw your example code, so I understand this PR is many problem.

But when comparing them with a for example US-locale they would suddenly be the same. Why? Because in languages that do not have the ß character this is often converted to ss. Which then gives a totally new meaning to the word.

I'm going to trying to ICU depends, However, this means many problems.
Because the meaning changes.

Your information is valuable.
Thank you very much. thanks again.

This RFC moved back to "Under Discussion".
I'm not going to force it.

youkidearitai · 2025-07-01T02:15:10Z

Dear @heiglandreas (or anyone)
Please give me more information.

Are you satisfied with the current behavior of grapheme_stripos(not locale insensitive and case-insensitive grapheme functions?

I think need to locale from your comment. Because changes to behavior of regions.

heiglandreas · 2025-07-01T05:41:10Z

I do see the issue with especially the turkish alphabet where the Capital letter "I" (U+0049) is associated with the lower case letter "ı" (U+0131) and the Capital letter "İ" (U+0131) is associated with the lower case letter "i" (U+0069) - so essentially matching the "wrong" letters to one another.
Which is not only a problem with turkish but also some other languages from the area like Azerbaijani, Crimean Tatar, Gagauz, Kazakh and Tatar.

TBH I am not sure this is something that can be handled by a standard locale as "en-EN" as that is region-specific and does - in it's general form - not say anything about the used alphabet.

So for example "sr-RS" (Serbian as spoken in Serbia) does not say whether to use the cyrilic or the latin alphabet. One would need to be explicit and use "sr-Latn-RS" to also include the alphabet. Similarily would you need to explicitly specify whether to use the Latin, the Arabic or the Cyrilic alphabet in Azerbaijani like via "az-Latn-AZ" or "az-Cyr-AZ".

The best option IMO with the current tools would be something like this:

$t = Transliterator::create('tr-Lower');

var_dump(grapheme_strpos(
	$t->transliterate('İ'),
	$t->transliterate('i')
));

But this uses an ID from the ICUs Transliterator which has only very little to do with a locale.

So from my point of view a new set of functions (names something like grapheme_strt* for Grapheme-function (grapheme_) for strings (str) based on transliteration (t)) would make sense to not overload the current functions with a slightly different functionality.

So in this case a grapheme_strtipos('İ', 'i', 'tr-Lower', 0) would return 0

Sidenote: mb_strtolower returns i̇ as lower-case version of İ so

grapheme_stripos('İ', 'i̇');

will output 0...

youkidearitai · 2025-07-01T10:30:02Z

@heiglandreas
Thank you. I see.
That means override to grapheme_stri* is not make sense. So we are going to new functions (grapheme_strt*), right?
I'm interested to that idea.

Just a moment, please.

[RFC] Add a locale for grapheme case-insensitive functions

d3b2061

github-actions bot added the Extension: intl label Jun 7, 2025

Separate parameter of grapheme_strstr and grapheme_stristr

8938426

youkidearitai added the Status: Requires RFC label Jun 16, 2025

Add locale for grapheme_levenshtein function

8f2e544

devnexen reviewed Jun 27, 2025

View reviewed changes

derickr reviewed Jun 30, 2025

View reviewed changes

Fix signatures

4988e9a

minor changes

7996330

[RFC] Add a locale for grapheme case-insensitive functions #18792

Are you sure you want to change the base?

[RFC] Add a locale for grapheme case-insensitive functions #18792

Conversation

youkidearitai commented Jun 7, 2025

Uh oh!

devnexen Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

youkidearitai Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

devnexen Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youkidearitai Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

derickr Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

youkidearitai Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

derickr Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

youkidearitai Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

heiglandreas commented Jun 30, 2025

Uh oh!

youkidearitai commented Jun 30, 2025

Uh oh!

youkidearitai commented Jun 30, 2025

Uh oh!

heiglandreas commented Jun 30, 2025

Uh oh!

youkidearitai commented Jun 30, 2025

Uh oh!

youkidearitai commented Jul 1, 2025

Uh oh!

heiglandreas commented Jul 1, 2025

Uh oh!

youkidearitai commented Jul 1, 2025

Uh oh!

Uh oh!

devnexen Jun 30, 2025 •

edited

Loading