Skip to content

<locale>: make collate::hash return the same hash for strings that collate the same #5469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

muellerj2
Copy link
Contributor

Fixes #5212.

At first glance, calling LCMapStringEx with LCMAP_HASH seems to be the obvious solution, but the docs state (emphasis mine):

Strings that appear equivalent typically return the same hash (for example, "hello" and "HELLO" with LCMAP_IGNORECASE). However, some complex cases, such as East Asian languages, can have similar strings with identical weights that compare as equal but do not return the same hash.

I briefly tested this and the warning seems to be true, as the following program finds some single-character strings that collate the same but yield different hashes:

#include <iostream>
#include <windows.h>

using namespace std;

int main() {
	const wchar_t* locale = L"ja-JP";
	for (wchar_t w = 1; w != 0; ++w) {
		int hashw;
		if (LCMapStringEx(locale, LCMAP_HASH, &w, 1, reinterpret_cast<LPWSTR>(&hashw), sizeof(int), nullptr, nullptr, 0) == 0) {
			continue;
		}

		for (wchar_t x = w + 1; x != 0; ++x) {
			int hashx;
			if (LCMapStringEx(locale, LCMAP_HASH, &x, 1, reinterpret_cast<LPWSTR>(&hashx), sizeof(int), nullptr, nullptr, 0) == 0) {
				continue;
			}
			if (hashw != hashx && CompareStringEx(locale, 0, &w, 1, &x, 1, nullptr, nullptr, 0) == CSTR_EQUAL) {
				cout << "found different hashes for characters collating the same at: " << int(w) << " : " << int(x) << '\n';
			}
		}
	}
}

Output:

found different hashes for characters collating the same at: 1556 : 64606
found different hashes for characters collating the same at: 1611 : 65137
found different hashes for characters collating the same at: 1614 : 65143
[...]

This means that LCMapStringEx with LCMAP_HASH is not good enough to meet the requirements in [locale.collate.virtuals]/3.

For this reason, this PR computes the sort key first (by calling do_transform) and then hashes the sort key.

@muellerj2 muellerj2 requested a review from a team as a code owner May 3, 2025 18:55
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 3, 2025
@StephanTLavavej StephanTLavavej added the bug Something isn't working label May 3, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 3, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 I pushed a trivial nitpick and a change to guard this with SKIP_COLLATE_TRANSFORM_TESTS and a comment explaining why. Please meow if I got that wrong.

@StephanTLavavej StephanTLavavej removed their assignment May 4, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Ready To Merge
Development

Successfully merging this pull request may close these issues.

<locale>: std::collate_byname<_Elem>::hash() yields different hashes for strings that collate the same
2 participants