-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sorting strings with Danish (and properly other languages) letters does not work #6350
Comments
I'm not totally clear what this ticket is asking - Certainly I know that there are users who modify the default collation on PG, and others who manually modify the collation used on columns/indexes. We don't document how to do that, as I would consider it a routine low level DBA-type operation as opposed to something HAPI specifically needs to bless. Are you requesting documentation on how to change the collation? |
@jamesagnew: You are right that collation is something you need to deal with on database level. The default collation is set when you create a database. |
Ah I see - So the actual request here is for the sort column to be the "exact" column and not the normalized one yes? Or alternately, perhaps a pluggable strategy for normalization? Would it make more sense to allow for Å to not be converted to A (for example) when normalizing? |
Yes
I dont know the reasoning that lies behind the current normalization (NFD + removal of diacritics), but I guess it is about getting some fuzzyness when searching. The code also allows configuration for a phonetic encoder (which we dont use). But if no normalization, or phonetic encoding, then the normalized field would just be a copy of the original. The current indexing is also only supporting sorting on normalized values: Line 63 in 6f94e22
So when we sort on the original value, then there probably should be another index (to enable an execution plan scanning an index to do sorting). |
Yeah, that's mostly why I thought maybe option 2 would be a better fix? If the normalization algorithm didn't treat Å/A as the same character, you problem would go away with a simple collation change to the index on that column (or a change to the default collation) which feels like a smaller change perhaps? But of course this would also mean that searches for The alternative is to add configuration to have the sort use the exact column. This means that the search for
|
Changing the collation on the SP_VALUE_NORMALIZED would not make the sorting correct in our locale. I can understand why the normalization strips diacritics, but in the case with the danish Å becoming A, the sorting will not be correct according to the danish rules. And things are actually a bit more complicated. In danish Å=Aa, eg. the name of my city is Aarhus (previously it was Århus, but it was decided to change it to an older form). And in danish, Å is sorted last, and Aa must also be sorted last. This is handled by the da_DK collation. The normalization for searching should then convert Å to AA, because Å and A is not the same. I admit it is weird. And, to be honest, we did not take proper care of changing collation when we created our databases. So they have the default en_US.UTF-8 collation, and it is not realistic for us to change it (because it would require export/import). |
Ok - so changing the normalization rules is out then. I'm still not really clear on what this ticket is requesting then. We could certainly add configuration in HAPI so that it uses the exact column instead of the normalized column for string sorts. I think the expectation on someone using that config is always going to be that they have to create a custom index if they want to use it - there would be no point in us creating an index there just to support configuration that most people would never use (therefore adding a bunch of wasted index space) |
The issue is that we believe that sorting on the normalized string is a defect. Sorting should be done on the original string. |
So presumably then you are advocating for string sorts to just always sort on the exact column? Sorting on the normalized string produces undesired results for the Danish language - that much you've certainly convinced me of, so I can see how the behaviour would be beneficial for you. Sorting on the normalized string will produce more reasonable results than sorting on exact with whatever default collation is in place in many other locales though. And modifying the database collation is an advanced concept that I suspect many users of HAPI wouldn't know how to do. I feel like what you're proposing would lead to many less experienced users to not understand why their sort is suddenly producing |
We had some discussion internally on this today. My proposed solution here would be:
|
Yes. Collation support does not require normalization (but it may depend on the database).
Again, it does not require normalization. Collation can be chosen to produce the preferred behaviour. Also for case insensitivity. |
Sorting on normalization by default produces "good enough" behaviour which works consistently, in a default configuration, with no special database settings or control required, across all of the database platforms we support as well as potentially others that people are trying out because hibernate supports them. I'm sorry, you are arguing that sorting on the exact can be made to be a better option for your use case, and I don't disagree. You are missing the point though about why we don't want to switch our default behaviour. I have suggested a change in behaviour that would get you what you need. It sounds like this does not meet your needs, so this ticket may have to be a WONTFIX. |
You are right. I was arguing on the basis of what is supported by PostgreSQL. I can understand why you have chosen the existing solution.
We would be fine with the solution you suggest. It would reduce the stuff we currently have to change in our fork to make things work in our context. |
Describe the bug
Using the jpa-server for sorting, the default normalized sort does not sort Danish letters correct. It should be possible to select the collation used for sorting.
To Reproduce
Resources:
{
"resourceType": "Questionnaire",
.
.
.
}
Making more resources replacing to fill with Danish letters z æ ø and å will result in a sorted result that is incorrect sorted.
ex. Å is represented as A which is wrong. Å is the last Danish letter
Expected behavior
Danish letters is sorted in this way Z Æ Ø Å. We do also represent Ø as OE, Æ as AE and Å as AA which means that AA should be placed together with Å
Screenshots
N/A
Environment (please complete the following information):
Additional context
The text was updated successfully, but these errors were encountered: