Sorting strings with Danish (and properly other languages) letters does not work #6350

jjksse · 2024-10-04T11:04:06Z

Describe the bug
Using the jpa-server for sorting, the default normalized sort does not sort Danish letters correct. It should be possible to select the collation used for sorting.

To Reproduce
Resources:
{
"resourceType": "Questionnaire",
.
.
.

"version": "1.0",
"name": "fff972b2-c0c6-4dcc-9cea-c0f8ff9095af",
"title": "<to fill>",
"status": "active"

}

Making more resources replacing to fill with Danish letters z æ ø and å will result in a sorted result that is incorrect sorted.

ex. Å is represented as A which is wrong. Å is the last Danish letter

Expected behavior
Danish letters is sorted in this way Z Æ Ø Å. We do also represent Ø as OE, Æ as AE and Å as AA which means that AA should be placed together with Å

Screenshots
N/A

Environment (please complete the following information):

HAPI FHIR Version 6.10.5
OS: [e.g. iOS] linux kubernetes
Browser [e.g. N/A

Additional context

The text was updated successfully, but these errors were encountered:

jamesagnew · 2024-10-04T12:39:07Z

I'm not totally clear what this ticket is asking - Certainly I know that there are users who modify the default collation on PG, and others who manually modify the collation used on columns/indexes. We don't document how to do that, as I would consider it a routine low level DBA-type operation as opposed to something HAPI specifically needs to bless.

Are you requesting documentation on how to change the collation?

tyfoni-systematic · 2024-10-08T13:03:08Z

@jamesagnew: You are right that collation is something you need to deal with on database level. The default collation is set when you create a database.
But our problem is that hapi-fhir will do sorting on a normalized string. The normalization in hapi-fhir will convert the danish letter Å to A (it uses NFD and then strips diacritics). This will break the danish collation.
This is probably not the case in all languages (as far as I know).
To fix the sorting, we have modified the SQL generation to use the original string (instead of the normalized one).
I would expect that any other collation would work correctly on the original string input.
The normalization can make sense if you try to make searches more "fuzzy". That is, if i search for: "René", then I also want to find: "Rene".

jamesagnew · 2024-10-08T13:31:58Z

Ah I see - So the actual request here is for the sort column to be the "exact" column and not the normalized one yes?

Or alternately, perhaps a pluggable strategy for normalization? Would it make more sense to allow for Å to not be converted to A (for example) when normalizing?

tyfoni-systematic · 2024-10-08T13:53:03Z

Ah I see - So the actual request here is for the sort column to be the "exact" column and not the normalized one yes?

Yes

Or alternately, perhaps a pluggable strategy for normalization? Would it make more sense to allow for Å to not be converted to A (for example) when normalizing?

I dont know the reasoning that lies behind the current normalization (NFD + removal of diacritics), but I guess it is about getting some fuzzyness when searching. The code also allows configuration for a phonetic encoder (which we dont use). But if no normalization, or phonetic encoding, then the normalized field would just be a copy of the original.

The current indexing is also only supporting sorting on normalized values:

hapi-fhir/hapi-fhir-jpaserver-model/src/main/java/ca/uhn/fhir/jpa/model/entity/ResourceIndexedSearchParamString.java

Line 63 in 6f94e22

// This is used for sorting, and for :contains queries currently

So when we sort on the original value, then there probably should be another index (to enable an execution plan scanning an index to do sorting).

jamesagnew · 2024-10-08T14:43:07Z

Yeah, that's mostly why I thought maybe option 2 would be a better fix?

If the normalization algorithm didn't treat Å/A as the same character, you problem would go away with a simple collation change to the index on that column (or a change to the default collation) which feels like a smaller change perhaps? But of course this would also mean that searches for ?name=A would no longer match names beginning with Å. I don't know where this would be desirable or not.

The alternative is to add configuration to have the sort use the exact column. This means that the search for ?name=A would still work, but would mean:

Users would need to define their own index to support this sort - I don't see us adding the index to support this automatically since it would impact DB space pretty significantly but would be rarely used
Sort would become case sensitive, which might or might not matter

tyfoni-systematic · 2024-10-09T07:08:20Z

Changing the collation on the SP_VALUE_NORMALIZED would not make the sorting correct in our locale. I can understand why the normalization strips diacritics, but in the case with the danish Å becoming A, the sorting will not be correct according to the danish rules. And things are actually a bit more complicated. In danish Å=Aa, eg. the name of my city is Aarhus (previously it was Århus, but it was decided to change it to an older form). And in danish, Å is sorted last, and Aa must also be sorted last. This is handled by the da_DK collation. The normalization for searching should then convert Å to AA, because Å and A is not the same. I admit it is weird.

And, to be honest, we did not take proper care of changing collation when we created our databases. So they have the default en_US.UTF-8 collation, and it is not realistic for us to change it (because it would require export/import).
So, we had to make a workaround, and have done it by explicitly specifying the collation in the ORDER BY clause generated by hapi-fhir. We do this with a custom change in our fork.
And, if we decide that we need an index on SP_VALUE_EXACT, it would also need to be a custom change, because the index must be created with a COLLATE clause to get the behaviour expected in our locale (da_DK).

jamesagnew · 2024-10-09T13:16:12Z

Ok - so changing the normalization rules is out then.

I'm still not really clear on what this ticket is requesting then. We could certainly add configuration in HAPI so that it uses the exact column instead of the normalized column for string sorts. I think the expectation on someone using that config is always going to be that they have to create a custom index if they want to use it - there would be no point in us creating an index there just to support configuration that most people would never use (therefore adding a bunch of wasted index space)

tyfoni-systematic · 2024-10-09T13:31:00Z

I'm still not really clear on what this ticket is requesting then. We could certainly add configuration in HAPI so that it uses the exact column instead of the normalized column for string sorts. I think the expectation on someone using that config is always going to be that they have to create a custom index if they want to use it - there would be no point in us creating an index there just to support configuration that most people would never use (therefore adding a bunch of wasted index space)

The issue is that we believe that sorting on the normalized string is a defect. Sorting should be done on the original string.
Our solution to the problem also covers the fact, that we have set a wrong default collation, which we also handle with a custom change on our fork.
But for those that care about sorting, it might be worth noting this defect, if they have similar issues (for other locales)

jamesagnew · 2024-10-09T14:04:18Z

So presumably then you are advocating for string sorts to just always sort on the exact column?

Sorting on the normalized string produces undesired results for the Danish language - that much you've certainly convinced me of, so I can see how the behaviour would be beneficial for you.

Sorting on the normalized string will produce more reasonable results than sorting on exact with whatever default collation is in place in many other locales though. And modifying the database collation is an advanced concept that I suspect many users of HAPI wouldn't know how to do. I feel like what you're proposing would lead to many less experienced users to not understand why their sort is suddenly producing a, b, c, A, B, C instead of something more expected like a, A, b, B, c, C. No?

jamesagnew · 2024-10-09T20:25:44Z

We had some discussion internally on this today. My proposed solution here would be:

Add a setting that can be used to optionally switch the sort to use the exact column instead of the normalized one
Document that this setting requires the user to manually create an index with a collation they consider appropriate to support the sort
Add a new pointcut that allows for manual manipulation of generated search SQL statements, so that you could inject your custom collation into the search SQL if you want

tyfoni-systematic · 2024-10-10T10:00:28Z

So presumably then you are advocating for string sorts to just always sort on the exact column?

Yes. Collation support does not require normalization (but it may depend on the database).

Sorting on the normalized string will produce more reasonable results than sorting on exact with whatever default collation is in place in many other locales though. And modifying the database collation is an advanced concept that I suspect many users of HAPI wouldn't know how to do. I feel like what you're proposing would lead to many less experienced users to not understand why their sort is suddenly producing a, b, c, A, B, C instead of something more expected like a, A, b, B, c, C. No?

Again, it does not require normalization. Collation can be chosen to produce the preferred behaviour. Also for case insensitivity.
For postgreSQL see: https://www.postgresql.org/docs/current/collation.html#:~:text=23.2.2.4.%20nondeterministic%20collations%20

jamesagnew · 2024-10-10T12:30:44Z

Sorting on normalization by default produces "good enough" behaviour which works consistently, in a default configuration, with no special database settings or control required, across all of the database platforms we support as well as potentially others that people are trying out because hibernate supports them.

I'm sorry, you are arguing that sorting on the exact can be made to be a better option for your use case, and I don't disagree. You are missing the point though about why we don't want to switch our default behaviour.

I have suggested a change in behaviour that would get you what you need. It sounds like this does not meet your needs, so this ticket may have to be a WONTFIX.

tyfoni-systematic · 2024-10-10T13:07:06Z

I'm sorry, you are arguing that sorting on the exact can be made to be a better option for your use case, and I don't disagree. You are missing the point though about why we don't want to switch our default behaviour.

You are right. I was arguing on the basis of what is supported by PostgreSQL. I can understand why you have chosen the existing solution.

I have suggested a change in behaviour that would get you what you need. It sounds like this does not meet your needs, so this ticket may have to be a WONTFIX.

We would be fine with the solution you suggest. It would reduce the stuff we currently have to change in our fork to make things work in our context.

jjksse mentioned this issue Oct 4, 2024

FUT1-13486 changed the Hapi-fhir string order to use Postgress COLLAT… tyfoni-systematic/hapi-fhir#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorting strings with Danish (and properly other languages) letters does not work #6350

Sorting strings with Danish (and properly other languages) letters does not work #6350

jjksse commented Oct 4, 2024 •

edited

Loading

jamesagnew commented Oct 4, 2024

tyfoni-systematic commented Oct 8, 2024

jamesagnew commented Oct 8, 2024

tyfoni-systematic commented Oct 8, 2024 •

edited

Loading

jamesagnew commented Oct 8, 2024

tyfoni-systematic commented Oct 9, 2024 •

edited

Loading

jamesagnew commented Oct 9, 2024

tyfoni-systematic commented Oct 9, 2024

jamesagnew commented Oct 9, 2024

jamesagnew commented Oct 9, 2024

tyfoni-systematic commented Oct 10, 2024

jamesagnew commented Oct 10, 2024

tyfoni-systematic commented Oct 10, 2024

Sorting strings with Danish (and properly other languages) letters does not work #6350

Sorting strings with Danish (and properly other languages) letters does not work #6350

Comments

jjksse commented Oct 4, 2024 • edited Loading

jamesagnew commented Oct 4, 2024

tyfoni-systematic commented Oct 8, 2024

jamesagnew commented Oct 8, 2024

tyfoni-systematic commented Oct 8, 2024 • edited Loading

jamesagnew commented Oct 8, 2024

tyfoni-systematic commented Oct 9, 2024 • edited Loading

jamesagnew commented Oct 9, 2024

tyfoni-systematic commented Oct 9, 2024

jamesagnew commented Oct 9, 2024

jamesagnew commented Oct 9, 2024

tyfoni-systematic commented Oct 10, 2024

jamesagnew commented Oct 10, 2024

tyfoni-systematic commented Oct 10, 2024

jjksse commented Oct 4, 2024 •

edited

Loading

tyfoni-systematic commented Oct 8, 2024 •

edited

Loading

tyfoni-systematic commented Oct 9, 2024 •

edited

Loading