Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting strings with Danish (and properly other languages) letters does not work #6350

Open
jjksse opened this issue Oct 4, 2024 · 13 comments

Comments

@jjksse
Copy link

jjksse commented Oct 4, 2024

Describe the bug
Using the jpa-server for sorting, the default normalized sort does not sort Danish letters correct. It should be possible to select the collation used for sorting.

To Reproduce
Resources:
{
"resourceType": "Questionnaire",
.
.
.

"version": "1.0",
"name": "fff972b2-c0c6-4dcc-9cea-c0f8ff9095af",
"title": "<to fill>",
"status": "active"

}

Making more resources replacing to fill with Danish letters z æ ø and å will result in a sorted result that is incorrect sorted.

ex. Å is represented as A which is wrong. Å is the last Danish letter

Expected behavior
Danish letters is sorted in this way Z Æ Ø Å. We do also represent Ø as OE, Æ as AE and Å as AA which means that AA should be placed together with Å

Screenshots
N/A

Environment (please complete the following information):

  • HAPI FHIR Version 6.10.5
  • OS: [e.g. iOS] linux kubernetes
  • Browser [e.g. N/A

Additional context

@jamesagnew
Copy link
Collaborator

I'm not totally clear what this ticket is asking - Certainly I know that there are users who modify the default collation on PG, and others who manually modify the collation used on columns/indexes. We don't document how to do that, as I would consider it a routine low level DBA-type operation as opposed to something HAPI specifically needs to bless.

Are you requesting documentation on how to change the collation?

@tyfoni-systematic
Copy link

@jamesagnew: You are right that collation is something you need to deal with on database level. The default collation is set when you create a database.
But our problem is that hapi-fhir will do sorting on a normalized string. The normalization in hapi-fhir will convert the danish letter Å to A (it uses NFD and then strips diacritics). This will break the danish collation.
This is probably not the case in all languages (as far as I know).
To fix the sorting, we have modified the SQL generation to use the original string (instead of the normalized one).
I would expect that any other collation would work correctly on the original string input.
The normalization can make sense if you try to make searches more "fuzzy". That is, if i search for: "René", then I also want to find: "Rene".

@jamesagnew
Copy link
Collaborator

Ah I see - So the actual request here is for the sort column to be the "exact" column and not the normalized one yes?

Or alternately, perhaps a pluggable strategy for normalization? Would it make more sense to allow for Å to not be converted to A (for example) when normalizing?

@tyfoni-systematic
Copy link

tyfoni-systematic commented Oct 8, 2024

Ah I see - So the actual request here is for the sort column to be the "exact" column and not the normalized one yes?

Yes

Or alternately, perhaps a pluggable strategy for normalization? Would it make more sense to allow for Å to not be converted to A (for example) when normalizing?

I dont know the reasoning that lies behind the current normalization (NFD + removal of diacritics), but I guess it is about getting some fuzzyness when searching. The code also allows configuration for a phonetic encoder (which we dont use). But if no normalization, or phonetic encoding, then the normalized field would just be a copy of the original.

The current indexing is also only supporting sorting on normalized values:

So when we sort on the original value, then there probably should be another index (to enable an execution plan scanning an index to do sorting).

@jamesagnew
Copy link
Collaborator

Yeah, that's mostly why I thought maybe option 2 would be a better fix?

If the normalization algorithm didn't treat Å/A as the same character, you problem would go away with a simple collation change to the index on that column (or a change to the default collation) which feels like a smaller change perhaps? But of course this would also mean that searches for ?name=A would no longer match names beginning with Å. I don't know where this would be desirable or not.

The alternative is to add configuration to have the sort use the exact column. This means that the search for ?name=A would still work, but would mean:

  • Users would need to define their own index to support this sort - I don't see us adding the index to support this automatically since it would impact DB space pretty significantly but would be rarely used
  • Sort would become case sensitive, which might or might not matter

@tyfoni-systematic
Copy link

tyfoni-systematic commented Oct 9, 2024

Changing the collation on the SP_VALUE_NORMALIZED would not make the sorting correct in our locale. I can understand why the normalization strips diacritics, but in the case with the danish Å becoming A, the sorting will not be correct according to the danish rules. And things are actually a bit more complicated. In danish Å=Aa, eg. the name of my city is Aarhus (previously it was Århus, but it was decided to change it to an older form). And in danish, Å is sorted last, and Aa must also be sorted last. This is handled by the da_DK collation. The normalization for searching should then convert Å to AA, because Å and A is not the same. I admit it is weird.

And, to be honest, we did not take proper care of changing collation when we created our databases. So they have the default en_US.UTF-8 collation, and it is not realistic for us to change it (because it would require export/import).
So, we had to make a workaround, and have done it by explicitly specifying the collation in the ORDER BY clause generated by hapi-fhir. We do this with a custom change in our fork.
And, if we decide that we need an index on SP_VALUE_EXACT, it would also need to be a custom change, because the index must be created with a COLLATE clause to get the behaviour expected in our locale (da_DK).

@jamesagnew
Copy link
Collaborator

Ok - so changing the normalization rules is out then.

I'm still not really clear on what this ticket is requesting then. We could certainly add configuration in HAPI so that it uses the exact column instead of the normalized column for string sorts. I think the expectation on someone using that config is always going to be that they have to create a custom index if they want to use it - there would be no point in us creating an index there just to support configuration that most people would never use (therefore adding a bunch of wasted index space)

@tyfoni-systematic
Copy link

I'm still not really clear on what this ticket is requesting then. We could certainly add configuration in HAPI so that it uses the exact column instead of the normalized column for string sorts. I think the expectation on someone using that config is always going to be that they have to create a custom index if they want to use it - there would be no point in us creating an index there just to support configuration that most people would never use (therefore adding a bunch of wasted index space)

The issue is that we believe that sorting on the normalized string is a defect. Sorting should be done on the original string.
Our solution to the problem also covers the fact, that we have set a wrong default collation, which we also handle with a custom change on our fork.
But for those that care about sorting, it might be worth noting this defect, if they have similar issues (for other locales)

@jamesagnew
Copy link
Collaborator

So presumably then you are advocating for string sorts to just always sort on the exact column?

Sorting on the normalized string produces undesired results for the Danish language - that much you've certainly convinced me of, so I can see how the behaviour would be beneficial for you.

Sorting on the normalized string will produce more reasonable results than sorting on exact with whatever default collation is in place in many other locales though. And modifying the database collation is an advanced concept that I suspect many users of HAPI wouldn't know how to do. I feel like what you're proposing would lead to many less experienced users to not understand why their sort is suddenly producing a, b, c, A, B, C instead of something more expected like a, A, b, B, c, C. No?

@jamesagnew
Copy link
Collaborator

We had some discussion internally on this today. My proposed solution here would be:

  • Add a setting that can be used to optionally switch the sort to use the exact column instead of the normalized one
  • Document that this setting requires the user to manually create an index with a collation they consider appropriate to support the sort
  • Add a new pointcut that allows for manual manipulation of generated search SQL statements, so that you could inject your custom collation into the search SQL if you want

@tyfoni-systematic
Copy link

So presumably then you are advocating for string sorts to just always sort on the exact column?

Yes. Collation support does not require normalization (but it may depend on the database).

Sorting on the normalized string will produce more reasonable results than sorting on exact with whatever default collation is in place in many other locales though. And modifying the database collation is an advanced concept that I suspect many users of HAPI wouldn't know how to do. I feel like what you're proposing would lead to many less experienced users to not understand why their sort is suddenly producing a, b, c, A, B, C instead of something more expected like a, A, b, B, c, C. No?

Again, it does not require normalization. Collation can be chosen to produce the preferred behaviour. Also for case insensitivity.
For postgreSQL see: https://www.postgresql.org/docs/current/collation.html#:~:text=23.2.2.4.%20nondeterministic%20collations%20

@jamesagnew
Copy link
Collaborator

Sorting on normalization by default produces "good enough" behaviour which works consistently, in a default configuration, with no special database settings or control required, across all of the database platforms we support as well as potentially others that people are trying out because hibernate supports them.

I'm sorry, you are arguing that sorting on the exact can be made to be a better option for your use case, and I don't disagree. You are missing the point though about why we don't want to switch our default behaviour.

I have suggested a change in behaviour that would get you what you need. It sounds like this does not meet your needs, so this ticket may have to be a WONTFIX.

@tyfoni-systematic
Copy link

I'm sorry, you are arguing that sorting on the exact can be made to be a better option for your use case, and I don't disagree. You are missing the point though about why we don't want to switch our default behaviour.

You are right. I was arguing on the basis of what is supported by PostgreSQL. I can understand why you have chosen the existing solution.

I have suggested a change in behaviour that would get you what you need. It sounds like this does not meet your needs, so this ticket may have to be a WONTFIX.

We would be fine with the solution you suggest. It would reduce the stuff we currently have to change in our fork to make things work in our context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants