-
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort order of languages in Kiwix serve / library.kiwix.org is wrong #980
Comments
For info, this is how it is done in pwa.kiwix.org, where the sort order is by local language name, and the English designation is provided in parenthesis for those who may be looking for a particular language but don't know the local name or script. (The mapping isn't perfect, because it is provided by an array that I had to populate manually in places where info was missing.) |
Seems clearly buggy even if I wonder a bit why this has not been detected earlier. Should be sorted alphabetically, based on what is displayed (not based on a technical ISO language code). Not in favour either to print anything additional. To me this should be done in the backend. |
@Jaifroid This is a complex topic (unless there is an established common way to do it).
Then languages with non-latin scripts will come after languages with latin scripts.
I guess that in pwa.kiwix.org you map different scripts to the Latin alphabet through some loose phonetic correspondence and treat the order of letters in the Latin alphabet as a good reference for sorting. However the Latin alphabet lacks dedicated letters for a lot of sounds and has to use digraphs (like sh, dz, ts, etc). If the language self name starts with such a sound (e.g. dz) it is counterintuitive to look it up in languages starting with the first letter of the digraph (d). Then the order of sounds in different alphabets is different. For example, in Russian the phonetic analog of V (the cyrillic letter В) - is the third letter of the alphabet. I definitely agree that the current sort order is silly for languages with Latin-based scripts. However the general problem can hardly be solved in a way that doesn't raise similar concerns by speakers of other languages. |
@veloman-yunkan Thanks for the explanation, which makes a lot of sense. Looking at it again, I think my solution was simply to order it alphabetically by the iso language codes (en, el, es) that Wikipedia uses, but making sure that same-language groups [like zh: '中文 (Chinese)', lzh: '文言 (Classical Chinese)'] appear together. This seems to give a better result than ordering by English names for languages but displaying localized names. I agree it's not perfect. |
there is a common way: use libicu https://unicode-org.github.io/icu/userguide/collation/ |
@kelson42 I still think that the issue at hand is different - we are not talking about different locale-dependent collation methods. The challenge is to sort a list of languages that is composed of strings (language self-names) in different languages. The essence of the conflict can be illustrated by a fictional language Zxcvb - in the alphabet of that language Z is the first letter, thus for the list of languages to be intuitive for the speakers of Zxcvb that language must appear in the beginning (but that is absolutely confusing for other users of the Latin-based alphabets). |
@veloman-yunkan I understand but there is no concrete/real evidence that sorting the language strings (each in its own language) using the collation of UI language is going to give a wrong result, or does it? If it does, what viable alternative do you propose? To me, there clearly no perfect solution AFAIK, but we need to be pragmatic. |
Short of allowing users to set their locale and then using a locale-specific sort algorithm (which would probably be nonsensical unless we knew names and spellings of languages in each of the many Wikipedia languages), then we have to have an approximation. The ISO-639 language codes as used by Wikipedia have a natural relationship to the ZIM archives we make. These are mostly the two-letter codes, but they occasionally use three-letter codes for more esoteric languages or variants thereof. I know we don't only do Wikipedia, but it is the most multi-lingual target we have. Unless someone has a better idea, or libicu provides an internationally accepted, global way of sorting international language names that is not specific to any one locale (and still allows giving language names in their native scripts)... |
@Jaifroid Wikipedia don't use ISO code for languages AFAIK in a visible manner, except for URLs... and most of the users don't know how URLs work. Here you really have a tech. trope IMO. |
An other solution would be to use a sophisticated solution which makes the question of the sorting less relevant. The language selector of Wikipedia is reusable AFAIK. |
That sounds interesting! |
I'm not sure if the last sentence is related the previous one but:
This is interesting, but I have just switch the language in a language I don't know (It seems it is Farsi from the url (https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C) ) and I'm totally lost. I cannot change the language to anything else as I don't know where to click. The only way I can change is by changing the url myself. And as the language can be set by a cookie (in our case) it could really difficult to reset it.
We will have a sorting at a moment. Either we select it and try to adapt it the better it fit or not. And as you say, the sorting will never be perfect. At least because we are doing the sort for two different persons a the same time: The one speaking the lang of the current UI and the one speaking the lang of the wanted language. As we don't know the wanted language, I think it is ok to sort language by the current UI language. |
The best solution would probably be to sort by language names as displayed in the user's selected locale, but I think that would be really complex to programme and would involve a huge matrix of at least 321 x 321 language names, only counting Wikipedia languages and not the different language codes and names used by Gutenberg, PhET, etc. Can we even get those data in a standardized form, that doesn't require using an online API (since Kiwix Serve must work offline)? My imperfect/pragmatic solution was to use the Wikipedia list of languages, giving the local name and script first, and the English name in parentheses, supplemented with the language codes used for non-Wikimedia projects (Gutenberg, PhET...). And I didn't have a better sort order than the language codes ( NB I'm not advocating for this: just documenting it in case any part of it is useful. |
Not really. First, there is two kind of languages:
We would only have something like 29xN languages. Second, we mostly not need a matrix.
We already have a lot of data embedded in our binary (from ICU data to our own translations) so it should not be a problem. |
Ah, OK, so it's less complex than I thought. That's good. |
BTW, a similar problem is present (as of writing this comment) in the language selector of the main page of Wikipedia (https://www.wikipedia.org) - the language list starts with Afrikaans followed by Polski: Also Bahasa Indonesia is between Hrvatski and Italiano. On the one hand it looks like trying to present a long list of languages in a sorted order is a wrong idea - if one wants to select a particular language the right tool is a text box with suggestions. But if one wants to see what languages are available they must have access to the full list, however in that case I don't think that the order matters (or, rather, the order is stipulated by some other criteria, e.g. count of books, count of speakers, etc). |
Well, I'd classify placing "Polski" (in Polish or English) after "Afrikaans" as a bug. I can't see any logic to it, and it's confusing. I don't think the fact that it's not easy to find a sorting mechanism should mean that we use random sorting. While I take the point that a sorted dropdown is probably not the right approach, and an auto-complete text-box would avoid the sorting issue, I also agree that it's good to see the languages that are available. In any case, sorting by the English-language spelling of the localized language names is probably quite insulting to some nationalities, and it's plain bizarre if the list doesn't show the sort key (i.e. doesn't give the language spelling by which it is sorted). |
See screenshot below. The language "Español" is not sorted under "E", but (I suppose) by its English name "Spanish", i.e., it appears in the list along with other languages beginning with "S". This is unintuitive. It would be better for the list to be sorted in UTF-8 order of localized language names, with English only being used as a fallback if we don't have the localized version (I'm not sure why we wouldn't, however).
The text was updated successfully, but these errors were encountered: