-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String.slice is broken on strings with multi-byte Unicode characters #977
Comments
This affects While 0.18 also used For now, there are multiple Elm-only fixes such as using EDIT: |
I am no longer convinced that this is a bug. Nothing forces a The
|
I think the choice here is up to the language design. I think one of the important mentions is that operations on strings is not consistent. Unfortunately, the behavior as it is seems to bind Elm to the underlying JavaScript nuances with string handling. It’s unfortunate that one might need to use a string type to represent bytes. Perhaps the work going on with elm in the binary space can allow strings to represent Unicode characters rather than 8 bit chunks. |
I think it's worthwhile to look at the Elixir implementation of strings. This conference talk is pretty enlightening: https://youtu.be/zZxBL-lV9uA |
Or, for those who prefer text: https://elixir-lang.org/getting-started/binaries-strings-and-char-lists.html In Lisp, a string is a sequence of characters. A character is one unicode code point. Representations vary, but the lisp I use stores 64 bits per character by default. Lots of wasted space, but in these days of multi-gigabyte RAM, I haven't noticed a problem. I'm now thinking that this is best fixed by documentation, just letting people know that a string is a sequence of bytes, and that those bytes are interpreted as UTF-8 when read or printed, but that other operations work just on the bytes. |
One important thing to note is that often it's not enough to just consider each Unicode character as a separate character, but you need to include any combining characters following the base character and consider those as a single unit. For example a string with "a" (U+0061) followed by "`" (U+0300) is shown as single character "à" while it is actually two Unicode characters and three bytes in UTF-8 format. EDIT: That article about Elixir strings doesn't seem to consider this at all. |
More information here: http://unicode.org/faq/char_combmark.html
So since this is such a complex topic, I personally prefer the solution used e.g. in PHP where string is just a sequence of bytes and nothing more. (Now PHP does have additional functions for handling multibyte characters, and Elm could also add either new type or new functions for some use cases, but I think that base string should be just bytes.) |
Just a note: Elm/JavaScript strings are not byte arrays as said in some previous comments but arrays of UTF-16 code units. |
The JavaScript implementing
String.slice
is not Unicode-enabled. It uses.slice
, which is a byte-array operator.SSCCE: https://ellie-app.com/3bs5VffDGP2a1
This displays
"�🙉🙊"
, not"🙉🙊"
, as expected.A fix that works at https://github.com/elm/core/blob/1.0.0/src/Elm/Kernel/String.js#L151 is:
Note that I haven't actually tested that against an Elm program, but I did test the returned expression in a JS debugger. It has the problem that it makes a full copy of the string into an array. There is probably a way to do this without allocating any memory except the returned string.
String.left
,String.right
,String.dropLeft
, andString.dropRight
all callString.slice
, so they're all broken.This bug was also in 0.18 (https://github.com/elm-lang/core/blob/5.1.1/src/Native/String.js#L89)
Thanks to @stephenreddek in Elm Slack for finding this.
The text was updated successfully, but these errors were encountered: