String.slice is broken on strings with multi-byte Unicode characters #977

billstclair · 2018-08-28T22:53:48Z

The JavaScript implementing String.slice is not Unicode-enabled. It uses .slice, which is a byte-array operator.

SSCCE: https://ellie-app.com/3bs5VffDGP2a1

module Main exposing (main)

import Browser
import Html exposing (Html, text)

view : () -> Html ()
view _ =
    text <| String.dropLeft 1 "🙈🙉🙊"

main : Program () () ()
main =
    Browser.sandbox
        { init = ()
        , view = view
        , update = \_ m -> m
        }

This displays "�🙉🙊", not "🙉🙊", as expected.

A fix that works at https://github.com/elm/core/blob/1.0.0/src/Elm/Kernel/String.js#L151 is:

var _String_slice = F3(function(start, end, str) {
    return Array.from(str).slice(start, end).join('');
});

Note that I haven't actually tested that against an Elm program, but I did test the returned expression in a JS debugger. It has the problem that it makes a full copy of the string into an array. There is probably a way to do this without allocating any memory except the returned string.

String.left, String.right, String.dropLeft, and String.dropRight all call String.slice, so they're all broken.

This bug was also in 0.18 (https://github.com/elm-lang/core/blob/5.1.1/src/Native/String.js#L89)

Thanks to @stephenreddek in Elm Slack for finding this.

The text was updated successfully, but these errors were encountered:

stephenreddek · 2018-08-29T14:09:21Z

This affects slice, left, right, dropLeft, dropRight.

While 0.18 also used slice in the same manner, I believe that it is important to treat surrogate pairs consistently between all String functions. I ran into this issue when upgrading teepark/elmoji. It uses uncons to move through a list and look for matches to known emojis. Later, it calls dropLeft with the number of characters that it counted in the emoji match. This now causes an issue because uncons will see "🖖" as 1 character and dropLeft will treat it like 2.

For now, there are multiple Elm-only fixes such as using uncons n times or using List.drop after calling String.toList and finally using String.fromList.

EDIT:
Regex.Match indexes are also unaware of surrogate pairs. It matches String.slice, but would not match uncons.

Addresses: elm/core#977

billstclair · 2018-10-15T11:02:26Z

I am no longer convinced that this is a bug. Nothing forces a String to contain well-formed UTF-8. It's just a sequence of bytes. I use it that way in CustomElement.FileListener in billstclair/elm-custom-element, and am glad for that feature.

The String documentation says that "Strings are not lists of characters." This is pretty obvious from the String.fromList example (from an elm repl transcript):

> String.length <| String.fromList ['🙈','🙉','🙊']
6 : Int

dynajoe · 2018-10-19T03:37:51Z

I think the choice here is up to the language design. I think one of the important mentions is that operations on strings is not consistent. Unfortunately, the behavior as it is seems to bind Elm to the underlying JavaScript nuances with string handling.

It’s unfortunate that one might need to use a string type to represent bytes. Perhaps the work going on with elm in the binary space can allow strings to represent Unicode characters rather than 8 bit chunks.

norpan · 2018-10-19T05:56:39Z

I think it's worthwhile to look at the Elixir implementation of strings. This conference talk is pretty enlightening: https://youtu.be/zZxBL-lV9uA

billstclair · 2018-10-19T09:45:57Z

Or, for those who prefer text: https://elixir-lang.org/getting-started/binaries-strings-and-char-lists.html

In Lisp, a string is a sequence of characters. A character is one unicode code point. Representations vary, but the lisp I use stores 64 bits per character by default. Lots of wasted space, but in these days of multi-gigabyte RAM, I haven't noticed a problem.

I'm now thinking that this is best fixed by documentation, just letting people know that a string is a sequence of bytes, and that those bytes are interpreted as UTF-8 when read or printed, but that other operations work just on the bytes.

malaire · 2018-10-19T11:55:08Z

One important thing to note is that often it's not enough to just consider each Unicode character as a separate character, but you need to include any combining characters following the base character and consider those as a single unit.

For example a string with "a" (U+0061) followed by "`" (U+0300) is shown as single character "à" while it is actually two Unicode characters and three bytes in UTF-8 format.

EDIT: That article about Elixir strings doesn't seem to consider this at all.

malaire · 2018-10-19T12:37:35Z

More information here: http://unicode.org/faq/char_combmark.html

Computing the length or position of a "character" in a Unicode string can be a little complicated, as there are four different approaches to doing so, plus the potential confusion caused by combining characters. The correct choice of which counting method to use depends on what is being counted and what the count or position is used for.

So since this is such a complex topic, I personally prefer the solution used e.g. in PHP where string is just a sequence of bytes and nothing more. (Now PHP does have additional functions for handling multibyte characters, and Elm could also add either new type or new functions for some use cases, but I think that base string should be just bytes.)

malaire · 2019-10-09T08:04:57Z

Just a note: Elm/JavaScript strings are not byte arrays as said in some previous comments but arrays of UTF-16 code units.

stephenreddek added a commit to stephenreddek/elm-emoji that referenced this issue Aug 29, 2018

Switch from the core dropLeft to safely handle surrogate pairs

0fd4879

Addresses: elm/core#977

This was referenced Nov 6, 2019

Invalid or incomplete documentation in many String functions (characters vs UTF-16 code units) #1061

Open

Cannot handle special characters (em dash, emoji, etc.) jxxcarlson/elm-tar#3

Closed

evancz added the breaking would require a MAJOR release label Feb 9, 2021

decioferreira mentioned this issue Sep 15, 2024

fix getCharWidth guida-lang/compiler#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String.slice is broken on strings with multi-byte Unicode characters #977

String.slice is broken on strings with multi-byte Unicode characters #977

billstclair commented Aug 28, 2018 •

edited

Loading

stephenreddek commented Aug 29, 2018 •

edited

Loading

billstclair commented Oct 15, 2018 •

edited

Loading

dynajoe commented Oct 19, 2018

norpan commented Oct 19, 2018

billstclair commented Oct 19, 2018

malaire commented Oct 19, 2018 •

edited

Loading

malaire commented Oct 19, 2018

malaire commented Oct 9, 2019

String.slice is broken on strings with multi-byte Unicode characters #977

String.slice is broken on strings with multi-byte Unicode characters #977

Comments

billstclair commented Aug 28, 2018 • edited Loading

stephenreddek commented Aug 29, 2018 • edited Loading

billstclair commented Oct 15, 2018 • edited Loading

dynajoe commented Oct 19, 2018

norpan commented Oct 19, 2018

billstclair commented Oct 19, 2018

malaire commented Oct 19, 2018 • edited Loading

malaire commented Oct 19, 2018

malaire commented Oct 9, 2019

billstclair commented Aug 28, 2018 •

edited

Loading

stephenreddek commented Aug 29, 2018 •

edited

Loading

billstclair commented Oct 15, 2018 •

edited

Loading

malaire commented Oct 19, 2018 •

edited

Loading