Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String.slice is broken on strings with multi-byte Unicode characters #977

Open
billstclair opened this issue Aug 28, 2018 · 8 comments
Open
Labels
breaking would require a MAJOR release

Comments

@billstclair
Copy link

billstclair commented Aug 28, 2018

The JavaScript implementing String.slice is not Unicode-enabled. It uses .slice, which is a byte-array operator.

SSCCE: https://ellie-app.com/3bs5VffDGP2a1

module Main exposing (main)

import Browser
import Html exposing (Html, text)

view : () -> Html ()
view _ =
    text <| String.dropLeft 1 "🙈🙉🙊"

main : Program () () ()
main =
    Browser.sandbox
        { init = ()
        , view = view
        , update = \_ m -> m
        }

This displays "�🙉🙊", not "🙉🙊", as expected.

A fix that works at https://github.com/elm/core/blob/1.0.0/src/Elm/Kernel/String.js#L151 is:

var _String_slice = F3(function(start, end, str) {
    return Array.from(str).slice(start, end).join('');
});

Note that I haven't actually tested that against an Elm program, but I did test the returned expression in a JS debugger. It has the problem that it makes a full copy of the string into an array. There is probably a way to do this without allocating any memory except the returned string.

String.left, String.right, String.dropLeft, and String.dropRight all call String.slice, so they're all broken.

This bug was also in 0.18 (https://github.com/elm-lang/core/blob/5.1.1/src/Native/String.js#L89)

Thanks to @stephenreddek in Elm Slack for finding this.

@stephenreddek
Copy link

stephenreddek commented Aug 29, 2018

This affects slice, left, right, dropLeft, dropRight.

While 0.18 also used slice in the same manner, I believe that it is important to treat surrogate pairs consistently between all String functions. I ran into this issue when upgrading teepark/elmoji. It uses uncons to move through a list and look for matches to known emojis. Later, it calls dropLeft with the number of characters that it counted in the emoji match. This now causes an issue because uncons will see "🖖" as 1 character and dropLeft will treat it like 2.

For now, there are multiple Elm-only fixes such as using uncons n times or using List.drop after calling String.toList and finally using String.fromList.

EDIT:
Regex.Match indexes are also unaware of surrogate pairs. It matches String.slice, but would not match uncons.

stephenreddek added a commit to stephenreddek/elm-emoji that referenced this issue Aug 29, 2018
@billstclair
Copy link
Author

billstclair commented Oct 15, 2018

I am no longer convinced that this is a bug. Nothing forces a String to contain well-formed UTF-8. It's just a sequence of bytes. I use it that way in CustomElement.FileListener in billstclair/elm-custom-element, and am glad for that feature.

The String documentation says that "Strings are not lists of characters." This is pretty obvious from the String.fromList example (from an elm repl transcript):

> String.length <| String.fromList ['🙈','🙉','🙊']
6 : Int

@dynajoe
Copy link

dynajoe commented Oct 19, 2018

I think the choice here is up to the language design. I think one of the important mentions is that operations on strings is not consistent. Unfortunately, the behavior as it is seems to bind Elm to the underlying JavaScript nuances with string handling.

It’s unfortunate that one might need to use a string type to represent bytes. Perhaps the work going on with elm in the binary space can allow strings to represent Unicode characters rather than 8 bit chunks.

@norpan
Copy link

norpan commented Oct 19, 2018

I think it's worthwhile to look at the Elixir implementation of strings. This conference talk is pretty enlightening: https://youtu.be/zZxBL-lV9uA

@billstclair
Copy link
Author

Or, for those who prefer text: https://elixir-lang.org/getting-started/binaries-strings-and-char-lists.html

In Lisp, a string is a sequence of characters. A character is one unicode code point. Representations vary, but the lisp I use stores 64 bits per character by default. Lots of wasted space, but in these days of multi-gigabyte RAM, I haven't noticed a problem.

I'm now thinking that this is best fixed by documentation, just letting people know that a string is a sequence of bytes, and that those bytes are interpreted as UTF-8 when read or printed, but that other operations work just on the bytes.

@malaire
Copy link

malaire commented Oct 19, 2018

One important thing to note is that often it's not enough to just consider each Unicode character as a separate character, but you need to include any combining characters following the base character and consider those as a single unit.

For example a string with "a" (U+0061) followed by "`" (U+0300) is shown as single character "à" while it is actually two Unicode characters and three bytes in UTF-8 format.

EDIT: That article about Elixir strings doesn't seem to consider this at all.

@malaire
Copy link

malaire commented Oct 19, 2018

More information here: http://unicode.org/faq/char_combmark.html

Computing the length or position of a "character" in a Unicode string can be a little complicated, as there are four different approaches to doing so, plus the potential confusion caused by combining characters. The correct choice of which counting method to use depends on what is being counted and what the count or position is used for.

So since this is such a complex topic, I personally prefer the solution used e.g. in PHP where string is just a sequence of bytes and nothing more. (Now PHP does have additional functions for handling multibyte characters, and Elm could also add either new type or new functions for some use cases, but I think that base string should be just bytes.)

@malaire
Copy link

malaire commented Oct 9, 2019

Just a note: Elm/JavaScript strings are not byte arrays as said in some previous comments but arrays of UTF-16 code units.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking would require a MAJOR release
Projects
None yet
Development

No branches or pull requests

6 participants