Digits ('0', '1', etc.) are interpreted as emojis #280

monoclex · 2021-07-09T06:40:23Z

Upon calling unic::emoji::char::is_emoji('0'), this library returns true. I'm not aware of the specifics of the unicode standard, but I believe that '0' is not an emoji.

This test may be useful to introduce:

#[test]
fn are_nums_emojis() {
    use unic::emoji::char::is_emoji;
    assert_eq!(is_emoji('0'), false);
    assert_eq!(is_emoji('1'), false);
    assert_eq!(is_emoji('2'), false);
    assert_eq!(is_emoji('3'), false);
    assert_eq!(is_emoji('4'), false);
    assert_eq!(is_emoji('5'), false);
    assert_eq!(is_emoji('6'), false);
    assert_eq!(is_emoji('7'), false);
    assert_eq!(is_emoji('8'), false);
    assert_eq!(is_emoji('9'), false);
}

The text was updated successfully, but these errors were encountered:

eyeplum · 2021-07-12T01:15:12Z

Looks like there might be a bug when generating https://github.com/open-i18n/rust-unic/blob/master/unic/emoji/char/tables/emoji.rsv#L6 so that U+0030..U+0039 were mistakenly extracted out from the original UCD file (https://github.com/open-i18n/data-unicode-ucd/blob/master/data/EmojiSources.txt).

I believe U+0030..U+0039 are only considered as emojis when they are followed by U+20E3.

eyeplum · 2021-07-12T01:20:24Z

I believe the cause is that https://github.com/open-i18n/rust-unic/blob/master/gen/src/source/emoji/emoji_data.rs#L54 reuses the regex for parsing other binary properties (source::ucd::BINARY_PROPERTIES_REGEX) however the UCD file for emoji sources has a slightly different format (or rather the code points are represented as a sequence instead of a range).

estebank · 2023-03-30T03:44:28Z

This might also be the reason # is considered an emoji, which caused a parsing regression when trying to recover identifiers with emoji in rustc. We're side-stepping the issue by ensuring ASCII chars are never considered emoji in our end.

thomcc · 2023-03-30T14:24:21Z

I'm not aware of the specifics of the unicode standard, but I believe that '0' is not an emoji.

Unfortunately, they (along with #, *, among others) have the Emoji unicode property, so this is somewhat incorrect. I don't think this is a bug so much as a non-intuitive API.

compiler-errors mentioned this issue Mar 30, 2023

Do not consider # an emoji in the lexer rust-lang/rust#109754

Closed

WanderingHogan mentioned this issue Nov 7, 2023

chore(emoji): Fix issue where numbers and some symbols got big_emoji class applied Satellite-im/Uplink#1469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Digits ('0', '1', etc.) are interpreted as emojis #280

Digits ('0', '1', etc.) are interpreted as emojis #280

monoclex commented Jul 9, 2021

eyeplum commented Jul 12, 2021

eyeplum commented Jul 12, 2021

estebank commented Mar 30, 2023

thomcc commented Mar 30, 2023

Digits ('0', '1', etc.) are interpreted as emojis #280

Digits ('0', '1', etc.) are interpreted as emojis #280

Comments

monoclex commented Jul 9, 2021

eyeplum commented Jul 12, 2021

eyeplum commented Jul 12, 2021

estebank commented Mar 30, 2023

thomcc commented Mar 30, 2023