Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Digits ('0', '1', etc.) are interpreted as emojis #280

Open
monoclex opened this issue Jul 9, 2021 · 4 comments
Open

Digits ('0', '1', etc.) are interpreted as emojis #280

monoclex opened this issue Jul 9, 2021 · 4 comments

Comments

@monoclex
Copy link

monoclex commented Jul 9, 2021

Upon calling unic::emoji::char::is_emoji('0'), this library returns true. I'm not aware of the specifics of the unicode standard, but I believe that '0' is not an emoji.

This test may be useful to introduce:

#[test]
fn are_nums_emojis() {
    use unic::emoji::char::is_emoji;
    assert_eq!(is_emoji('0'), false);
    assert_eq!(is_emoji('1'), false);
    assert_eq!(is_emoji('2'), false);
    assert_eq!(is_emoji('3'), false);
    assert_eq!(is_emoji('4'), false);
    assert_eq!(is_emoji('5'), false);
    assert_eq!(is_emoji('6'), false);
    assert_eq!(is_emoji('7'), false);
    assert_eq!(is_emoji('8'), false);
    assert_eq!(is_emoji('9'), false);
}
@eyeplum
Copy link
Member

eyeplum commented Jul 12, 2021

Looks like there might be a bug when generating https://github.com/open-i18n/rust-unic/blob/master/unic/emoji/char/tables/emoji.rsv#L6 so that U+0030..U+0039 were mistakenly extracted out from the original UCD file (https://github.com/open-i18n/data-unicode-ucd/blob/master/data/EmojiSources.txt).

I believe U+0030..U+0039 are only considered as emojis when they are followed by U+20E3.

@eyeplum
Copy link
Member

eyeplum commented Jul 12, 2021

I believe the cause is that https://github.com/open-i18n/rust-unic/blob/master/gen/src/source/emoji/emoji_data.rs#L54 reuses the regex for parsing other binary properties (source::ucd::BINARY_PROPERTIES_REGEX) however the UCD file for emoji sources has a slightly different format (or rather the code points are represented as a sequence instead of a range).

@estebank
Copy link

This might also be the reason # is considered an emoji, which caused a parsing regression when trying to recover identifiers with emoji in rustc. We're side-stepping the issue by ensuring ASCII chars are never considered emoji in our end.

@thomcc
Copy link

thomcc commented Mar 30, 2023

I'm not aware of the specifics of the unicode standard, but I believe that '0' is not an emoji.

Unfortunately, they (along with #, *, among others) have the Emoji unicode property, so this is somewhat incorrect. I don't think this is a bug so much as a non-intuitive API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants