Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle \u200d delimited emoji #20

Closed
lsmith77 opened this issue May 12, 2023 · 3 comments
Closed

handle \u200d delimited emoji #20

lsmith77 opened this issue May 12, 2023 · 3 comments
Labels

Comments

@lsmith77
Copy link

Currently πŸ‘¨πŸ½β€πŸ‘©πŸ½β€πŸ‘§πŸ½ is handled as multiple tokens. Note this likely relate to carpedm20/emoji#204

Ideally it would be handled as a single token.

@bdura bdura added the bug label May 12, 2023
@bdura
Copy link
Contributor

bdura commented May 12, 2023

Hello @lsmith77, the issue comes from the tokenizer defaults. It so happens that πŸ‘¨ is one of the icons that are looked for during tokenization.

Given the root of the issue, I don't think we'll be able to much about it in the near future... However, you could customize your tokenizer to avoid splitting the emoji.

@lsmith77
Copy link
Author

ah bummer .. thx for the feedback though.

@adrianeboyd
Copy link
Contributor

I'll close this for now because it looks like it isn't a bug in spacymoji itself.

@adrianeboyd adrianeboyd closed this as not planned Won't fix, can't repro, duplicate, stale May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants