-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow to load #1
Comments
Hi @ohenrik, I have same issues and I investigated this around, the problem is just one line in https://github.com/ines/spacymoji/blob/6bc4e44545d50d642baaba8b5ad36a050e1eb221/spacymoji/__init__.py#L59 It takes time to parse the EMOJI list. in emoji package, it is roughly 3k item list. When you feed with nlp language model en, it will make it worse, where if you do None language "nlp = Language(Vocab())", it will parse quickly. Here is my changes adapted from the original emoji. I hope contributor will remove the PhraseMatcher, it is not efficient to have in the pipeline. We all know emoji is always one token, so it is suitable to just normal string match like described in function "get_emoji_desc". I changed so it will be cached when you do emoji_desc multiple times. import spacy
from spacy.tokens import Doc, Span, Token
from emoji import UNICODE_EMOJI
# make sure multi-character emoji don't contain whitespace
EMOJI = {e.replace(' ', ''): t for e, t in UNICODE_EMOJI.items()}
class Emoji2(object):
name = 'emoji2'
def __init__(self, nlp=None, merge_spans=True, lookup={}, pattern_id='EMOJI',
attrs=('has_emoji', 'is_emoji', 'emoji_desc', 'emoji')):
self._has_emoji, self._is_emoji, self._emoji_desc, self._emoji = attrs
self.merge_spans = merge_spans
self.lookup = lookup
# Add attributes
Doc.set_extension(self._has_emoji, getter=self.has_emoji)
Doc.set_extension(self._emoji, getter=self.iter_emoji)
Span.set_extension(self._has_emoji, getter=self.has_emoji)
Span.set_extension(self._emoji, getter=self.iter_emoji)
Token.set_extension(self._is_emoji, default=False)
Token.set_extension(self._emoji_desc, default=None)
def __call__(self, doc):
for token in doc:
emoji_desc = self.get_emoji_desc(token)
if emoji_desc:
token._.set(self._is_emoji, True)
token._.set(self._emoji_desc, emoji_desc)
return doc
def has_emoji(self, tokens):
return any(token._.get(self._is_emoji) for token in tokens)
def iter_emoji(self, tokens):
return [(t.text, i, t._.get(self._emoji_desc))
for i, t in enumerate(tokens)
if t._.get(self._is_emoji)]
def get_emoji_desc(self, token):
if token.text in self.lookup:
return self.lookup[token.text]
if token.text in EMOJI:
desc = EMOJI[token.text]
# Here we're converting shortcodes, e.g. ":man_getting_haircut:"
return desc.replace('_', ' ').replace(':', '')
return None
nlp = spacy.load('en_core_web_sm')
emoji = Emoji2(nlp)
nlp.add_pipe(emoji, first=True)
print(nlp('🎀 🐈')._.has_emoji) |
@kororo, I don't think that's quite correct. Firstly, as Ines explains in the README, there are emoji that contain multiple characters, and Spacy doesn't always tokenize them into a single token by default. For example: from emoji import UNICODE_EMOJI
EMOJI = {e.replace(' ', ''): t for e, t in UNICODE_EMOJI.items()}
nlp = sp.load("en_core_web_sm")
long_emoji = [e for e in EMOJI.keys() if len(e) > 4]
txt = f"This is one emoji: {long_emoji[0]} and this another: {long_emoji[1]}."
print(f"Original text:\n'{txt}'")
doc = nlp(txt)
print("\nTokens:")
for t in doc: print(f"'{t}'") will produce this:
So I think Ines's way of merging is the correct one. However, I found that a simple change from emoji_patterns1 = [nlp(emoji) for emoji in EMOJI.keys()] to emoji_patterns2 = list(nlp.pipe(EMOJI.keys())) reduces the execution time on my laptop from ~25s to <4s, without changing the result: def docs_equal(d1, d2):
attribs = ["ORTH", "LEMMA", "POS", "TAG"]
return (d1.to_array(attribs) == d2.to_array(attribs)).all()
all([docs_equal(emoji_patterns1[i], emoji_patterns2[i]) for i in range(len(emoji_patterns1))])
>> True I think in principle it could be sped up even more by not executing the whole pipe, e.g. using: |
I just confirmed that Ines's test: doc = nlp(u"This is a test 😻 👍🏿")
assert doc._.has_emoji == True
assert doc[2:5]._.has_emoji == True
assert doc[0]._.is_emoji == False
assert doc[4]._.is_emoji == True
assert doc[5]._.emoji_desc == u'thumbs up dark skin tone'
assert len(doc._.emoji) == 2
assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone') passes with the change to emoji_patterns = list(nlp.tokenizer.pipe(EMOJI.keys())) which takes 155ms instead of 25s on my laptop. If I haven't overlooked anything this could probably be included as a change. |
Thanks so much for looking int to this – I can confirm that we need to match both single and multiple tokens for emoji, and that using Would you like to submit a PR with your fix? |
Ok. Will do. I think the way to go then would be with
When I have a moment I will create the PR. |
I’m currently using this in my project, however it seems like it takes about 10-20 seconds toinstantiate this class. Is that normal or might there be something wrong with my installation?
^ Takes about 21 seconds
The text was updated successfully, but these errors were encountered: