Slow to load #1

ohenrik · 2018-02-28T14:22:18Z

I’m currently using this in my project, however it seems like it takes about 10-20 seconds toinstantiate this class. Is that normal or might there be something wrong with my installation?

nlp = spacy.load('en')
emoji = Emoji(nlp)

^ Takes about 21 seconds

kororo · 2018-07-07T01:49:43Z

Hi @ohenrik,

I have same issues and I investigated this around, the problem is just one line in https://github.com/ines/spacymoji/blob/6bc4e44545d50d642baaba8b5ad36a050e1eb221/spacymoji/__init__.py#L59

It takes time to parse the EMOJI list. in emoji package, it is roughly 3k item list. When you feed with nlp language model en, it will make it worse, where if you do None language "nlp = Language(Vocab())", it will parse quickly. Here is my changes adapted from the original emoji.

I hope contributor will remove the PhraseMatcher, it is not efficient to have in the pipeline. We all know emoji is always one token, so it is suitable to just normal string match like described in function "get_emoji_desc". I changed so it will be cached when you do emoji_desc multiple times.

import spacy
from spacy.tokens import Doc, Span, Token
from emoji import UNICODE_EMOJI

# make sure multi-character emoji don't contain whitespace
EMOJI = {e.replace(' ', ''): t for e, t in UNICODE_EMOJI.items()}

class Emoji2(object):
    name = 'emoji2'

    def __init__(self, nlp=None, merge_spans=True, lookup={}, pattern_id='EMOJI',
                 attrs=('has_emoji', 'is_emoji', 'emoji_desc', 'emoji')):
        self._has_emoji, self._is_emoji, self._emoji_desc, self._emoji = attrs
        self.merge_spans = merge_spans
        self.lookup = lookup
        # Add attributes
        Doc.set_extension(self._has_emoji, getter=self.has_emoji)
        Doc.set_extension(self._emoji, getter=self.iter_emoji)
        Span.set_extension(self._has_emoji, getter=self.has_emoji)
        Span.set_extension(self._emoji, getter=self.iter_emoji)
        Token.set_extension(self._is_emoji, default=False)
        Token.set_extension(self._emoji_desc, default=None)

    def __call__(self, doc):
        for token in doc:
            emoji_desc = self.get_emoji_desc(token)
            if emoji_desc:
                token._.set(self._is_emoji, True)
                token._.set(self._emoji_desc, emoji_desc)
        return doc

    def has_emoji(self, tokens):
        return any(token._.get(self._is_emoji) for token in tokens)

    def iter_emoji(self, tokens):
        return [(t.text, i, t._.get(self._emoji_desc))
                for i, t in enumerate(tokens)
                if t._.get(self._is_emoji)]

    def get_emoji_desc(self, token):
        if token.text in self.lookup:
            return self.lookup[token.text]
        if token.text in EMOJI:
            desc = EMOJI[token.text]
            # Here we're converting shortcodes, e.g. ":man_getting_haircut:"
            return desc.replace('_', ' ').replace(':', '')
        return None

nlp = spacy.load('en_core_web_sm')
emoji = Emoji2(nlp)
nlp.add_pipe(emoji, first=True)
print(nlp('🎀 🐈')._.has_emoji)

buhrmann · 2018-11-07T11:05:21Z

@kororo, I don't think that's quite correct.

Firstly, as Ines explains in the README, there are emoji that contain multiple characters, and Spacy doesn't always tokenize them into a single token by default. For example:

from emoji import UNICODE_EMOJI

EMOJI = {e.replace(' ', ''): t for e, t in UNICODE_EMOJI.items()}
nlp = sp.load("en_core_web_sm")

long_emoji = [e for e in EMOJI.keys() if len(e) > 4] 
txt = f"This is one emoji: {long_emoji[0]} and this another: {long_emoji[1]}."
print(f"Original text:\n'{txt}'")

doc = nlp(txt)
print("\nTokens:")
for t in doc: print(f"'{t}'")

will produce this:

Original text:
'This is one emoji: 🏴󠁧󠁢󠁥󠁮󠁧󠁿 and this another: 🏴󠁧󠁢󠁳󠁣󠁴󠁿.'

Tokens:
'This'
'is'
'one'
'emoji'
':'
'🏴'
'󠁧󠁢󠁥󠁮󠁧󠁿'
'and'
'this'
'another'
':'
'🏴'
'󠁧󠁢󠁳󠁣󠁴󠁿.'

So I think Ines's way of merging is the correct one. However, I found that a simple change from

emoji_patterns1 = [nlp(emoji) for emoji in EMOJI.keys()]

to

emoji_patterns2 = list(nlp.pipe(EMOJI.keys()))

reduces the execution time on my laptop from ~25s to <4s, without changing the result:

def docs_equal(d1, d2):
    attribs = ["ORTH", "LEMMA", "POS", "TAG"]
    return (d1.to_array(attribs) == d2.to_array(attribs)).all()

all([docs_equal(emoji_patterns1[i], emoji_patterns2[i]) for i in range(len(emoji_patterns1))])

>> True

I think in principle it could be sped up even more by not executing the whole pipe, e.g. using: nlp.tokenizer.pipe(EMOJI.keys()). I haven't tried that yet, but I don't think information from the tagger, parser etc. is needed for the PhraseMatcher to work correctly here.

buhrmann · 2018-11-07T11:20:50Z

I just confirmed that Ines's test:

doc = nlp(u"This is a test 😻 👍🏿")
assert doc._.has_emoji == True
assert doc[2:5]._.has_emoji == True
assert doc[0]._.is_emoji == False
assert doc[4]._.is_emoji == True
assert doc[5]._.emoji_desc == u'thumbs up dark skin tone'
assert len(doc._.emoji) == 2
assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')

passes with the change to

emoji_patterns = list(nlp.tokenizer.pipe(EMOJI.keys()))

which takes 155ms instead of 25s on my laptop.

If I haven't overlooked anything this could probably be included as a change.

ines · 2018-11-07T21:57:59Z

Thanks so much for looking int to this – I can confirm that we need to match both single and multiple tokens for emoji, and that using nlp.pipe is definitely much better than calling just nlp on each string. No idea why I didn't do this properly (especially since this is something I recommend to others a lot) 🤦‍♀️

Would you like to submit a PR with your fix?

buhrmann · 2018-11-08T12:30:20Z

Ok. Will do. I think the way to go then would be with

emoji_patterns = list(nlp.tokenizer.pipe(EMOJI.keys()))

When I have a moment I will create the PR.

ines added the bug label Nov 7, 2018

ines closed this as completed in c588f1f Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow to load #1

Slow to load #1

ohenrik commented Feb 28, 2018

kororo commented Jul 7, 2018

buhrmann commented Nov 7, 2018

buhrmann commented Nov 7, 2018

ines commented Nov 7, 2018

buhrmann commented Nov 8, 2018

Slow to load #1

Slow to load #1

Comments

ohenrik commented Feb 28, 2018

kororo commented Jul 7, 2018

buhrmann commented Nov 7, 2018

buhrmann commented Nov 7, 2018

ines commented Nov 7, 2018

buhrmann commented Nov 8, 2018