Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow to load #1

Closed
ohenrik opened this issue Feb 28, 2018 · 5 comments
Closed

Slow to load #1

ohenrik opened this issue Feb 28, 2018 · 5 comments
Labels

Comments

@ohenrik
Copy link

ohenrik commented Feb 28, 2018

I’m currently using this in my project, however it seems like it takes about 10-20 seconds toinstantiate this class. Is that normal or might there be something wrong with my installation?

nlp = spacy.load('en')
emoji = Emoji(nlp)

^ Takes about 21 seconds

@kororo
Copy link

kororo commented Jul 7, 2018

Hi @ohenrik,

I have same issues and I investigated this around, the problem is just one line in https://github.com/ines/spacymoji/blob/6bc4e44545d50d642baaba8b5ad36a050e1eb221/spacymoji/__init__.py#L59

It takes time to parse the EMOJI list. in emoji package, it is roughly 3k item list. When you feed with nlp language model en, it will make it worse, where if you do None language "nlp = Language(Vocab())", it will parse quickly. Here is my changes adapted from the original emoji.

I hope contributor will remove the PhraseMatcher, it is not efficient to have in the pipeline. We all know emoji is always one token, so it is suitable to just normal string match like described in function "get_emoji_desc". I changed so it will be cached when you do emoji_desc multiple times.

import spacy
from spacy.tokens import Doc, Span, Token
from emoji import UNICODE_EMOJI

# make sure multi-character emoji don't contain whitespace
EMOJI = {e.replace(' ', ''): t for e, t in UNICODE_EMOJI.items()}

class Emoji2(object):
    name = 'emoji2'

    def __init__(self, nlp=None, merge_spans=True, lookup={}, pattern_id='EMOJI',
                 attrs=('has_emoji', 'is_emoji', 'emoji_desc', 'emoji')):
        self._has_emoji, self._is_emoji, self._emoji_desc, self._emoji = attrs
        self.merge_spans = merge_spans
        self.lookup = lookup
        # Add attributes
        Doc.set_extension(self._has_emoji, getter=self.has_emoji)
        Doc.set_extension(self._emoji, getter=self.iter_emoji)
        Span.set_extension(self._has_emoji, getter=self.has_emoji)
        Span.set_extension(self._emoji, getter=self.iter_emoji)
        Token.set_extension(self._is_emoji, default=False)
        Token.set_extension(self._emoji_desc, default=None)

    def __call__(self, doc):
        for token in doc:
            emoji_desc = self.get_emoji_desc(token)
            if emoji_desc:
                token._.set(self._is_emoji, True)
                token._.set(self._emoji_desc, emoji_desc)
        return doc

    def has_emoji(self, tokens):
        return any(token._.get(self._is_emoji) for token in tokens)

    def iter_emoji(self, tokens):
        return [(t.text, i, t._.get(self._emoji_desc))
                for i, t in enumerate(tokens)
                if t._.get(self._is_emoji)]

    def get_emoji_desc(self, token):
        if token.text in self.lookup:
            return self.lookup[token.text]
        if token.text in EMOJI:
            desc = EMOJI[token.text]
            # Here we're converting shortcodes, e.g. ":man_getting_haircut:"
            return desc.replace('_', ' ').replace(':', '')
        return None

nlp = spacy.load('en_core_web_sm')
emoji = Emoji2(nlp)
nlp.add_pipe(emoji, first=True)
print(nlp('🎀 🐈')._.has_emoji)

@buhrmann
Copy link
Contributor

buhrmann commented Nov 7, 2018

@kororo, I don't think that's quite correct.

Firstly, as Ines explains in the README, there are emoji that contain multiple characters, and Spacy doesn't always tokenize them into a single token by default. For example:

from emoji import UNICODE_EMOJI

EMOJI = {e.replace(' ', ''): t for e, t in UNICODE_EMOJI.items()}
nlp = sp.load("en_core_web_sm")

long_emoji = [e for e in EMOJI.keys() if len(e) > 4] 
txt = f"This is one emoji: {long_emoji[0]} and this another: {long_emoji[1]}."
print(f"Original text:\n'{txt}'")

doc = nlp(txt)
print("\nTokens:")
for t in doc: print(f"'{t}'")

will produce this:

Original text:
'This is one emoji: 🏴󠁧󠁢󠁥󠁮󠁧󠁿 and this another: 🏴󠁧󠁢󠁳󠁣󠁴󠁿.'

Tokens:
'This'
'is'
'one'
'emoji'
':'
'🏴'
'󠁧󠁢󠁥󠁮󠁧󠁿'
'and'
'this'
'another'
':'
'🏴'
'󠁧󠁢󠁳󠁣󠁴󠁿.'

So I think Ines's way of merging is the correct one. However, I found that a simple change from

emoji_patterns1 = [nlp(emoji) for emoji in EMOJI.keys()]

to

emoji_patterns2 = list(nlp.pipe(EMOJI.keys()))

reduces the execution time on my laptop from ~25s to <4s, without changing the result:

def docs_equal(d1, d2):
    attribs = ["ORTH", "LEMMA", "POS", "TAG"]
    return (d1.to_array(attribs) == d2.to_array(attribs)).all()

all([docs_equal(emoji_patterns1[i], emoji_patterns2[i]) for i in range(len(emoji_patterns1))])

>> True

I think in principle it could be sped up even more by not executing the whole pipe, e.g. using: nlp.tokenizer.pipe(EMOJI.keys()). I haven't tried that yet, but I don't think information from the tagger, parser etc. is needed for the PhraseMatcher to work correctly here.

@buhrmann
Copy link
Contributor

buhrmann commented Nov 7, 2018

I just confirmed that Ines's test:

doc = nlp(u"This is a test 😻 👍🏿")
assert doc._.has_emoji == True
assert doc[2:5]._.has_emoji == True
assert doc[0]._.is_emoji == False
assert doc[4]._.is_emoji == True
assert doc[5]._.emoji_desc == u'thumbs up dark skin tone'
assert len(doc._.emoji) == 2
assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')

passes with the change to

emoji_patterns = list(nlp.tokenizer.pipe(EMOJI.keys()))

which takes 155ms instead of 25s on my laptop.

If I haven't overlooked anything this could probably be included as a change.

@ines
Copy link
Member

ines commented Nov 7, 2018

Thanks so much for looking int to this – I can confirm that we need to match both single and multiple tokens for emoji, and that using nlp.pipe is definitely much better than calling just nlp on each string. No idea why I didn't do this properly (especially since this is something I recommend to others a lot) 🤦‍♀️

Would you like to submit a PR with your fix?

@ines ines added the bug label Nov 7, 2018
@buhrmann
Copy link
Contributor

buhrmann commented Nov 8, 2018

Ok. Will do. I think the way to go then would be with

emoji_patterns = list(nlp.tokenizer.pipe(EMOJI.keys()))

When I have a moment I will create the PR.

@ines ines closed this as completed in c588f1f Nov 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants