Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span embeddings with HuggingFace #8

Open
ogarciasierra opened this issue Jun 7, 2021 · 4 comments
Open

Span embeddings with HuggingFace #8

ogarciasierra opened this issue Jun 7, 2021 · 4 comments

Comments

@ogarciasierra
Copy link

Hi everyone. I was wondering if is it possible to do the same "span contextual embeddings" with a HuggingFace model. I`ve been able to generate token contextual embeddings (https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958), but cannot do it with spans. For example, in “three days ago I ate meat”, I would like to get contextual embeddings for “three days ago” in a similar way Tuomo does it with spaCy in the ALT blog.

Thanks everyone.

@thiippal
Copy link
Contributor

thiippal commented Jun 7, 2021

Hi @ogarciasierra!

Just to make sure: I haven't really looked at doing this directly HuggingFace Transformers, so I assume that you would like to do extract contextual word embeddings for spans using spaCy?

@ogarciasierra
Copy link
Author

Hi @thiippal

I would like to extract contextual embeddings for spans using Hugging Face Transformers or pytorch. The main thing is to use a Hugging Face model to generate those embeddings. I dont care which library we use for extracting them :)

Thanks!

@thiippal
Copy link
Contributor

thiippal commented Jun 9, 2021

Okay @ogarciasierra, one way to do this is to follow the process here.

  1. Create the custom component for assigning Transformer features to the vector attribute of spaCy Token/Span/Doc elements.
  2. Then simply take a slice of the Doc object containing the Span of interest and access the vector attribute.

A demo, which assumes that you've created the custom component and added it to the Transformer-powered spaCy pipeline:

meat = nlp_trf("three days ago I ate meat")
left = nlp_trf("We left Finland three days ago")

meat_span = meat[0:3]    # get the Span for "three days ago" by indexing Token positions
left_span = left[3:7]    # do the same for the second Doc

meat_span.similarity(left_span)     # calculate cosine similarity

This outputs 0.8840232, which indicates that the two Spans indeed have similar vectors, but also incorporate information about the context in which they occur.

TL;DR: Just slice spaCy Docs and access the representation using the vector attribute.

@ogarciasierra
Copy link
Author

Yes, I checked your code with spaCy before! But my doubt is about how to do it with a Hugging Face model and its own embeddings. Those trf_data atributes are onle available for spaCy models, I am afraid. The process is amazing with your spaCy tutorial, so I tried to do it with a pre trained Hugging Face model, its easy with just one token) , but wasn't able to do it with a span. Sorry to bother you again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants