Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aligning tokens with supersenses? #4

Open
victoryhb opened this issue Oct 16, 2020 · 3 comments
Open

Aligning tokens with supersenses? #4

victoryhb opened this issue Oct 16, 2020 · 3 comments

Comments

@victoryhb
Copy link

Thank you very much for sharing the code for your excellent paper.
Pardon me for asking this newbie question: how to align the tokens in the input sentence with the supersenses outputted from the model?
For example, the words in the sentence "I went to the store to buy some groceries." do not appear to be aligned with the following senses

['noun.person']
['verb.communication']
['verb.social']
['verb.communication']
['noun.artifact']
['noun.artifact']
['verb.communication']
['verb.cognition']
['noun.artifact']
['noun.artifact']
['adv.all']
['adv.all']

as printed using the following code:

for i, id_ in enumerate(input_ids[0]):
  print(sensebert_model.tokenizer.convert_ids_to_senses([np.argmax(supersense_logits[0][i])]))

Could you please provide some example code for how to do this properly? Thanks a lot in advance!

@MeMartijn
Copy link

@victoryhb This might be a long shot, but I was wondering whether you figured this out in the end. I also can't seem to figure out how to align the tokens.

@MeMartijn
Copy link

@oriram Do you have any hints on how to align the predicted senses to words in sentences?

@oriram
Copy link
Contributor

oriram commented Jul 22, 2021

Hi @MeMartijn,
There is no clear "alignment" as out-of-vocabulary words are split to multiple tokens (and therefore can have multiple supersenses).
However, you can do one of the following:

  • Enumerate over input_ids and predicted supersenses - This will give you the supersense for each token.
  • Change the tokenizer code such that it returns the index of the first token for each "word"

Hope this helps,
Ori

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants