-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocab specials #230
Vocab specials #230
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I have some questions.
Not sure about Specials
subclassing str
. What is the main benefit of this compared to a simple Special
class?
podium/storage/vocab.py
Outdated
@@ -32,38 +31,121 @@ def unique(values: Iterable): | |||
yield element | |||
|
|||
|
|||
class VocabDict(dict): | |||
class Special(str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the user expected to implement own Special
s at some point or are all Special
s maintained by us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is intended for the user to subclass specials in case of new usages (e.g., Mask is a relatively new case).
2. Adds a stub ``apply`` method which accepts a sequence of tokens and adds the special token to that sequence. In its essence, the apply method is a post-tokenization hook which doesn't see the raw data whose job is to add the special token to the sequence of replace some of the existing tokens with the special token. The special tokens are applied after all post-tokenization hooks in the order they are passed to the :class:`podium.storage.vocab.Vocab` constructor. Each concrete implementation of a Special token has to implement this method. | ||
3. Implements singleton-like hash and equality checks. The ``Special`` class overrides the default hash and equals and instead of checking for string value equality, it checks for *class name equality*. We use this type of check to ensure that each Vocab has a single instance of each Special and for simpler referencing and contains checks. | ||
|
||
To better understand how specials work, we will walk through the implementation of one of special tokens implemented in Podium: the beginning-of-sequence (BOS) token. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you maybe happen to know a resource which contains typical Specials used in NLP we could link here? After a quick Google search I could not find one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vocabs in transformers (or tokenizers? not sure where they delegated the vocab) had quite a large number of reserved tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this is the best I could find: https://huggingface.co/transformers/main_classes/tokenizer.html#pretrainedtokenizer
docs/source/specials.rst
Outdated
@@ -0,0 +1,29 @@ | |||
Special tokens | |||
=============== | |||
.. autoclass:: podium.storage.vocab.Special |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should (eventually) get refactored to omit storage
, but looks good otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, IDK while docs refuse to use the shortened versions. I have an idea and might try it out soon.
@FilipBolt all comments addressed |
@mttk Can you please reference the issue that this PR will close? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, few minor comments left, but good to go in my mind
podium/vocab.py
Outdated
[self.stoi[token] if token in self.stoi else unk_token for token in data] | ||
) | ||
else: | ||
# Either UNK is not in Vocab or the user has requested unknown tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, we might need to update the docs to reflect what happens when UNK is in/out of the vocab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented.
Some outstanding things:
Add masking functionality in wrapper of vocab-> prepped for this but left for next PRdeterministic
arg)Closes #216