Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocab specials #230

Merged
merged 30 commits into from
Jan 13, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 64 additions & 1 deletion docs/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ And we're done! We can now add our hook to the text field either through the :me
Removing punctuation as a posttokenization hook
-----------------------------------------------

We will now similarly define a posttokenization hook to remove punctuation. We will use the punctuation list from python's built-in ``string`` module, which we will store as an attribute of our hook.
We will now similarly define a posttokenization hook to remove punctuation. We will use the punctuation list from python's built-in ``str`` module, which we will store as an attribute of our hook.

.. code-block:: python

Expand Down Expand Up @@ -133,6 +133,69 @@ We can see that our hooks worked: the raw data was lowercased prior to tokenizat

We have prepared a number of predefined hooks which are ready for you to use. You can see them here: :ref:`predefined-hooks`.

.. _specials:

Special tokens
mttk marked this conversation as resolved.
Show resolved Hide resolved
===============
We have earlier mentioned special tokens, but now is the time to elaborate on what exactly they are. In Podium, each special token is a subclass of the python ``str`` which also encapsulates the functionality for adding that special token in the tokenized sequence. The ``Vocab`` handles special tokens differently -- each special token is guaranteed a place in the ``Vocab``, which is what makes them... *special*.

Since our idea of special tokens was made to be extensible, we will take a brief look at how they are implemented, so we can better understand how to use them. We mentioned that each special token is a subclass of the python string, but there is an intermediary -- the :class:`podium.storage.vocab.Special` base class. The ``Special`` base class implements the following functionality, while still being an instance of a string:

1. Extending the constructor of the special token with a default value functionality. The default value for each special token should be set via the ``default_value`` class attribute, while if another value is passed upon creation, it will be used.
2. Adds a stub ``apply`` method which accepts a sequence of tokens and adds the special token to that sequence. In its essence, the apply method is a post-tokenization hook (applied to the tokenized sequence after other post-tokenization hooks) which doesn't see the raw data whose job is to add the special token to the sequence of replace some of the existing tokens with the special token. The special tokens are applied after all post-tokenization hooks in the order they are passed to the :class:`podium.storage.vocab.Vocab` constructor. Each concrete implementation of a Special token has to implement this method.
3. Implements singleton-like hash and equality checks. The ``Special`` class overrides the default hash and equals and instead of checking for string value equality, it checks for *class name equality*. We use this type of check to ensure that each ``Vocab`` has a single instance of each Special and for simpler referencing and contains checks.

There is a number of special tokens used throughout NLP for a number of purposes. The most frequently used ones are the unknown token (UNK), which is used as a catch-all substitute for tokens which are not present in the vocabulary, and the padding token (PAD), which is used to nicely pack variable length sequences into fixed size batch tensors.
Alongside these two, common special tokens include the beginning-of-sequence and end-of-sequence tokens (BOS, EOS), the separator token (SEP) and the mask token introduced in BERT (MASK).

To better understand how specials work, we will walk through the implementation of one of special tokens implemented in Podium: the beginning-of-sequence (BOS) token.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you maybe happen to know a resource which contains typical Specials used in NLP we could link here? After a quick Google search I could not find one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vocabs in transformers (or tokenizers? not sure where they delegated the vocab) had quite a large number of reserved tokens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


.. code-block:: python

>>> from podium.storage.vocab import Special
>>> class BOS(Special):
>>> default_value = "<BOS>"
>>>
>>> def apply(self, sequence):
>>> # Prepend to the sequence
>>> return [self] + sequence
>>>
>>> bos = BOS()
>>> print(bos)
<BOS>

This code block is the full implementation of a special token! All we needed to do is set the default value and implement the ``apply`` function. The default value is ``None`` by default and if not set, you have to make sure it is passed upon construction, like so:

.. code-block:: python

>>> my_bos = BOS("<MY_BOS>")
mttk marked this conversation as resolved.
Show resolved Hide resolved
>>> print(my_bos)
<MY_BOS>
>>> print(bos == my_bos)
True

We can also see that although we have changed the string representation of the special token, the equality check will still return True due to the ``Special`` base class changes mentioned earlier.

To see the effect of the ``apply`` method, we will once again take a look at the SST dataset:

.. code-block:: python

>>> from podium import Vocab, Field, LabelField
>>> from podium.datasets import SST
>>>
>>> vocab = Vocab(specials=(bos))
>>> text = Field(name='text', numericalizer=vocab)
>>> label = LabelField(name='label')
>>> fields = {'text': text, 'label': label}
>>> sst_train, sst_test, sst_dev = SST.get_dataset_splits(fields=fields)
>>> print(sst_train[222].text)
(None, ['<BOS>', 'A', 'slick', ',', 'engrossing', 'melodrama', '.'])

Where we can see that the special token was indeed added to the beginning of the tokenized sequence.

Finally, it is important to note that there is an implicit distinction between special tokens. The unknown (:class:`podium.storage.vocab.UNK`) and padding (:class:`podium.storage.vocab.PAD`) special tokens are something we refer to as **core** special tokens, whose functionality is hardcoded in the implementation of the ``Vocab`` due to them being deeply integrated with the way iterators and numericalization work.
The only difference between normal and core specials is that core specials are added to the sequence by other Podium classes (their behavior is hardcoded) instead of by their apply method.

Custom numericalization functions
===========================================

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ The documentation is organized in four parts:
:caption: Core package Reference:

vocab_and_fields
specials
datasets
iterators
vectorizers
Expand Down
29 changes: 29 additions & 0 deletions docs/source/specials.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
Special tokens
===============
.. autoclass:: podium.vocab.Special
:members:
:no-undoc-members:

The unknown token
^^^^^^^^^^^^^^^^^^
.. autoclass:: podium.vocab.UNK
:members:
:no-undoc-members:

The padding token
^^^^^^^^^^^^^^^^^^
.. autoclass:: podium.vocab.PAD
:members:
:no-undoc-members:

The beginning-of-sequence token
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: podium.vocab.BOS
:members:
:no-undoc-members:

The end-of-sequence token
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: podium.vocab.EOS
:members:
:no-undoc-members:
4 changes: 2 additions & 2 deletions docs/source/walkthrough.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ That's it! We have defined our Fields. In order for them to be initialized, we n
>>> print(small_vocabulary)
Vocab[finalized: True, size: 5000]

Our new Vocab has been limited to the 5000 most frequent words. The remaining words will be replaced by the unknown (``<UNK>``) token, which is one of the default `special` tokens in the Vocab.
Our new Vocab has been limited to the 5000 most frequent words. If your `Vocab` contains the unknown special token :class:`podium.vocab.UNK`, the words not present in the vocabulary will be set to the value of the unknown token. The unknown token is one of the default `special` tokens in the Vocab, alongside the padding token :class:`podium.vocab.PAD`. You can read more about these in :ref:`specials`.

You might have noticed that we used a different type of Field: :class:`podium.storage.LabelField` for the label. LabelField is one of the predefined custom Field classes with sensible default constructor arguments for its concrete use-case. We'll take a closer look at LabelFields in the following subsection.

Expand Down Expand Up @@ -246,7 +246,7 @@ For this dataset, we need to define three Fields. We also might want the fields
>>> print(dataset)
TabularDataset[Size: 1, Fields: ['premise', 'hypothesis', 'label']]
>>> print(shared_vocab.itos)
[<SpecialVocabSymbols.UNK: '<unk>'>, <SpecialVocabSymbols.PAD: '<pad>'>, 'man', 'A', 'inspects', 'the', 'uniform', 'of', 'a', 'figure', 'in', 'some', 'East', 'Asian', 'country', '.', 'The', 'is', 'sleeping']
['<UNK>', '<PAD>', 'man', 'A', 'inspects', 'the', 'uniform', 'of', 'a', 'figure', 'in', 'some', 'East', 'Asian', 'country', '.', 'The', 'is', 'sleeping']


.. _hf-loading:
Expand Down
Loading