Vocab specials (#230)

Disable caching in Field when numericalizer isn't deterministic (via deterministic arg) Make Specials a singleton string subclass Add static constructors (from_itos, from_stoi) Add UNK filtering Apply non-core specials in Field & add tests for this Fix previous tests, add new tests
TakeLab · Jan 13, 2021 · 5273cac · 5273cac
1 parent 37cfb9f
commit 5273cac
Show file tree

Hide file tree

Showing 11 changed files with 679 additions and 293 deletions.
diff --git a/docs/source/advanced.rst b/docs/source/advanced.rst
@@ -100,7 +100,7 @@ And we're done! We can now add our hook to the text field either through the :me
 Removing punctuation as a posttokenization hook
 -----------------------------------------------
 
-We will now similarly define a posttokenization hook to remove punctuation. We will use the punctuation list from python's built-in ``string`` module, which we will store as an attribute of our hook.
+We will now similarly define a posttokenization hook to remove punctuation. We will use the punctuation list from python's built-in ``str`` module, which we will store as an attribute of our hook.
 
 .. code-block:: python
 
@@ -133,6 +133,69 @@ We can see that our hooks worked: the raw data was lowercased prior to tokenizat
 
 We have prepared a number of predefined hooks which are ready for you to use. You can see them here: :ref:`predefined-hooks`.
 
+.. _specials:
+
+Special tokens
+===============
+We have earlier mentioned special tokens, but now is the time to elaborate on what exactly they are. In Podium, each special token is a subclass of the python ``str`` which also encapsulates the functionality for adding that special token in the tokenized sequence. The ``Vocab`` handles special tokens differently -- each special token is guaranteed a place in the ``Vocab``, which is what makes them... *special*.
+
+Since our idea of special tokens was made to be extensible, we will take a brief look at how they are implemented, so we can better understand how to use them. We mentioned that each special token is a subclass of the python string, but there is an intermediary -- the :class:`podium.storage.vocab.Special` base class. The ``Special`` base class implements the following functionality, while still being an instance of a string:
+
+  1. Extending the constructor of the special token with a default value functionality. The default value for each special token should be set via the ``default_value`` class attribute, while if another value is passed upon creation, it will be used.
+  2. Adds a stub ``apply`` method which accepts a sequence of tokens and adds the special token to that sequence. In its essence, the apply method is a post-tokenization hook (applied to the tokenized sequence after other post-tokenization hooks) which doesn't see the raw data whose job is to add the special token to the sequence of replace some of the existing tokens with the special token. The special tokens are applied after all post-tokenization hooks in the order they are passed to the :class:`podium.storage.vocab.Vocab` constructor. Each concrete implementation of a Special token has to implement this method.
+  3. Implements singleton-like hash and equality checks. The ``Special`` class overrides the default hash and equals and instead of checking for string value equality, it checks for *class name equality*. We use this type of check to ensure that each ``Vocab`` has a single instance of each Special and for simpler referencing and contains checks.
+
+There is a number of special tokens used throughout NLP for a number of purposes. The most frequently used ones are the unknown token (UNK), which is used as a catch-all substitute for tokens which are not present in the vocabulary, and the padding token (PAD), which is used to nicely pack variable length sequences into fixed size batch tensors.
+Alongside these two, common special tokens include the beginning-of-sequence and end-of-sequence tokens (BOS, EOS), the separator token (SEP) and the mask token introduced in BERT (MASK).
+
+To better understand how specials work, we will walk through the implementation of one of special tokens implemented in Podium: the beginning-of-sequence (BOS) token.
+
+.. code-block:: python
+
+  >>> from podium.storage.vocab import Special
+  >>> class BOS(Special):
+  >>>   default_value = "<BOS>"
+  >>>
+  >>>  def apply(self, sequence):
+  >>>      # Prepend to the sequence
+  >>>      return [self] + sequence
+  >>>
+  >>> bos = BOS()
+  >>> print(bos)
+  <BOS>
+
+This code block is the full implementation of a special token! All we needed to do is set the default value and implement the ``apply`` function. The default value is ``None`` by default and if not set, you have to make sure it is passed upon construction, like so:
+
+.. code-block:: python
+
+  >>> my_bos = BOS("<MY_BOS>")
+  >>> print(my_bos)
+  <MY_BOS>
+  >>> print(bos == my_bos)
+  True
+
+We can also see that although we have changed the string representation of the special token, the equality check will still return True due to the ``Special`` base class changes mentioned earlier.
+
+To see the effect of the ``apply`` method, we will once again take a look at the SST dataset:
+
+.. code-block:: python
+
+  >>> from podium import Vocab, Field, LabelField
+  >>> from podium.datasets import SST
+  >>> 
+  >>> vocab = Vocab(specials=(bos))
+  >>> text = Field(name='text', numericalizer=vocab)
+  >>> label = LabelField(name='label')
+  >>> fields = {'text': text, 'label': label}
+  >>> sst_train, sst_test, sst_dev = SST.get_dataset_splits(fields=fields)
+  >>> print(sst_train[222].text)
+  (None, ['<BOS>', 'A', 'slick', ',', 'engrossing', 'melodrama', '.'])
+
+Where we can see that the special token was indeed added to the beginning of the tokenized sequence.
+
+Finally, it is important to note that there is an implicit distinction between special tokens. The unknown (:class:`podium.storage.vocab.UNK`) and padding (:class:`podium.storage.vocab.PAD`) special tokens are something we refer to as **core** special tokens, whose functionality is hardcoded in the implementation of the ``Vocab`` due to them being deeply integrated with the way iterators and numericalization work.
+The only difference between normal and core specials is that core specials are added to the sequence by other Podium classes (their behavior is hardcoded) instead of by their apply method.
+
 Custom numericalization functions
 ===========================================
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -41,6 +41,7 @@ The documentation is organized in four parts:
    :caption: Core package Reference:
 
    vocab_and_fields
+   specials
    datasets
    iterators
    vectorizers

diff --git a/docs/source/specials.rst b/docs/source/specials.rst
@@ -0,0 +1,29 @@
+Special tokens
+===============
+.. autoclass:: podium.vocab.Special
+   :members:
+   :no-undoc-members:
+
+The unknown token
+^^^^^^^^^^^^^^^^^^
+.. autoclass:: podium.vocab.UNK
+   :members:
+   :no-undoc-members:
+
+The padding token
+^^^^^^^^^^^^^^^^^^
+.. autoclass:: podium.vocab.PAD
+   :members:
+   :no-undoc-members:
+
+The beginning-of-sequence token
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. autoclass:: podium.vocab.BOS
+   :members:
+   :no-undoc-members:
+
+The end-of-sequence token
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. autoclass:: podium.vocab.EOS
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/walkthrough.rst b/docs/source/walkthrough.rst
@@ -127,7 +127,7 @@ That's it! We have defined our Fields. In order for them to be initialized, we n
   >>> print(small_vocabulary)
   Vocab[finalized: True, size: 5000]
 
-Our new Vocab has been limited to the 5000 most frequent words. The remaining words will be replaced by the unknown (``<UNK>``) token, which is one of the default `special` tokens in the Vocab.
+Our new Vocab has been limited to the 5000 most frequent words. If your `Vocab` contains the unknown special token :class:`podium.vocab.UNK`, the words not present in the vocabulary will be set to the value of the unknown token. The unknown token is one of the default `special` tokens in the Vocab, alongside the padding token :class:`podium.vocab.PAD`. You can read more about these in :ref:`specials`.
 
 You might have noticed that we used a different type of Field: :class:`podium.storage.LabelField` for the label. LabelField is one of the predefined custom Field classes with sensible default constructor arguments for its concrete use-case. We'll take a closer look at LabelFields in the following subsection.
 
@@ -246,7 +246,7 @@ For this dataset, we need to define three Fields. We also might want the fields
   >>> print(dataset)
   TabularDataset[Size: 1, Fields: ['premise', 'hypothesis', 'label']]
   >>> print(shared_vocab.itos)
-  [<SpecialVocabSymbols.UNK: '<unk>'>, <SpecialVocabSymbols.PAD: '<pad>'>, 'man', 'A', 'inspects', 'the', 'uniform', 'of', 'a', 'figure', 'in', 'some', 'East', 'Asian', 'country', '.', 'The', 'is', 'sleeping']
+  ['<UNK>', '<PAD>', 'man', 'A', 'inspects', 'the', 'uniform', 'of', 'a', 'figure', 'in', 'some', 'East', 'Asian', 'country', '.', 'The', 'is', 'sleeping']
 
 
 .. _hf-loading: