diff --git a/docs/source/advanced.rst b/docs/source/advanced.rst index e6cc723e..cee349a9 100644 --- a/docs/source/advanced.rst +++ b/docs/source/advanced.rst @@ -100,7 +100,7 @@ And we're done! We can now add our hook to the text field either through the :me Removing punctuation as a posttokenization hook ----------------------------------------------- -We will now similarly define a posttokenization hook to remove punctuation. We will use the punctuation list from python's built-in ``string`` module, which we will store as an attribute of our hook. +We will now similarly define a posttokenization hook to remove punctuation. We will use the punctuation list from python's built-in ``str`` module, which we will store as an attribute of our hook. .. code-block:: python @@ -133,6 +133,69 @@ We can see that our hooks worked: the raw data was lowercased prior to tokenizat We have prepared a number of predefined hooks which are ready for you to use. You can see them here: :ref:`predefined-hooks`. +.. _specials: + +Special tokens +=============== +We have earlier mentioned special tokens, but now is the time to elaborate on what exactly they are. In Podium, each special token is a subclass of the python ``str`` which also encapsulates the functionality for adding that special token in the tokenized sequence. The ``Vocab`` handles special tokens differently -- each special token is guaranteed a place in the ``Vocab``, which is what makes them... *special*. + +Since our idea of special tokens was made to be extensible, we will take a brief look at how they are implemented, so we can better understand how to use them. We mentioned that each special token is a subclass of the python string, but there is an intermediary -- the :class:`podium.storage.vocab.Special` base class. The ``Special`` base class implements the following functionality, while still being an instance of a string: + + 1. Extending the constructor of the special token with a default value functionality. The default value for each special token should be set via the ``default_value`` class attribute, while if another value is passed upon creation, it will be used. + 2. Adds a stub ``apply`` method which accepts a sequence of tokens and adds the special token to that sequence. In its essence, the apply method is a post-tokenization hook (applied to the tokenized sequence after other post-tokenization hooks) which doesn't see the raw data whose job is to add the special token to the sequence of replace some of the existing tokens with the special token. The special tokens are applied after all post-tokenization hooks in the order they are passed to the :class:`podium.storage.vocab.Vocab` constructor. Each concrete implementation of a Special token has to implement this method. + 3. Implements singleton-like hash and equality checks. The ``Special`` class overrides the default hash and equals and instead of checking for string value equality, it checks for *class name equality*. We use this type of check to ensure that each ``Vocab`` has a single instance of each Special and for simpler referencing and contains checks. + +There is a number of special tokens used throughout NLP for a number of purposes. The most frequently used ones are the unknown token (UNK), which is used as a catch-all substitute for tokens which are not present in the vocabulary, and the padding token (PAD), which is used to nicely pack variable length sequences into fixed size batch tensors. +Alongside these two, common special tokens include the beginning-of-sequence and end-of-sequence tokens (BOS, EOS), the separator token (SEP) and the mask token introduced in BERT (MASK). + +To better understand how specials work, we will walk through the implementation of one of special tokens implemented in Podium: the beginning-of-sequence (BOS) token. + +.. code-block:: python + + >>> from podium.storage.vocab import Special + >>> class BOS(Special): + >>> default_value = "" + >>> + >>> def apply(self, sequence): + >>> # Prepend to the sequence + >>> return [self] + sequence + >>> + >>> bos = BOS() + >>> print(bos) + + +This code block is the full implementation of a special token! All we needed to do is set the default value and implement the ``apply`` function. The default value is ``None`` by default and if not set, you have to make sure it is passed upon construction, like so: + +.. code-block:: python + + >>> my_bos = BOS("") + >>> print(my_bos) + + >>> print(bos == my_bos) + True + +We can also see that although we have changed the string representation of the special token, the equality check will still return True due to the ``Special`` base class changes mentioned earlier. + +To see the effect of the ``apply`` method, we will once again take a look at the SST dataset: + +.. code-block:: python + + >>> from podium import Vocab, Field, LabelField + >>> from podium.datasets import SST + >>> + >>> vocab = Vocab(specials=(bos)) + >>> text = Field(name='text', numericalizer=vocab) + >>> label = LabelField(name='label') + >>> fields = {'text': text, 'label': label} + >>> sst_train, sst_test, sst_dev = SST.get_dataset_splits(fields=fields) + >>> print(sst_train[222].text) + (None, ['', 'A', 'slick', ',', 'engrossing', 'melodrama', '.']) + +Where we can see that the special token was indeed added to the beginning of the tokenized sequence. + +Finally, it is important to note that there is an implicit distinction between special tokens. The unknown (:class:`podium.storage.vocab.UNK`) and padding (:class:`podium.storage.vocab.PAD`) special tokens are something we refer to as **core** special tokens, whose functionality is hardcoded in the implementation of the ``Vocab`` due to them being deeply integrated with the way iterators and numericalization work. +The only difference between normal and core specials is that core specials are added to the sequence by other Podium classes (their behavior is hardcoded) instead of by their apply method. + Custom numericalization functions =========================================== diff --git a/docs/source/index.rst b/docs/source/index.rst index 808e9ebd..621a06a5 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -41,6 +41,7 @@ The documentation is organized in four parts: :caption: Core package Reference: vocab_and_fields + specials datasets iterators vectorizers diff --git a/docs/source/specials.rst b/docs/source/specials.rst new file mode 100644 index 00000000..d9c8801b --- /dev/null +++ b/docs/source/specials.rst @@ -0,0 +1,29 @@ +Special tokens +=============== +.. autoclass:: podium.vocab.Special + :members: + :no-undoc-members: + +The unknown token +^^^^^^^^^^^^^^^^^^ +.. autoclass:: podium.vocab.UNK + :members: + :no-undoc-members: + +The padding token +^^^^^^^^^^^^^^^^^^ +.. autoclass:: podium.vocab.PAD + :members: + :no-undoc-members: + +The beginning-of-sequence token +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. autoclass:: podium.vocab.BOS + :members: + :no-undoc-members: + +The end-of-sequence token +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. autoclass:: podium.vocab.EOS + :members: + :no-undoc-members: diff --git a/docs/source/walkthrough.rst b/docs/source/walkthrough.rst index 59e6722a..13c3e40c 100644 --- a/docs/source/walkthrough.rst +++ b/docs/source/walkthrough.rst @@ -127,7 +127,7 @@ That's it! We have defined our Fields. In order for them to be initialized, we n >>> print(small_vocabulary) Vocab[finalized: True, size: 5000] -Our new Vocab has been limited to the 5000 most frequent words. The remaining words will be replaced by the unknown (````) token, which is one of the default `special` tokens in the Vocab. +Our new Vocab has been limited to the 5000 most frequent words. If your `Vocab` contains the unknown special token :class:`podium.vocab.UNK`, the words not present in the vocabulary will be set to the value of the unknown token. The unknown token is one of the default `special` tokens in the Vocab, alongside the padding token :class:`podium.vocab.PAD`. You can read more about these in :ref:`specials`. You might have noticed that we used a different type of Field: :class:`podium.storage.LabelField` for the label. LabelField is one of the predefined custom Field classes with sensible default constructor arguments for its concrete use-case. We'll take a closer look at LabelFields in the following subsection. @@ -246,7 +246,7 @@ For this dataset, we need to define three Fields. We also might want the fields >>> print(dataset) TabularDataset[Size: 1, Fields: ['premise', 'hypothesis', 'label']] >>> print(shared_vocab.itos) - ['>, '>, 'man', 'A', 'inspects', 'the', 'uniform', 'of', 'a', 'figure', 'in', 'some', 'East', 'Asian', 'country', '.', 'The', 'is', 'sleeping'] + ['', '', 'man', 'A', 'inspects', 'the', 'uniform', 'of', 'a', 'figure', 'in', 'some', 'East', 'Asian', 'country', '.', 'The', 'is', 'sleeping'] .. _hf-loading: diff --git a/podium/field.py b/podium/field.py index e1f259a3..6894de01 100644 --- a/podium/field.py +++ b/podium/field.py @@ -65,157 +65,6 @@ def clear(self): self.hooks.clear() -class MultioutputField: - """ - Field that does pretokenization and tokenization once and passes it to its - output fields. - - Output fields are any type of field. The output fields are used only for - posttokenization processing (posttokenization hooks and vocab updating). - """ - - def __init__( - self, - output_fields: List["Field"], - tokenizer: TokenizerType = "split", - pretokenize_hooks: Optional[Iterable[PretokenizationHookType]] = None, - ): - """ - Field that does pretokenization and tokenization once and passes it to - its output fields. Output fields are any type of field. The output - fields are used only for posttokenization processing (posttokenization - hooks and vocab updating). - - Parameters - ---------- - output_fields : List[Field], - List containig the output fields. The pretokenization hooks and tokenizer - in these fields are ignored and only posttokenization hooks are used. - tokenizer : Optional[Union[str, Callable]] - The tokenizer that is to be used when preprocessing raw data - (only if 'tokenize' is True). The user can provide his own - tokenizer as a callable object or specify one of the premade - tokenizers by a string. The available premade tokenizers are: - - - 'split' - default str.split() - - 'spacy-lang' - the spacy tokenizer. The language model can be defined - by replacing `lang` with the language model name. For example `spacy-en` - - pretokenize_hooks: Iterable[Callable[[Any], Any]] - Iterable containing pretokenization hooks. Providing hooks in this way is - identical to calling `add_pretokenize_hook`. - """ - - self._tokenizer_arg = tokenizer - self._pretokenization_pipeline = PretokenizationPipeline() - - if pretokenize_hooks is not None: - if not isinstance(pretokenize_hooks, (list, tuple)): - pretokenize_hooks = [pretokenize_hooks] - for hook in pretokenize_hooks: - self.add_pretokenize_hook(hook) - - self._tokenizer = get_tokenizer(tokenizer) - self._output_fields = deque(output_fields) - - def add_pretokenize_hook(self, hook: PretokenizationHookType): - """ - Add a pre-tokenization hook to the MultioutputField. If multiple hooks - are added to the field, the order of their execution will be the same as - the order in which they were added to the field, each subsequent hook - taking the output of the previous hook as its input. If the same - function is added to the Field as a hook multiple times, it will be - executed that many times. The output of the final pre-tokenization hook - is the raw data that the tokenizer will get as its input. - - Pretokenize hooks have the following signature: - func pre_tok_hook(raw_data): - raw_data_out = do_stuff(raw_data) - return raw_data_out - - This can be used to eliminate encoding errors in data, replace numbers - and names, etc. - - Parameters - ---------- - hook : Callable[[Any], Any] - The pre-tokenization hook that we want to add to the field. - """ - self._pretokenization_pipeline.add_hook(hook) - - def _run_pretokenization_hooks(self, data: Any) -> Any: - """ - Runs pretokenization hooks on the raw data and returns the result. - - Parameters - ---------- - data : Any - data to be processed - - Returns - ------- - Any - processed data - """ - - return self._pretokenization_pipeline(data) - - def add_output_field(self, field: "Field"): - """ - Adds the passed field to this field's output fields. - - Parameters - ---------- - field : Field - Field to add to output fields. - """ - self._output_fields.append(field) - - def preprocess(self, data: Any) -> Iterable[Tuple[str, Tuple[Optional[Any], Any]]]: - """ - Preprocesses raw data, tokenizing it if required. The outputfields - update their vocabs if required and preserve the raw data if the output - field's 'keep_raw' is true. - - Parameters - ---------- - data : Any - The raw data that needs to be preprocessed. - - Returns - ------- - Iterable[Tuple[str, Tuple[Optional[Any], Any]]] - An Iterable containing the raw and tokenized data of all the output fields. - The structure of the returned tuples is (name, (raw, tokenized)), where 'name' - is the name of the output field and raw and tokenized are processed data. - - Raises - ------ - If data is None and missing data is not allowed. - """ - data = self._run_pretokenization_hooks(data) - tokens = self._tokenizer(data) if self._tokenizer is not None else data - return tuple(field._process_tokens(data, tokens) for field in self._output_fields) - - def get_output_fields(self) -> Iterable["Field"]: - """ - Returns an Iterable of the contained output fields. - - Returns - ------- - Iterable[Field] : - an Iterable of the contained output fields. - """ - return self._output_fields - - def remove_pretokenize_hooks(self): - """ - Remove all the pre-tokenization hooks that were added to the - MultioutputField. - """ - self._pretokenization_pipeline.clear() - - class Field: """ Holds the preprocessing and numericalization logic for a single field of a @@ -232,6 +81,7 @@ def __init__( fixed_length: Optional[int] = None, allow_missing_data: bool = False, disable_batch_matrix: bool = False, + disable_numericalize_caching: bool = False, padding_token: Union[int, float] = -999, missing_data_token: Union[int, float] = -1, pretokenize_hooks: Optional[Iterable[PretokenizationHookType]] = None, @@ -297,6 +147,17 @@ def __init__( If True, a list of unpadded vectors(or other data type) will be returned instead. For missing data, the value in the list will be None. + disable_numericalize_caching : bool + The flag which determines whether the numericalization of this field should be + cached. This flag should be set to True if the numericalization can differ + between `numericalize` function calls for the same instance. When set to False, + the numericalization values will be cached and reused each time the instance + is used as part of a batch. The flag is passed to the numericalizer to indicate + use of its nondeterministic setting. This flag is mainly intended be used in the + case of masked language modelling, where we wish the inputs to be masked + (nondeterministic), and the outputs (labels) to not be masked while using the + same vocabulary. + padding_token : int Padding token used when numericalizer is a callable. If the numericalizer is None or a Vocab, this value is ignored. @@ -327,6 +188,7 @@ def __init__( ) self._name = name self._disable_batch_matrix = disable_batch_matrix + self._disable_numericalize_caching = disable_numericalize_caching self._tokenizer_arg_string = tokenizer if isinstance(tokenizer, str) else None if tokenizer is None: @@ -415,6 +277,10 @@ def vocab(self): """ return self._vocab + @property + def disable_numericalize_caching(self): + return self._disable_numericalize_caching + @property def use_vocab(self): """ @@ -645,8 +511,17 @@ def _process_tokens( """ raw, tokenized = self._run_posttokenization_hooks(raw, tokens) + + # Apply the special tokens. These act as a post-tokenization + # hook, but are applied separately as we want to encapsulate + # that logic in their class to minimize code changes. + if self.use_vocab: + for special_token in self.vocab.specials: + tokenized = special_token.apply(tokenized) + raw = raw if self._keep_raw else None + # Self.eager checks if a vocab is used so this won't error if self.eager and not self.vocab.finalized: self.update_vocab(tokenized) return self.name, (raw, tokenized) @@ -824,6 +699,10 @@ def get_numericalization_for_example( cache_field_name = f"{self.name}_" numericalization = example.get(cache_field_name) + # Check if this concrete field can be cached. + + cache = cache and not self.disable_numericalize_caching + if numericalization is None: example_data = example[self.name] numericalization = self.numericalize(example_data) @@ -883,6 +762,157 @@ def get_output_fields(self) -> Iterable["Field"]: return (self,) +class MultioutputField: + """ + Field that does pretokenization and tokenization once and passes it to its + output fields. + + Output fields are any type of field. The output fields are used only for + posttokenization processing (posttokenization hooks and vocab updating). + """ + + def __init__( + self, + output_fields: List["Field"], + tokenizer: TokenizerType = "split", + pretokenize_hooks: Optional[Iterable[PretokenizationHookType]] = None, + ): + """ + Field that does pretokenization and tokenization once and passes it to + its output fields. Output fields are any type of field. The output + fields are used only for posttokenization processing (posttokenization + hooks and vocab updating). + + Parameters + ---------- + output_fields : List[Field], + List containig the output fields. The pretokenization hooks and tokenizer + in these fields are ignored and only posttokenization hooks are used. + tokenizer : Optional[Union[str, Callable]] + The tokenizer that is to be used when preprocessing raw data + (only if 'tokenize' is True). The user can provide his own + tokenizer as a callable object or specify one of the premade + tokenizers by a string. The available premade tokenizers are: + + - 'split' - default str.split() + - 'spacy-lang' - the spacy tokenizer. The language model can be defined + by replacing `lang` with the language model name. For example `spacy-en` + + pretokenize_hooks: Iterable[Callable[[Any], Any]] + Iterable containing pretokenization hooks. Providing hooks in this way is + identical to calling `add_pretokenize_hook`. + """ + + self._tokenizer_arg = tokenizer + self._pretokenization_pipeline = PretokenizationPipeline() + + if pretokenize_hooks is not None: + if not isinstance(pretokenize_hooks, (list, tuple)): + pretokenize_hooks = [pretokenize_hooks] + for hook in pretokenize_hooks: + self.add_pretokenize_hook(hook) + + self._tokenizer = get_tokenizer(tokenizer) + self._output_fields = deque(output_fields) + + def add_pretokenize_hook(self, hook: PretokenizationHookType): + """ + Add a pre-tokenization hook to the MultioutputField. If multiple hooks + are added to the field, the order of their execution will be the same as + the order in which they were added to the field, each subsequent hook + taking the output of the previous hook as its input. If the same + function is added to the Field as a hook multiple times, it will be + executed that many times. The output of the final pre-tokenization hook + is the raw data that the tokenizer will get as its input. + + Pretokenize hooks have the following signature: + func pre_tok_hook(raw_data): + raw_data_out = do_stuff(raw_data) + return raw_data_out + + This can be used to eliminate encoding errors in data, replace numbers + and names, etc. + + Parameters + ---------- + hook : Callable[[Any], Any] + The pre-tokenization hook that we want to add to the field. + """ + self._pretokenization_pipeline.add_hook(hook) + + def _run_pretokenization_hooks(self, data: Any) -> Any: + """ + Runs pretokenization hooks on the raw data and returns the result. + + Parameters + ---------- + data : Any + data to be processed + + Returns + ------- + Any + processed data + """ + + return self._pretokenization_pipeline(data) + + def add_output_field(self, field: "Field"): + """ + Adds the passed field to this field's output fields. + + Parameters + ---------- + field : Field + Field to add to output fields. + """ + self._output_fields.append(field) + + def preprocess(self, data: Any) -> Iterable[Tuple[str, Tuple[Optional[Any], Any]]]: + """ + Preprocesses raw data, tokenizing it if required. The outputfields + update their vocabs if required and preserve the raw data if the output + field's 'keep_raw' is true. + + Parameters + ---------- + data : Any + The raw data that needs to be preprocessed. + + Returns + ------- + Iterable[Tuple[str, Tuple[Optional[Any], Any]]] + An Iterable containing the raw and tokenized data of all the output fields. + The structure of the returned tuples is (name, (raw, tokenized)), where 'name' + is the name of the output field and raw and tokenized are processed data. + + Raises + ------ + If data is None and missing data is not allowed. + """ + data = self._run_pretokenization_hooks(data) + tokens = self._tokenizer(data) if self._tokenizer is not None else data + return tuple(field._process_tokens(data, tokens) for field in self._output_fields) + + def get_output_fields(self) -> Iterable["Field"]: + """ + Returns an Iterable of the contained output fields. + + Returns + ------- + Iterable[Field] : + an Iterable of the contained output fields. + """ + return self._output_fields + + def remove_pretokenize_hooks(self): + """ + Remove all the pre-tokenization hooks that were added to the + MultioutputField. + """ + self._pretokenization_pipeline.clear() + + class LabelField(Field): """ Field subclass used when no tokenization is required. @@ -893,8 +923,10 @@ class LabelField(Field): def __init__( self, name: str, - numericalizer: NumericalizerType = None, + numericalizer: Optional[Union[Vocab, NumericalizerType]] = None, allow_missing_data: bool = False, + disable_batch_matrix: bool = False, + disable_numericalize_caching: bool = False, is_target: bool = True, missing_data_token: Union[int, float] = -1, pretokenize_hooks: Optional[Iterable[PretokenizationHookType]] = None, @@ -924,6 +956,24 @@ def __init__( If 'allow_missing_data' is True, if a None is sent to be preprocessed, it will be stored and later numericalized properly. + disable_batch_matrix: bool + Whether the batch created for this field will be compressed into a matrix. + If False, the batch returned by an Iterator or Dataset.batch() will contain + a matrix of numericalizations for all examples (if possible). + If True, a list of unpadded vectors(or other data type) will be returned + instead. For missing data, the value in the list will be None. + + disable_numericalize_caching : bool + The flag which determines whether the numericalization of this field should be + cached. This flag should be set to True if the numericalization can differ + between `numericalize` function calls for the same instance. When set to False, + the numericalization values will be cached and reused each time the instance + is used as part of a batch. The flag is passed to the numericalizer to indicate + use of its nondeterministic setting. This flag is mainly intended be used in the + case of masked language modelling, where we wish the inputs to be masked + (nondeterministic), and the outputs (labels) to not be masked while using the + same vocabulary. + is_target : bool Whether this field is a target variable. Affects iteration over batches. @@ -956,6 +1006,8 @@ def __init__( is_target=is_target, fixed_length=1, allow_missing_data=allow_missing_data, + disable_batch_matrix=disable_batch_matrix, + disable_numericalize_caching=disable_numericalize_caching, missing_data_token=missing_data_token, pretokenize_hooks=pretokenize_hooks, ) @@ -972,10 +1024,12 @@ def __init__( self, name: str, tokenizer: TokenizerType = None, - numericalizer: NumericalizerType = None, + numericalizer: Optional[Union[Vocab, NumericalizerType]] = None, num_of_classes: Optional[int] = None, is_target: bool = True, allow_missing_data: bool = False, + disable_batch_matrix: bool = False, + disable_numericalize_caching: bool = False, missing_data_token: Union[int, float] = -1, pretokenize_hooks: Optional[Iterable[PretokenizationHookType]] = None, posttokenize_hooks: Optional[Iterable[PosttokenizationHookType]] = None, @@ -1027,6 +1081,24 @@ def __init__( If 'allow_missing_data' is True, if a None is sent to be preprocessed, it will be stored and later numericalized properly. + disable_batch_matrix: bool + Whether the batch created for this field will be compressed into a matrix. + If False, the batch returned by an Iterator or Dataset.batch() will contain + a matrix of numericalizations for all examples (if possible). + If True, a list of unpadded vectors(or other data type) will be returned + instead. For missing data, the value in the list will be None. + + disable_numericalize_caching : bool + The flag which determines whether the numericalization of this field should be + cached. This flag should be set to True if the numericalization can differ + between `numericalize` function calls for the same instance. When set to False, + the numericalization values will be cached and reused each time the instance + is used as part of a batch. The flag is passed to the numericalizer to indicate + use of its nondeterministic setting. This flag is mainly intended be used in the + case of masked language modelling, where we wish the inputs to be masked + (nondeterministic), and the outputs (labels) to not be masked while using the + same vocabulary. + missing_data_token : Union[int, float] Token to use to mark batch rows as missing. If data for a field is missing, its matrix row will be filled with this value. For non-numericalizable fields, @@ -1065,6 +1137,8 @@ def __init__( is_target=is_target, fixed_length=num_of_classes, allow_missing_data=allow_missing_data, + disable_batch_matrix=disable_batch_matrix, + disable_numericalize_caching=disable_numericalize_caching, missing_data_token=missing_data_token, pretokenize_hooks=pretokenize_hooks, posttokenize_hooks=posttokenize_hooks, diff --git a/podium/vocab.py b/podium/vocab.py index 37797664..e24ecb46 100644 --- a/podium/vocab.py +++ b/podium/vocab.py @@ -3,7 +3,6 @@ """ import warnings from collections import Counter -from enum import Enum from itertools import chain from typing import Iterable, Union @@ -32,38 +31,120 @@ def unique(values: Iterable): yield element -class VocabDict(dict): +class Special(str): """ - Vocab dictionary class that is used like default dict but without adding - missing key to the dictionary. + Base class for a special token. + + Every special token is a subclass of string (this way one can) easily modify + the concrete string representation of the special. The functionality of the + special token, which acts the same as a post-tokenization hook should be + implemented in the `apply` instance method for each subclass. We ensure that + each special token will be present in the Vocab. """ - def __init__(self, default_factory=None, *args, **kwargs): - super().__init__(*args, **kwargs) - self._default_factory = default_factory + default_value = None + + def __new__(cls, token=None): + """ + Provides default value initialization for subclasses. + + If creating a new instance without a string argument, the + `default_value` class attribute must be set in the subclass + implementation. + """ - def __missing__(self, key): - if self._default_factory is None: - raise KeyError( - "Default factory is not defined and key is not in the dictionary." + if token is None and cls.default_value is None: + error_msg = ( + "When initializing a special token without argument" + f" the {cls.__class__}.default_value attribute must be set." ) - return self._default_factory() + raise RuntimeError(error_msg) + + if token is None: + token = cls.default_value + + return super(Special, cls).__new__(cls, token) + + def __hash__(self): + """ + Overrides hash. + + Check docs of `__eq__` for motivation. + """ + return hash(self.__class__) + + def __eq__(self, other): + """ + Check equals via class instead of value. + + The motivation behind this is that we want to be able to match the + special token by class and not by value, as it is the type of the + special token that determines its functionality. This way we allow for + the concrete string representation of the special to be easily changed, + while retaining simple existence checks for vocab functionality. + """ + return self.__class__ == other.__class__ + + def apply(self, sequence): + """ + Apply (insert) the special token in the adequate place in the sequence. + + By default, returns the unchanged sequence. + """ + return sequence -class SpecialVocabSymbols(Enum): +class BOS(Special): + """ + The beginning-of-sequence special token. """ - Class for special vocabular symbols. - Attributes - ---------- - UNK : str - Tag for unknown word - PAD : str - TAG for padding symbol + default_value = "" + + def apply(self, sequence): + """ + Apply the BOS token, adding it to the start of the sequence. + """ + return [self] + sequence + + +class EOS(Special): + """ + The end-of-sequence special token. + """ + + default_value = "" + + def apply(self, sequence): + """ + Apply the EOS token, adding it to the end of the sequence. + """ + return sequence + [self] + + +################# +# Core specials # +################# + + +class UNK(Special): + """ + The unknown core special token. + + Functionality handled by Vocab. """ - UNK = "" - PAD = "" + default_value = "" + + +class PAD(Special): + """ + The padding core special token. + + Functionality handled by Vocab. + """ + + default_value = "" class Vocab: @@ -81,11 +162,14 @@ class Vocab: mapping from word string to index """ + _unk = UNK() + _pad = PAD() + def __init__( self, max_size=None, min_freq=1, - specials=(SpecialVocabSymbols.UNK, SpecialVocabSymbols.PAD), + specials=(UNK(), PAD()), keep_freqs=False, eager=True, ): @@ -104,67 +188,103 @@ def __init__( keep_freqs : bool if true word frequencies will be saved for later use on the finalization + eager : bool + if `True` the frequencies will be built immediately upon + dataset loading. While not obvious, the main effect of + this argument if set to `True` is that the frequencies of + the vocabulary will be built on based _all_ datasets + that use this vocabulary, while if set to `False`, the + vocabulary will be built by iterating again over the + datasets passed as argument to the `finalize_fields` + function. """ self._freqs = Counter() self._keep_freqs = keep_freqs self._min_freq = min_freq - self.specials = () if specials is None else specials + self._specials = () if specials is None else specials if not isinstance(self.specials, (tuple, list)): - self.specials = (self.specials,) + self._specials = (self._specials,) self._has_specials = len(self.specials) > 0 - self.itos = list(self.specials) - self._default_unk_index = self._init_default_unk_index(self.specials) - self.stoi = VocabDict(self._default_unk) - self.stoi.update({k: v for v, k in enumerate(self.itos)}) + # Apply uniqueness check + if len(self.specials) > len(set(self.specials)): + error_msg = "Specials may not contain multiple instances of same type." + raise ValueError(error_msg) + + self._itos = list(self.specials) + # self._default_unk_index = self._init_default_unk_index(self.specials) + self._stoi = {k: v for v, k in enumerate(self.itos)} self._max_size = max_size - self.eager = eager - self.finalized = False # flag to know if we're ready to numericalize + self._eager = eager + self._finalized = False # flag to know if we're ready to numericalize + + @property + def freqs(self): + return self._freqs + + @property + def eager(self): + return self._eager + + @property + def finalized(self): + return self._finalized + + @property + def specials(self): + return self._specials + + @property + def itos(self): + return self._itos + + @property + def stoi(self): + return self._stoi - @staticmethod - def _init_default_unk_index(specials): + @classmethod + def from_itos(cls, itos): """ - Method computes index of default unknown symbol in given collection. + Method constructs a vocab from a predefined index-to-string mapping. Parameters ---------- - specials : iter(SpecialVocabSymbols) - collection of special vocab symbols + itos: list | tuple + The index-to-string mapping for tokens in the vocabulary + """ + specials = [token for token in itos if isinstance(token, Special)] - Returns - ------- - index : int or None - index of default unkwnown symbol or None if it doesn't exist + vocab = cls(specials=specials) + vocab._itos = itos + vocab._stoi = {v: k for k, v in enumerate(itos)} + vocab._finalized = True + + return vocab + + @classmethod + def from_stoi(cls, stoi): """ - ind = 0 - for spec in specials: - if spec == SpecialVocabSymbols.UNK: - return ind - ind += 1 - return None + Method constructs a vocab from a predefined index-to-string mapping. - def _default_unk(self): + Parameters + ---------- + stoi: dict + The string-to-index mapping for the vocabulary """ - Method obtains default unknown symbol index. Used for stoi. + specials = [token for token in stoi.keys() if isinstance(token, Special)] - Returns - ------- - index: int - index of default unknown symbol + vocab = cls(specials=specials) + vocab._stoi = stoi + vocab_max_index = max(stoi.values()) + itos = [None] * (vocab_max_index + 1) + for token, index in stoi.items(): + itos[index] = token + vocab._itos = itos + vocab._finalized = True - Raises - ------ - ValueError - If unknown symbol is not present in the vocab. - """ - if self._default_unk_index is None: - raise ValueError( - "Unknown symbol is not present in the vocab but " - "the user asked for the word that isn't in the vocab." - ) - return self._default_unk_index + return vocab def get_freqs(self): """ @@ -186,7 +306,7 @@ def get_freqs(self): "User specified that frequencies aren't kept in " "vocabulary but the get_freqs method is called." ) - return self._freqs + return self.freqs def padding_index(self): """ @@ -202,9 +322,9 @@ def padding_index(self): ValueError If the padding symbol is not present in the vocabulary. """ - if SpecialVocabSymbols.PAD not in self.stoi: + if Vocab._pad not in self.stoi: raise ValueError("Padding symbol is not in the vocabulary.") - return self.stoi[SpecialVocabSymbols.PAD] + return self.stoi[Vocab._pad] def __iadd__(self, values: Union["Vocab", Iterable]): """ @@ -215,7 +335,9 @@ def __iadd__(self, values: Union["Vocab", Iterable]): values : Iterable or Vocab Values to be added to this Vocab. If Vocab, all of the token frequencies and specials from that Vocab will be - added to this Vocab. + added to this Vocab. Wheen adding two Vocabs with a different string values + for a special token, only the special token instance with the valuefrom the + LHS operand will be used. If Iterable, all of the tokens from the Iterable will be added to this Vocab, increasing the frequencies of those tokens. @@ -258,7 +380,7 @@ def __iadd__(self, values: Union["Vocab", Iterable]): ) # unique is used instead of set to somewhat preserve ordering - self.specials = list(unique(chain(self.specials, other_vocab.specials))) + self._specials = list(unique(chain(self.specials, other_vocab.specials))) self._has_specials = len(self.specials) > 0 self._itos = list(self.specials) self._freqs += other_vocab._freqs # add freqs to this instance @@ -285,7 +407,9 @@ def __add__(self, values: Union["Vocab", Iterable]): ---------- values : Iterable or Vocab If Vocab, a new Vocab will be created containing all of the special symbols - and tokens from both Vocabs. + and tokens from both Vocabs. Wheen adding two Vocabs with a different string + values for a special token, only the special token instance with the value + from the first operand will be used. If Iterable, a new Vocab will be returned containing a copy of this Vocab with the iterables' tokens added. @@ -390,7 +514,7 @@ def finalize(self): if not self._keep_freqs: self._freqs = None # release memory - self.finalized = True + self._finalized = True def numericalize(self, data): """ @@ -398,8 +522,8 @@ def numericalize(self, data): Parameters ---------- - data : iter(str) - iterable collection of tokens + data : str | iter(str) + a single token or iterable collection of tokens Returns ------- @@ -416,7 +540,20 @@ def numericalize(self, data): "Cannot numericalize if the vocabulary has not been " "finalized because itos and stoi are not yet built." ) - return np.array([self.stoi[token] for token in data]) + + if isinstance(data, str): + # Wrap string into list + data = [data] + + if Vocab._unk in self.stoi: + # If UNK is in the vocabulary, substitute unknown words with its value + unk_token = self.stoi[Vocab._unk] + return np.array( + [self.stoi[token] if token in self.stoi else unk_token for token in data] + ) + else: + # If UNK is not in the vocabulary we filter out unknown words + return np.array([self.stoi[token] for token in data if token in self.stoi]) def reverse_numericalize(self, numericalized_data: Iterable): """ @@ -469,7 +606,7 @@ def __len__(self): """ if self.finalized: return len(self.itos) - return len(self._freqs) + return len(self.freqs) def __eq__(self, other): """ @@ -490,7 +627,7 @@ def __eq__(self, other): return False if self.finalized != other.finalized: return False - if self._freqs != other._freqs: + if self.freqs != other.freqs: return False if self.stoi != other.stoi: return False @@ -511,7 +648,7 @@ def __iter__(self): iterator over vocab tokens """ if not self.finalized: - return iter(self._freqs.keys()) + return iter(self.freqs.keys()) return iter(self.itos) def __repr__(self): diff --git a/tests/conftest.py b/tests/conftest.py index c0007120..6cb5a120 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -51,18 +51,27 @@ def vocab(tabular_dataset_fields): return tabular_dataset_fields["text"].vocab +@pytest.fixture +@pytest.mark.usefixtures("json_file_path") +def cache_disabled_tabular_dataset(json_file_path): + return create_tabular_dataset_from_json( + tabular_dataset_fields(disable_numericalize_caching=True), json_file_path + ) + + @pytest.fixture @pytest.mark.usefixtures("json_file_path") def tabular_dataset(json_file_path): return create_tabular_dataset_from_json(tabular_dataset_fields(), json_file_path) -def tabular_dataset_fields(fixed_length=None): +def tabular_dataset_fields(fixed_length=None, disable_numericalize_caching=False): text = Field( "text", numericalizer=Vocab(eager=True), fixed_length=fixed_length, allow_missing_data=False, + disable_numericalize_caching=disable_numericalize_caching, ) text_missing = Field( "text_with_missing_data", diff --git a/tests/datasets/test_iterator.py b/tests/datasets/test_iterator.py index 67e01e81..c550a7e9 100644 --- a/tests/datasets/test_iterator.py +++ b/tests/datasets/test_iterator.py @@ -167,6 +167,24 @@ def test_lazy_numericalization_caching(tabular_dataset): assert np.all(numericalized_data == cached_data) +@pytest.mark.usefixtures("cache_disabled_tabular_dataset") +def test_caching_disabled(tabular_dataset): + # Run one epoch to cause lazy numericalization + for _ in Iterator(dataset=tabular_dataset, batch_size=10): + pass + + cache_disabled_fields = [ + f for f in tabular_dataset.fields if f.disable_numericalize_caching + ] + # Test if cached data is equal to numericalized data + for example in tabular_dataset: + for field in cache_disabled_fields: + + cache_field_name = f"{field.name}_" + numericalization = example.get(cache_field_name) + assert numericalization is None + + @pytest.mark.usefixtures("tabular_dataset") def test_sort_key(tabular_dataset): def text_len_sort_key(example): diff --git a/tests/test_field.py b/tests/test_field.py index a83bad73..3cbcaa47 100644 --- a/tests/test_field.py +++ b/tests/test_field.py @@ -6,7 +6,7 @@ import pytest from podium.field import Field, LabelField, MultilabelField, MultioutputField -from podium.vocab import SpecialVocabSymbols, Vocab +from podium.vocab import BOS, EOS, PAD, UNK, Vocab ONE_TO_FIVE = [1, 2, 3, 4, 5] @@ -37,6 +37,7 @@ def __init__(self, eager=True): self.finalized = False self.numericalized = False self.eager = eager + self.specials = () def padding_index(self): return PAD_NUM @@ -419,6 +420,35 @@ def to_lower_hook(raw, tokenized): assert to_lower_hook.call_count == 2 +def test_field_applies_specials(): + bos, eos = BOS(), EOS() + vocab = Vocab(specials=(bos, eos)) + f = Field(name="F", tokenizer="split", numericalizer=vocab, keep_raw=True) + + _, received = f.preprocess("asd 123 BLA")[0] + expected = ("asd 123 BLA", [bos, "asd", "123", "BLA", eos]) + + assert received == expected + + # Test with empty specials + vocab = Vocab(specials=()) + f = Field(name="F", tokenizer="split", numericalizer=vocab, keep_raw=True) + + _, received = f.preprocess("asd 123 BLA")[0] + expected = ("asd 123 BLA", ["asd", "123", "BLA"]) + + assert received == expected + + # Test core specials are a no-op + vocab = Vocab(specials=(PAD(), UNK())) + f = Field(name="F", tokenizer="split", numericalizer=vocab, keep_raw=True) + + _, received = f.preprocess("asd 123 BLA")[0] + expected = ("asd 123 BLA", ["asd", "123", "BLA"]) + + assert received == expected + + def test_field_is_target(): f1 = Field(name="text", is_target=False) f2 = Field(name="label", is_target=True) @@ -522,7 +552,7 @@ def test_multilabel_field_specials_in_vocab_fail(): with pytest.raises(ValueError): MultilabelField( name="bla", - numericalizer=Vocab(specials=(SpecialVocabSymbols.UNK,)), + numericalizer=Vocab(specials=(UNK())), num_of_classes=10, ) diff --git a/tests/test_vocab.py b/tests/test_vocab.py index 1fef65af..cfd64ff1 100644 --- a/tests/test_vocab.py +++ b/tests/test_vocab.py @@ -1,6 +1,7 @@ import os import dill +import numpy as np import pytest from podium import vocab @@ -104,22 +105,44 @@ def test_empty_specials_get_pad_symbol(): voc.padding_index() -def test_empty_specials_stoi(): +def test_no_unk_filters_unknown_tokens(): voc = vocab.Vocab(specials=[]) data = ["tree", "plant", "grass"] voc = voc + set(data) voc.finalize() + + # Tree is in vocab + assert len(voc.numericalize("tree")) == 1 + # Apple isn't in vocab + assert len(voc.numericalize("apple")) == 0 + # Try with list argument + assert len(voc.numericalize(["tree", "apple"])) == 1 + + +@pytest.mark.parametrize( + "default_instance, second_default_instance, custom_instance", + [ + (vocab.UNK(), vocab.UNK(), vocab.UNK("")), + (vocab.PAD(), vocab.PAD(), vocab.PAD("")), + (vocab.BOS(), vocab.BOS(), vocab.BOS("")), + (vocab.EOS(), vocab.EOS(), vocab.EOS("")), + ], +) +def test_specials_uniqueness(default_instance, second_default_instance, custom_instance): + with pytest.raises(ValueError): + vocab.Vocab(specials=[default_instance, second_default_instance]) + with pytest.raises(ValueError): - voc.stoi["apple"] + vocab.Vocab(specials=[default_instance, custom_instance]) def test_specials_get_pad_symbol(): - voc = vocab.Vocab(specials=(vocab.SpecialVocabSymbols.PAD,)) + voc = vocab.Vocab(specials=(vocab.PAD(),)) data = ["tree", "plant", "grass"] voc = voc + set(data) assert voc.padding_index() == 0 voc.finalize() - assert voc.itos[0] == vocab.SpecialVocabSymbols.PAD + assert voc.itos[0] == vocab.PAD() def test_max_size(): @@ -133,7 +156,7 @@ def test_max_size(): def test_max_size_with_specials(): voc = vocab.Vocab( max_size=2, - specials=[vocab.SpecialVocabSymbols.PAD, vocab.SpecialVocabSymbols.UNK], + specials=[vocab.PAD(), vocab.UNK()], ) data = ["tree", "plant", "grass"] voc = (voc + set(data)) + {"plant"} @@ -142,7 +165,7 @@ def test_max_size_with_specials(): def test_size_after_final_with_specials(): - specials = [vocab.SpecialVocabSymbols.PAD, vocab.SpecialVocabSymbols.UNK] + specials = [vocab.PAD(), vocab.UNK()] voc = vocab.Vocab(specials=specials) data = ["tree", "plant", "grass"] voc = (voc + set(data)) + {"plant"} @@ -150,18 +173,25 @@ def test_size_after_final_with_specials(): assert len(voc) == len(data) + len(specials) -def test_enum_special_vocab_symbols(): - assert vocab.SpecialVocabSymbols.PAD.value == "" - assert vocab.SpecialVocabSymbols.UNK.value == "" +def test_special_vocab_symbols(): + assert str(vocab.PAD()) == "" + assert str(vocab.UNK()) == "" + + assert str(vocab.PAD("")) == "" + assert str(vocab.UNK("")) == "" + + # These hold due to overloaded hash/eq + assert vocab.PAD("") == vocab.PAD() + assert vocab.UNK("") == vocab.UNK() def test_get_stoi_for_unknown_word_default_unk(): - specials = [vocab.SpecialVocabSymbols.PAD, vocab.SpecialVocabSymbols.UNK] + specials = [vocab.PAD(), vocab.UNK()] voc = vocab.Vocab(specials=specials) data = ["tree", "plant", "grass"] voc = (voc + set(data)) + {"plant"} voc.finalize() - assert voc.stoi["unknown"] == 1 + assert voc.numericalize("unknown") == 1 def test_iadd_word_after_finalization_error(): @@ -189,19 +219,19 @@ def test_add_vocab_to_vocab(): for word in voc._freqs: assert voc._freqs[word] == expected_freq[word] - voc3 = vocab.Vocab(specials=vocab.SpecialVocabSymbols.UNK) + voc3 = vocab.Vocab(specials=vocab.UNK()) voc3 += data1 voc3 += data3 voc3.finalize() - voc4 = vocab.Vocab(specials=vocab.SpecialVocabSymbols.PAD) + voc4 = vocab.Vocab(specials=vocab.PAD()) voc4 += data2 voc4.finalize() voc = voc3 + voc4 assert set(voc.specials) == { - vocab.SpecialVocabSymbols.PAD, - vocab.SpecialVocabSymbols.UNK, + vocab.PAD(), + vocab.UNK(), } assert voc.finalized assert len(voc.itos) == 7 @@ -212,19 +242,16 @@ def test_iadd_vocab_to_vocab(): data2 = ["a1", "a2", "w1"] expected_freqs = {"w1": 2, "w2": 1, "w3": 1, "a1": 1, "a2": 1} - voc1 = vocab.Vocab(specials=vocab.SpecialVocabSymbols.PAD) + voc1 = vocab.Vocab(specials=vocab.PAD()) voc1 += data1 - voc2 = vocab.Vocab(specials=vocab.SpecialVocabSymbols.UNK) + voc2 = vocab.Vocab(specials=vocab.UNK()) voc2 += data2 voc1 += voc2 assert voc1.get_freqs() == expected_freqs - assert all( - spec in voc1.specials - for spec in (vocab.SpecialVocabSymbols.PAD, vocab.SpecialVocabSymbols.UNK) - ) + assert all(spec in voc1.specials for spec in (vocab.PAD(), vocab.UNK())) def test_add_vocab_to_vocab_error(): @@ -353,13 +380,14 @@ def test_equals_two_vocabs_different_freq(): assert voc1 != voc2 +# This won't fail anymore, should change to +# test_vocab_filer_unk def test_vocab_fail_no_unk(): voc = vocab.Vocab(specials=()) voc += [1, 2, 3, 4, 5] voc.finalize() - with pytest.raises(ValueError): - voc.numericalize([1, 2, 3, 6]) + assert np.array_equal(voc.numericalize([1, 2, 3, 6]), np.array([0, 1, 2])) def test_vocab_has_no_specials(): @@ -374,33 +402,9 @@ def test_vocab_has_specials(): voc = vocab.Vocab() assert voc.has_specials - voc2 = vocab.Vocab(specials=vocab.SpecialVocabSymbols.UNK) + voc2 = vocab.Vocab(specials=vocab.UNK()) assert voc2._has_specials - assert voc2.specials == (vocab.SpecialVocabSymbols.UNK,) - - -def test_vocab_dict_normal_dict_use(): - vocab_dict = vocab.VocabDict() - vocab_dict["first"] = 2 - vocab_dict["second"] = 5 - assert len(vocab_dict) == 2 - assert vocab_dict["first"] == 2 - assert vocab_dict["second"] == 5 - - -def test_vocab_dict_default_factory(): - vocab_dict = vocab.VocabDict(default_factory=lambda: "default") - vocab_dict["item"] = 1 - assert len(vocab_dict) == 1 - assert vocab_dict["unkown_element"] == "default" - assert "unkown_element" not in vocab_dict - assert len(vocab_dict) == 1 - - -def test_vocab_dict_default_factory_none_error(): - vocab_dict = vocab.VocabDict(default_factory=None) - with pytest.raises(KeyError): - vocab_dict["item_not_in_dict"] + assert voc2.specials == (vocab.UNK(),) def test_reverse_numericalize(): @@ -421,3 +425,26 @@ def test_reverse_numericalize_not_finalized(): with pytest.raises(RuntimeError): voc.reverse_numericalize(voc.numericalize(words)) + + +def test_vocab_static_constructors(): + specials = [vocab.PAD(), vocab.UNK()] + voc = vocab.Vocab(specials=specials) + data = ["tree", "plant", "grass"] + voc = (voc + set(data)) + {"plant"} + voc.finalize() + + itos2voc = vocab.Vocab.from_itos(voc.itos) + # Only the frequencies will be different because + # we don't transfer this information, so the full + # vocab1 == vocab2 will fail. Perhaps split equality + # checks for vocab on before/after finalization? + + assert itos2voc.itos == voc.itos + assert itos2voc.stoi == voc.stoi + assert itos2voc.specials == voc.specials + + stoi2voc = vocab.Vocab.from_stoi(voc.stoi) + assert stoi2voc.itos == voc.itos + assert stoi2voc.stoi == voc.stoi + assert stoi2voc.specials == voc.specials diff --git a/tests/vectorizers/test_tfidf.py b/tests/vectorizers/test_tfidf.py index 945559e7..4a5be592 100644 --- a/tests/vectorizers/test_tfidf.py +++ b/tests/vectorizers/test_tfidf.py @@ -4,7 +4,7 @@ from podium.field import Field from podium.vectorizers.tfidf import CountVectorizer, TfIdfVectorizer -from podium.vocab import SpecialVocabSymbols, Vocab +from podium.vocab import PAD, UNK, Vocab TABULAR_TEXT = ("a b c", "a", "a b c d", "a", "d b", "d c g", "b b b b b b") @@ -49,7 +49,7 @@ def test_build_count_matrix_from_tensor_without_specials(): def test_build_count_matrix_from_tensor_with_specials(): - vocab = Vocab(specials=(SpecialVocabSymbols.UNK, SpecialVocabSymbols.PAD)) + vocab = Vocab(specials=(UNK(), PAD())) for i in DATA: vocab += i.split(" ") vocab.finalize() @@ -72,7 +72,7 @@ def test_build_count_matrix_from_tensor_with_specials(): def test_build_count_matrix_out_of_vocab_words(): - vocab = Vocab(specials=(SpecialVocabSymbols.UNK, SpecialVocabSymbols.PAD)) + vocab = Vocab(specials=(UNK(), PAD())) vocab_words = ["this", "is", "the", "first", "document"] vocab += vocab_words vocab.finalize() @@ -108,13 +108,11 @@ def test_build_count_matrix_costum_specials_vocab_without_specials(): def test_build_count_matrix_costum_specials_vocab_with_specials(): - vocab = Vocab(specials=(SpecialVocabSymbols.UNK, SpecialVocabSymbols.PAD)) + vocab = Vocab(specials=(UNK(), PAD())) vocab_words = ["this", "is", "the", "first", "document"] vocab += vocab_words vocab.finalize() - tfidf = TfIdfVectorizer( - vocab=vocab, specials=[SpecialVocabSymbols.PAD, "this", "first"] - ) + tfidf = TfIdfVectorizer(vocab=vocab, specials=[PAD(), "this", "first"]) tfidf._init_special_indexes() numericalized_data = get_numericalized_data(data=DATA, vocab=vocab) @@ -126,7 +124,7 @@ def test_build_count_matrix_costum_specials_vocab_with_specials(): def test_specials_indexes(): - specials = (SpecialVocabSymbols.UNK, SpecialVocabSymbols.PAD) + specials = (UNK(), PAD()) vocab = Vocab(specials=specials) for i in DATA: vocab += i.split(" ") @@ -247,7 +245,7 @@ def test_count_vectorizer_examples_none(tabular_dataset): def test_count_matrix_specials_indexes(): - specials = (SpecialVocabSymbols.UNK, SpecialVocabSymbols.PAD) + specials = (UNK(), PAD()) vocab = Vocab(specials=specials) for i in DATA: vocab += i.split(" ")