Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework examples #316

Merged
merged 31 commits into from
Apr 2, 2021
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
31b2792
Fix references in colabs
mariosasko Mar 31, 2021
c7f23d6
Rework examples
mttk Mar 31, 2021
a6d88a8
Merge
mttk Mar 31, 2021
ecd9c26
Finalize tfidf example
mttk Mar 31, 2021
f6726e4
Fix shuffling cost
mttk Apr 1, 2021
de5d083
Fix shuffling cost
mttk Apr 1, 2021
9123789
Fix shuffling cost+
mttk Apr 1, 2021
78757a3
Fix shuffling cost+
mttk Apr 1, 2021
1bd5887
Fix shuffling cost+
mttk Apr 1, 2021
7fec7f6
Merge branch 'master' into examples_rework
mttk Apr 1, 2021
6c9b67b
merge
mttk Apr 1, 2021
2ed6b95
Finalize pytorch rnn example
mttk Apr 1, 2021
0183c38
Finalize examples
mttk Apr 1, 2021
0c04847
Move examples back to examples folder
mttk Apr 1, 2021
ab3f504
Merge branch 'examples_rework' of github.com:TakeLab/podium into exam…
mariosasko Apr 1, 2021
9f77e5a
Add examples directory to notebooks
mttk Apr 1, 2021
73ab602
Remove debug print
mttk Apr 1, 2021
7b78c81
Fix tfidf example, improve notebooks
mariosasko Apr 1, 2021
e5cb8c6
Fix conflict
mariosasko Apr 1, 2021
0ce4f20
Move examples notebooks to notebooks/examples
mariosasko Apr 1, 2021
23b44d1
Fix JS colab condition
mariosasko Apr 1, 2021
3827d1b
Merge branch 'master' into examples_rework
mttk Apr 1, 2021
dcb6b9e
Comments
mttk Apr 1, 2021
63bed39
Delete examples (the camera ready ones are migrated into docs)
mttk Apr 1, 2021
6ff3314
Remove examples dir from commands
mttk Apr 1, 2021
715859d
Remove examples dir from action
mttk Apr 1, 2021
cf26d20
Update readme outputs
mttk Apr 1, 2021
d5a37e7
Add roadmap
mttk Apr 2, 2021
7436d30
Add roadmap
mttk Apr 2, 2021
a3cd42e
Add roadmap
mttk Apr 2, 2021
10bf1cf
Polish, comments, rename BasicVectorStorage to WordVectors
mttk Apr 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_static/js/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ const hasNotebook = [
"advanced",
"preprocessing",
"walkthrough",
"tfidf_example",
"pytorch_rnn_example"
]

function addIcon() {
Expand Down
8 changes: 4 additions & 4 deletions docs/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -448,18 +448,18 @@ The ``bucket_sort_key`` function defines how the instances in the dataset should
For Iterator, padding = 148141 out of 281696 = 52.588961149608096%
For BucketIterator, padding = 2125 out of 135680 = 1.5661851415094339%

As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the highligted data loading line in the first snippet to:
As we can see, the difference between using a regular Iterator and a BucketIterator is massive. Not only do we reduce the amount of padding, we have reduced the total amount of tokens processed by about 50%. The SST dataset, however, is a relatively small dataset so this experiment might be a bit biased. Let's take a look at the same statistics for the :class:`podium.datasets.impl.IMDB` dataset. After changing the data loading line in the first snippet to:

.. code-block:: rest

train, test = IMDB.get_dataset_splits(fields=fields)
>>> train, test = IMDB.get_dataset_splits(fields=fields)

And re-running the code, we obtain the following, still significant improvement:

.. code-block:: rest

For Iterator, padding = 13569936 out of 19414616 = 69.89546432440385%
For BucketIterator, padding = 259800 out of 6104480 = 4.255890755641758%
For Iterator, padding = 13569936 out of 19414616 = 69.89%
For BucketIterator, padding = 259800 out of 6104480 = 4.25%

Generally, using bucketing when iterating over your NLP dataset is preferred and will save you quite a bit of processing time.

Expand Down
273 changes: 273 additions & 0 deletions docs/source/examples/pytorch_rnn_example.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
Pytorch RNN classifier
=======================

In this example, we will cover a simple RNN-based classifier model implemented in Pytorch. We will use the IMDB dataset loaded from 🤗/datasets, preprocess it with Fields and train the model briefly.
While having a GPU is not necessary it is recommended as otherwise training the model, even for a single epoch, will take a while.

Loading a dataset from 🤗/datasets
-----------------------------------

As we have covered in :ref:`hf-loading`, we have implemented wrappers around 🤗 dataset classes to enable working with the plethora of datasets implemented therein. We will now briefly go through (1) loading a dataset from 🤗/datasets and (2) wrapping it in Podium classes.

.. code-block:: python

>>> import datasets
>>> imdb = datasets.load_dataset('imdb')
>>> print(imdb)
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
})
>>> from pprint import pprint
>>> pprint(imdb['train'].features)
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None),
'text': Value(dtype='string', id=None)}

By calling ``load_dataset`` the dataset was downloaded and cached on disk through the ``datasets`` library. The dataset has two splits we are interested in (``train`` and ``test``).
The main thing we need to pay attention to are the ``features`` of the dataset, in this case ``text`` and ``label``. These features, or data columns, need to be mapped to (and processed by) Podium Fields.

For convenience, we have implemented automatic ``Field`` type inference from 🤗 dataset features -- however it is far from perfect as we have to make many assumptions on the way. We will now wrap the IMDB dataset in Podium and show the automatically inferred Fields.

.. code-block:: python

>>> from podium.datasets.hf import HFDatasetConverter as HF
>>> splits = HF.from_dataset_dict(imdb)
>>> imdb_train, imdb_test = splits['train'], splits['test']
>>> imdb_train.finalize_fields() # Construct the vocab
>>> print(*imdb_train.fields, sep="\n")
Field({
name: 'text',
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 280619})
})
LabelField({
name: 'label',
keep_raw: False,
is_target: True
})

Both of the Fields were constructed well, but there are a couple of drawbacks for this concrete dataset. Firstly, the size of the vocabulary is very large (``280619``) -- we would like to trim this down to a reasonable number as we won't be using subword tokenization in this example.

.. code-block:: python

>>> print(imdb_train[0])
Example({
text: (None, ['Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy.', 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life,', 'such', 'as', '"Teachers".', 'My', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'Bromwell', "High's", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', '"Teachers".', 'The', 'scramble', 'to', 'survive', 'financially,', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', "teachers'", 'pomp,', 'the', 'pettiness', 'of', 'the', 'whole', 'situation,', 'all', 'remind', 'me', 'of', 'the', 'schools', 'I', 'knew', 'and', 'their', 'students.', 'When', 'I', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school,', 'I', 'immediately', 'recalled', '.........', 'at', '..........', 'High.', 'A', 'classic', 'line:', 'INSPECTOR:', "I'm", 'here', 'to', 'sack', 'one', 'of', 'your', 'teachers.', 'STUDENT:', 'Welcome', 'to', 'Bromwell', 'High.', 'I', 'expect', 'that', 'many', 'adults', 'of', 'my', 'age', 'think', 'that', 'Bromwell', 'High', 'is', 'far', 'fetched.', 'What', 'a', 'pity', 'that', 'it', "isn't!"]),
label: (None, 1)
})

When inspecting a concrete instance, there are a few more things to note. Firstly, IMDB instances can be quite long (on average around 200 tokens per instance), secondly, the text wasn't tokenized properly near sentence boundaries (due to using the default ``str.split`` tokenizer) and lastly, the text has varying casing.
We will instead define our own Fields for the corresponding features, add posttokenization hooks which will transform the data, and use those Fields to replace the automatically inferred ones:

.. code-block:: python

>>> from podium import Field, LabelField, Vocab
>>>
>>> # Lowercasing as a post-tokenization hook
>>> def lowercase(raw, tokenized):
... return raw, [token.lower() for token in tokenized]
>>>
>>> # Truncating as a post-tokenization hook
>>> def truncate(raw, tokenized, max_length=200):
... return raw, tokenized[:max_length]
>>>
>>> vocab = Vocab(max_size=10000)
>>> text = Field(name="text",
... numericalizer=vocab,
... include_lengths=True,
... tokenizer="spacy-en_core_web_sm",
... posttokenize_hooks=[truncate, lowercase])
>>>
>>> # The labels are already mapped to indices in /datasets so we will
>>> # pass them through
>>> label = LabelField(name="label", numericalizer=lambda x: x)
>>> fields = {
... 'text': text,
... 'label': label
... }
>>>
>>> # Use the given Fields to load the dataset again
>>> splits = HF.from_dataset_dict(imdb, fields=fields)
>>> imdb_train, imdb_test = splits['train'], splits['test']
>>> imdb_train.finalize_fields()
>>> print(imdb_train)
HFDatasetConverter({
dataset_name: imdb,
size: 25000,
fields: [
Field({
name: 'text',
keep_raw: False,
is_target: False,
vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 10000})
}),
LabelField({
name: 'label',
keep_raw: False,
is_target: True
})

]
})
>>> print(imdb_train[0])
Example({
text: (None, ['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', '.', 'it', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', ',', 'such', 'as', '"', 'teachers', '"', '.', 'my', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'bromwell', 'high', "'s", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', '"', 'teachers', '"', '.', 'the', 'scramble', 'to', 'survive', 'financially', ',', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', "'", 'pomp', ',', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', ',', 'all', 'remind', 'me', 'of', 'the', 'schools', 'i', 'knew', 'and', 'their', 'students', '.', 'when', 'i', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', ',', 'i', 'immediately', 'recalled', '.........', 'at', '..........', 'high', '.', 'a', 'classic', 'line', ':', 'inspector', ':', 'i', "'m", 'here', 'to', 'sack', 'one', 'of', 'your', 'teachers', '.', 'student', ':', 'welcome', 'to', 'bromwell', 'high', '.', 'i', 'expect', 'that', 'many', 'adults', 'of', 'my', 'age', 'think', 'that', 'bromwell', 'high', 'is', 'far', 'fetched', '.', 'what', 'a', 'pity', 'that', 'it', 'is', "n't", '!']),
label: (None, 1)
})

Here, we can see the effect of our hooks and using the spacy tokenizer. Now our dataset will be a bit cleaner to work with. Some data cleaning would still be desired, such as removing tokens which only contain punctuation, but we leave this exercise to the reader :)

Loading pretrained embeddings
-----------------------------
In most use-cases, we want to use pre-trained word embeddings along with our neural model. With Podium, this process is very simple. If your field uses a vocabulary, it has already built an inventory of tokens for your dataset.

For example, we will use the `GloVe <https://nlp.stanford.edu/projects/glove/>`__ vectors. You can read more about loading pretrained vectors in :ref:`pretrained`, but the procedure to load these vectors has two steps: (1) initialize the vector class, which sets all the required paths and (2) obtain the vectors for a pre-defined list of words by calling ``load_vocab``.

.. code-block:: python

>>> from podium.vectorizers import GloVe
>>> vocab = fields['text'].vocab
>>> glove = GloVe()
>>> embeddings = glove.load_vocab(vocab)
>>> print(f"For vocabulary of size: {len(vocab)} loaded embedding matrix of shape: {embeddings.shape}")
For vocabulary of size: 10000 loaded embedding matrix of shape: (10000, 300)
>>> # We can obtain vectors for a single word (given the word is loaded) like this:
>>> word = "sport"
>>> print(f"Vector for {word}: {glove.token_to_vector(word)}")
Vector for sport: [ 0.34566 0.15934 0.48444 -0.13693 0.18737 0.2678
-0.39159 0.4931 -0.76111 -1.4586 0.41475 0.55837
...
0.13802 0.36619 0.19734 0.35701 -0.42228 -0.25242
-0.050651 -0.041129 0.15092 0.22084 0.52252 -0.27224 ]

Defining a simple neural model in Pytorch
------------------------------------------

In this section, we will implement a very simple neural classification model -- a 2-layer BiGRU with a single hidden layer classifier on top of its last hidden state. Many improvements to the model can be made, but this is not our current focus.

.. code-block:: python

>>> import torch
>>> import torch.nn as nn
>>> import torch.nn.functional as F
>>>
>>> from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
>>>
>>> class RNNClassifier(nn.Module):
... def __init__(self, embedding, embed_dim=300, hidden_dim=300, num_labels=2):
... super(NLIModel, self).__init__()
... self.embedding = embedding
... self.encoder = nn.GRU(
... input_size=embed_dim,
... hidden_size=hidden_dim,
... num_layers=2,
... bidirectional=True,
... dropout=0.3
... )
... self.decoder = nn.Sequential(
... nn.Linear(2*hidden_dim, hidden_dim),
... nn.Tanh(),
... nn.Linear(hidden_dim, num_labels)
... )
...
... def forward(self, x, lengths):
... e = self.embedding(x)
... h_pack = pack_padded_sequence(e,
... lengths,
... enforce_sorted=False,
... batch_first=True)
...
... _, h = self.encoder(h_pack) # [2L x B x H]
...
... # Concat last state of left and right directions
... h = torch.cat([h[-1], h[-2]], dim=-1) # [B x 2H]
... return self.decoder(h)

There. We will now define the prerequisites for pytorch model training, where we will use a GPU for speed, however running the model for one epoch will is possible albeit time-consuing even without a GPU.

.. code-block:: python

>>> embed_dim = 300
>>> padding_index = text.vocab.get_padding_index()
>>> embedding_matrix = nn.Embedding(len(text.vocab), embed_dim,
... padding_idx=padding_index)
>>> # Copy the pretrained GloVe word embeddings
>>> embedding_matrix.weight.data.copy_(torch.from_numpy(embeddings))
>>>
>>> device = torch.device("cuda:0")
>>> model = RNNClassifier(embedding_matrix)
>>> model = model.to(device)
>>> criterion = nn.CrossEntropyLoss()
>>> optimizer = torch.optim.Adam(model.parameters())

Now that we have the model setup code ready, we will first define helper method to measure accuracy of our model after each epoch:

.. code-block:: python

>>> import numpy as np
>>> def update_stats(accuracy, confusion_matrix, logits, y):
... _, max_ind = torch.max(logits, 1)
... equal = torch.eq(max_ind, y)
... correct = int(torch.sum(equal))
...
... for j, i in zip(max_ind, y):
... confusion_matrix[int(i),int(j)]+=1
... return accuracy + correct, confusion_matrix

and now the training loop for the model:

.. code-block:: python

>>> import tqdm
>>> def train(model, data, optimizer, criterion, num_labels):
... model.train()
... accuracy, confusion_matrix = 0, np.zeros((num_labels, num_labels), dtype=int)
... for batch_num, batch in tqdm.tqdm(enumerate(data), total=len(data)):
... x, lens = batch.text
... y = batch.label
... logits = model(x, lens)
... accuracy, confusion_matrix = update_stats(accuracy, confusion_matrix, logits, y)
... loss = criterion(logits, y.squeeze())
... loss.backward()
... optimizer.step()
... print("[Accuracy]: {}/{} : {:.3f}%".format(
... accuracy, len(data)*data.batch_size, accuracy / len(data) / data.batch_size * 100))
... return accuracy, confusion_matrix

and now, we are done with our model code. Let's turn back to Podium and see how we can set up batching for our training loop to start ticking.

Minibatching data in Podium
--------------------------------

We have covered batching data in :ref:`minibatching` and advanced batching through bucketing in :ref:`bucketing`. We will use the plain Iterator and leave bucketing for you to change to see how much the model speeds up when minimizing padding. One change we would like to do when iterating over data is to obtain the data matrices as torch tensors on the ``device`` we defined previously. We will now demonstrate how to do this by setting the ``matrix_class`` argument of the :class:`podium.datasets.Iterator`\:

.. code-block:: python

>>> from podium import Iterator
>>> # Closure for converting data to given device
>>> def gpu_tensor(data):
... return torch.tensor(data).to(device)
>>> # Initialize our iterator
>>> train_iter = Iterator(imdb_train, batch_size=32, matrix_class=gpu_tensor)
>>>
>>> epochs = 5
>>> for epoch in range(epochs):
>>> train(model, train_iter, optimizer, criterion, num_labels=2)
[Accuracy]: 20050/25024 : 80.123%
[Accuracy]: 22683/25024 : 90.645%
[Accuracy]: 23709/25024 : 94.745%
[Accuracy]: 24323/25024 : 97.199%
[Accuracy]: 24595/25024 : 98.286%

And we are done! In our case, the model takes about one minute per epoch on a GPU, but this can be sped up by using bucketing, which we recommend you try out yourself.
Loading