Rework examples #316

mttk · 2021-03-31T15:17:37Z

Move examples to RST files
Fixes Error when dataset is None in SingleBatchIterator #269 by setting batch_size to a placeholder if a dataset is not set in the Iterator
Fixes an iterator issue where the whole dataset was copied when shuffled (this caused slowdowns when iterating over HF dataset wrappers as they are disk backed and the copy made them concrete)

Closes #172 , #245, #269

…ples_rework

ivansmokovic

LGTM.

Took a good look at the examples in the docs. Good work, love them 👍.

docs/source/examples/pytorch_rnn_example.rst

ivansmokovic · 2021-04-01T16:16:45Z

docs/source/examples/tfidf_example.rst

+
+  >>> from podium.vectorizers import TfIdfVectorizer
+  >>> tfidf_vectorizer = TFIdfVectorizer()
+  >>> tfidf_vectorizer.fit(train, field=train.field('text'))


Idea: add the option to pass an str as field. The tfidf_vectorizer could then default to extracting the field from the dataset itself. So it could be written as

tfidf_vectorizer.fit(train, field='text')

which IMO looks nicer and covers 99% of the usecase.

True, will add this change in a subsequent PR.

ivansmokovic · 2021-04-01T16:28:22Z

podium/datasets/iterator.py

-            batch_instances = data[i : i + self.batch_size]
+        for i in range(start, len(self._dataset), self.batch_size):
+            batch_indices = indices[i : i + self.batch_size]
+            batch_instances = self._dataset[batch_indices]


How does this work with HF datasets? It does not cause the copy of the whole dataset as before? Does it only cause copying of the range? In any case, at some point, we should wrap this in a view. Leave it as-is for now, as it can be changed later without changing API.

It doesn't, the HFDatasetConverter implements its own __get__, which essentially delegates it to the underlying arrow dataset implementation, everything stays on disk and not in memory. I'm a bit wary of how indexing vs slicing behaves speed-wise, will check this at some point.
Definitely should explore if there's a cleaner way of doing this.

podium/datasets/iterator.py

Roadmap.md

FilipBolt · 2021-04-02T10:36:47Z

docs/source/examples/tfidf_example.rst

+.. code-block:: python
+
+  >>> # Use [1-3]grams, inclusive
+  >>> ngram_hook = NGramHook(1,3)


this looks really great

FilipBolt · 2021-04-02T10:39:15Z

podium/datasets/iterator.py

@@ -494,9 +493,12 @@ def __init__(self, dataset: DatasetBase = None, shuffle=True, add_padding=True):
            iterator. If set to ``False``, numericalized Fields will be
            returned as python lists of ``matrix_class`` instances.
        """
+
+        batch_size = len(dataset) if dataset else 0


batch_size of 0 looks really weird? What exactly does it mean and how exactly does that fit in range

Yeah, wasn't sure what to put here tbh. I didn't want to make the change we discussed in #269 where the dataset arg would be required because I'd also need to change a lot of dependent code. The iterator will error out during iteration if the dataset isn't set in any case, and the batch size will be set when the dataset is set so I think this is OK although not pretty.

perhaps just raise an exception or set it to None?

Set it to None, tests work fine. This will need to be addressed soonish though.

mariosasko and others added 3 commits March 31, 2021 02:59

Fix references in colabs

31b2792

Rework examples

c7f23d6

Merge

a6d88a8

mttk self-assigned this Mar 31, 2021

mttk added 10 commits March 31, 2021 21:03

Finalize tfidf example

ecd9c26

Fix shuffling cost

f6726e4

Fix shuffling cost

de5d083

Fix shuffling cost+

9123789

Fix shuffling cost+

78757a3

Fix shuffling cost+

1bd5887

Merge branch 'master' into examples_rework

7fec7f6

merge

6c9b67b

Finalize pytorch rnn example

2ed6b95

Finalize examples

0183c38

mttk requested review from mariosasko, FilipBolt and ivansmokovic April 1, 2021 14:57

mttk changed the title ~~[WIP] Rework examples~~ Rework examples Apr 1, 2021

mttk and others added 4 commits April 1, 2021 17:01

Move examples back to examples folder

0c04847

Merge branch 'examples_rework' of github.com:TakeLab/podium into exam…

ab3f504

…ples_rework

Add examples directory to notebooks

9f77e5a

Remove debug print

73ab602

ivansmokovic approved these changes Apr 1, 2021

View reviewed changes

mariosasko and others added 7 commits April 1, 2021 18:47

Fix tfidf example, improve notebooks

7b78c81

Fix conflict

e5cb8c6

Move examples notebooks to notebooks/examples

0ce4f20

Fix JS colab condition

23b44d1

Merge branch 'master' into examples_rework

3827d1b

Comments

dcb6b9e

Delete examples (the camera ready ones are migrated into docs)

63bed39

mttk added 6 commits April 1, 2021 20:27

Remove examples dir from commands

6ff3314

Remove examples dir from action

715859d

Update readme outputs

cf26d20

Add roadmap

d5a37e7

Add roadmap

7436d30

Add roadmap

a3cd42e

FilipBolt approved these changes Apr 2, 2021

View reviewed changes

Polish, comments, rename BasicVectorStorage to WordVectors

10bf1cf

mttk merged commit 34513a4 into master Apr 2, 2021

mttk deleted the examples_rework branch April 2, 2021 11:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework examples #316

Rework examples #316

mttk commented Mar 31, 2021 •

edited

Loading

ivansmokovic left a comment

ivansmokovic Apr 1, 2021 •

edited

Loading

mttk Apr 1, 2021

ivansmokovic Apr 1, 2021

mttk Apr 1, 2021

FilipBolt Apr 2, 2021

FilipBolt Apr 2, 2021

mttk Apr 2, 2021

FilipBolt Apr 2, 2021

mttk Apr 2, 2021

Rework examples #316

Rework examples #316

Conversation

mttk commented Mar 31, 2021 • edited Loading

ivansmokovic left a comment

Choose a reason for hiding this comment

ivansmokovic Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mttk commented Mar 31, 2021 •

edited

Loading

ivansmokovic Apr 1, 2021 •

edited

Loading