Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework examples #316

Merged
merged 31 commits into from
Apr 2, 2021
Merged

Rework examples #316

merged 31 commits into from
Apr 2, 2021

Conversation

mttk
Copy link
Member

@mttk mttk commented Mar 31, 2021

  • Move examples to RST files
  • Fixes Error when dataset is None in SingleBatchIterator #269 by setting batch_size to a placeholder if a dataset is not set in the Iterator
  • Fixes an iterator issue where the whole dataset was copied when shuffled (this caused slowdowns when iterating over HF dataset wrappers as they are disk backed and the copy made them concrete)

Closes #172 , #245, #269

@mttk mttk self-assigned this Mar 31, 2021
@mttk mttk changed the title [WIP] Rework examples Rework examples Apr 1, 2021
Copy link
Collaborator

@ivansmokovic ivansmokovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Took a good look at the examples in the docs. Good work, love them 👍.

docs/source/examples/pytorch_rnn_example.rst Outdated Show resolved Hide resolved

>>> from podium.vectorizers import TfIdfVectorizer
>>> tfidf_vectorizer = TFIdfVectorizer()
>>> tfidf_vectorizer.fit(train, field=train.field('text'))
Copy link
Collaborator

@ivansmokovic ivansmokovic Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: add the option to pass an str as field. The tfidf_vectorizer could then default to extracting the field from the dataset itself. So it could be written as

tfidf_vectorizer.fit(train, field='text')

which IMO looks nicer and covers 99% of the usecase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, will add this change in a subsequent PR.

batch_instances = data[i : i + self.batch_size]
for i in range(start, len(self._dataset), self.batch_size):
batch_indices = indices[i : i + self.batch_size]
batch_instances = self._dataset[batch_indices]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with HF datasets? It does not cause the copy of the whole dataset as before? Does it only cause copying of the range? In any case, at some point, we should wrap this in a view. Leave it as-is for now, as it can be changed later without changing API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, the HFDatasetConverter implements its own __get__, which essentially delegates it to the underlying arrow dataset implementation, everything stays on disk and not in memory. I'm a bit wary of how indexing vs slicing behaves speed-wise, will check this at some point.
Definitely should explore if there's a cleaner way of doing this.

podium/datasets/iterator.py Outdated Show resolved Hide resolved
Roadmap.md Show resolved Hide resolved
.. code-block:: python

>>> # Use [1-3]grams, inclusive
>>> ngram_hook = NGramHook(1,3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks really great

@@ -494,9 +493,12 @@ def __init__(self, dataset: DatasetBase = None, shuffle=True, add_padding=True):
iterator. If set to ``False``, numericalized Fields will be
returned as python lists of ``matrix_class`` instances.
"""

batch_size = len(dataset) if dataset else 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_size of 0 looks really weird? What exactly does it mean and how exactly does that fit in range

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, wasn't sure what to put here tbh. I didn't want to make the change we discussed in #269 where the dataset arg would be required because I'd also need to change a lot of dependent code. The iterator will error out during iteration if the dataset isn't set in any case, and the batch size will be set when the dataset is set so I think this is OK although not pretty.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps just raise an exception or set it to None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set it to None, tests work fine. This will need to be addressed soonish though.

@mttk mttk merged commit 34513a4 into master Apr 2, 2021
@mttk mttk deleted the examples_rework branch April 2, 2021 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when dataset is None in SingleBatchIterator Move examples to separate directories
4 participants