-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework examples #316
Rework examples #316
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Took a good look at the examples in the docs. Good work, love them 👍.
|
||
>>> from podium.vectorizers import TfIdfVectorizer | ||
>>> tfidf_vectorizer = TFIdfVectorizer() | ||
>>> tfidf_vectorizer.fit(train, field=train.field('text')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea: add the option to pass an str
as field
. The tfidf_vectorizer
could then default to extracting the field from the dataset itself. So it could be written as
tfidf_vectorizer.fit(train, field='text')
which IMO looks nicer and covers 99% of the usecase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, will add this change in a subsequent PR.
batch_instances = data[i : i + self.batch_size] | ||
for i in range(start, len(self._dataset), self.batch_size): | ||
batch_indices = indices[i : i + self.batch_size] | ||
batch_instances = self._dataset[batch_indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work with HF datasets? It does not cause the copy of the whole dataset as before? Does it only cause copying of the range? In any case, at some point, we should wrap this in a view. Leave it as-is for now, as it can be changed later without changing API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't, the HFDatasetConverter
implements its own __get__
, which essentially delegates it to the underlying arrow dataset implementation, everything stays on disk and not in memory. I'm a bit wary of how indexing vs slicing behaves speed-wise, will check this at some point.
Definitely should explore if there's a cleaner way of doing this.
.. code-block:: python | ||
|
||
>>> # Use [1-3]grams, inclusive | ||
>>> ngram_hook = NGramHook(1,3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks really great
podium/datasets/iterator.py
Outdated
@@ -494,9 +493,12 @@ def __init__(self, dataset: DatasetBase = None, shuffle=True, add_padding=True): | |||
iterator. If set to ``False``, numericalized Fields will be | |||
returned as python lists of ``matrix_class`` instances. | |||
""" | |||
|
|||
batch_size = len(dataset) if dataset else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_size of 0 looks really weird? What exactly does it mean and how exactly does that fit in range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, wasn't sure what to put here tbh. I didn't want to make the change we discussed in #269 where the dataset
arg would be required because I'd also need to change a lot of dependent code. The iterator will error out during iteration if the dataset isn't set in any case, and the batch size will be set when the dataset is set so I think this is OK although not pretty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps just raise an exception or set it to None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set it to None
, tests work fine. This will need to be addressed soonish though.
batch_size
to a placeholder if a dataset is not set in the IteratorCloses #172 , #245, #269