Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Implement Feature Request from #1077 on Left Padding #1126

Closed
wants to merge 36 commits into from
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
a52a90c
Update docstrings for issue #1077
Sep 14, 2021
ed61663
Merge branch 'main' into 1077-implement
Sep 16, 2021
ff1e396
Implementation of left padding for issue #1077
Sep 16, 2021
d9e457d
Update #1077 implementation
Sep 16, 2021
e25c6e8
Implement #1077 update with docstring and type hinting.
Sep 16, 2021
295d4e2
Merge branch 'main' into 1077-implement
lesnikow Sep 16, 2021
5166d57
Merge branch 'main' into 1077-implement
lesnikow Sep 17, 2021
e55336c
Merge branch 'main' into 1077-implement
lesnikow Sep 20, 2021
af8aa57
Merge branch 'main' of github.com:NVIDIA/NVTabular into 1077-implement
Sep 23, 2021
299d356
Update tensorflow module docstring for docs syntax
Sep 23, 2021
1285783
Expose pad_left to user
Sep 24, 2021
364bcf1
Skip test_distributed_multigpu()
Sep 24, 2021
071b8bf
Add unit test for torch dataloader and padding argument
Sep 24, 2021
3cce162
Update torch test for padding argument
Sep 24, 2021
cebb715
Update unit test for padding argument
Sep 25, 2021
5acd76a
Update dataloader torch to pass new tests
Sep 25, 2021
1684289
Clean up loader/torch module
Sep 25, 2021
a319501
Clean up test_torch_dataloader module
Sep 25, 2021
0be389e
Update tests
Sep 27, 2021
d93f9c5
Add tests for the TensorFlow runtime dataloader
Sep 28, 2021
0c0ce69
Implement pad_left in _build_sparse_tensor TF
Sep 28, 2021
941d2f3
Update torch loader documentation
Sep 28, 2021
7944b2a
Merge branch 'main' of 1077-implement
Sep 28, 2021
76c0024
Cleanup _build_sparese_tensor for TF dataloader
Sep 28, 2021
46847cb
Add docstring to _build_sparse_tensor() for tf
Sep 28, 2021
c7ae873
Update docstring
Sep 28, 2021
d86cec3
Refactor torch dataloader pad_left and _build_spar
Sep 28, 2021
d90e1df
Update pytest decorator
Sep 28, 2021
b21c57d
Cleanup torch loader
Sep 28, 2021
2150ede
Implement pad_left with TF ops
Sep 29, 2021
a51aa44
Implement pad_left with TF ops cleanup
Sep 29, 2021
01749f9
Merge branch 'main' into 1077-implement
lesnikow Sep 29, 2021
b305afa
Update tensorflow dataloader implementation
Sep 30, 2021
2febf1a
Merge branch '1077-implement' of https://github.com/NVIDIA/NVTabular …
Sep 30, 2021
587ef0c
Update pad_left TF unit tests
Sep 30, 2021
dd9927e
Update pad_left code for TF sparse tensors
Sep 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions nvtabular/loader/tensorflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,20 +140,20 @@ class KerasSequenceLoader(tf.keras.utils.Sequence, DataLoader):

Iterator output is of the form `(dict(features), list(labels))`,
where each element of the features dict is a
`feature_name: feature_tensor` and each elemtn of the labels
`feature_name: feature_tensor` and each element of the labels
list is a tensor, and all tensors are of shape `(batch_size, 1)`.
Note that this means vectorized continuous and multi-hot categorical
features are not currently supported.
The underlying NVTabular `Dataset` object is stored in the `data`
attribute, and should be used for updating NVTabular `Workflow`
statistics::
statistics:
benfred marked this conversation as resolved.
Show resolved Hide resolved

workflow = nvt.Workflow(...)
dataset = KerasSequenceLoader(...)
workflow.update_stats(dataset.data.to_iter(), record_stats=True)

Parameters
-------------
----------
- paths_or_dataset: str or list(str)
Either a string representing a file pattern (see `tf.glob` for
pattern rules), a list of filenames to be iterated through, or
Expand Down Expand Up @@ -205,6 +205,7 @@ class KerasSequenceLoader(tf.keras.utils.Sequence, DataLoader):
dictionary of key: column_name + value: integer representing max sequence length for column
sparse_dense : bool
bool value to activate transforming sparse tensors to dense

"""

_use_nnz = True
Expand Down Expand Up @@ -238,7 +239,7 @@ def __init__(
feature_columns, cat_names, cont_names, schema=dataset.schema
)

# sort the ccolumns to avoid getting incorrect output
# Sort the columns to avoid getting incorrect output.
# (https://github.com/NVIDIA/NVTabular/issues/412)
cat_names = _get_embedding_order(cat_names)
cont_names = _get_embedding_order(cont_names)
Expand All @@ -265,19 +266,18 @@ def __init__(
self._map_fns = []

def __len__(self):
"""
recreating since otherwise Keras yells at you
"""
"""Recreating since otherwise Keras yells at you."""
# TODO: what's a better way to do this inheritance
# of the appropriate methods? A Metaclass?
DataLoader.stop(self)
return DataLoader.__len__(self)

def __getitem__(self, idx):
"""
implemented exclusively for consistency
Implemented exclusively for consistency
with Keras model.fit. Does not leverage
passed idx in any way
passed idx in any way.

"""
return DataLoader.__next__(self)

Expand All @@ -286,6 +286,7 @@ def map(self, fn):
Applying a function to each batch.

This can for instance be used to add `sample_weight` to the model.

"""
self._map_fns.append(fn)

Expand Down
34 changes: 32 additions & 2 deletions nvtabular/loader/torch.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ class TorchAsyncItr(torch.utils.data.IterableDataset, DataLoader):
batches are the specified size until the final batch.

Parameters
-----------
----------
dataset : NVTabular dataset
cats : [str]
the list of categorical columns in the dataset
Expand Down Expand Up @@ -174,8 +174,38 @@ def _get_sparse_tensor(self, values, indices, num_rows, seq_limit):
sparse_tensor = sparse_tensor.to_dense()
return sparse_tensor

def _build_sparse_tensor(self, values, offsets, diff_offsets, num_rows, seq_limit):
def _build_sparse_tensor(
self, values, offsets, diff_offsets, num_rows, seq_limit, padding: str = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this padding value supposed to be passed by the user? It seems like this parameter is only set on a non-public method - and isn't set anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Ben for your review. Yes, let me see how to have this option be user-accessible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will plan to implement these options as user-accessible argument in the signatures for the TorchAsyncItr class in torch.py and in the KerasSequenceLoader class in tensorflow.py, if there are not objections to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benfred I have seen the test files for the dataloaders here. Based on your knowledge of the existing tests, are there some existing tests that you would advise or guide me that I can use as a template to test this padding feature for the two dataloader implementations?

Copy link
Contributor Author

@lesnikow lesnikow Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benfred Would you have any thoughts or advice on how to implement a user-facing interface to this option? Gabriel had suggested to modify a couple of the private methods in this torch dataloader module. I do not see anywhere though that either of these private methods, _build_sparse_tensor() or _get_indices() are used in this torch module or anywhere else in the codebase. My guess is that he had either wanted to call these private methods directly, or was mistaken where to implement this feature. Would you have any advice or guidance on whether to leave this implementation in the private methods or where to expose the padding side argument to users?

):
"""Builds sparse tensors in our torch dataloader.

Parameters
----------
values :
offsets :
diff_offsets :
num_rows :
seq_limit :
padding : str, optional
Padding mode, choose among 'left' for left padding, 'right' for right padding,
or '' for no padding, by default ''

Returns
-------
torch.sparse
Our built torch sparse tensor.

Raises
------
NotImplementedError
Raises this error when this method is called with a not implemented
padding mode string.
"""
indices = self._get_indices(offsets, diff_offsets)
if padding == "left":
indices[:, 1] = seq_limit - 1 - indices[:, 1]
if padding == "right":
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this fail here? Shouldn't we handle this by default?

Copy link
Contributor Author

@lesnikow lesnikow Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the original issue from Gabriel, it sounded to me, based on what was written there, that right padding has already been implemented. In the description for issue 1077, there is for instance: The PyT and TF Dataloader support padding list (sparse) features to the right, which means that shorter list sequences will be completed with 0s in the right.

I did not want to reduplicate it here to avoid doing the same thing in multiple places of the codebase. Let me investigate some more whether this has been already implemented, and hence should not be duplicated, or whether it makes sense to implement this feature here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benfred Would you know off-hand, based on your knowledge of the codebase, whether right padding has definitely been or not been implemented elsewhere in the repository?

return self._get_sparse_tensor(values, indices, num_rows, seq_limit)


Expand Down
10 changes: 5 additions & 5 deletions nvtabular/ops/list_slice.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def transform(self, col_selector: ColumnSelector, df: DataFrameType) -> DataFram
on_cpu = _is_cpu_object(df)
ret = type(df)()
for col in col_selector.names:
# handle CPU via normal python slicing (not very efficient)
# Handle CPU via normal python slicing (not very efficient).
if on_cpu:
ret[col] = [row[self.start : self.end] for row in df[col]]
else:
Expand Down Expand Up @@ -99,8 +99,8 @@ def output_tags(self):

@numba.cuda.jit
def _calculate_row_sizes(start, end, offsets, row_sizes):
"""given a slice (start/end) and existing offsets indicating row lengths, this
calculates the size for each new row after slicing"""
"""Given a slice (start/end) and existing offsets indicating row lengths, this
calculates the size for each new row after slicing."""
rowid = numba.cuda.grid(1)
if rowid < offsets.size - 1:
original_row_size = offsets[rowid + 1] - offsets[rowid]
Expand All @@ -120,9 +120,9 @@ def _calculate_row_sizes(start, end, offsets, row_sizes):

@numba.cuda.jit
def _slice_rows(start, offsets, elements, new_offsets, new_elements):
"""slices rows of a list column. requires the 'new_offsets' to
"""Slices rows of a list column. requires the 'new_offsets' to
be previously calculated (meaning that we don't need the 'end' slice index
since thats baked into the new_offsets"""
since thats baked into the new_offsets."""
rowid = numba.cuda.grid(1)
if rowid < (new_offsets.size - 1):
if start >= 0:
Expand Down