Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Implement Feature Request from #1077 on Left Padding #1126

Closed
wants to merge 36 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
a52a90c
Update docstrings for issue #1077
Sep 14, 2021
ed61663
Merge branch 'main' into 1077-implement
Sep 16, 2021
ff1e396
Implementation of left padding for issue #1077
Sep 16, 2021
d9e457d
Update #1077 implementation
Sep 16, 2021
e25c6e8
Implement #1077 update with docstring and type hinting.
Sep 16, 2021
295d4e2
Merge branch 'main' into 1077-implement
lesnikow Sep 16, 2021
5166d57
Merge branch 'main' into 1077-implement
lesnikow Sep 17, 2021
e55336c
Merge branch 'main' into 1077-implement
lesnikow Sep 20, 2021
af8aa57
Merge branch 'main' of github.com:NVIDIA/NVTabular into 1077-implement
Sep 23, 2021
299d356
Update tensorflow module docstring for docs syntax
Sep 23, 2021
1285783
Expose pad_left to user
Sep 24, 2021
364bcf1
Skip test_distributed_multigpu()
Sep 24, 2021
071b8bf
Add unit test for torch dataloader and padding argument
Sep 24, 2021
3cce162
Update torch test for padding argument
Sep 24, 2021
cebb715
Update unit test for padding argument
Sep 25, 2021
5acd76a
Update dataloader torch to pass new tests
Sep 25, 2021
1684289
Clean up loader/torch module
Sep 25, 2021
a319501
Clean up test_torch_dataloader module
Sep 25, 2021
0be389e
Update tests
Sep 27, 2021
d93f9c5
Add tests for the TensorFlow runtime dataloader
Sep 28, 2021
0c0ce69
Implement pad_left in _build_sparse_tensor TF
Sep 28, 2021
941d2f3
Update torch loader documentation
Sep 28, 2021
7944b2a
Merge branch 'main' of 1077-implement
Sep 28, 2021
76c0024
Cleanup _build_sparese_tensor for TF dataloader
Sep 28, 2021
46847cb
Add docstring to _build_sparse_tensor() for tf
Sep 28, 2021
c7ae873
Update docstring
Sep 28, 2021
d86cec3
Refactor torch dataloader pad_left and _build_spar
Sep 28, 2021
d90e1df
Update pytest decorator
Sep 28, 2021
b21c57d
Cleanup torch loader
Sep 28, 2021
2150ede
Implement pad_left with TF ops
Sep 29, 2021
a51aa44
Implement pad_left with TF ops cleanup
Sep 29, 2021
01749f9
Merge branch 'main' into 1077-implement
lesnikow Sep 29, 2021
b305afa
Update tensorflow dataloader implementation
Sep 30, 2021
2febf1a
Merge branch '1077-implement' of https://github.com/NVIDIA/NVTabular …
Sep 30, 2021
587ef0c
Update pad_left TF unit tests
Sep 30, 2021
dd9927e
Update pad_left code for TF sparse tensors
Sep 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Update docstrings for issue #1077
Update docstrings for issue #1077. This touches the tensorflow
and torch dataloader modules and the list_slice op module. The
motivation for this is to improve readability. This commit is
towards resolving issue #1077 on implementing left padding
for sparse sequential features.
  • Loading branch information
Adam Lesnikowski committed Sep 14, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit a52a90cf03f742d9618cbb94ccd69b7fb98928f9
19 changes: 10 additions & 9 deletions nvtabular/loader/tensorflow.py
Original file line number Diff line number Diff line change
@@ -140,20 +140,20 @@ class KerasSequenceLoader(tf.keras.utils.Sequence, DataLoader):

Iterator output is of the form `(dict(features), list(labels))`,
where each element of the features dict is a
`feature_name: feature_tensor` and each elemtn of the labels
`feature_name: feature_tensor` and each element of the labels
list is a tensor, and all tensors are of shape `(batch_size, 1)`.
Note that this means vectorized continuous and multi-hot categorical
features are not currently supported.
The underlying NVTabular `Dataset` object is stored in the `data`
attribute, and should be used for updating NVTabular `Workflow`
statistics::
statistics:
benfred marked this conversation as resolved.
Show resolved Hide resolved

workflow = nvt.Workflow(...)
dataset = KerasSequenceLoader(...)
workflow.update_stats(dataset.data.to_iter(), record_stats=True)

Parameters
-------------
----------
- paths_or_dataset: str or list(str)
Either a string representing a file pattern (see `tf.glob` for
pattern rules), a list of filenames to be iterated through, or
@@ -205,6 +205,7 @@ class KerasSequenceLoader(tf.keras.utils.Sequence, DataLoader):
dictionary of key: column_name + value: integer representing max sequence length for column
sparse_dense : bool
bool value to activate transforming sparse tensors to dense

"""

_use_nnz = True
@@ -238,7 +239,7 @@ def __init__(
feature_columns, cat_names, cont_names, schema=dataset.schema
)

# sort the ccolumns to avoid getting incorrect output
# Sort the columns to avoid getting incorrect output.
# (https://github.com/NVIDIA/NVTabular/issues/412)
cat_names = _get_embedding_order(cat_names)
cont_names = _get_embedding_order(cont_names)
@@ -265,19 +266,18 @@ def __init__(
self._map_fns = []

def __len__(self):
"""
recreating since otherwise Keras yells at you
"""
"""Recreating since otherwise Keras yells at you."""
# TODO: what's a better way to do this inheritance
# of the appropriate methods? A Metaclass?
DataLoader.stop(self)
return DataLoader.__len__(self)

def __getitem__(self, idx):
"""
implemented exclusively for consistency
Implemented exclusively for consistency
with Keras model.fit. Does not leverage
passed idx in any way
passed idx in any way.

"""
return DataLoader.__next__(self)

@@ -286,6 +286,7 @@ def map(self, fn):
Applying a function to each batch.

This can for instance be used to add `sample_weight` to the model.

"""
self._map_fns.append(fn)

2 changes: 1 addition & 1 deletion nvtabular/loader/torch.py
Original file line number Diff line number Diff line change
@@ -42,7 +42,7 @@ class TorchAsyncItr(torch.utils.data.IterableDataset, DataLoader):
batches are the specified size until the final batch.

Parameters
-----------
----------
dataset : NVTabular dataset
cats : [str]
the list of categorical columns in the dataset
10 changes: 5 additions & 5 deletions nvtabular/ops/list_slice.py
Original file line number Diff line number Diff line change
@@ -61,7 +61,7 @@ def transform(self, col_selector: ColumnSelector, df: DataFrameType) -> DataFram
on_cpu = _is_cpu_object(df)
ret = type(df)()
for col in col_selector.names:
# handle CPU via normal python slicing (not very efficient)
# Handle CPU via normal python slicing (not very efficient).
if on_cpu:
ret[col] = [row[self.start : self.end] for row in df[col]]
else:
@@ -99,8 +99,8 @@ def output_tags(self):

@numba.cuda.jit
def _calculate_row_sizes(start, end, offsets, row_sizes):
"""given a slice (start/end) and existing offsets indicating row lengths, this
calculates the size for each new row after slicing"""
"""Given a slice (start/end) and existing offsets indicating row lengths, this
calculates the size for each new row after slicing."""
rowid = numba.cuda.grid(1)
if rowid < offsets.size - 1:
original_row_size = offsets[rowid + 1] - offsets[rowid]
@@ -120,9 +120,9 @@ def _calculate_row_sizes(start, end, offsets, row_sizes):

@numba.cuda.jit
def _slice_rows(start, offsets, elements, new_offsets, new_elements):
"""slices rows of a list column. requires the 'new_offsets' to
"""Slices rows of a list column. requires the 'new_offsets' to
be previously calculated (meaning that we don't need the 'end' slice index
since thats baked into the new_offsets"""
since thats baked into the new_offsets."""
rowid = numba.cuda.grid(1)
if rowid < (new_offsets.size - 1):
if start >= 0: