-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Implement Feature Request from #1077 on Left Padding #1126
Changes from 29 commits
a52a90c
ed61663
ff1e396
d9e457d
e25c6e8
295d4e2
5166d57
e55336c
af8aa57
299d356
1285783
364bcf1
071b8bf
3cce162
cebb715
5acd76a
1684289
a319501
0be389e
d93f9c5
0c0ce69
941d2f3
7944b2a
76c0024
46847cb
c7ae873
d86cec3
d90e1df
b21c57d
2150ede
a51aa44
01749f9
b305afa
2febf1a
587ef0c
dd9927e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -140,7 +140,7 @@ class KerasSequenceLoader(tf.keras.utils.Sequence, DataLoader): | |
|
||
Iterator output is of the form `(dict(features), list(labels))`, | ||
where each element of the features dict is a | ||
`feature_name: feature_tensor` and each elemtn of the labels | ||
`feature_name: feature_tensor` and each element of the labels | ||
list is a tensor, and all tensors are of shape `(batch_size, 1)`. | ||
Note that this means vectorized continuous and multi-hot categorical | ||
features are not currently supported. | ||
|
@@ -153,7 +153,7 @@ class KerasSequenceLoader(tf.keras.utils.Sequence, DataLoader): | |
workflow.update_stats(dataset.data.to_iter(), record_stats=True) | ||
|
||
Parameters | ||
------------- | ||
---------- | ||
- paths_or_dataset: str or list(str) | ||
Either a string representing a file pattern (see `tf.glob` for | ||
pattern rules), a list of filenames to be iterated through, or | ||
|
@@ -205,6 +205,10 @@ class KerasSequenceLoader(tf.keras.utils.Sequence, DataLoader): | |
dictionary of key: column_name + value: integer representing max sequence length for column | ||
sparse_dense : bool | ||
bool value to activate transforming sparse tensors to dense | ||
pad_left : bool | ||
Boolean value to indicate whether to pad on the left. Use True to pad on the left, | ||
False to pad on the right. Default: False | ||
|
||
""" | ||
|
||
_use_nnz = True | ||
|
@@ -230,6 +234,7 @@ def __init__( | |
sparse_names=None, | ||
sparse_max=None, | ||
sparse_as_dense=False, | ||
pad_left=False, | ||
): | ||
dataset = _validate_dataset( | ||
paths_or_dataset, batch_size, buffer_size, engine, reader_kwargs | ||
|
@@ -238,7 +243,7 @@ def __init__( | |
feature_columns, cat_names, cont_names, schema=dataset.schema | ||
) | ||
|
||
# sort the ccolumns to avoid getting incorrect output | ||
# Sort the columns to avoid getting incorrect output. | ||
# (https://github.com/NVIDIA/NVTabular/issues/412) | ||
cat_names = _get_embedding_order(cat_names) | ||
cont_names = _get_embedding_order(cont_names) | ||
|
@@ -261,23 +266,23 @@ def __init__( | |
sparse_names=sparse_names, | ||
sparse_max=sparse_max, | ||
sparse_as_dense=sparse_as_dense, | ||
pad_left=pad_left, | ||
) | ||
self._map_fns = [] | ||
|
||
def __len__(self): | ||
""" | ||
recreating since otherwise Keras yells at you | ||
""" | ||
"""Recreating since otherwise Keras yells at you.""" | ||
# TODO: what's a better way to do this inheritance | ||
# of the appropriate methods? A Metaclass? | ||
DataLoader.stop(self) | ||
return DataLoader.__len__(self) | ||
|
||
def __getitem__(self, idx): | ||
""" | ||
implemented exclusively for consistency | ||
Implemented exclusively for consistency | ||
with Keras model.fit. Does not leverage | ||
passed idx in any way | ||
passed idx in any way. | ||
|
||
""" | ||
return DataLoader.__next__(self) | ||
|
||
|
@@ -286,6 +291,7 @@ def map(self, fn): | |
Applying a function to each batch. | ||
|
||
This can for instance be used to add `sample_weight` to the model. | ||
|
||
""" | ||
self._map_fns.append(fn) | ||
|
||
|
@@ -416,8 +422,30 @@ def _get_sparse_tensor(self, values, indices, num_rows, seq_limit): | |
return sparse_tensor | ||
|
||
def _build_sparse_tensor(self, values, offsets, diff_offsets, num_rows, seq_limit): | ||
"""Builds sparse tensors in the TensorFlow dataloader. | ||
|
||
Parameters | ||
---------- | ||
values : | ||
offsets : | ||
diff_offsets : | ||
num_rows : | ||
seq_limit : | ||
|
||
Returns | ||
------- | ||
tf.sparse | ||
Our built TensorFlow sparse tensor. | ||
|
||
""" | ||
ragged = tf.RaggedTensor.from_row_lengths(values=values, row_lengths=diff_offsets) | ||
tensor = tf.RaggedTensor.from_tensor(ragged.to_tensor(shape=[None, seq_limit])).to_sparse() | ||
if self.pad_left: | ||
max_len = max(max(len(row) for row in ragged), seq_limit) | ||
tensor = tf.stack([tf.pad(row, [[max_len - len(row), 0]]) for row in ragged], axis=0) | ||
else: | ||
tensor = ragged.to_tensor(shape=[None, seq_limit]) | ||
|
||
tensor = tf.RaggedTensor.from_tensor(tensor).to_sparse() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jperez999 Are you able to see this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see it now... Can you time this compared to the example I gave... would like to see the timing on this execution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is good to hear. Yes one moment please. Is there a particular timing framework or method that you use, for consistency of the timings? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this current code block using
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /nvtabular, configfile: pyproject.toml
collected 77 items / 75 deselected / 2 selected
tests/unit/loader/test_tf_dataloader.py::test_sparse_tensor_left_padding[False] PASSED [ 50%] ================================================================================================ slowest durations ================================================================================================ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have a commit with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah I just meant grabbing that same mini tensor I create in my example (called digits) and run it through your scenario the idea is to get a comparison to see which is the fastest method to ensure that is what we select.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the code I reported the ~.002 seconds execution time
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, here are the timings I got.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes I see the change... Just need to update torch version now... |
||
if self.sparse_as_dense: | ||
tensor = tf.sparse.to_dense(tensor) | ||
return tensor | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,7 +42,7 @@ class TorchAsyncItr(torch.utils.data.IterableDataset, DataLoader): | |
batches are the specified size until the final batch. | ||
|
||
Parameters | ||
----------- | ||
---------- | ||
dataset : NVTabular dataset | ||
cats : [str] | ||
the list of categorical columns in the dataset | ||
|
@@ -64,6 +64,10 @@ class TorchAsyncItr(torch.utils.data.IterableDataset, DataLoader): | |
dictionary of key: column_name + value: integer representing max sequence length for column | ||
sparse_dense : bool | ||
bool value to activate transforming sparse tensors to dense | ||
pad_left : bool | ||
Boolean value to indicate whether to pad on the left. Use True to pad on the left, | ||
False to pad on the right. Default: False | ||
|
||
""" | ||
|
||
def __init__( | ||
|
@@ -83,6 +87,7 @@ def __init__( | |
sparse_names=None, | ||
sparse_max=None, | ||
sparse_as_dense=False, | ||
pad_left=False, | ||
): | ||
DataLoader.__init__( | ||
self, | ||
|
@@ -101,6 +106,7 @@ def __init__( | |
sparse_names=sparse_names, | ||
sparse_max=sparse_max, | ||
sparse_as_dense=sparse_as_dense, | ||
pad_left=pad_left, | ||
) | ||
|
||
def __iter__(self): | ||
|
@@ -174,8 +180,48 @@ def _get_sparse_tensor(self, values, indices, num_rows, seq_limit): | |
sparse_tensor = sparse_tensor.to_dense() | ||
return sparse_tensor | ||
|
||
def _build_sparse_tensor(self, values, offsets, diff_offsets, num_rows, seq_limit): | ||
def _build_sparse_tensor_helper_process_column(self, col: torch.Tensor) -> torch.Tensor: | ||
"""Process column by increasing blocks for use in left padding.""" | ||
col = col.tolist() | ||
prev, curr = 0, 0 | ||
while curr < len(col): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is the while loop I am talking about @lesnikow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is good to know, thank you. I have not implemented anything optimized over this yet. I would like to hear your feedback first on the |
||
if col[curr] >= col[curr - 1]: | ||
col[prev:curr] = col[prev:curr][::-1] | ||
prev = curr | ||
if curr == (len(col) - 1): | ||
col[prev : curr + 1] = col[prev : curr + 1][::-1] | ||
curr += 1 | ||
return torch.Tensor(col) | ||
|
||
def _build_sparse_tensor( | ||
self, | ||
values, | ||
offsets, | ||
diff_offsets, | ||
num_rows, | ||
seq_limit, | ||
): | ||
"""Builds sparse tensors in our torch dataloader. | ||
|
||
Parameters | ||
---------- | ||
values : | ||
offsets : | ||
diff_offsets : | ||
num_rows : | ||
seq_limit : | ||
|
||
Returns | ||
------- | ||
torch.sparse | ||
Our built torch sparse tensor. | ||
|
||
""" | ||
indices = self._get_indices(offsets, diff_offsets) | ||
if self.pad_left: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. again this iteration logic is not the best methodology for covering this: https://stackoverflow.com/questions/48686945/reshaping-a-tensor-with-padding-in-pytorch something like this would be more efficient. torch.nn.functional has a padding function you could use to your advantage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Likewise this is good to know. I will look into how to do this outlined approach for the Torch implementation here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have the same questions for your second comment as I wrote in reply to your first comment above. Would you have any guidance on this here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read over this Stack Overflow question along with all the replies. These approaches will not directly work for this torch implementation. The main obstacle in these approaches is that methods like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still see the while loop here for the torch side is that accurate? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This has not been updated yet, since I would like to hear your feedback first on the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please see line 187 |
||
indices[:, 1] = self._build_sparse_tensor_helper_process_column( | ||
(seq_limit - 1) - indices[:, 1] | ||
) | ||
return self._get_sparse_tensor(values, indices, num_rows, seq_limit) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is iterating through each row the only way to do this? You can actually logically figure out all the padding amounts by doing array math and then you can pass that entire list of "pad_length" entries at once that way its not doing each row, one at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to know. Let me see how to implement this approach you outline here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having a hard time seeing currently a vectorized or further optimized approach to this code snippet. Figuring out the padding amounts by row is not that difficult, but what to do with that tensor of row lengths is very much unclear to me. For instance there is
tf.pad()
, but this only takes padding tensors of shape[n, 2]
, wheren
is the rank of the original tensor, so that we may only do a constant amount of padding per dimension. Fortorch
there istorch.nn.functional.pad()
, which also only does a constant amount of left or right padding per dimension. There istf.ragged
'sto_tensor()
method, which implicitly does padding or truncation according to a given shape, but this only does padding on the right. So it is not that clear how to do a further optimized approach using this variable padding tensor. Did you have some othertensorflow
ortorch
methods that you had in mind, or some other vectorized approach using the variable padding lengths that I am missing? I have tried some others, but to no success yet. In particular composingto_tensor()
withtf.tensor.pad()
with thepad
method using left padding will not work. I can modify the run-length-encoding of the ragged tensor here, but I would guess that this would also not be a vectorized operation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jperez999 How does this current updated implementation of this TensorFlow code snippet without iterating through the rows of the ragged tensor address your feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The for loops you are doing are the problem here I think... That methodology is extremely slow. You have to go in to each row read the data... as opposed to doing the entire column at once.
I think this gives you the padding behavior you want and its a tad bit faster I am clocking this logic in at 0.005383729934692383 and I am clocking your logic at 0.026534557342529297.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jperez999 Did you see the updated TensorFlow implementation that I have here that does not use
for
loops? It is at commit 01749f9.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that commit you linked there references conda environment files... no code changes... As the PR stands that for loop logic still exists for both torch and tensorflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That commit is a final merge of main into this branch. The changes are in the predecessors of this commit. They should be viewable here in the web UI or by doing a checkout of that commit. The
for
loop implementation for TensforFlow is an outdated change. I also tagged you with a comment on the code section below. Are you able to see these changes now?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, in my web UI, the
tensorflow
code snippet you are referencing has a yellow "Outdated" box next to it.