-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further optimize from_pandas_edgelist
with cudf
#4528
Conversation
if is_src_copied: | ||
src_indices = src_array | ||
else: | ||
src_indices = cp.array(src_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something obvious, but couldn't the same behavior be achieved by just doing the following?
if is_src_copied: | |
src_indices = src_array | |
else: | |
src_indices = cp.array(src_array) | |
src_indices = cp.array(src_array, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant:
src_indices = cp.array(src_array, copy=not is_src_copied)
which should work as expected.
I found if-else branches more clear to use here, and easy for us to do given that we already have the booleans around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And once again, if you didn't catch it from one of my previous comments in previous PRs, numpy 2 changed semantics:
https://numpy.org/doc/stable/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword
so I think it's okay to be extra clear here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, but no, actually that snippet wasn't what I was thinking. But after re-reading again I think there's some necessary side effects that I was missing, but I'd rather not assume I know what they are. Let's chat offline and I'll update the comment afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my questioning of why not just use copy=False
and therefore why _cp_iscopied_asarray
is needed at all is because I didn't initially realize this function was intentionally making copies of the incoming dataframe series for the graph to own. That makes sense and I confirmed this with Erik offline (although we should probably think about a future improvement to prevent yet another copy happening when the PLC graph is made, but that can be for later) so my only request is a comment mentioning that. Perhaps something like this here:
# Now that the arrays have been extracted from the input
# dataframe, create copies the graph instance can own. If a
# copy already occurred during the extraction step above,
# don't copy again.
but maybe something even shorter and better, maybe just in the docstring for the function could simply be
This function will create a copy of the input dataframe data
source and target series to be owned by the returned Graph object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, code comments added. NetworkX doesn't share ownership with input objects, so I figure neither should we.
Thanks @eriknw , I really like the attention to detail with the tests! I just had one question (see review) and I'm wondering if the answer to it could significantly change the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for meeting offline and clarifying things! I updated the review with the explanation. LGTM overall, but I have just one request for a comment to add.
if is_src_copied: | ||
src_indices = src_array | ||
else: | ||
src_indices = cp.array(src_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my questioning of why not just use copy=False
and therefore why _cp_iscopied_asarray
is needed at all is because I didn't initially realize this function was intentionally making copies of the incoming dataframe series for the graph to own. That makes sense and I confirmed this with Erik offline (although we should probably think about a future improvement to prevent yet another copy happening when the PLC graph is made, but that can be for later) so my only request is a comment mentioning that. Perhaps something like this here:
# Now that the arrays have been extracted from the input
# dataframe, create copies the graph instance can own. If a
# copy already occurred during the extraction step above,
# don't copy again.
but maybe something even shorter and better, maybe just in the docstring for the function could simply be
This function will create a copy of the input dataframe data
source and target series to be owned by the returned Graph object.
…share ownership, so neither should we
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the new comments.
/merge |
/merge |
4ff7acb
into
rapidsai:branch-24.08
This continues #4525 (and this comment) to avoid copies and to be more optimal whether using pandas, cudf, or cudf.pandas. Notably, using
s.to_numpy
with cudf will return a numpy array, butcudf.pandas
may return a cupy array (proxy).Also,
s.to_numpy(copy=False)
(from comment) is not used, b/c cudf'sto_numpy
raises ifcopy=False
. We get the behavior we want by not specifyingcopy=
.I don't know if this is the best way to determine whether a copy occurred or not, but this seems like a useful pattern to establish, because we want to make ingest more efficient.
CC @rlratzel