Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further optimize from_pandas_edgelist with cudf #4528

Merged
merged 12 commits into from
Jul 30, 2024

Conversation

eriknw
Copy link
Contributor

@eriknw eriknw commented Jul 9, 2024

This continues #4525 (and this comment) to avoid copies and to be more optimal whether using pandas, cudf, or cudf.pandas. Notably, using s.to_numpy with cudf will return a numpy array, but cudf.pandas may return a cupy array (proxy).

Also, s.to_numpy(copy=False) (from comment) is not used, b/c cudf's to_numpy raises if copy=False. We get the behavior we want by not specifying copy=.

I don't know if this is the best way to determine whether a copy occurred or not, but this seems like a useful pattern to establish, because we want to make ingest more efficient.

CC @rlratzel

@eriknw eriknw added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python nx-cugraph labels Jul 9, 2024
@eriknw eriknw requested a review from a team as a code owner July 9, 2024 11:58
Comment on lines +80 to +83
if is_src_copied:
src_indices = src_array
else:
src_indices = cp.array(src_array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something obvious, but couldn't the same behavior be achieved by just doing the following?

Suggested change
if is_src_copied:
src_indices = src_array
else:
src_indices = cp.array(src_array)
src_indices = cp.array(src_array, copy=False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant:

        src_indices = cp.array(src_array, copy=not is_src_copied)

which should work as expected.

I found if-else branches more clear to use here, and easy for us to do given that we already have the booleans around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And once again, if you didn't catch it from one of my previous comments in previous PRs, numpy 2 changed semantics:
https://numpy.org/doc/stable/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword
so I think it's okay to be extra clear here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, but no, actually that snippet wasn't what I was thinking. But after re-reading again I think there's some necessary side effects that I was missing, but I'd rather not assume I know what they are. Let's chat offline and I'll update the comment afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my questioning of why not just use copy=False and therefore why _cp_iscopied_asarray is needed at all is because I didn't initially realize this function was intentionally making copies of the incoming dataframe series for the graph to own. That makes sense and I confirmed this with Erik offline (although we should probably think about a future improvement to prevent yet another copy happening when the PLC graph is made, but that can be for later) so my only request is a comment mentioning that. Perhaps something like this here:

# Now that the arrays have been extracted from the input
# dataframe, create copies the graph instance can own. If a
# copy already occurred during the extraction step above,
# don't copy again.

but maybe something even shorter and better, maybe just in the docstring for the function could simply be

This function will create a copy of the input dataframe data
source and target series to be owned by the returned Graph object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, code comments added. NetworkX doesn't share ownership with input objects, so I figure neither should we.

@rlratzel
Copy link
Contributor

Thanks @eriknw , I really like the attention to detail with the tests! I just had one question (see review) and I'm wondering if the answer to it could significantly change the PR.

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for meeting offline and clarifying things! I updated the review with the explanation. LGTM overall, but I have just one request for a comment to add.

Comment on lines +80 to +83
if is_src_copied:
src_indices = src_array
else:
src_indices = cp.array(src_array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my questioning of why not just use copy=False and therefore why _cp_iscopied_asarray is needed at all is because I didn't initially realize this function was intentionally making copies of the incoming dataframe series for the graph to own. That makes sense and I confirmed this with Erik offline (although we should probably think about a future improvement to prevent yet another copy happening when the PLC graph is made, but that can be for later) so my only request is a comment mentioning that. Perhaps something like this here:

# Now that the arrays have been extracted from the input
# dataframe, create copies the graph instance can own. If a
# copy already occurred during the extraction step above,
# don't copy again.

but maybe something even shorter and better, maybe just in the docstring for the function could simply be

This function will create a copy of the input dataframe data
source and target series to be owned by the returned Graph object.

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new comments.

@rlratzel
Copy link
Contributor

/merge

@rlratzel
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 4ff7acb into rapidsai:branch-24.08 Jul 30, 2024
143 of 144 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change nx-cugraph python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants