-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA]: Add DASK edgelist and graph support to the Dataset API #4035
Conversation
Also, I added Please help me add appropriate labels to this PR. Next week is my final week, so I may sometimes delay my responses. I will try my best to check for any updates and comments here in my spare time. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this PR. I think it looks good but I have a couple suggestions.
/ok to test |
Sorry for the delay! I have modified the file download method based on suggestions and all the tests have been passed again. |
This comment was marked as duplicate.
This comment was marked as duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This comment was marked as duplicate.
This comment was marked as duplicate.
/merge |
Hi! I choose to go further with some simple work other than docs. This PR is going to close #3218.
Here is what I have done in this PR:
get_dask_edgelist()
andget_dask_graph()
(and another internal helper function__download_dask_csv()
) to Dataset API.Here are some additional details regarding this PR:
============================================================ warnings summary ============================================================ cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0] cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0] cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] /home/ubuntu/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/cudf/core/index.py:3284: FutureWarning: cudf.StringIndex is deprecated and will be removed from cudf in a future version. Use cudf.Index with the appropriate dtype instead. warnings.warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
I think above warnings came from the function call
from_dask_cudf_edgelist
but currently I have no idea how to remove them. I will do my best to address it if anyone has any ideas about it.get_edgelist()
function returns a deep copy of the object, but this is not supported forget_dask_edgelist()
since only shallow copy is allowed for Dask cuDF dataframe (see docs). This will lead to a problem where if a user modifies the dataframe, the changes will be reflected in the internalself._edgelist
object. So whenget_dask_graph()
is called later, the resulting graph will differ from the one directly constructed from the data file.The keyword
chunksize
is no longer in use (check docs here). I have checked all related functions in the repository and found that they currently usechunksize
. If there is a need to change them toblocksize
, I will create another PR to address this issue.Any comments and suggestions are welcome!