Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compatibility with Cudf #969

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Add compatibility with Cudf #969

wants to merge 8 commits into from

Conversation

pprados
Copy link

@pprados pprados commented Oct 18, 2022

Hello,

Nvidia propose a compatible Panda like framework to use the GPU : cudf

This pull-request add the compatibility to cudf.

Regards

@cosmicBboy
Copy link
Collaborator

this is awesome @pprados !

This PR will also need to add tests to ensure that the pandera API works well with cudf (from experience there might be weird quirks in differences between pandas and other pandas-like frameworks)

For an example, you can take a look at the test suites for the other frameworks:

@pprados
Copy link
Author

pprados commented Oct 18, 2022

The problem with the test for cudf, is, you must have an nvidia card in docker. What do you advise me to do to comply with your platform?

@cosmicBboy
Copy link
Collaborator

I don't know too much about cudf, but is there a "cpu mode" to at least test that the code runs as expected?

How does the cudf project do testing?

@pprados
Copy link
Author

pprados commented Oct 18, 2022

Non. cudf is a GPU implementation of Pandas. All methods run inside the NVidia card. It's not possible to test, without GPU.
If your CI can have an CUDA driver (see https://hub.docker.com/r/nvidia/cuda), I can add some tests.
Else, I can add some tests and run "locally" in my computer.

@cosmicBboy
Copy link
Collaborator

Cool, lemme look into getting a self-hosted GPU runner for this project. https://cirun.io also looks promising, which supports GPU runners.

In the mean-time, would you mind adding tests to this PR? Once the CI runners are up we can add those tests to the ci suite.

@cosmicBboy
Copy link
Collaborator

@pprados see #970

@pprados
Copy link
Author

pprados commented Oct 18, 2022

Ok. I will add some similar test of others frameworks.

@pprados
Copy link
Author

pprados commented Oct 18, 2022

How you start yours tests?

@cosmicBboy
Copy link
Collaborator

pytest tests

See here for more info in general re: contributing

@pprados
Copy link
Author

pprados commented Oct 19, 2022

I suppose the process to instal a conda environment to build have some mistake.
I receive these errors:

Encountered problems while solving:
  - nothing provides requested jupyterlite_sphinx
  - package pyspark-stubs-2.4.0.post5-py_0 requires pyspark >=2.4.0,<3.0.0, but none of the providers can be installed

jupyterlite_sphinx is not in conda repository.

@cosmicBboy
Copy link
Collaborator

looks like jupyterlite_sphinx needs to be placed under the pip key: https://github.com/unionai-oss/pandera/blob/master/environment.yml#L66

@cosmicBboy
Copy link
Collaborator

once #973 is merged you can rebase on that and it should fix the error

@pprados
Copy link
Author

pprados commented Oct 19, 2022

Cool.
I started writing tests (similar of modin) and fixe some bugs.
Note: cudf run only on python 3.9

@cosmicBboy
Copy link
Collaborator

nice, yeah we can add some conditional logic to the CI workflow once #973 is ready

@pprados
Copy link
Author

pprados commented Oct 19, 2022

To test modin, I must set some specific environment variable?

@cosmicBboy
Copy link
Collaborator

are you seeing an error/warning?

@pprados
Copy link
Author

pprados commented Oct 19, 2022

I need rebase from master?

conda env update -n pandera-dev -f environment.yml

fail

Note: with pyspark, I have this error:

py4j.protocol.Py4JError: org.apache.spark.sql.api.python.PythonSQLUtils.isTimestampNTZPreferred does not exist in the JVM
Which version of openjdk are you using?

@pprados
Copy link
Author

pprados commented Oct 19, 2022

The current version has bugs. Wait a moment. I will fix it.

@pprados
Copy link
Author

pprados commented Oct 20, 2022

I rebase the code from the branch fix-dependencies. It's not enough. The pyspark-stubs pose some problems.

@pprados
Copy link
Author

pprados commented Oct 20, 2022

The current version rebase from fix-dependencies and propose all tests for cudf.
My English is too weak to work on the documentation.
I think it's time to test and integrate this code.

@cosmicBboy
Copy link
Collaborator

thanks @pprados! will review this and start getting the CI stuff ready.

A quick look at the diff, there's some commented out code in the tests, is this WIP or did you mean to delete those?
https://github.com/unionai-oss/pandera/pull/969/files#diff-cf3f976c6656875160c19bea55028bd06076bd7bb6a7ada5c378e8a54989a842R265-R286

@pprados
Copy link
Author

pprados commented Oct 21, 2022

Hello.
I fix your remarks and others bugs.

But, I have some problem with make nox-conda.

    make nox-conda
    nox -db conda --envdir .nox-conda "-r"
    Running locally
    nox > Error while collecting sessions.
    nox > Sessions not found: mypy

@cosmicBboy
Copy link
Collaborator

But, I have some problem with make nox-conda.

Ah, yeah I recently replaced the mypy nox session with an equivalent pre-commit hook... lemme fix that

@pprados
Copy link
Author

pprados commented Oct 25, 2022

Just for information, I need this PR to combine your framework Pandera with another : Virtual DataFrame

I would like to propose a link to Pandera, but I can do it, only, if Pandera is compatible with cudf.

May be, later, we can add a mutual link. ;-)

@cosmicBboy
Copy link
Collaborator

Nice, looks like a cool project! FYI I probably won't get to look into #970 until next month, appreciate your patience! (of course would love if you can help integrate with cirun.io to get GPU support in pandera's CI)

@cosmicBboy
Copy link
Collaborator

hi @pprados, due to the security issues associated with using self-hosted runners on a public repo, I'm inclined to wait for gpu-enabled github-hosted runners to add CI for this.

We'll continue to explore options in this repo but to expedite your work, I'd be happy to help merge this into main and tag it as "experimental" in the documentation.

My English is too weak to work on the documentation.

Your English seems strong to me! If you can start a new docs page under this one I can help edit it. Basically you can roughly follow the structure in https://github.com/unionai-oss/pandera/blob/main/docs/source/modin.rst (with a SKIP_CUDF in the :skipif: option) so that the code doesn't actually run.

@pprados
Copy link
Author

pprados commented Dec 3, 2022 via email

@cosmicBboy
Copy link
Collaborator

hey @pprados I just retrofitted this PR with the recent internals rewrite #913 and figured I'd ask if you can test this out on your setup?

I'll also try running the tests on a cuda-enabled google colab notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants