Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-wrapper for vaex? #1026

Open
ericmjl opened this issue Feb 23, 2022 · 3 comments
Open

Auto-wrapper for vaex? #1026

ericmjl opened this issue Feb 23, 2022 · 3 comments
Labels
discussion-needed enhancement New feature or request

Comments

@ericmjl
Copy link
Member

ericmjl commented Feb 23, 2022

Brief Description

I recently read the vaex docs and it looks quite promising for highly scalable dataframe computation. I'd like to kickstart a discussion on what it might take to support vaex with pyjanitor.

Wanted to also make explicit that we don't have to decide on yes/no for this idea!

The vaex docs are available here. Because a lot of our underlying codebase operates on the pandas API, and vaex is supposed to be dataframe API compatible, it appears to me that we should be able to automagically wrap functions in our top-level API and have them "just work" under the df.func namespace.

@ericmjl ericmjl added enhancement New feature or request discussion-needed labels Feb 23, 2022
@thatlittleboy
Copy link
Contributor

I'm no expert on Vaex (had it in my sights for a while now but never had the time to explore it much);
On a general note, there are a couple of things I see pertinent to make pyjanitor work with vaex:

  • We need a way to add accessors / methods on the Vaex DataFrame in the same way that pandas_flavor does for pandas DataFrame. So that the end-user can do df.clean_names().remove_empty() and chain pyjanitor methods, regardless of whether they are using a Vaex or pandas df. There is this section on "Extending Vaex" in the docs, but it doesn't seem relevant (?)
  • How about the completeness of the Vaex API compared to pandas? I'm not sure if this is covered somewhere in the docs (e.g. I know libraries like dask/modin/koalas have some notion of "API coverage", I wonder what the number is for vaex.) I ask this because for some pyjanitor functions, we rely heavily on specific pandas functions (factorize, cut etc.).

Let me know if I'm off-course on this!
Overall, on first glance, I'm not opposed to the idea, just want to flesh out first, how well will this change gel together with existing functionality of pyjanitor.

A nice first step I think would be to have a working POC for a very simple pyjanitor function (say, shuffle? or also/then?) implemented for a generic Vaex df.
The ideal scenario would be to have the pyjanitor API identical between the two dataframes; just that only a subset of functions will be supported for Vaex dfs.

@ericmjl
Copy link
Member Author

ericmjl commented Feb 23, 2022

All-round great pointers, @thatlittleboy!

Yes, I agree a small prototype might be a good starting point. I might take my time on this one, as it is fairly low-priority in the grand scheme of things; our effort on #972 is currently more important.

On the specific questions you raised:

We need a way to add accessors / methods on the Vaex DataFrame in the same way that pandas_flavor does for pandas DataFrame. So that the end-user can do df.clean_names().remove_empty() and chain pyjanitor methods, regardless of whether they are using a Vaex or pandas df. There is this section on "Extending Vaex" in the docs, but it doesn't seem relevant (?)

vaex's extension API will register functions under the df.func namespace, rather than the df namespace. ("namespace" is a rather generic term here, I guess.) If we don't like the df.func.<some_function> API, we may need to implement our own wrapper, like what's done for the Spark API.

How about the completeness of the Vaex API compared to pandas? I'm not sure if this is covered somewhere in the docs (e.g. I know libraries like dask/modin/koalas have some notion of "API coverage", I wonder what the number is for vaex.) I ask this because for some pyjanitor functions, we rely heavily on specific pandas functions (factorize, cut etc.).

Not sure here either. I guess a prototype done using the most idiosyncratic pandas' functions would be the way to know!

@thatlittleboy
Copy link
Contributor

thatlittleboy commented Feb 23, 2022

From what I see, the df.func.<somefunction> using @register_function seems to only accept expressions / arrays, which seems overly restrictive for what we're trying to do in pyjanitor.
E.g. pyjanitor functions that work on a group of user-defined columns. Can Vaex's register_function decorator accept a function like def func(*args)?

Ah. maybe the dataframe accessors might work.. worth a shot regardless. I can't quite tell just by looking 😄

Agreed on the point on the need to wrap if we somehow get the Vaex extensions to work; I'm strongly of the opinion we need to keep the internal (pyjanitor) API consistent, regardless of the DataFrame type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion-needed enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants