Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT Add make_series and make_dataframe #798

Merged

Conversation

Vincent-Maladiere
Copy link
Member

Addresses https://github.com/skrub-data/skrub/pull/784/files#r1358253388

This simple PR implements, for both Pandas and Polars:

  • A function to create a dataframe from a dictionary of columns
  • A function to create a series from a 1d array

When merged, this can be used directly in #784 for example.

skrub/dataframe/_pandas.py Outdated Show resolved Hide resolved
skrub/dataframe/_polars.py Outdated Show resolved Hide resolved
X : Pandas dataframe
Converted output.
"""
if not isinstance(X, dict) or not all(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of mistake do you foresee to make this check necessary?

Copy link
Member Author

@Vincent-Maladiere Vincent-Maladiere Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be slightly overkill, but I think about dictionaries of 2d arrays or dataframes.
For instance:

pd.DataFrame(
	dict(
		a=[[1, 2, 1]],
		b=[1, 2, 3],
	)
)

will raise ValueError: All arrays must be of the same length, but this error is not informative enough IMO.
I suspect this kind of error might happen more than we'd expect. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this is a private function, right? that will be used only by skrub code not directly by users. so if skrub passes it incorrect inputs there isn't too much the user can do about it except report the problem

btw WDYT about renaming dataframe to _dataframe? we want skrub estimators to support both pandas and polars, but I'm not sure we want to publicly expose the machinery that makes it possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for the renaming, I'll remove the check

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Oct 31, 2023

I adjusted the API to make the dataframe module private (it seems I still have some work on the examples).
I also did some housekeeping by removing the DataFrameLike and SeriesLike type hints, LMKWYT @jeromedockes :)

@jeromedockes
Copy link
Member

LGTM, I think once you remove the merge conflict we can merge it

@jeromedockes
Copy link
Member

I think you also need to update test_select_cols

@jeromedockes
Copy link
Member

and the docstring of get_df_namespace also mentions skrub.dataframe

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@jeromedockes jeromedockes merged commit 5005364 into skrub-data:main Nov 2, 2023
20 of 21 checks passed
@Vincent-Maladiere Vincent-Maladiere deleted the add_namespace_df_builder branch November 9, 2023 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants