-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44962: [Python] Clean-up name / field_name handling in pandas compat #44963
GH-44962: [Python] Clean-up name / field_name handling in pandas compat #44963
Conversation
@github-actions crossbow submit -g python |
Revision: 251cd97 Submitted crossbow builds: ursacomputing/crossbow @ actions-6fcc46cbf4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an expert here but tests and CI are passing so it LGTM, just a minor question
python/pyarrow/pandas_compat.py
Outdated
def construct_metadata(columns_to_convert, df, column_names, column_field_names, | ||
index_levels, index_descriptors, preserve_index, types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is supposed to be internal use only, right?
Should we update the docstring, it seems it was already missing column_names
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should indeed be internal yes (although it is accessible as pyarrow.pandas_compat.construct_metadata
... which doesn't necessarily look as private ..)
The docstring is not the most informative, but pushed an update to at least keep it up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And good that you ask it, because based on a quick github search it seems that cudf is using this .. (https://github.com/search?q=repo%3Arapidsai%2Fcudf%20construct_metadata&type=code, eg https://github.com/rapidsai/cudf/blob/0c5bd6627159fe44a49e56020f0c0842696bc397/python/cudf/cudf/core/dataframe.py#L5772).
(and also legate (https://github.com/nv-legate/legate.pandas/blob/7a97b455999e49c328c1873e49fb65d2eade7f2a/legate/pandas/core/table.py#L1230), but that is not an active project)
While I don't think we should keep to much compatibility guarantees here, let me just make it backwards compatible by making it an optional keyword (and later I think we should consider deprecating this and making it private)
|
@github-actions crossbow submit -g python |
Revision: fc92d71 Submitted crossbow builds: ursacomputing/crossbow @ actions-5e2c1c2e60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche , LGTM
The CI failures are unrelated
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 6252e9c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Small part of #44195 factored out into its own PR because this change is just a small refactor making #44195 easier to do, but in itself not changing any logic.
We currently both store
name
andfield_name
in the pandas metadata.field_name
is guaranteed to be a string, and is always exactly the name used in the arrow schema.name
can also be None if the original pandas DataFrame used None as the column label or if it was coming from an index level without name.Right now we had several places where we used
name
but then checked for it being None. With this PR I made it more consistently usefield_name
in the cases it needs the string version, by more consistently passing through both names a field_names.Are these changes tested?
Existing tests should cover this
Are there any user-facing changes?
No