Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44962: [Python] Clean-up name / field_name handling in pandas compat #44963

Merged
merged 6 commits into from
Dec 11, 2024

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 8, 2024

Rationale for this change

Small part of #44195 factored out into its own PR because this change is just a small refactor making #44195 easier to do, but in itself not changing any logic.

We currently both store name and field_name in the pandas metadata. field_name is guaranteed to be a string, and is always exactly the name used in the arrow schema. name can also be None if the original pandas DataFrame used None as the column label or if it was coming from an index level without name.

Right now we had several places where we used name but then checked for it being None. With this PR I made it more consistently use field_name in the cases it needs the string version, by more consistently passing through both names a field_names.

Are these changes tested?

Existing tests should cover this

Are there any user-facing changes?

No

@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit -g python

Copy link

github-actions bot commented Dec 8, 2024

Revision: 251cd97

Submitted crossbow builds: ursacomputing/crossbow @ actions-6fcc46cbf4

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert here but tests and CI are passing so it LGTM, just a minor question

Comment on lines 194 to 195
def construct_metadata(columns_to_convert, df, column_names, column_field_names,
index_levels, index_descriptors, preserve_index, types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is supposed to be internal use only, right?
Should we update the docstring, it seems it was already missing column_names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should indeed be internal yes (although it is accessible as pyarrow.pandas_compat.construct_metadata ... which doesn't necessarily look as private ..)

The docstring is not the most informative, but pushed an update to at least keep it up to date.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And good that you ask it, because based on a quick github search it seems that cudf is using this .. (https://github.com/search?q=repo%3Arapidsai%2Fcudf%20construct_metadata&type=code, eg https://github.com/rapidsai/cudf/blob/0c5bd6627159fe44a49e56020f0c0842696bc397/python/cudf/cudf/core/dataframe.py#L5772).
(and also legate (https://github.com/nv-legate/legate.pandas/blob/7a97b455999e49c328c1873e49fb65d2eade7f2a/legate/pandas/core/table.py#L1230), but that is not an active project)

While I don't think we should keep to much compatibility guarantees here, let me just make it backwards compatible by making it an optional keyword (and later I think we should consider deprecating this and making it private)

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Dec 9, 2024
Copy link

⚠️ GitHub issue #44962 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Dec 11, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 11, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit -g python

Copy link

Revision: fc92d71

Submitted crossbow builds: ursacomputing/crossbow @ actions-5e2c1c2e60

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche , LGTM
The CI failures are unrelated

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Dec 11, 2024
@raulcd raulcd merged commit 6252e9c into apache:main Dec 11, 2024
12 of 14 checks passed
@raulcd raulcd removed the awaiting merge Awaiting merge label Dec 11, 2024
@jorisvandenbossche jorisvandenbossche deleted the pandas-metadata-field-name branch December 11, 2024 09:29
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 6252e9c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants