Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38034: [Python] DataFrame Interchange Protocol - correct dtype information for categorical columns #38065

Merged
merged 5 commits into from
Oct 10, 2023

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Oct 6, 2023

Rationale for this change

See: #38034 (comment)

What changes are included in this PR?

The f_string for the columns with categorical dtype is now corrected to reflect the type of the indices from the dictionary data type. Bit width has been correct before. From the spec:

For categoricals, the format string describes the type of the
categorical in the data buffer. In case of a separate encoding of
the categorical (e.g. an integer to string mapping), this can
be derived from self.describe_categorical.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@danepitkin
Copy link
Member

LGTM! Just one linter issue reported.

-    assert col.dtype[0] == 23 # <DtypeKind.CATEGORICAL: 23>
+    assert col.dtype[0] == 23  # <DtypeKind.CATEGORICAL: 23>

@AlenkaF
Copy link
Member Author

AlenkaF commented Oct 7, 2023

Oh, bummer. Thanks for pinging me Dane!

python/pyarrow/interchange/column.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Oct 9, 2023
@AlenkaF AlenkaF merged commit db420c9 into apache:main Oct 10, 2023
10 checks passed
@AlenkaF AlenkaF removed the awaiting merge Awaiting merge label Oct 10, 2023
@AlenkaF AlenkaF deleted the gh-38034-categorical-dtype-bug branch October 10, 2023 04:06
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit db420c9.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…pe information for categorical columns (apache#38065)

### Rationale for this change
See: apache#38034 (comment)

### What changes are included in this PR?

The `f_string` for the columns with categorical dtype is now corrected to reflect the type of the indices from the dictionary data type. Bit width has been correct before. From the spec:

> For categoricals, the format string describes the type of the
              categorical in the data buffer. In case of a separate encoding of
              the categorical (e.g. an integer to string mapping), this can
              be derived from ``self.describe_categorical``.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#38034

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pe information for categorical columns (apache#38065)

### Rationale for this change
See: apache#38034 (comment)

### What changes are included in this PR?

The `f_string` for the columns with categorical dtype is now corrected to reflect the type of the indices from the dictionary data type. Bit width has been correct before. From the spec:

> For categoricals, the format string describes the type of the
              categorical in the data buffer. In case of a separate encoding of
              the categorical (e.g. an integer to string mapping), this can
              be derived from ``self.describe_categorical``.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#38034

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pe information for categorical columns (apache#38065)

### Rationale for this change
See: apache#38034 (comment)

### What changes are included in this PR?

The `f_string` for the columns with categorical dtype is now corrected to reflect the type of the indices from the dictionary data type. Bit width has been correct before. From the spec:

> For categoricals, the format string describes the type of the
              categorical in the data buffer. In case of a separate encoding of
              the categorical (e.g. an integer to string mapping), this can
              be derived from ``self.describe_categorical``.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#38034

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] DataFrame Interchange Protocol - wrong dtype information for categorical columns
3 participants