-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet #20248
feat: Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet #20248
Conversation
27bd711
to
b5ff749
Compare
Blocked on #20250. |
After some consideration, |
fad3670
to
5eab0bf
Compare
This removes all instances of the DictionaryEncoder except one which is used for Polars enums and categoricals. This essentially makes it so that the dictionary arrow type is regarded as any other arrow type. Fixes pola-rs#20242. Fixes pola-rs#17945.
946fc70
to
e1f46a4
Compare
Okay, soooooo. This turned into a way larger change than I originally had in mind, but I still think this can be a relatively cohesive patch-set. In the end, I really wanted two things:
The rules for converting from an
Similarly, when we convert a Categorical or Enum to arrow, we now set |
ca19a13
to
5407a7a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20248 +/- ##
==========================================
- Coverage 79.61% 79.60% -0.01%
==========================================
Files 1565 1566 +1
Lines 218328 218220 -108
Branches 2478 2465 -13
==========================================
- Hits 173820 173714 -106
+ Misses 43941 43937 -4
- Partials 567 569 +2 ☔ View full report in Codecov by Sentry. |
|
||
/// Propagate the nulls from the dictionary values into the keys and remove those nulls from the | ||
/// values. | ||
pub fn propagate_dictionary_value_nulls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good one.
@@ -3255,6 +3271,7 @@ impl DataFrame { | |||
|
|||
pub struct RecordBatchIter<'a> { | |||
columns: &'a Vec<Column>, | |||
schema: ArrowSchemaRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want the schema on RecordBatch
? Do we want O(1) access somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need it to be able to properly round trip types. We need the ArrowFields, the ArrowDataType isn't enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. Convinces me more that we should shift to Fields
on Series
/Columns
.
This removes all instances of the DictionaryEncoder except one which is used for Polars enums and categoricals. This essentially makes it so that the dictionary arrow type is regarded as any other arrow type.
Fixes #20242.
Fixes #17945.
Fixes #20270.
Fixes #20288.
Fixes #20271.
This should massively speed up the decoding of enums and categoricals, although that is very much not the goal. This PR unifies the decoder kernels and removes a lot of the useless monomorphizations.