-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use arrow's schema instead of spark's for local rel #3602
fix: use arrow's schema instead of spark's for local rel #3602
Conversation
// since daft's Utf8 always maps to Arrow's LargeUtf8, we need to handle this special case | ||
// If the expected physical type is LargeUtf8, but the actual Arrow type is Utf8, we need to convert it | ||
if expected_arrow_physical_type == arrow2::datatypes::DataType::LargeUtf8 | ||
&& arrow_array.data_type() == &arrow2::datatypes::DataType::Utf8 | ||
{ | ||
let utf8_arr = arrow_array | ||
.as_any() | ||
.downcast_ref::<arrow2::array::Utf8Array<i32>>() | ||
.unwrap(); | ||
|
||
let arr = Box::new(utf8_to_large_utf8(utf8_arr)); | ||
|
||
return Ok(Self { | ||
field: physical_field, | ||
data: arr, | ||
marker_: PhantomData, | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should be able to create a utf8 array from arrow without an explicit cast, but I can also see the argument for wanting to do this cast outside of the constructor. So I'm fine with if we want to do this outside of fn new
.
I think this'll make arrow interop easier as a whole without remembering needing to cast utf8 to largeutf8 every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @andrewgazelka. this should supercede #3601
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also @jaychia, @samster25 do you have any preferences on handling this inside the constructor?
for context, spark uses small utf8, but in rust we don't natively support creating series/DataArray from arrow's smallutf8 array. So this adds a check here to cast the smallutf8 to a largeutf8 when constructing the array.
See the below test for expected usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodSpeed Performance ReportMerging #3602 will degrade performances by 44%Comparing Summary
Benchmarks breakdown
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3602 +/- ##
==========================================
+ Coverage 77.80% 77.83% +0.02%
==========================================
Files 718 717 -1
Lines 88176 87962 -214
==========================================
- Hits 68607 68465 -142
+ Misses 19569 19497 -72
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good; made #3605 in case we want to revist conversion
No description provided.