Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use arrow's schema instead of spark's for local rel #3602

Merged
merged 4 commits into from
Dec 19, 2024

Conversation

universalmind303
Copy link
Contributor

No description provided.

Comment on lines +56 to +73
// since daft's Utf8 always maps to Arrow's LargeUtf8, we need to handle this special case
// If the expected physical type is LargeUtf8, but the actual Arrow type is Utf8, we need to convert it
if expected_arrow_physical_type == arrow2::datatypes::DataType::LargeUtf8
&& arrow_array.data_type() == &arrow2::datatypes::DataType::Utf8
{
let utf8_arr = arrow_array
.as_any()
.downcast_ref::<arrow2::array::Utf8Array<i32>>()
.unwrap();

let arr = Box::new(utf8_to_large_utf8(utf8_arr));

return Ok(Self {
field: physical_field,
data: arr,
marker_: PhantomData,
});
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should be able to create a utf8 array from arrow without an explicit cast, but I can also see the argument for wanting to do this cast outside of the constructor. So I'm fine with if we want to do this outside of fn new.

I think this'll make arrow interop easier as a whole without remembering needing to cast utf8 to largeutf8 every time.

Copy link
Contributor Author

@universalmind303 universalmind303 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @andrewgazelka. this should supercede #3601

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also @jaychia, @samster25 do you have any preferences on handling this inside the constructor?

for context, spark uses small utf8, but in rust we don't natively support creating series/DataArray from arrow's smallutf8 array. So this adds a check here to cast the smallutf8 to a largeutf8 when constructing the array.

See the below test for expected usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

codspeed-hq bot commented Dec 18, 2024

CodSpeed Performance Report

Merging #3602 will degrade performances by 44%

Comparing universalmind303:local_rel_cleanup (aa0f642) with main (ca4d3f7)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 25 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main universalmind303:local_rel_cleanup Change
test_count[1 Small File] 3.6 ms 3.2 ms +12.36%
test_iter_rows_first_row[100 Small Files] 169.1 ms 301.9 ms -44%

Copy link

codecov bot commented Dec 18, 2024

Codecov Report

Attention: Patch coverage is 98.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 77.83%. Comparing base (ca4d3f7) to head (aa0f642).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ect/src/translation/logical_plan/local_relation.rs 96.29% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3602      +/-   ##
==========================================
+ Coverage   77.80%   77.83%   +0.02%     
==========================================
  Files         718      717       -1     
  Lines       88176    87962     -214     
==========================================
- Hits        68607    68465     -142     
+ Misses      19569    19497      -72     
Files with missing lines Coverage Δ
src/daft-connect/src/translation/datatype.rs 12.66% <ø> (-10.41%) ⬇️
src/daft-core/src/array/mod.rs 73.22% <100.00%> (+5.92%) ⬆️
...ect/src/translation/logical_plan/local_relation.rs 94.44% <96.29%> (+3.01%) ⬆️

... and 1 file with indirect coverage changes

Copy link
Contributor

@andrewgazelka andrewgazelka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good; made #3605 in case we want to revist conversion

@andrewgazelka andrewgazelka merged commit e0d2b8a into Eventual-Inc:main Dec 19, 2024
40 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants