Update ObjectStore 0.7.0 and Arrow 46.0.0 #7282

tustvold · 2023-08-14T20:59:28Z

Which issue does this PR close?

Closes #7332

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold · 2023-08-14T21:00:20Z

datafusion/core/src/datasource/physical_plan/mod.rs

@@ -19,8 +19,6 @@

 mod arrow_file;
 mod avro;
-#[cfg(test)]
-mod chunked_store;


This was moved upstream

tustvold · 2023-08-14T21:00:49Z

datafusion/core/src/test/mod.rs

-    for w in writers.iter_mut() {
+
+    // Must drop the stream before creating ObjectMeta below as drop
+    // triggers finish for ZstdEncoder which writes additional data


We previously got away with this as the requested range was being ignored by LocalFileSystem

tustvold · 2023-08-14T22:15:27Z

datafusion/core/src/datasource/physical_plan/csv.rs

                    let is_whole_file_scanned = file_meta.range.is_none();
                    let decoder = if is_whole_file_scanned {
-                        // For special case: `get_range()` will interpret `start` and `end` as the
-                        // byte range after decompression for compressed files
+                        // Don't seek if no range as breaks FIFO files


This is at best a hack, I'm not really sure how to coherently support FIFO files, I do wonder if this support really belongs in DataFusion proper

FYI @metesynnada -- I wonder if you have thoughts about moving FIFO support into a more separated boundary -- I wonder if we could make a special interface that handles incremental streaming somehow, and then implement FIFO support for that interface 🤔

I apologize for seeing your comment late. I will bring up the topic with @ozankabak for further discussion. Additionally, I wanted to inquire if you recommend using a so-called streaming_store rather than an object_store to accommodate streaming use cases.

I think the question is perhaps more whether the streaming operators should be operators in their own right, instead of both streaming and non-streaming use-cases using CsvExec. Perhaps we could introduce a FileStreamExec or something? Both could still make use of object_store and arrow-csv under the hood, but separating them would perhaps better accommodate divergent functionality like schema inference, parallel reads, late materialisation, etc... that doesn't work in the same way for streams?

I dunno, just spitballing here, it seems unfortunate to force a lowest common denominator on CsvExec, where it can't read byte ranges from files...

I will think about the FileStreamExec idea and discuss with @metesynnada. We might be coming to a point where taking such a step may make sense. We will circle back once we have some clarity on our end.

@ozankabak and I agree that implementing FileStreamExec would be a logical choice. We plan on developing a proof of concept for it next week and sharing a design document.

…e-0.7

tustvold · 2023-08-18T19:03:25Z

datafusion/core/src/datasource/physical_plan/mod.rs

-        assert_eq!(c1.value(0), "1");
-        assert_eq!(c1.value(1), "0");
-        assert_eq!(c1.value(2), "1");
+        assert_eq!(c1.value(0), "true");


apache/arrow-rs#4666

tustvold · 2023-08-18T20:23:27Z

datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs

@@ -201,6 +201,7 @@ pub fn cell_to_string(col: &ArrayRef, row: usize) -> Result<String> {
        Ok(NULL_STR.to_string())
    } else {
        match col.data_type() {
+            DataType::Null => Ok(NULL_STR.to_string()),


apache/arrow-rs#4691

…e-0.7

tustvold · 2023-08-24T21:28:04Z

I'm not sure what the pyarrow CI failures are, but they don't appear to be related to this PR

alamb

I saw one todo -- so far this PR looks epic. I look forward to completing my review in the morning

alamb · 2023-08-25T00:16:51Z

datafusion/common/src/scalar.rs

@@ -1329,6 +1321,11 @@ impl ScalarValue {
        self.to_array_of_size(1)
    }

+    /// Converts a scalar into an arrow [`Scalar`]
+    pub fn to_scalar(&self) -> Scalar<ArrayRef> {


alamb · 2023-08-25T00:19:25Z

datafusion/core/tests/memory_limit.rs

@@ -237,6 +237,7 @@ async fn sort_preserving_merge() {
 }

 #[tokio::test]
+#[ignore] // TODO: Fix this


this still seems to be left TODO

Oops, forgot about that 😅

tustvold · 2023-08-25T09:04:13Z

datafusion/core/tests/memory_limit.rs

@@ -581,7 +581,9 @@ fn make_dict_batches() -> Vec<RecordBatch> {
        // ...
        // 0000000002

-        let values: Vec<_> = (i..i + batch_size).map(|x| format!("{x:010}")).collect();
+        let values: Vec<_> = (i..i + batch_size)
+            .map(|x| format!("{:010}", x / 16))


This change was necessary to make it so the dictionaries contain some repeated values, which avoids the RowConverter overheads coming to dominate the memory usage to the point where testing the spill reservation becomes a moot point, as the intermediate memory usage of the merge is too great

alamb

Thank you @tustvold -- I went through this PR carefully and it looks (really) nice to me. Tightening up of the binary expression evaluation has been something that has bothered me for a long time. Thank you for doing this.

I also checked the size of the datafusion-cli binary (built via cargo build --release):

This branch:

du -h /Users/alamb/Software/target-df/release/datafusion-cli
 72M	/Users/alamb/Software/target-df/release/datafusion-cli

Main:

$ du -h /Users/alamb/Software/target-df2/release/datafusion-cli
 85M	/Users/alamb/Software/target-df2/release/datafusion-cli

alamb · 2023-08-25T13:24:28Z

datafusion/core/src/physical_plan/joins/hash_join.rs

-            or_kleene(&and(&left_is_null, &right_is_null)?, &eq)
-        }
-        _ => eq_dyn(left, right),
+        _ if null_equals_null => not_distinct(&left, &right),


FYI @Dandandan

alamb · 2023-08-25T13:26:13Z

datafusion/sqllogictest/test_files/ddl.slt

@@ -302,7 +302,7 @@ CREATE TABLE my_table(c1 float, c2 double, c3 boolean, c4 varchar) AS SELECT *,c
 query RRBT rowsort
 SELECT * FROM my_table order by c1 LIMIT 1
 ----
-0.00001 0.000000000001 true 1
+0.00001 0.000000000001 true true


converting boolean to "true" seems like an improvement from 1 to me

alamb · 2023-08-25T13:26:39Z

datafusion/physical-expr/src/expressions/binary/kernels_arrow.rs

-//! This module contains computation kernels that are eventually
-//! destined for arrow-rs but are in datafusion until they are ported.
-
-use arrow::{array::*, datatypes::ArrowNumericType};


thank you so much @tustvold -- this is great to see this gone

alamb · 2023-08-25T13:28:35Z

datafusion/physical-expr/src/expressions/binary.rs

-        Ok(Arc::new(paste::expr! {[<$OP _binary>]}(&ll, &rr)?))
-    }};
-}
-
 /// Invoke a compute kernel on a data array and a scalar value
 macro_rules! compute_utf8_op_scalar {


it seems like eventually we could make a ColumnarValue::to_arrow_datum type function that would allow direct invocation of many of these kernels from DataFusion without needing layers of dispatch

Yes, ideally with #7353 we could make ScalarValue implement Datum directly

And then implement Datum directly for ColumnarValue as well 🤔 that would be pretty sweet

…e-0.7

github-actions bot added the core Core DataFusion crate label Aug 14, 2023

tustvold commented Aug 14, 2023

View reviewed changes

tustvold force-pushed the prepare-object-store-0.7 branch from 84655f0 to ac3c787 Compare August 14, 2023 21:01

Prepare for ObjectStore 0.7.0

80e73d7

tustvold force-pushed the prepare-object-store-0.7 branch from ac3c787 to 80e73d7 Compare August 14, 2023 22:12

tustvold commented Aug 14, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into prepare-object-stor…

4b3e187

…e-0.7

tustvold mentioned this pull request Aug 15, 2023

Prepare object_store 0.7.0 apache/arrow-rs#4699

Merged

Update arrow

0d5d5d0

github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) substrait labels Aug 18, 2023

tustvold added 4 commits August 18, 2023 19:12

Use is_distinct kernels

bb896c4

Merge remote-tracking branch 'upstream/main' into prepare-object-stor…

5f4126a

…e-0.7

Further fixes

677839c

Update bool cast

3e0e8c8

tustvold commented Aug 18, 2023

View reviewed changes

tustvold mentioned this pull request Aug 18, 2023

Add distinct kernels (#960) (#4438) apache/arrow-rs#4716

Merged

Further fixes

f4fafa9

github-actions bot added the optimizer Optimizer rules label Aug 18, 2023

tustvold commented Aug 18, 2023

View reviewed changes

Lockfile

43f6a90

yjshen mentioned this pull request Aug 21, 2023

Update object_store requirement from 0.6.1 to 0.7.0 #7347

Closed

Update pin

33aa0ee

tustvold mentioned this pull request Aug 21, 2023

Is Distinct From Incorrectly Handles Masked Nulls #7332

Closed

tustvold changed the title ~~Prepare for ObjectStore 0.7.0~~ Update ObjectStore 0.7.0 and Arrow 46.0.0 Aug 21, 2023

tustvold added 2 commits August 24, 2023 21:30

Merge remote-tracking branch 'upstream/main' into prepare-object-stor…

3b1080e

…e-0.7

Remove pin

f758fca

tustvold marked this pull request as ready for review August 24, 2023 20:40

alamb reviewed Aug 25, 2023

View reviewed changes

Update sort_spill_reservation

5a9be81

tustvold commented Aug 25, 2023

View reviewed changes

alamb approved these changes Aug 25, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into prepare-object-stor…

ec1d6f2

…e-0.7

tustvold merged commit aa1d72c into apache:main Aug 25, 2023

parkma99 mentioned this pull request Aug 29, 2023

Improve parallel CSV scan #6922

Closed

tustvold mentioned this pull request Oct 30, 2023

Decouple Streaming Use-Case from File IO Abstractions #7994

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ObjectStore 0.7.0 and Arrow 46.0.0 #7282

Update ObjectStore 0.7.0 and Arrow 46.0.0 #7282

tustvold commented Aug 14, 2023 •

edited

Loading

tustvold Aug 14, 2023

tustvold Aug 14, 2023

tustvold Aug 14, 2023

alamb Aug 15, 2023

metesynnada Aug 19, 2023

tustvold Aug 19, 2023 •

edited

Loading

ozankabak Aug 19, 2023

metesynnada Aug 21, 2023

tustvold Aug 18, 2023

tustvold Aug 18, 2023

tustvold commented Aug 24, 2023

alamb left a comment

alamb Aug 25, 2023

alamb Aug 25, 2023

tustvold Aug 25, 2023

tustvold Aug 25, 2023 •

edited

Loading

alamb left a comment

alamb Aug 25, 2023

Dandandan Aug 26, 2023

alamb Aug 25, 2023

alamb Aug 25, 2023

alamb Aug 25, 2023

tustvold Aug 25, 2023

alamb Aug 25, 2023

Update ObjectStore 0.7.0 and Arrow 46.0.0 #7282

Update ObjectStore 0.7.0 and Arrow 46.0.0 #7282

Conversation

tustvold commented Aug 14, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Aug 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 24, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 14, 2023 •

edited

Loading

tustvold Aug 19, 2023 •

edited

Loading

tustvold Aug 25, 2023 •

edited

Loading