[POC][wip] faster `DefaultEngine` parquet reads #595

zachschuermann · 2024-12-13T00:56:21Z

TLDR

This PR is a POC/exploration on speeding up DefaultEngine parquet reads. The current implementation is unfortunately relatively complicated and despite a solid amount of async code appears to be serially reading all parquet files. There are two main outcomes from this exploration:

Empirically showed that the DefaultParquetHandler::read_parquet_files implementation is indeed serial but can be trivially made async/concurrent using typical async code and tokio tasks. In this case we can fire off all IO requests (up to some limit) and then bridge the async-to-sync boundary with an mpsc channel.
Within our Scan::execute implementation we only ever pass a singular FileMeta to read_parquet_files thereby immediately limiting any concurrency implemented in the engine. This wasn't explored further in this PR but likely takes some more design work since there is a requirement to colocate a parquet file's partition values with the outcome of the parquet read. This doesn't seem to fit nicely in the existing API.

Details

Need for a better `DefaultParquetHandler::read_parquet_files`

The POC here gives an alternative (strawman) as AsyncParquetHandler which simply launches a tokio task for each parquet file to read. The rudimentary tests (that only work on my machine lol) show that the existing implementation serially reads each file (despite readahead = 10) and the new implementation indeed fires of all IO immediately. (tests simulated high IO latency via 'sleep')

Future work

understand why the existing implementation is serial (obviously a bug and not intended)
consider productionization of something like the AsyncParquetHandler
include benchmarks/other substantiation of changes made in this area
consider if/when/how to integrate with various async runtimes. if the consumer of the kernel is also an async rust user, it may be beneficial to propagate a runtime handle so that we don't end up with competing runtimes.
consider overall execution design in the DefaultEngine: do we want multiple runtimes? (IO runtime and CPU-bound runtime?)

Need for a better `Scan::execute` implementation

Currently, within Scan::execute we take the ScanFile iterator and sequentially call read_parquet_files() on each one:

    let result = scan_files_iter
        .map(move |scan_file| -> DeltaResult<_> {
            // ... [snip] ...
            let read_result_iter = engine.get_parquet_handler().read_parquet_files(
                &[meta],
                global_state.physical_schema.clone(),
                physical_predicate.clone(),
            )?;
            // ... [snip] ...
            Ok(read_result_iter.map(move |read_result| -> DeltaResult<_> {
                // to transform the physical data into the correct logical form
                let logical = transform_to_logical_internal(...);
                // ... [snip] ...
                Ok(result)
            }))
        })
        .flatten_ok()
        .map(|x| x?);

Instead, ideally we would pass many parquet files required for the scan at once and then let the engine decide how to schedule IO. This is unfortunately not easy to coordinate since we have per-file state that must propagate 'through' the read_parquet_file API. That is, if we colocate all parquet files in the scan (spanning multiple partitions) we must somehow align all the data we read with the corresponding partition values. I didn't delve too much further here but flagging this for optimization soon.

Future work

consider all the spots we do parquet file reads, do we pass in multiple files to leverage concurrency implemented by the engine? No: everywhere except for reading all checkpoint parts just passes a single file. Note a future use case is reading all sidecars in v2Checkpoint
how can we fix the execute implementation to allow for reading all parquet files at once and propagating necessary per-file information?

nicklan

nice, this is clearly simpler and probably better.

Couple of things:

Do you have any rough benchmarks for perf differences?
yeah execute is pretty dumb here. we could consider a with_batch_size arg or something that tells it to try and collect a certain number of files to read before firing off to the parquet handler, I don't think that would be too hard to implement.

nicklan · 2024-12-14T00:46:08Z

kernel/examples/read-table-single-threaded/src/main.rs

@@ -137,6 +137,9 @@ fn try_main() -> DeltaResult<()> {
            }
        })
        .try_collect()?;
-    print_batches(&batches)?;
+    // print_batches(&batches)?;


for experimenting, i'd suggest using the multi-threaded reader. although i guess this does help determine how much a single call can read. regardless, read-table-multi-threaded has a --limit option for this case so you can see that some data got returned but not print it all, but it does tell you the total row count. maybe add that as an option here too :)

yep thanks I ended up playing with both but yea the --limit is nicer :)

nicklan · 2024-12-14T00:51:10Z

kernel/src/engine/default/parquet2.rs

+
+    let file_opener: Arc<dyn FileOpener + Send + Sync> = Arc::from(file_opener);
+    let len = files.len();
+    runtime.block_on(async {


I think you can run basically all the code in here not inside the block_on except the join. So you'd do something like:

let files = files.to_vec(); let mut handles = Vec::with_capacity(len); for file in files.into_iter() { [same code] } runtime.block_on(async { join_all(handles).await; });

Just a little more clear what's going on I think

nicklan · 2024-12-14T00:51:37Z

kernel/src/engine/default/parquet2.rs

+
+        for file in files.into_iter() {
+            // let permit = semaphore.clone().acquire_owned().await.unwrap();
+            let tx_clone = tx.clone();


nit: just call it tx

nicklan · 2024-12-14T00:55:22Z

kernel/src/engine/default/parquet.rs


        Ok(Box::pin(async move {
            // TODO avoid IO by converting passed file meta to ObjectMeta
            let meta = store.head(&path).await?;
            let mut reader = ParquetObjectReader::new(store, meta);
+            if let Some(handle) = handle {
+                reader = reader.with_runtime(handle);


what does setting this do?

new in arrow 53.3 i think - lets you push down a runtime for them to schedule their IO on. This has gotten me thinking about various ways to enable this sort of 'runtime passthrough' ourselves..

Perform IO on the provided tokio runtime
Tokio is a cooperative scheduler, and relies on tasks yielding in a timely manner to service IO. Therefore, running IO and CPU-bound tasks, such as parquet decoding, on the same tokio runtime can lead to degraded throughput, dropped connections and other issues. For more information see here.

see with_runtime

Thanks. Yeah, seems similar to what you're doing.

zachschuermann · 2024-12-16T04:20:32Z

nice, this is clearly simpler and probably better.

Couple of things:

Do you have any rough benchmarks for perf differences?

yeah execute is pretty dumb here. we could consider a with_batch_size arg or something that tells it to try and collect a certain number of files to read before firing off to the parquet handler, I don't think that would be too hard to implement.

Note

100MB test table with 100 files in latest snapshot (on S3) read goes from 50s to 7.58s (6.6x speedup) on my M1 mac

woo! and yea i just did some heavy-handed hacking to get a somewhat useful benchmark running. I hacked around to get Scan::execute to just hand all the files at once to the reader. The hard part with execute is 'lining up' all the chunks of data.. that is, if we have 3 parquet files and each read them in two chunks then in our current API we have 6 chunks yielded from the parquet reader, and would need to somehow 'attach' the appropriate per-file data to each set (e.g. if the first file is partition part=1 then we need to make sure that part=1 is applied to the first two chunks). Anyways, I totally hacked it out and i'm not applying any DVs or partition values appropriately, but just did a simple bake-off to validate. Table size is ~100MB in the latest snapshot split across 100 files (1MB each):

before my changes (old read_parquet_files)

[20:02] [databricks] read-table-single-threaded ➜ time ../../../target/release/read-table-single-threaded "s3://zach-tables/test_table_100/100_file_table" --public --region us-west-2
Reading s3://zach-tables/test_table_100/100_file_table/
Total rows read: 10000000

________________________________________________________
Executed in   50.02 secs    fish           external
   usr time    1.38 secs    0.14 millis    1.38 secs
   sys time    0.88 secs    3.30 millis    0.88 secs

after my changes (new read_parquet_files)

[20:03] [databricks] read-table-single-threaded ➜ time ../../../target/release/read-table-single-threaded "s3://zach-tables/test_table_100/100_file_table" --public --region us-west-2
Reading s3://zach-tables/test_table_100/100_file_table/
Total rows read: 10000000

________________________________________________________
Executed in    7.58 secs      fish           external
   usr time  968.55 millis    0.12 millis  968.43 millis
   sys time  880.15 millis    2.65 millis  877.50 millis

poc working

13ce62a

zachschuermann requested review from nicklan, scovich, roeap and OussamaSaoudi-db December 13, 2024 00:56

github-actions bot assigned zachschuermann Dec 13, 2024

github-actions bot added the breaking-change Change that will require a version bump label Dec 13, 2024

nicklan reviewed Dec 14, 2024

View reviewed changes

hack

e79f02c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC][wip] faster `DefaultEngine` parquet reads #595

[POC][wip] faster `DefaultEngine` parquet reads #595

zachschuermann commented Dec 13, 2024 •

edited

Loading

nicklan left a comment

nicklan Dec 14, 2024

zachschuermann Dec 14, 2024

nicklan Dec 14, 2024

nicklan Dec 14, 2024

nicklan Dec 14, 2024

zachschuermann Dec 14, 2024

nicklan Dec 14, 2024

zachschuermann commented Dec 16, 2024

[POC][wip] faster DefaultEngine parquet reads #595

Are you sure you want to change the base?

[POC][wip] faster DefaultEngine parquet reads #595

Conversation

zachschuermann commented Dec 13, 2024 • edited Loading

TLDR

Details

Need for a better DefaultParquetHandler::read_parquet_files

Future work

Need for a better Scan::execute implementation

Future work

nicklan left a comment

Choose a reason for hiding this comment

nicklan Dec 14, 2024

Choose a reason for hiding this comment

zachschuermann Dec 14, 2024

Choose a reason for hiding this comment

nicklan Dec 14, 2024

Choose a reason for hiding this comment

nicklan Dec 14, 2024

Choose a reason for hiding this comment

nicklan Dec 14, 2024

Choose a reason for hiding this comment

zachschuermann Dec 14, 2024

Choose a reason for hiding this comment

nicklan Dec 14, 2024

Choose a reason for hiding this comment

zachschuermann commented Dec 16, 2024

before my changes (old read_parquet_files)

after my changes (new read_parquet_files)

[POC][wip] faster `DefaultEngine` parquet reads #595

[POC][wip] faster `DefaultEngine` parquet reads #595

zachschuermann commented Dec 13, 2024 •

edited

Loading

Need for a better `DefaultParquetHandler::read_parquet_files`

Need for a better `Scan::execute` implementation