TableChangesScan::execute and end to end testing for CDF #580

OussamaSaoudi-db · 2024-12-09T18:46:00Z

What changes are proposed in this pull request?

This PR introduces the execute method to TableChangesScan, which performs a CDF scan and returns ScanResult the change data feed. The ScanResult holds engine data and a selection vector to apply to the rows of the data.

A helper method read_scan_file is added to read the rows of a scan_file and perform physical to logical transformation.

The macros sort_lines and asert_batches_sorted_eq are moved to the common crate so they can be shared between the read.rs and new cdf.rs integration tests.

The cdf-table and cdf-table-non-partitioned test tables are from the delta-rs project.

How was this change tested?

This tests data reading for 3 tables with the following features:

a table with deletion vectors involving deletion and restoration
a non-partitioned table
a partitioned table

… dv resolution

…lection treemap

This reverts commit 055f550.

This reverts commit 094fe1e.

OussamaSaoudi-db · 2024-12-10T18:14:28Z

TODO; rename table-with-cdf-and-dv to cdf-table-with-dv

zachschuermann

flushing comments, i'll let you rebase now that the other PR landed

zachschuermann · 2024-12-10T18:57:58Z

kernel/src/table_changes/scan.rs

+        last_modified: 0,
+        size: 0,


hm, perhaps we should store these in CdfScanFile? make a follow-up if yes?

We discussed this before and the conclusion was that these may someday be useful, but immediately they don't contribute much.

iirc, last_modified may be handy because it could be used to check cache staleness in the engine in case it saves a file in memory without re-reading it for each scan.

right: so I think if the decision is to keep them in filemeta then we should have a follow up to actually use them here (and enable any further optimizations downstream - without this, no matter the downstream optimizations they won't be used)

zachschuermann · 2024-12-10T18:58:36Z

kernel/src/table_changes/scan.rs

+
+    let physical_to_logical_expr =
+        physical_to_logical_expr(&scan_file, global_state.logical_schema.as_ref(), all_fields)?;
+    let read_schema = scan_file_read_schema(&scan_file, global_state.read_schema.as_ref());


is this physical_schema? maybe stick to physical/logical naming?

Ya sure 👍 In that case, we should probably do some renaming in global scan state and scan/mod.rs at some point.

Address this further in this PR too: https://github.com/delta-io/delta-kernel-rs/pull/588/files

cool thanks!

zachschuermann · 2024-12-10T19:07:25Z

kernel/src/table_changes/scan.rs

+    let read_result_iter =
+        engine
+            .get_parquet_handler()
+            .read_parquet_files(&[file], read_schema, predicate)?;


(don't have to solve now) I wonder if we should introduce this higher in the iterator pipeline we've built up here?

Oversimplification, but instead of something like:

scan_files .map(|scan_file| read_parquet_files([scan_file])) // parquet batches .map(|batch| do_stuff) // return final batches

could push it up to be

read_parquet_files(scan_files) // parquet batches .map(|batch| do_stuff) // return final batches

unsure how much state you'd need to track and associate with each batch (if it makes things too complicated)

Hmm we'd have to somehow associate the scan_file and selection_vector to the batch because it's important for the P2L transformation. So the type of the proposed read_parquet_files would have to be something like Iterator<(EngineData, ResolvedCdfScanFile)>.
and do_stuff: (EngineData, ResolvedCdfScanFile) -> ScanResult

oh i posted another comment about this, but maybe just make an issue to invest a bit more time into this

scovich · 2024-12-10T15:40:16Z

kernel/src/scan/mod.rs

@@ -160,7 +160,7 @@ impl ScanResult {
 /// store the name of the column, as that's all that's needed during the actual query. For
 /// `Partition` we store an index into the logical schema for this query since later we need the
 /// data type as well to materialize the partition column.
-#[derive(PartialEq, Debug)]
+#[derive(Clone, PartialEq, Debug)]


Clone needed because now we process the selected column types separately for each commit, where the normal log replay scan does it all in one shot?

Ah good point. The reason is really that I don't want to hold a reference to all_fields in TableChangesScan because the iterator needs to be free from lifetimes.

But really all I need is to Arc it. Now we don't need to clone the Vec.

Note that the existing Scan::execute still borrows self

pub fn execute( &self, engine: Arc<dyn Engine>, ) -> DeltaResult<impl Iterator<Item = DeltaResult<ScanResult>> + '_> {

as you can see with that anonymous lifetime '_. I was planning on opening a followup PR to take that out too.

scovich · 2024-12-10T15:42:29Z

kernel/src/table_changes/scan.rs

@@ -29,8 +33,6 @@ pub struct TableChangesScan {
    predicate: Option<ExpressionRef>,
    // The [`ColumnType`] of all the fields in the `logical_schema`
    all_fields: Vec<ColumnType>,
-    // `true` if any column in the `logical_schema` is a partition column
-    have_partition_cols: bool,


What is special about CDF that we don't need to care about partition columns?
Or did you find a better approach that we should consider backporting to the normal scan as well?

have_partition_cols was useful in scan to skip building the physical to logical expression. Source. For CDF, we always have to construct an expression because it changes based on the scan file.

scovich · 2024-12-10T15:45:39Z

kernel/src/table_changes/scan.rs

+        let table_root = self.table_changes.table_root().clone();
+        let all_fields = self.all_fields.clone();
+        let predicate = self.predicate.clone();
+        let dv_engine_ref = engine.clone();


interesting... we need a separate arc for each of the two map calls, because they co-exist in time and they both outlive this function call.

Yeah I was surprised to see that too. I suppose an arc clone is cheap enough given all the processing we're doing.

Yeah, I don't think there's anything wrong -- it was just the first time I'd seen that and was trying to grok it

kernel/src/table_changes/scan.rs

scovich · 2024-12-10T19:28:00Z

kernel/tests/cdf.rs

+            if let Some(mask) = mask {
+                Ok(filter_record_batch(&record_batch, &mask.into())?)
+            } else {
+                Ok(record_batch)
+            }


nit: I'm always torn which syntax to use when each case would be a single line either way:

Suggested change

if let Some(mask) = mask {

Ok(filter_record_batch(&record_batch, &mask.into())?)

} else {

Ok(record_batch)

}

match mask {

Some(mask) => Ok(filter_record_batch(&record_batch, &mask.into())?),

None => Ok(record_batch),

}

I think I'd lean towards the match since it feels clearer that there are only two cases.

codecov · 2024-12-10T22:44:07Z

Codecov Report

Attention: Patch coverage is 89.58333% with 10 lines in your changes missing coverage. Please review.

Project coverage is 83.44%. Comparing base (af075a8) to head (192849d).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/table_changes/scan.rs	88.63%	1 Missing and 9 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #580      +/-   ##
==========================================
+ Coverage   83.21%   83.44%   +0.23%     
==========================================
  Files          74       74              
  Lines       16775    16861      +86     
  Branches    16775    16861      +86     
==========================================
+ Hits        13959    14070     +111     
+ Misses       2168     2130      -38     
- Partials      648      661      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zachschuermann

LGTM woooo!!! but don't think tests are exhaustive (probably need to test some of the filtering/projection/etc. and i'll need to think through more edge cases) but let's merge and take on some more testing soon!

zachschuermann · 2024-12-10T23:55:37Z

kernel/src/table_changes/scan.rs

+        location,
+    };
+    let read_result_iter = engine.get_parquet_handler().read_parquet_files(
+        &[file],


regarding that other conversation maybe we just make a follow up to consider pulling read_parquet_files higher in the iterator stack to do better than single-file calls?

zachschuermann · 2024-12-10T23:57:16Z

kernel/tests/common/mod.rs

+/// unpack the test data from {test_parent_dir}/{test_name}.tar.zst into a temp dir, and return the dir it was
+/// unpacked into
+#[allow(unused)]
+pub(crate) fn load_test_data(


nice thanks!

zachschuermann · 2024-12-11T00:04:10Z

kernel/tests/cdf.rs

+    Ok(batches)
+}
+
+fn assert_batches_sorted_eq(expected_lines: &[impl ToString], batches: &[RecordBatch]) {


is this duplicated? can we share code or just create follow up to clean this up soon?

Ah good catch. Moved it to common and updated tests. We still passing 😎

OussamaSaoudi-db added 29 commits December 6, 2024 13:17

scan file implementation for cdf

bd9585d

change naming and update documentation for cdf scan files

e2c2d90

fixup some naming

3db127f

inline dvinfo creation

ec6e629

Change name of expression transform

74eed10

Add test for null parttion columns

9fe16db

Improve testing for scan_file

8635e38

Change visitor, remove (un)resolved cdf scan file

a17f4b0

Only keep track of remove dv instead of hashmap

5fab00c

Fix remove dv

21553a4

patch rm_dv

08867f8

Update comment for CdfScanFileVisitor

1228107

Initial cdf read phase with deletion vector resolution

97a5790

lazily construct empty treemaps

4969e44

Change treemap handling in kernel, use selection vector, and simplify…

dcb17fa

… dv resolution

Rebase onto scan file

ebaf225

update doc comment for dvs

bd49142

fix comment

6287e6e

prototype of add/rm dv in cdfscanfile

7fe4f4d

Use default for dv info

d4f95d5

shorten tests, check error cases

8a8f6bf

Rename to physical_to_logical, add comment explaining the deletion/se…

eeaabb0

…lection treemap

Add note about ordinary scans for dv resolution

fa13ade

Add doc comment to treemap_to_bools_with

4dabdaf

Update resolve_scan_file_dv docs

7fcb531

Add test and docs

577b424

Add documentation

a970df1

Rename to resolve-dvs

5b70aab

Fixup naming and docs

3e0ead6

github-actions bot assigned OussamaSaoudi-db Dec 9, 2024

OussamaSaoudi-db added 13 commits December 9, 2024 21:46

Remove unneeded changes

e3031c3

make raw mask private

88df7fa

Revert "Remove data for next PR"

2eeb144

This reverts commit 055f550.

Revert "Remove dv test file"

376c061

This reverts commit 094fe1e.

Move integration tests into cdf.rs

c03b58f

remove unused

59c90f8

remove failing test

6ff5b63

add back read_scan_file

23ce3cf

remove unused, move read_scan_file

09b709b

Rename read_scan_data

17eaf3e

change to read_scan_file

44c8ec9

Refactor assert_batches_sorted_eq

e3f183c

Fix compilation errors

2ecf506

OussamaSaoudi-db force-pushed the cdf_execute branch from dacfc34 to 2ecf506 Compare December 10, 2024 05:51

zachschuermann reviewed Dec 10, 2024

View reviewed changes

OussamaSaoudi-db added 2 commits December 10, 2024 11:33

Fix tests by projecting timestamp

29b7548

compress tests

50b5d45

scovich approved these changes Dec 10, 2024

View reviewed changes

zachschuermann mentioned this pull request Dec 10, 2024

improve selection vector handling in Scan and table_changes Scan #587

Open

OussamaSaoudi-db added 3 commits December 10, 2024 13:43

Remove Clone from ColumnType. Change it to Arc

c6c5bef

Change naming from read_schema to physical_schema

1e4cc61

Merge branch 'main' into cdf_execute

75445af

OussamaSaoudi-db requested a review from zachschuermann December 10, 2024 22:42

zachschuermann approved these changes Dec 11, 2024

View reviewed changes

Remove reimplemented function

192849d

OussamaSaoudi-db merged commit 7bcbb57 into delta-io:main Dec 11, 2024
20 checks passed

OussamaSaoudi-db mentioned this pull request Dec 14, 2024

[tracking issue] Change Data Feed read path #440

Open

21 tasks

zachschuermann removed the breaking-change Change that will require a version bump label Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TableChangesScan::execute and end to end testing for CDF #580

TableChangesScan::execute and end to end testing for CDF #580

OussamaSaoudi-db commented Dec 9, 2024 •

edited

Loading

OussamaSaoudi-db commented Dec 10, 2024

zachschuermann left a comment

zachschuermann Dec 10, 2024

OussamaSaoudi-db Dec 10, 2024

zachschuermann Dec 10, 2024

zachschuermann Dec 10, 2024

OussamaSaoudi-db Dec 10, 2024

OussamaSaoudi-db Dec 10, 2024

zachschuermann Dec 10, 2024

zachschuermann Dec 10, 2024

zachschuermann Dec 10, 2024

OussamaSaoudi-db Dec 10, 2024

zachschuermann Dec 11, 2024

scovich Dec 10, 2024

OussamaSaoudi-db Dec 10, 2024

scovich Dec 10, 2024

OussamaSaoudi-db Dec 10, 2024

scovich Dec 10, 2024

OussamaSaoudi-db Dec 10, 2024

scovich Dec 10, 2024

scovich Dec 10, 2024

OussamaSaoudi-db Dec 10, 2024

codecov bot commented Dec 10, 2024 •

edited

Loading

zachschuermann left a comment

zachschuermann Dec 10, 2024

zachschuermann Dec 10, 2024

zachschuermann Dec 11, 2024

OussamaSaoudi-db Dec 11, 2024

TableChangesScan::execute and end to end testing for CDF #580

TableChangesScan::execute and end to end testing for CDF #580

Conversation

OussamaSaoudi-db commented Dec 9, 2024 • edited Loading

What changes are proposed in this pull request?

How was this change tested?

OussamaSaoudi-db commented Dec 10, 2024

zachschuermann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 10, 2024 • edited Loading

Codecov Report

zachschuermann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Dec 9, 2024 •

edited

Loading

codecov bot commented Dec 10, 2024 •

edited

Loading