ScanFile expression and visitor for CDF #546

OussamaSaoudi-db · 2024-11-27T01:51:47Z

What changes are proposed in this pull request?

This PR introduces four concepts:

cdf_scan_row_schema: This is the schema that engine data will be transformed into at the end of the log replay phase. This schema prunes the log schema down only to the fields necessary to produce CDF columns.
cdf_scan_row_expression: This is a function that generates an expression to transform an engine data into the cdf_scan_row_schema. The function takes timestamp and commit number as arguments because it inserts these as columns into the output engine data.
CDFScanFile: This is a type that holds all the information needed to read a data file and generate its CDF rows. It holds path, deletion vector, the type of action, and the paired remove deletion vector. The action type is encoded as an enum CDFScanFileType
CDFScanFileVisitor: This is a visitor that reads engine data with the cdf_scan_row_schema and constructs CDFScanFiles.

This PR is only for internal use, and is only expected to be used by TableChangesScan::execute when it is implemented. Engines must not use the visitor nor CDFScanFile.

This PR affects the following public APIs

No public APIs are changed.

How was this change tested?

I generate a table with add, remove and cdc actions. Then:

The table is read,
The engine data is transformed using table_changes_action_iter which in transforms the engine data into the cdf_scan_row_schema using the cdf_scan_row_expression
The transformed engine data is read again using the CDFScanFileVisitor and assert that the CDFScanFiles are as expected.

This test checks the following cases:

A remove with None partition values. An empty hashmap for partition values should be used.
A remove with partition values.
An add/remove DV pair. This should place the correct remove dv into the add's CdfScanFile.
The visitor extracts the correct timestamp and commit version for each file.

codecov · 2024-11-27T01:55:43Z

Codecov Report

Attention: Patch coverage is 93.44828% with 19 lines in your changes missing coverage. Please review.

Project coverage is 82.45%. Comparing base (bea3326) to head (a5b513c).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/table_changes/scan_file.rs	93.63%	5 Missing and 13 partials ⚠️
kernel/src/table_changes/log_replay.rs	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #546      +/-   ##
==========================================
+ Coverage   82.28%   82.45%   +0.16%     
==========================================
  Files          71       72       +1     
  Lines       15777    16054     +277     
  Branches    15777    16054     +277     
==========================================
+ Hits        12982    13237     +255     
- Misses       2167     2175       +8     
- Partials      628      642      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nicklan

This looks fine. Please note in the PR that this is internal for now. That is, at some point we may want to expose this to engines, but for the first draft of CDF we are only allowing a full execute() which just uses this internally and doesn't expose it

kernel/src/table_changes/scan_file.rs

nicklan · 2024-12-03T21:50:03Z

kernel/src/table_changes/scan_file.rs

+///
+/// }
+/// ```
+pub(crate) fn scan_row_schema() -> SchemaRef {


probably doesn't need to be pub(crate). I think it should only be used in this file

This will be used by the LogReplayScanner since this is the output schema for the transformed EngineData.

OussamaSaoudi-db · 2024-12-05T02:04:05Z

todo: rename to scan_data_* instead of scan_row_*

OussamaSaoudi-db · 2024-12-05T19:44:56Z

TODO: Remove the comment on schema because that can become out of sync

OussamaSaoudi-db · 2024-12-05T19:46:07Z

TODO: Make these changes: #540 (comment)

scovich

quick drive-by, but will wait for pending changes to do a full review

scovich · 2024-12-05T21:11:51Z

kernel/src/table_changes/scan_file.rs

+#[allow(unused)]
+pub(crate) fn scan_data_to_scan_file(
+    scan_data: impl Iterator<Item = DeltaResult<TableChangesScanData>>,
+) -> impl Iterator<Item = DeltaResult<(CDFScanFile, Option<Arc<HashMap<String, DvInfo>>>)>> {


That type is a real mouthful... worth factoring it out as a named type or even a struct?

I was considering introducing two types:

#[allow(unused)] pub(crate) struct UnresolvedCDFScanFile { scan_file: CDFScanFile, remove_dvs: Arc<HashMap<String, DvInfo>>, } #[allow(unused)] pub(crate) struct ResolvedCDFScanFile { scan_file: CDFScanFile, selection_vector: Vec<bool>, }

Because the step after this will be to resolve deletion vectors. Is this sensible naming?

kernel/src/table_changes/scan_file.rs

OussamaSaoudi-db · 2024-12-06T07:11:47Z

@scovich you had a question about whether we need to use unsigned ints to track commit versions. I'll check the spec at some point.

Visitor code expects that all numbers are signed ints, but we declare type Version = u64. I'd been keeping versions as u64, but I think we should come up with a consistent approach.

scovich · 2024-12-06T15:16:53Z

@scovich you had a question about whether we need to use unsigned ints to track commit versions. I'll check the spec at some point.

Visitor code expects that all numbers are signed ints, but we declare type Version = u64. I'd been keeping versions as u64, but I think we should come up with a consistent approach.

Can we make a tracking issue, with the goal to decide once and for all our stance on signed vs. unsigned?
Or maybe #174 is that issue and we just need to do something about it.

nicklan

given that this is all internal, stamping for now and we can iterate later if needed

nicklan · 2024-12-06T20:49:39Z

kernel/src/table_changes/scan_file.rs

+    }
+
+    fn selected_column_names_and_types(&self) -> (&'static [ColumnName], &'static [DataType]) {
+        static NAMES_AND_TYPES: LazyLock<ColumnNamesAndTypes> =


it's a really nice way to handle this :)

scovich

Approach looks sound. Several nits to address (especially the one about schema prining/nullability), but it should be quick and easy.

scovich · 2024-12-06T14:59:55Z

kernel/src/table_changes/scan_file.rs

+use crate::{DeltaResult, EngineData, Error, RowVisitor};
+
+#[allow(unused)]
+pub(crate) struct UnresolvedCDFScanFile {


naming nit: There are no good ways to handle acronyms in CamelCase, but I believe the general consensus is that the acronym should be init-cap instead of all-cap so readers can tell where "word" boundaries are

Suggested change

pub(crate) struct UnresolvedCDFScanFile {

pub(crate) struct UnresolvedCdfScanFile {

(see e.g. Cdf action, DvInfo below, etc)

(many more below)

scovich · 2024-12-06T15:01:25Z

kernel/src/table_changes/scan_file.rs

+    remove_dvs: Arc<HashMap<String, DvInfo>>,
+}
+#[allow(unused)]
+pub(crate) struct ResolvedCDFScanFile {


It would be helpful to have doc comments explaining the difference between "resolved" and "unresolved" CDF files? (I believe it refers to whether the file's deletion vector has been loaded yet)

kernel/src/table_changes/scan_file.rs

scovich · 2024-12-06T15:30:00Z

kernel/src/table_changes/scan_file.rs

+        let deletion_vector = StructType::new([
+            StructField::new("storageType", DataType::STRING, true),
+            StructField::new("pathOrInlineDv", DataType::STRING, true),
+            StructField::new("offset", DataType::INTEGER, true),
+            StructField::new("sizeInBytes", DataType::INTEGER, true),
+            StructField::new("cardinality", DataType::LONG, true),
+        ]);


AFAIK, this schema is not pruned; should we just use DeletionVectorDescriptor::to_schema instead?
(I think the answer depends on whether we want this read to automatically pick up any new fields added in the future, or if we intentionally restrict it to this specific set of fields)

NOTE: Most of these fields are non-nullable, which is allowed for a read schema. The rule that e.g. arrow-rust follows is that a non-nullable field is allowed inside a nullable field, and will have a null value exactly and only when the outer field was already null. That seems a useful thing to enforce for the read schema.

The non-nullable observation applies to several fields in other structs below, like {add, remove, cdc}.path; the fileConstantValues field should probably be non-nullable as well, but it won't matter in practice because that's anyway a generated field. Note that the Delta spec is ambiguous about whether partitionValues is nullable: it's required for Add, optional for Remove, and unspecified for Cdc.

NOTE: Visitor schemas are different from read schemas, because visitor schemas are exploded out as stand-alone leaf fields. At that point all parent/child relationships are lost, so a field passed to the visitor must be nullable if it or any parent is nullable. Which means all-nullable for all action visitors. Which is why the visitor API takes name/type pairs (no nullability param) and the getters are all nullable.

should we just use DeletionVectorDescriptor::to_schema instead

I'd like to, but we hard code the number of fields in the visitor getters. So any change to the DV schema would break the visitor.

The schema stuff makes sense! I'll keep the top level fields nullable and patch up the nested ones.

Do the nullability semantics work atm?
I tried the following schema:

static CDF_SCAN_ROW_SCHEMA: LazyLock<Arc<StructType>> = LazyLock::new(|| { let deletion_vector = StructType::new([ StructField::new("storageType", DataType::STRING, false), StructField::new("pathOrInlineDv", DataType::STRING, false), StructField::new("offset", DataType::INTEGER, true), StructField::new("sizeInBytes", DataType::INTEGER, false), StructField::new("cardinality", DataType::LONG, false), ]); let partition_values = MapType::new(DataType::STRING, DataType::STRING, true); let file_constant_values = StructType::new([StructField::new("partitionValues", partition_values, true)]); let add = StructType::new([ StructField::new("path", DataType::STRING, false), StructField::new("deletionVector", deletion_vector.clone(), true), StructField::new("fileConstantValues", file_constant_values.clone(), false), ]); let remove = StructType::new([ StructField::new("path", DataType::STRING, false), StructField::new("deletionVector", deletion_vector, true), StructField::new("fileConstantValues", file_constant_values.clone(), false), ]); let cdc = StructType::new([ StructField::new("path", DataType::STRING, false), StructField::new("fileConstantValues", file_constant_values, false), ]); Arc::new(StructType::new([ StructField::new("add", add, true), StructField::new("remove", remove, true), StructField::new("cdc", cdc, true), StructField::new("timestamp", DataType::LONG, false), StructField::new("commit_version", DataType::LONG, false), ])) });

The top level is nullable, but I seem to get an issue with arrow on the path field of all things!

Arrow(InvalidArgumentError("Found unmasked nulls for non- nullable StructArray field \"path\"")

My guess is that it's treating path as non-nullable even when the top level is stated to be nullable.

Also given that partitionValues is technically nullable for removes, I'm constructing a default map if I don't find anything. Is this necessary/sensible?

let partition_values = partition_values.unwrap_or_else(Default::default);

Found unmasked nulls for non-nullable StructArray field "path"

That error indicates that add was non-null but path was null, which is definitely not a good thing. Do you have a backtrace? I wonder if this tickled another latent bug in our arrow data code, similar to the one that @nicklan hit a while back?

Hmm I'll try to track this down and report back 🫡

Okay I've narrowed down the point in the code where we fail for this. It's actually in the expression transformation, and not the visitor!

The specific line is here. I don't see anything that handles nested nullability, but maybe arrow is supposed to do that on its own?

I think I've got a reasonably MRE. I'll post an issue.

I submitted an issue. Would love for you to sanity check it @scovich:
#578

I'll punt on this for now and keep everything nullable. This is what we do in the normal read path, so I think it should be okay for now.

zachschuermann

Can we add a scan_data_to_scan_file test (just something through that API) otherwise LGTM!

zachschuermann · 2024-12-06T21:21:35Z

(and we can take some of the todo's above from my other reviews as follow-up or just quickly in this PR, either way)

zachschuermann · 2024-12-06T21:23:57Z

and this shouldn't be a breaking change right?

scovich

Late find while reviewing #568

scovich · 2024-12-06T22:02:07Z

kernel/src/table_changes/scan_file.rs

+            let scan_data = scan_data?;
+            let callback: CDFScanCallback<Vec<CDFScanFile>> =
+                |context, scan_file| context.push(scan_file);
+            let result = visit_cdf_scan_files(


I don't understand why this method can't take scan_data directly (instead of breaking it up and passing (almost) all its fields as individual args? Then we could directly probe the remove_dvs while visiting adds, and CDFScanFile could hold dv_info for both add and remove, and the next PR would have a much easier time resolving DVs (no need to pass the hashmap to it).

Is it attempting to preserve similarity to the analagous visit_scan_files or something?

Good point, I was preserving the similarity with visit_scan-files. iirc this part of the code has ffi-related reasons for being the way that it is 🤔 I'll look into it, but I'll ping @nicklan if he has input

I ended up switching it and it simplified it nicely. Ig we can deal with engine support later given that we aren't targeting that rn

OussamaSaoudi-db · 2024-12-08T00:07:29Z

@zachschuermann @scovich I updated the test to use the log replay. PTAL

scovich · 2024-12-09T17:03:52Z

kernel/src/table_changes/scan_file.rs

+        let deletion_vector = StructType::new([
+            StructField::new("storageType", DataType::STRING, true),
+            StructField::new("pathOrInlineDv", DataType::STRING, true),
+            StructField::new("offset", DataType::INTEGER, true),
+            StructField::new("sizeInBytes", DataType::INTEGER, true),
+            StructField::new("cardinality", DataType::LONG, true),
+        ]);


zachschuermann

LGTM couple questions/nits!

zachschuermann · 2024-12-09T17:36:55Z

kernel/src/table_changes/scan_file.rs

+use crate::{DeltaResult, Error, RowVisitor};
+
+// The type of action associated with a [`CdfScanFile`].
+#[allow(unused)]


let's not forget to remove all the #[allow(unused)] when we actually use it :)

Yeah I think I'll leave this to the final execute pr.

zachschuermann · 2024-12-09T17:37:55Z

kernel/src/table_changes/scan_file.rs

+/// Transforms an iterator of TableChangesScanData into an iterator  of
+/// `UnresolvedCdfScanFile` by visiting the engine data.


Suggested change

/// Transforms an iterator of TableChangesScanData into an iterator of

/// `UnresolvedCdfScanFile` by visiting the engine data.

/// Transforms an iterator of [`TableChangesScanData`] into an iterator of

/// [`CdfScanFile`] by visiting the engine data.

zachschuermann · 2024-12-09T18:25:40Z

kernel/src/table_changes/scan_file.rs

+struct CdfScanFileVisitor<'a, T> {
+    callback: CdfScanCallback<T>,
+    selection_vector: &'a [bool],
+    remove_dvs: &'a Arc<HashMap<String, DvInfo>>,


why ref arc? should this just be arc?

Didn't see a reason to do an entire arc clone in the loop. I changed this to instead have &'a HashMap<String, DvInfo>, and we use the arc as_ref

kernel/src/table_changes/scan_file.rs

github-actions bot assigned OussamaSaoudi-db Nov 27, 2024

OussamaSaoudi-db marked this pull request as draft November 27, 2024 01:52

OussamaSaoudi-db force-pushed the cdf_scan_file branch from febf7b1 to e384be6 Compare November 29, 2024 03:20

github-actions bot added the breaking-change Change that will require a version bump label Nov 29, 2024

OussamaSaoudi-db mentioned this pull request Nov 30, 2024

Introduce TableChangesScan::execute and ScanFileReader #555

Closed

OussamaSaoudi-db changed the title ~~[WIP] Implement ScanFile visitor, expression, and struct for TableChanges~~ [WIP] ScanFile expression and visitor for CDF Dec 2, 2024

OussamaSaoudi-db force-pushed the cdf_scan_file branch from 7f1098b to 6190992 Compare December 2, 2024 22:22

OussamaSaoudi-db mentioned this pull request Dec 2, 2024

Implement Log Replay for Change Data Feed #540

Merged

nicklan reviewed Dec 3, 2024

View reviewed changes

OussamaSaoudi-db force-pushed the cdf_scan_file branch from 6190992 to 2eb5383 Compare December 4, 2024 22:40

OussamaSaoudi-db removed the breaking-change Change that will require a version bump label Dec 4, 2024

github-actions bot added the breaking-change Change that will require a version bump label Dec 4, 2024

OussamaSaoudi-db marked this pull request as ready for review December 4, 2024 23:32

OussamaSaoudi-db changed the title ~~[WIP] ScanFile expression and visitor for CDF~~ ScanFile expression and visitor for CDF Dec 4, 2024

OussamaSaoudi-db requested review from nicklan, scovich and zachschuermann December 5, 2024 00:13

scovich reviewed Dec 5, 2024

View reviewed changes

OussamaSaoudi-db force-pushed the cdf_scan_file branch from 3e7202f to e3504c5 Compare December 6, 2024 05:31

OussamaSaoudi-db requested a review from scovich December 6, 2024 07:16

nicklan approved these changes Dec 6, 2024

View reviewed changes

scovich approved these changes Dec 6, 2024

View reviewed changes

scan file implementation for cdf

bd9585d

OussamaSaoudi-db force-pushed the cdf_scan_file branch from 73bd980 to bd9585d Compare December 6, 2024 21:18

zachschuermann approved these changes Dec 6, 2024

View reviewed changes

zachschuermann removed the breaking-change Change that will require a version bump label Dec 6, 2024

scovich reviewed Dec 6, 2024

View reviewed changes

OussamaSaoudi-db added 2 commits December 6, 2024 14:44

change naming and update documentation for cdf scan files

e2c2d90

fixup some naming

3db127f

scovich mentioned this pull request Dec 6, 2024

Resolve deletion vectors to find inserted and removed rows for CDF #568

Merged

OussamaSaoudi-db added 2 commits December 6, 2024 16:05

inline dvinfo creation

ec6e629

Change name of expression transform

74eed10

OussamaSaoudi-db mentioned this pull request Dec 7, 2024

Expression evaluators do not handle nested non-nullable fields in nullable structs. #578

Open

OussamaSaoudi-db added 2 commits December 7, 2024 15:16

Add test for null parttion columns

9fe16db

Improve testing for scan_file

8635e38

OussamaSaoudi-db added 5 commits December 7, 2024 18:34

Change visitor, remove (un)resolved cdf scan file

a17f4b0

Only keep track of remove dv instead of hashmap

5fab00c

Fix remove dv

21553a4

patch rm_dv

08867f8

Update comment for CdfScanFileVisitor

1228107

scovich approved these changes Dec 9, 2024

View reviewed changes

zachschuermann approved these changes Dec 9, 2024

View reviewed changes

OussamaSaoudi-db added 3 commits December 9, 2024 11:17

Address pr comments

964b3f3

Merge branch 'main' into cdf_scan_file

9bd2bfe

remove unnecessary comment

a5b513c

OussamaSaoudi-db merged commit ed714c5 into delta-io:main Dec 9, 2024
20 checks passed

OussamaSaoudi-db mentioned this pull request Dec 14, 2024

[tracking issue] Change Data Feed read path #440

Open

21 tasks

	pub(crate) struct UnresolvedCDFScanFile {
	pub(crate) struct UnresolvedCdfScanFile {

		/// Transforms an iterator of TableChangesScanData into an iterator of
		/// `UnresolvedCdfScanFile` by visiting the engine data.

ScanFile expression and visitor for CDF #546

ScanFile expression and visitor for CDF #546

Conversation

OussamaSaoudi-db commented Nov 27, 2024 • edited Loading

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

codecov bot commented Nov 27, 2024 • edited Loading

Codecov Report

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Dec 5, 2024

OussamaSaoudi-db commented Dec 5, 2024

OussamaSaoudi-db commented Dec 5, 2024

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Dec 6, 2024

scovich commented Dec 6, 2024

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachschuermann left a comment

Choose a reason for hiding this comment

zachschuermann commented Dec 6, 2024

zachschuermann commented Dec 6, 2024

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Dec 8, 2024

Choose a reason for hiding this comment

zachschuermann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Nov 27, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024 •

edited

Loading