-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TableChangesScan::execute and end to end testing for CDF #580
TableChangesScan::execute and end to end testing for CDF #580
Conversation
dacfc34
to
2ecf506
Compare
TODO; rename table-with-cdf-and-dv to cdf-table-with-dv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing comments, i'll let you rebase now that the other PR landed
last_modified: 0, | ||
size: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, perhaps we should store these in CdfScanFile
? make a follow-up if yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this before and the conclusion was that these may someday be useful, but immediately they don't contribute much.
iirc, last_modified may be handy because it could be used to check cache staleness in the engine in case it saves a file in memory without re-reading it for each scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right: so I think if the decision is to keep them in filemeta then we should have a follow up to actually use them here (and enable any further optimizations downstream - without this, no matter the downstream optimizations they won't be used)
kernel/src/table_changes/scan.rs
Outdated
|
||
let physical_to_logical_expr = | ||
physical_to_logical_expr(&scan_file, global_state.logical_schema.as_ref(), all_fields)?; | ||
let read_schema = scan_file_read_schema(&scan_file, global_state.read_schema.as_ref()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this physical_schema? maybe stick to physical/logical naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya sure 👍 In that case, we should probably do some renaming in global scan state and scan/mod.rs
at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address this further in this PR too: https://github.com/delta-io/delta-kernel-rs/pull/588/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool thanks!
kernel/src/table_changes/scan.rs
Outdated
let read_result_iter = | ||
engine | ||
.get_parquet_handler() | ||
.read_parquet_files(&[file], read_schema, predicate)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(don't have to solve now) I wonder if we should introduce this higher in the iterator pipeline we've built up here?
Oversimplification, but instead of something like:
scan_files
.map(|scan_file| read_parquet_files([scan_file])) // parquet batches
.map(|batch| do_stuff) // return final batches
could push it up to be
read_parquet_files(scan_files) // parquet batches
.map(|batch| do_stuff) // return final batches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsure how much state you'd need to track and associate with each batch (if it makes things too complicated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we'd have to somehow associate the scan_file
and selection_vector
to the batch
because it's important for the P2L transformation. So the type of the proposed read_parquet_files
would have to be something like Iterator<(EngineData, ResolvedCdfScanFile)>
.
and do_stuff: (EngineData, ResolvedCdfScanFile) -> ScanResult
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i posted another comment about this, but maybe just make an issue to invest a bit more time into this
kernel/src/scan/mod.rs
Outdated
@@ -160,7 +160,7 @@ impl ScanResult { | |||
/// store the name of the column, as that's all that's needed during the actual query. For | |||
/// `Partition` we store an index into the logical schema for this query since later we need the | |||
/// data type as well to materialize the partition column. | |||
#[derive(PartialEq, Debug)] | |||
#[derive(Clone, PartialEq, Debug)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clone needed because now we process the selected column types separately for each commit, where the normal log replay scan does it all in one shot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. The reason is really that I don't want to hold a reference to all_fields
in TableChangesScan
because the iterator needs to be free from lifetimes.
But really all I need is to Arc
it. Now we don't need to clone the Vec.
Note that the existing Scan::execute
still borrows self
pub fn execute(
&self,
engine: Arc<dyn Engine>,
) -> DeltaResult<impl Iterator<Item = DeltaResult<ScanResult>> + '_> {
as you can see with that anonymous lifetime '_
. I was planning on opening a followup PR to take that out too.
@@ -29,8 +33,6 @@ pub struct TableChangesScan { | |||
predicate: Option<ExpressionRef>, | |||
// The [`ColumnType`] of all the fields in the `logical_schema` | |||
all_fields: Vec<ColumnType>, | |||
// `true` if any column in the `logical_schema` is a partition column | |||
have_partition_cols: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is special about CDF that we don't need to care about partition columns?
Or did you find a better approach that we should consider backporting to the normal scan as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have_partition_cols
was useful in scan to skip building the physical to logical expression. Source. For CDF, we always have to construct an expression because it changes based on the scan file.
let table_root = self.table_changes.table_root().clone(); | ||
let all_fields = self.all_fields.clone(); | ||
let predicate = self.predicate.clone(); | ||
let dv_engine_ref = engine.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting... we need a separate arc for each of the two map calls, because they co-exist in time and they both outlive this function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was surprised to see that too. I suppose an arc clone is cheap enough given all the processing we're doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think there's anything wrong -- it was just the first time I'd seen that and was trying to grok it
kernel/tests/cdf.rs
Outdated
if let Some(mask) = mask { | ||
Ok(filter_record_batch(&record_batch, &mask.into())?) | ||
} else { | ||
Ok(record_batch) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm always torn which syntax to use when each case would be a single line either way:
if let Some(mask) = mask { | |
Ok(filter_record_batch(&record_batch, &mask.into())?) | |
} else { | |
Ok(record_batch) | |
} | |
match mask { | |
Some(mask) => Ok(filter_record_batch(&record_batch, &mask.into())?), | |
None => Ok(record_batch), | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd lean towards the match since it feels clearer that there are only two cases.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #580 +/- ##
==========================================
+ Coverage 83.21% 83.44% +0.23%
==========================================
Files 74 74
Lines 16775 16861 +86
Branches 16775 16861 +86
==========================================
+ Hits 13959 14070 +111
+ Misses 2168 2130 -38
- Partials 648 661 +13 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM woooo!!! but don't think tests are exhaustive (probably need to test some of the filtering/projection/etc. and i'll need to think through more edge cases) but let's merge and take on some more testing soon!
location, | ||
}; | ||
let read_result_iter = engine.get_parquet_handler().read_parquet_files( | ||
&[file], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding that other conversation maybe we just make a follow up to consider pulling read_parquet_files higher in the iterator stack to do better than single-file calls?
/// unpack the test data from {test_parent_dir}/{test_name}.tar.zst into a temp dir, and return the dir it was | ||
/// unpacked into | ||
#[allow(unused)] | ||
pub(crate) fn load_test_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice thanks!
kernel/tests/cdf.rs
Outdated
Ok(batches) | ||
} | ||
|
||
fn assert_batches_sorted_eq(expected_lines: &[impl ToString], batches: &[RecordBatch]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this duplicated? can we share code or just create follow up to clean this up soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch. Moved it to common and updated tests. We still passing 😎
What changes are proposed in this pull request?
This PR introduces the
execute
method toTableChangesScan
, which performs a CDF scan and returnsScanResult
the change data feed. TheScanResult
holds engine data and a selection vector to apply to the rows of the data.A helper method
read_scan_file
is added to read the rows of ascan_file
and perform physical to logical transformation.The macros
sort_lines
andasert_batches_sorted_eq
are moved to thecommon
crate so they can be shared between theread.rs
and newcdf.rs
integration tests.The
cdf-table
andcdf-table-non-partitioned
test tables are from the delta-rs project.How was this change tested?
This tests data reading for 3 tables with the following features: