-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Port cdf tests from delta-spark to kernel #611
test: Port cdf tests from delta-spark to kernel #611
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #611 +/- ##
==========================================
+ Coverage 83.67% 83.70% +0.02%
==========================================
Files 75 75
Lines 16950 16951 +1
Branches 16950 16951 +1
==========================================
+ Hits 14183 14188 +5
Misses 2100 2100
+ Partials 667 663 -4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great thanks! One small thing, could you recreate the archives with --no-xattrs
passed to tar? Otherwise if you look at these zstd files on a non-macos machine you get lots of:
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.provenance'
tar -c --no-xattrs [etc]
should do the trick
#[test] | ||
fn simple_cdf_version_ranges() -> DeltaResult<()> { | ||
let batches = read_cdf_for_table("cdf-table-simple", 0, 0, None)?; | ||
let mut expected = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer putting the expected output along with the data (like we do with dat
) rather than spreading it out like this. I think this is okay for now though, and maybe we can make an issue to port each test by recreating the archive with an expected data parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: #626
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reminds me: we should make an issue (or perhaps we already have one?) to fix all the tests and move away from string matching and instead compare underlying (sorted) data/metadata
cc @nicklan
@nicklan rebuilt with --no-xattrs. Seems to have caused a diff on github, but I'm still seeing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did all the expected output come from delta-spark? if yes, then LGTM
#[test] | ||
fn simple_cdf_version_ranges() -> DeltaResult<()> { | ||
let batches = read_cdf_for_table("cdf-table-simple", 0, 0, None)?; | ||
let mut expected = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reminds me: we should make an issue (or perhaps we already have one?) to fix all the tests and move away from string matching and instead compare underlying (sorted) data/metadata
cc @nicklan
// Note: `update_pre` and `update_post` are technically not part of the delta spec, and instead | ||
// should be `update_preimage` and `update_postimage` respectively. However, the tests in | ||
// delta-spark use the post and pre. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are we observing update_pre
and update_post
here then? aren't we reading the CDF and then filling in our own update_preimage
etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All update _change_type
s come directly from a cdc file. We don't insert or modify them. In this test, delta-spark wrote update_pre and update_post directly into the cdc file.
@zachschuermann regarding creating an issue, I already put one up for CDF here #626, but I can expand its scope to remove all string matching from our testing. |
@zachschuermann My methodology is as follows: I used their expected results that looked like this (sample below), constructed the tables by hand, and only then did I verify it against our implementation. checkCDCAnswer(
log,
CDCReader.changesToBatchDF(log, 0, 2, spark).filter("_change_type = 'insert'"),
Range(0, 6).map { i => Row(i, "old", i % 2, "insert", 0) })
checkCDCAnswer(
log,
CDCReader.changesToBatchDF(log, 0, 2, spark).filter("_change_type = 'delete'"),
Seq(0, 2, 3, 4).map { i => Row(i, "old", i % 2, "delete", if (i % 2 == 0) 2 else 1) })
// ... |
What changes are proposed in this pull request?
This PR adds several CDF tests from delta-spark. We check the following:
Table-changes construction is also changed so that CDF version error is checked before snapshots are created. This makes the error message clearer in the case that the start version is beyond the end of the table.