[DRAFT] Create a POC writer test for discussion purposes #26

MrPowers · 2022-11-17T19:22:58Z

This is a POC writer test for discussion purposes.

Want to start brainstorming with the team how we could structure a write test.

wjones127

Thanks for drafting this. I used this to write down my thoughts on the different parts of the test.

tests/pyspark_delta/test_pyspark_writer.py

wjones127 · 2022-11-17T21:32:46Z

tests/pyspark_delta/test_pyspark_writer.py

+    # append data to the reference table
+    rdd = spark.sparkContext.parallelize([("z", 9, 9.9)])
+    df = rdd.toDF(["letter", "number", "a_float"])
+    df.write.format("delta").mode("append").save("tmp/delta")


This will be one of the main challenges: how can we describe these operations in an engine agnostic way? There's two possible levels to do this at:

A machine-readable description that could be automatically translated. This would be nice, but I'm not confident it would be feasible.

A human-readable description that would have to be hand-written by implementors. I think this would be sufficient since the semantics of it are going to be checked by the expected result.

Yep, I vote for the "human-readable" approach as the first pass and then see if we can make something more clever down the road when we know the problem a bit better.

+1 for human-readable as a first pass.

wjones127 · 2022-11-17T21:40:17Z

tests/pyspark_delta/test_pyspark_writer.py

+
+    # make assertions & cleanup
+    actual_df = spark.read.format("delta").load("tmp/delta")
+    chispa.assert_df_equality(actual_df, expected_df, ignore_row_order=True)


Besides asserting the result is right, we also need to validate the log. First, there are probably general invariants we want to check. Ones that aren't specific to this operation, but should always be true after every write. A few examples:

Only one new log entry was added.

All log entries prior are still there and unchanged.

Old data files referenced by the log are still there and unchanged.

The new log entry is valid according to the protocol. (Perhaps we can generate a JSON schema for the delta log entries and use that as a first pass?)

Beyond that, we might also want operation-specific validation. Like an "append" operation has dataChange=True on all add actions and doesn't contain any remove actions?

Yea, let's brainstorm how to write this code. Parsing the transaction log from scratch in Python would seem a little tedious. Perhaps I can used delta-rs for this?

Wouldn't using delta-rs in this way be circular? We could just compare the JSON files directly... some more thought needed for the checkpoint files.

@nkarpov - yea, using delta-rs in that way would definitely be circular, especially for testing the delta-rs connector. Less so for testing the Flink, Hive, etc. connectors.

So I guess we just make a Python reader for parsing all the intricacies of the transaction log. That seems kind of complicated as well.

I really don't know the best path forward here.

wjones127 · 2022-11-29T16:28:57Z

This could be a follow up, but we should also consider how to represent writes that should fail. There will be a lot of test cases where we want to make sure the writers are enforcing rules, and stopping writes when they are supposed to.

Create a POC writer test for discussion purposes

fab4b92

wjones127 reviewed Nov 17, 2022

View reviewed changes

Refactor writer test based on PR feedback

5a146d2

MrPowers requested a review from wjones127 November 18, 2022 19:25

wjones127 mentioned this pull request Jan 17, 2023

Create delta log validation program #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Create a POC writer test for discussion purposes #26

[DRAFT] Create a POC writer test for discussion purposes #26

MrPowers commented Nov 17, 2022

wjones127 left a comment

wjones127 Nov 17, 2022

MrPowers Nov 17, 2022

wjones127 Nov 17, 2022

wjones127 Nov 17, 2022

MrPowers Nov 18, 2022

nkarpov Nov 18, 2022

MrPowers Nov 18, 2022

wjones127 commented Nov 29, 2022

[DRAFT] Create a POC writer test for discussion purposes #26

Are you sure you want to change the base?

[DRAFT] Create a POC writer test for discussion purposes #26

Conversation

MrPowers commented Nov 17, 2022

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Nov 17, 2022

Choose a reason for hiding this comment

MrPowers Nov 17, 2022

Choose a reason for hiding this comment

wjones127 Nov 17, 2022

Choose a reason for hiding this comment

wjones127 Nov 17, 2022

Choose a reason for hiding this comment

MrPowers Nov 18, 2022

Choose a reason for hiding this comment

nkarpov Nov 18, 2022

Choose a reason for hiding this comment

MrPowers Nov 18, 2022

Choose a reason for hiding this comment

wjones127 commented Nov 29, 2022