Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Create a POC writer test for discussion purposes #26

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MrPowers
Copy link
Contributor

This is a POC writer test for discussion purposes.

Want to start brainstorming with the team how we could structure a write test.

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drafting this. I used this to write down my thoughts on the different parts of the test.

tests/pyspark_delta/test_pyspark_writer.py Outdated Show resolved Hide resolved
tests/pyspark_delta/test_pyspark_writer.py Outdated Show resolved Hide resolved
Comment on lines 35 to 38
# append data to the reference table
rdd = spark.sparkContext.parallelize([("z", 9, 9.9)])
df = rdd.toDF(["letter", "number", "a_float"])
df.write.format("delta").mode("append").save("tmp/delta")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be one of the main challenges: how can we describe these operations in an engine agnostic way? There's two possible levels to do this at:

  • A machine-readable description that could be automatically translated. This would be nice, but I'm not confident it would be feasible.
  • A human-readable description that would have to be hand-written by implementors. I think this would be sufficient since the semantics of it are going to be checked by the expected result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I vote for the "human-readable" approach as the first pass and then see if we can make something more clever down the road when we know the problem a bit better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for human-readable as a first pass.


# make assertions & cleanup
actual_df = spark.read.format("delta").load("tmp/delta")
chispa.assert_df_equality(actual_df, expected_df, ignore_row_order=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides asserting the result is right, we also need to validate the log. First, there are probably general invariants we want to check. Ones that aren't specific to this operation, but should always be true after every write. A few examples:

  1. Only one new log entry was added.
  2. All log entries prior are still there and unchanged.
  3. Old data files referenced by the log are still there and unchanged.
  4. The new log entry is valid according to the protocol. (Perhaps we can generate a JSON schema for the delta log entries and use that as a first pass?)

Beyond that, we might also want operation-specific validation. Like an "append" operation has dataChange=True on all add actions and doesn't contain any remove actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, let's brainstorm how to write this code. Parsing the transaction log from scratch in Python would seem a little tedious. Perhaps I can used delta-rs for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't using delta-rs in this way be circular? We could just compare the JSON files directly... some more thought needed for the checkpoint files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkarpov - yea, using delta-rs in that way would definitely be circular, especially for testing the delta-rs connector. Less so for testing the Flink, Hive, etc. connectors.

So I guess we just make a Python reader for parsing all the intricacies of the transaction log. That seems kind of complicated as well.

I really don't know the best path forward here.

@MrPowers MrPowers requested a review from wjones127 November 18, 2022 19:25
@wjones127
Copy link
Collaborator

This could be a follow up, but we should also consider how to represent writes that should fail. There will be a lot of test cases where we want to make sure the writers are enforcing rules, and stopping writes when they are supposed to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants