-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype parquet-integration-testing integration tests #5956
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,285 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic idea would be to check in files like this for all the various files in parquet-testing, and each parquet implementation could write a driver that made the equivalent JSON and checked it against the expected output
I don't know, I think the Arrow integration testing procedure is relatively fine for Arrow (even though it doesn't allow us to exercise everything, such as dictionary deltas or compression) but won't scale for Parquet which as a ton of different options. |
Thanks @pitrou -- I agree there are many options for parquet, though I don't understand why that would prevent us from integration testing 🤔 While the "expected data" portion of such a test suite will be substantial, the actual code to make parquet-rs write this JSON format was quite small (and I expect it would be realtively straightforward for other implementations as well) And conveniently, we already have a reasonable test corpus of files with various features in |
Well, I would have to see a JSON format that covers a substantial number of Parquet features before I can be convinced :-) I'll note that the proposed JSON format features a |
My alternative proposal would be a directory tree with pre-generated integration files, something like:
... which allows us to have many different scenarios without the scaling problem of having all implementations run within the same CI job. The textual |
Thank you @pitrou -- I filed apache/parquet-format#441 to discuss this idea further |
@@ -0,0 +1,285 @@ | |||
{ | |||
"filename": "alltypes_plain.parquet", | |||
"rows": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to formalize the expected output for each physical/logic type.
The other thing I was thinking is it might pay to consider trade-offs between a row oriented format here compared to a column oriented format that is closer to what is actually written in parquet (i.e. rep/def levels and values). They both might be useful in some situations. For instance I've seen ill-formed parquet files in the wild because of inconsistent rep/def levels so ensuring there are sanity checks at that level makes sense.
Row based would certainly help for instance I've seen for non-conformant logical nested types like Lists/Maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if we want to move forward with this approach we should spend time formalizing and documenting what the expected format means
THIS IS A WIP, to show what such an integration suite might look like
Which issue does this PR close?
To be filed
Rationale for this change
Modeled after the arrow-testing integration tests, for example: https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration
Related:
What changes are included in this PR?
The high level proposal is that each implementation implements a driver that:
I imagine JSON files for both data and metadata
The exact format of the JSON files is totally TBD -- this PR just makes whatever is convenient for arrow-rs
Are there any user-facing changes?