-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet compatibility / integration testing #441
Comments
Right, and also there might be some closed-source Parquet implementations significant enough to participate in integration testing? Cueing in more informed people @julienledem @wesm @gszadovszky . There's also at least one GPU implementation in cuDF, which would complicate integration CI if all implementations had be to run together: https://docs.rapids.ai/api/cudf/stable/user_guide/10min/#reading-writing-parquet-files |
Option 4:
|
|
I'd personally prefer option 2. I agree with @pitrou that a centralized CI system that can cross validate all implementations will be very hard (and expensive) to realize. Self reporting for the purposes of the compatibility matrix is easiest for all, and cheats will be found out soon enough 😄. Second best would be option 3, but I'm curious how often an implementation would be expected to provide files? The full set for each release, or just one for the earliest release that support the feature? The former could become quite unwieldy given the quick release cycles of some implementations. |
This is a good point. Does it apply to all options? If other options solve this by running some drivers manually/internally (outside of official CI), same solution can apply here too.
I am not sure how Arrow does it. Could it be that this can be done better? Does
We need to specify what to output from drivers anyway so that work must be done in any option with a driver. What I am avoiding with my proposal is a way to describe rows data - instead I suggest we do that with parquet itself. |
My guess is that both Snowflake and Databricks (for their fork of Spark) have made their own implementations, so those would be the two most mainstream platforms that we would want to try to get to participate in an integration testing matrix. Some open source projects (e.g. Apache Impala -- though I'm not sure how many Impala users are out there anymore) have Parquet implementations within them, and it may be feasible to create a Docker-based setup to turn them into an integration test target (e.g. Ibis has a docker-compose configuration for running tests against Impala https://github.com/ibis-project/ibis/blob/main/compose.yaml). |
I am pretty sure DuckDB has their own parquet implementation too that would likely be good to get represented: https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_reader.cpp |
I agree that mandating new files for each release isn't reasonable, and we might leave this up to each implementation. For example, they might decide to upload new files if important changes were made in the writer implementation that would yield significantly different binary output. |
Well, option 3 is based on files being uploaded to a specific repo (or directory tree), so there's no need for implementations to run alongside each other in the same CI job.
No, it's based on a Docker setup and it's purely stateless (apart from the optional storage of compilation caching data). I agree there are certainly ways to make things more optimized, but each optimization adds a layer of complexity and fragility, especially if it involves implementations that are maintained independently from each other, and by different teams. Also, it is useful for the worst case to remain reasonably short. For example, using a Docker stateless setup anyone can replicate the integration locally. |
Dremio also has its own (closed source) reader but it uses parquet-java for the write path. So, Dremio would highly benefit from the generated golden files, but currently it does not makes sense to provide additional ones. |
IMO, I like both options 1 and 2. i think for 2, the core framework can be owned by the parquet community with documentation on how to integration custom readers (similar to archery and @alkis I like the name carpenter). For CI purposes I think having the tool run between C++/Java would be useful. |
See related mailing list discussion: https://lists.apache.org/thread/kd3k4q691lp5c4q3r767zb8jltrm9z33
Background
In apache/parquet-site#34 we are adding an "implementation status" matrix for different paruqet implementations, to help people understand the supported feature sets of various parquet implementations across the ecosystem.
As we work to fill out this matrix for various parquet implementations, the question arises what does "supports a particular Parquet feature" mean, precisely?
One way to provide a precise definition is to provide a way to automate the check for each feature.
Prior Art
parquet-testing
The parquet-testing repository contains example parquet files written with various different features.
The
README.md
file contains brief descriptions of the contents of these files, but there is no machine readable description of the data contained within those files.Apache Arrow
Apache Arrow has a similar feature chart: https://arrow.apache.org/docs/status.html
Part of maintaining this chart is a comprehensive integration suite which programtically checks if data created by one implementation of Arrow can be read by others.
The suite is implemented using a single integration tool called
archery
, maintained by the Arrow project in the apache/arrow-testing github repo. Each implementation of Arrow implements a driver program that accepts inputs / generates outputs in a known format and then archery orchestrates running that driver programThere are also a number of known "gold files" here which contain JSON representations of data stored in gold master arrow files
Note that Arrow is somewhat different than Parquet in that most of the Arrow implementations are maintained by the Apache Arrow project itself. In comparison, I believe most of the Parquet implementations are maintained by projects / teams other than Apache Parquet.
Options
Here are some ideas of what a Parquet compatibility test might look like
Option 1: integration harness similar to
archery
In this case, an integration harness similar to
archery
would handled automatically verifying different implementations. This harness could do orchestrate workflows such as read gold parquet files, as well as write parquet data with one implementation and read it with another and verify their compatibilityPros:
Cons:
Option 2: Add golden files to
parquet-testing
In this option, we would add
golden
files to theparquet-testing
repo (e.g. JSON formatted) coresponding to each existing .parquet fileEach implementation could then check compatibility by creating their own driver program
This approach has a (very) rough prototype here: apache/arrow-rs#5956
Pros:
Cons
Option 3: Add golden files and files written by other implementations to
parquet-testing
@pitrou suggested what I think is an extension of option 2 on apache/arrow-rs#5956 (comment)
I think this mechanism would allow for cross-implementation integration testing without requiring a unified harness
The text was updated successfully, but these errors were encountered: