-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support source/sink for plain Parquet/ORC/Avro Tables #166
Comments
@anoopj what would the metadata look like for a sink export? I like the idea of a generic bootstrap so that users could take existing data and try out all 3 formats if they want to do some testing with other tools. |
Sink could be based on manifest files in SymlinkTextInputFormat. BigQuery also now supports manifest files.
Yes, bootstrap is probably higher priority than sink. |
@jackwener any interest in looking into something like this? |
@the-other-tim-brown I'm trying to find a good first issue to ramp up on XTable. Can I take a look at this one? Perhaps we can split it into different issues. One initial task could be to add support for the Parquet input data format, for example? I'm not sure what the code looks like, but ultimately, we can create something modular enough to extend to AVRO or other formats later, if it hasn't been done already. I would be interested to discuss of the possible approaches to fill up the partitioning and statistics info... However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system? |
I think it makes sense to start with just one of the file formats like Parquet. We can discuss how to get the info you would need.
Yes you could do that as well. There is another issue I had my eye on that I could guide you through as well if you are interested: #411 |
Ok, if you agree that we want to move away from this workaround approach, then I think supporting Parquet is a good first issue for me to smooth the learning curve.
ok, this one could be a good next step, but for now, I prefer to limit the amount of novelty. I should have some time to start on the parquet issue next week. |
@marqub we do not have a slack setup for the project yet, I can shoot you an email to connect and discuss any of the details in the meantime. |
Hi, Is someone working on it? I am new to this project and would like to get started. |
@Reactor11 there is a similar effort for a parquet file source that is being worked on: #553 |
Supporting plain Parquet/ORC/Avro (partitioned as well as unpartitioned) may be useful for "upgrading" legacy data to table formats. Sink may be useful for exporting a specific snapshot for interoperability reasons.
This feature is lower priority, as Iceberg/Delta etc have native support for metadata-only conversions and offer Spark procedures.
The text was updated successfully, but these errors were encountered: