Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Control over Parquet Writing #6123

Closed
cpwright opened this issue Sep 25, 2024 · 3 comments
Closed

More Control over Parquet Writing #6123

cpwright opened this issue Sep 25, 2024 · 3 comments
Assignees
Labels
core Core development tasks feature request New feature or request parquet Related to the Parquet integration
Milestone

Comments

@cpwright
Copy link
Contributor

As a systems integrator, I want to be able to have increased control over writing parquet files so that I can implement a process for transforming data overnight.

This ticket needs more definition before we work on it, but I would like to be able to either pass a row-group of data at once to the write function; or alternatively pass one column of a row-group at one time so that I can ensure read-locality for my input data.

@cpwright cpwright added feature request New feature or request triage labels Sep 25, 2024
@rcaudy rcaudy added core Core development tasks parquet Related to the Parquet integration and removed triage labels Sep 25, 2024
@rcaudy rcaudy added this to the Backlog milestone Sep 25, 2024
@rcaudy
Copy link
Member

rcaudy commented Sep 25, 2024

As noted by @cpwright , we're still defining this ticket and its priority.

One detail we'll need to be sure to handle is data indexes when there are multiple row groups. One approach might be to mirror the row group structure of the "main" file in each index file, as a hint that we potentially need to shift the row sets persisted to the index table in order to compensate for row group shifts in the main table.

@devinrsmith
Copy link
Member

I could also see #6125 as imposing some writing requirements; potentially the need to tack on field_ids, or add KV metadata, amongst other things (I don't know what support we may or may not already have for those types of reqs).

@malhotrashivam
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core development tasks feature request New feature or request parquet Related to the Parquet integration
Projects
None yet
Development

No branches or pull requests

4 participants