Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logically repartition files by row splits #14607

Open
AdamGS opened this issue Feb 11, 2025 · 2 comments · May be fixed by #14754
Open

Logically repartition files by row splits #14607

AdamGS opened this issue Feb 11, 2025 · 2 comments · May be fixed by #14754
Labels
enhancement New feature or request

Comments

@AdamGS
Copy link
Contributor

AdamGS commented Feb 11, 2025

Is your feature request related to a problem or challenge?

We’re implementing a file format Vortex, which has no “row groups” or similar concept, meaning byte range might fall completely within one column, and aligning columns is a non trivial task. I would like to be able express repartitioning logic to only split files logically (by rows and not by bytes).
The existing repartitioning logic in Datafusion (specifically FileGroupPartitioner and FileScanConfig::repartitioned) assume that files can be split logically by byte ranges (FileRange), and even the rustdoc on it seems very Parquet-specific (even though other formats do support it). This assumes some mapping/alignment between the physical layout and the logical one.

Describe the solution you'd like

Seems like the best way would be to configure FileGroupPartitioner through FileSource. The other option would be to make FileRange an enum, but that would still mean we (and any other format with a similar structure) will have to maintain our own repartitioning logic.

Describe alternatives you've considered

We can keep the current state, which is maintaining our own repartitioning logic and eventually just reusing FileRange to describe row splits.

Additional context

No response

@AdamGS AdamGS added the enhancement New feature or request label Feb 11, 2025
@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

Seems like the best way would be to configure FileGroupPartitioner through FileSource. The other option would be to make FileRange an enum, but that would still mean we (and any other format with a similar structure) will have to maintain our own repartitioning logic.

I agree -- having a Format specific repartitioner makes the most sense

Note sure if are following the latest developments on main, but there is the newly added FileSource triat

fn supports_repartition(&self, config: &FileScanConfig) -> bool;

Maybe instead of supports_repartition we could extend it to directly do the repartitioning 🤔

@AdamGS
Copy link
Contributor Author

AdamGS commented Feb 13, 2025

that seems like an appropriate place, I'll try and play around with that sometime this week and I'll share if I get anything I like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants