You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
We’re implementing a file format Vortex, which has no “row groups” or similar concept, meaning byte range might fall completely within one column, and aligning columns is a non trivial task. I would like to be able express repartitioning logic to only split files logically (by rows and not by bytes).
The existing repartitioning logic in Datafusion (specifically FileGroupPartitioner and FileScanConfig::repartitioned) assume that files can be split logically by byte ranges (FileRange), and even the rustdoc on it seems very Parquet-specific (even though other formats do support it). This assumes some mapping/alignment between the physical layout and the logical one.
Describe the solution you'd like
Seems like the best way would be to configure FileGroupPartitioner through FileSource. The other option would be to make FileRange an enum, but that would still mean we (and any other format with a similar structure) will have to maintain our own repartitioning logic.
Describe alternatives you've considered
We can keep the current state, which is maintaining our own repartitioning logic and eventually just reusing FileRange to describe row splits.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Seems like the best way would be to configure FileGroupPartitioner through FileSource. The other option would be to make FileRange an enum, but that would still mean we (and any other format with a similar structure) will have to maintain our own repartitioning logic.
I agree -- having a Format specific repartitioner makes the most sense
Note sure if are following the latest developments on main, but there is the newly added FileSource triat
Is your feature request related to a problem or challenge?
We’re implementing a file format Vortex, which has no “row groups” or similar concept, meaning byte range might fall completely within one column, and aligning columns is a non trivial task. I would like to be able express repartitioning logic to only split files logically (by rows and not by bytes).
The existing repartitioning logic in Datafusion (specifically
FileGroupPartitioner
andFileScanConfig::repartitioned
) assume that files can be split logically by byte ranges (FileRange
), and even the rustdoc on it seems very Parquet-specific (even though other formats do support it). This assumes some mapping/alignment between the physical layout and the logical one.Describe the solution you'd like
Seems like the best way would be to configure
FileGroupPartitioner
throughFileSource
. The other option would be to makeFileRange
an enum, but that would still mean we (and any other format with a similar structure) will have to maintain our own repartitioning logic.Describe alternatives you've considered
We can keep the current state, which is maintaining our own repartitioning logic and eventually just reusing FileRange to describe row splits.
Additional context
No response
The text was updated successfully, but these errors were encountered: