-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API, Core, Spark: Scan API for partition stats #14508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Some background: The current the way to query partition stats is through The content of this PR:
These could possibly be some follow-up steps:
|
| value, | ||
| (existingEntry, newEntry) -> { | ||
| existingEntry.appendStats(newEntry); | ||
| ((PartitionStats) existingEntry).appendStats(newEntry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PartitionStatistics interface had the appendStats function, this cast (and another occurrence) wouldn't be needed. It seemed a bit weird to have it there, but I'm open to make this change to clean up casts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep the interface clean
| PartitionStatisticsScan filter(Expression filter); | ||
|
|
||
| /** | ||
| * Create a new scan from this with the schema as its projection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe describe what will happen with the PartitionStatistics attributes which are not part of the schema.
| /** | ||
| * Create a new scan from this with the schema as its projection. | ||
| * | ||
| * @param schema a projection schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the user create the Schema?
I would prefer something like the DataFile where the possible columns are available as constants, and the type is available as well. Maybe copy/move/deprecate the schema from the old place.
| Types.StructType partitionType = Partitioning.partitionType(table); | ||
| Schema schema = PartitionStatsHandler.schema(partitionType, TableUtil.formatVersion(table)); | ||
|
|
||
| FileFormat fileFormat = FileFormat.fromFileName(statsFile.get().path()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that getting the file format from the name is brittle. Maybe not in this PR, but I would love to have this fixed
|
|
||
| try { | ||
| stats = computeAndMergeStatsIncremental(table, snapshot, partitionType, statisticsFile); | ||
| stats = computeAndMergeStatsIncremental(table, snapshot, statisticsFile.snapshotId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove the partitionType parameter? We recalculate it later again for every file. Isn't that unnecessary?
| rec -> { | ||
| return (PartitionStats) recordToPartitionStats(rec); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the braces?
| } | ||
|
|
||
| @Override | ||
| public CloseableIterable<PartitionStatistics> scan() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have tests for this?
No description provided.