API, Core, Spark: Scan API for partition stats #14508

gaborkaszab · 2025-11-05T11:30:47Z

No description provided.

gaborkaszab · 2025-11-05T13:03:07Z

Some background: The current the way to query partition stats is through PartitionStatsHandler.readPartitionStatsFile(). For the the user has to put together the schema and get the input file to read. It would be beneficial for easier usability (also one comment on my stats proposal doc mentions) to have a more convenient API to scan partition stats. This could also have filter and projection capabilities.

The content of this PR:

Introduce PartitionStatisticsScan API and its implementation BasePartitionStatisticsScan in core. For simplicity this has the functionality that exists today, no filtering by partition, no projection.
Replace the usage of PartitionStatsHandler.readPartitionStatsFile() with the new API
Introduce PartitionStatistics interface into the API module, make PartitionStats in core to derive from this. This is needed so that the Scan API could use this as return value, while the existing PartitionStats class is in core module.
Replace the usage of PartitionStats whenever possible with the new interface.

These could possibly be some follow-up steps:

Implementation of filter() and project() on the new Scan API
The naming of affected classes is a bit weird: interface api/PartitionStatistics that is implemented by core/PartitionStats. Ideally the name of the implementation would be BasePartitionStatistics. As a next step we can introduce a class with the same content and new name and deprecate the existing one, also remove usage. Changes within PartitionStats are easier to review in case "renaming" happens in a follow-up PR.
Older Spark versions should be covered

gaborkaszab · 2025-11-05T13:06:09Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

                value,
                (existingEntry, newEntry) -> {
-                  existingEntry.appendStats(newEntry);
+                  ((PartitionStats) existingEntry).appendStats(newEntry);


If PartitionStatistics interface had the appendStats function, this cast (and another occurrence) wouldn't be needed. It seemed a bit weird to have it there, but I'm open to make this change to clean up casts.

I would prefer to keep the interface clean

pvary · 2025-11-06T09:44:43Z

api/src/main/java/org/apache/iceberg/PartitionStatisticsScan.java

+  PartitionStatisticsScan filter(Expression filter);
+
+  /**
+   * Create a new scan from this with the schema as its projection.


Maybe describe what will happen with the PartitionStatistics attributes which are not part of the schema.

pvary · 2025-11-06T09:47:25Z

api/src/main/java/org/apache/iceberg/PartitionStatisticsScan.java

+  /**
+   * Create a new scan from this with the schema as its projection.
+   *
+   * @param schema a projection schema


How does the user create the Schema?

I would prefer something like the DataFile where the possible columns are available as constants, and the type is available as well. Maybe copy/move/deprecate the schema from the old place.

pvary · 2025-11-06T09:49:43Z

core/src/main/java/org/apache/iceberg/BasePartitionStatisticsScan.java

+    Types.StructType partitionType = Partitioning.partitionType(table);
+    Schema schema = PartitionStatsHandler.schema(partitionType, TableUtil.formatVersion(table));
+
+    FileFormat fileFormat = FileFormat.fromFileName(statsFile.get().path());


I still think that getting the file format from the name is brittle. Maybe not in this PR, but I would love to have this fixed

pvary · 2025-11-06T09:54:34Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java


      try {
-        stats = computeAndMergeStatsIncremental(table, snapshot, partitionType, statisticsFile);
+        stats = computeAndMergeStatsIncremental(table, snapshot, statisticsFile.snapshotId());


Why do we remove the partitionType parameter? We recalculate it later again for every file. Isn't that unnecessary?

pvary · 2025-11-06T09:55:53Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

+        rec -> {
+          return (PartitionStats) recordToPartitionStats(rec);
+        });


Do we need the braces?

pvary · 2025-11-06T10:00:31Z

core/src/main/java/org/apache/iceberg/BasePartitionStatisticsScan.java

+  }
+
+  @Override
+  public CloseableIterable<PartitionStatistics> scan() {


Do we have tests for this?

API, Core, Spark: Scan API for partition stats

fb06b8a

github-actions bot added API spark core data labels Nov 5, 2025

gaborkaszab commented Nov 5, 2025

View reviewed changes

gaborkaszab requested review from ajantha-bhat, nastra, pvary and rdblue November 5, 2025 13:06

pvary reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API, Core, Spark: Scan API for partition stats #14508

API, Core, Spark: Scan API for partition stats #14508

gaborkaszab commented Nov 5, 2025

Uh oh!

gaborkaszab commented Nov 5, 2025

Uh oh!

gaborkaszab Nov 5, 2025

Uh oh!

pvary Nov 6, 2025

Uh oh!

pvary Nov 6, 2025

Uh oh!

pvary Nov 6, 2025

Uh oh!

pvary Nov 6, 2025

Uh oh!

pvary Nov 6, 2025

Uh oh!

pvary Nov 6, 2025

Uh oh!

pvary Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

API, Core, Spark: Scan API for partition stats #14508

Are you sure you want to change the base?

API, Core, Spark: Scan API for partition stats #14508

Conversation

gaborkaszab commented Nov 5, 2025

Uh oh!

gaborkaszab commented Nov 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants