Normalize partitioned and flat object listing #18146

BlakeOrth · 2025-10-17T22:40:34Z

Which issue does this PR close?

Enable the ListFilesCache to be available for partitioned tables #17211

It's not yet clear to me if this will fully close the above issue, or if it's just the first step. I think there may be more work to do, so I'm not going to have this auto-close the issue.

Rationale for this change

tl;dr of the issue: normalizing the access pattern(s) for objects for partitioned tables should not only reduce the number of requests to a backing object store, but will also allow any existing and/or future caching mechanisms to apply equally to both directory-partitioned and flat tables.

List request on main:

DataFusion CLI v50.2.0
> \object_store_profiling summary
ObjectStore Profile mode set to Summary
> CREATE EXTERNAL TABLE overture_partitioned
STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-09-24.0/';
0 row(s) fetched.
Elapsed 37.236 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----+-----+-----+-----+-------+
| Operation | Metric   | min | max | avg | sum | count |
+-----------+----------+-----+-----+-----+-----+-------+
| List      | duration |     |     |     |     | 1     |
| List      | size     |     |     |     |     | 1     |
+-----------+----------+-----+-----+-----+-----+-------+
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Operation | Metric   | min       | max       | avg         | sum         | count |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Get       | duration | 0.044411s | 0.338399s | 0.104535s   | 162.133179s | 1551  |
| Get       | size     | 8 B       | 1285059 B | 338457.56 B | 524947683 B | 1551  |
| List      | duration |           |           |             |             | 3     |
| List      | size     |           |           |             |             | 3     |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
> select count(*) from overture_partitioned;
+------------+
| count(*)   |
+------------+
| 4219677254 |
+------------+
1 row(s) fetched.
Elapsed 40.061 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Operation | Metric   | min       | max       | avg         | sum         | count |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Get       | duration | 0.042554s | 0.453125s | 0.103147s   | 159.980835s | 1551  |
| Get       | size     | 8 B       | 1285059 B | 338457.56 B | 524947683 B | 1551  |
| List      | duration | 0.043498s | 0.196298s | 0.092462s   | 2.034174s   | 22    |
| List      | size     |           |           |             |             | 22    |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
> select count(*) from overture_partitioned;
+------------+
| count(*)   |
+------------+
| 4219677254 |
+------------+
1 row(s) fetched.
Elapsed 0.924 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----------+-----------+-----------+-----------+-------+
| Operation | Metric   | min       | max       | avg       | sum       | count |
+-----------+----------+-----------+-----------+-----------+-----------+-------+
| List      | duration | 0.040526s | 0.161407s | 0.092792s | 2.041431s | 22    |
| List      | size     |           |           |           |           | 22    |
+-----------+----------+-----------+-----------+-----------+-----------+-------+
>

List requests for this PR:

DataFusion CLI v50.2.0
> \object_store_profiling summary
ObjectStore Profile mode set to Summary
> CREATE EXTERNAL TABLE overture_partitioned
STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-09-24.0/';
0 row(s) fetched.
Elapsed 33.962 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----+-----+-----+-----+-------+
| Operation | Metric   | min | max | avg | sum | count |
+-----------+----------+-----+-----+-----+-----+-------+
| List      | duration |     |     |     |     | 1     |
| List      | size     |     |     |     |     | 1     |
+-----------+----------+-----+-----+-----+-----+-------+
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Operation | Metric   | min       | max       | avg         | sum         | count |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Get       | duration | 0.043832s | 0.342730s | 0.110505s   | 171.393509s | 1551  |
| Get       | size     | 8 B       | 1285059 B | 338457.56 B | 524947683 B | 1551  |
| List      | duration |           |           |             |             | 3     |
| List      | size     |           |           |             |             | 3     |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
> select count(*) from overture_partitioned;
+------------+
| count(*)   |
+------------+
| 4219677254 |
+------------+
1 row(s) fetched.
Elapsed 38.119 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Operation | Metric   | min       | max       | avg         | sum         | count |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Get       | duration | 0.043186s | 0.296394s | 0.099681s   | 154.605286s | 1551  |
| Get       | size     | 8 B       | 1285059 B | 338457.56 B | 524947683 B | 1551  |
| List      | duration |           |           |             |             | 1     |
| List      | size     |           |           |             |             | 1     |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
> select count(*) from overture_partitioned;
+------------+
| count(*)   |
+------------+
| 4219677254 |
+------------+
1 row(s) fetched.
Elapsed 0.815 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----+-----+-----+-----+-------+
| Operation | Metric   | min | max | avg | sum | count |
+-----------+----------+-----+-----+-----+-----+-------+
| List      | duration |     |     |     |     | 1     |
| List      | size     |     |     |     |     | 1     |
+-----------+----------+-----+-----+-----+-----+-------+
>

List operations

Action	`main`	this PR
Create Table	3	3
Cold-cache Query	22	1
Warm-cache Query	22	1

What changes are included in this PR?

Refactored helpers related to listing, discovering, and pruning objects based on partitions to normalize the strategy between partitioned and flat tables

Are these changes tested?

Yes. The internal methods that have been modified are covered by existing tests.

Are there any user-facing changes?

No

Additional Notes

I want to surface that I believe there is a chance for a performance regression for certain queries against certain tables. One performance related mechanism the existing code implements, but this code currently omits, is (potentially) reducing the number of partitions listed based on query filters. In order for the existing code to exercise this optimization the query filters must contain all the path elements of a subdirectory as column filters. E.g.

Given a table with a directory-partitioning structure like:

path/to/table/a=1/b=2/c=3/data.parquet

This query:

select count(*) from table where a=1 and b=2;

Will result in listing the following path:

LIST: path/to/table/a=1/b=2/

Whereas this query:

select count(*) from table where b=2;

Will result in listing the following path:

LIST: path/to/table/

I believe the real-world impact of this omission is likely minimal, at least when using high-latency storage such as S3 or other object stores, especially considering the existing implementation is likely to execute multiple sequential LIST operations due to its breadth-first search implementation. The most likely configuration for a table that would be negatively impacted would be a table that holds many thousands of underlying objects (most cloud stores return recursive list requests with page sizes of many hundreds to thousands of objects) with a relatively shallow partition structure. I may be able to find or build a dataset that fulfills these criteria to test this assertion if there's concern about it.

I believe we could also augment the existing low-level object_store interactions to allow listing a prefix on a table, which would allow the same pruning of list operations with the code in this PR. The downside to this approach is it either complicates future caching efforts, or leads to cache fragmentation in a simpler cache implementation. I didn't include these changes in this PR to avoid the change set being too large.

cc @alamb

- Refactored helpers related to listing, discovering, and pruning objects based on partitions to normalize the strategy between partitioned and flat tables

BlakeOrth · 2025-10-17T22:49:48Z

datafusion/catalog-listing/Cargo.toml


 [dev-dependencies]
 datafusion-datasource-parquet = { workspace = true }
+datafusion = { workspace = true }


It looks like this dependency tripped the circular dependency check even though it's a dev dependency for test setup. Is there an alternative mechanism to get access to a SessionStateBuilder for testing rather than using the import here?

https://github.com/apache/datafusion/pull/18146/files#diff-d73871e86bf1a466152863951f635e91ad931a16f6d40863b5061e39eefeea31R461

alamb · 2025-10-19T12:04:09Z

I want to surface that I believe there is a chance for a performance regression for certain queries against certain tables. One performance related mechanism the existing code implements,

I agree with your assesment that this is likely to be minimal -- especially given that queries that request thousands of objects will therefore require many thousand of s3 requests for the data files themselves

alamb

Thanks @BlakeOrth -- I started reviewing this PR and hope to do more later today

alamb · 2025-10-19T12:09:18Z

datafusion/catalog-listing/src/helpers.rs

-        fn task_ctx(&self) -> Arc<datafusion_execution::TaskContext> {
-            unimplemented!()
-        }
+        let state = SessionStateBuilder::new().build();


In order to avoid circular dependencies (needed to allow datafusion to compile faster) the API needed for the catalog is in the Session trait, which is implemented by SessionState, but can be implemented by other things

Thus, in this case the options are:

Keep the MockSession and implement whatever APIs it needs

Move the tests to the datafusion crate (e.g. somewhere in https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs)

Normalize partitioned and flat object listing

0721219

- Refactored helpers related to listing, discovering, and pruning objects based on partitions to normalize the strategy between partitioned and flat tables

BlakeOrth force-pushed the feature/partitions_list_all branch from 2e2af3c to 0721219 Compare October 17, 2025 22:40

github-actions bot added the catalog Related to the catalog crate label Oct 17, 2025

BlakeOrth commented Oct 17, 2025

View reviewed changes

alamb reviewed Oct 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize partitioned and flat object listing #18146

Normalize partitioned and flat object listing #18146

Uh oh!

BlakeOrth commented Oct 17, 2025 •

edited

Loading

Uh oh!

BlakeOrth Oct 17, 2025

Uh oh!

alamb commented Oct 19, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Normalize partitioned and flat object listing #18146

Are you sure you want to change the base?

Normalize partitioned and flat object listing #18146

Uh oh!

Conversation

BlakeOrth commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Additional Notes

Uh oh!

BlakeOrth Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 19, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BlakeOrth commented Oct 17, 2025 •

edited

Loading