-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Normalize partitioned and flat object listing #18146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Refactored helpers related to listing, discovering, and pruning objects based on partitions to normalize the strategy between partitioned and flat tables
2e2af3c
to
0721219
Compare
|
||
[dev-dependencies] | ||
datafusion-datasource-parquet = { workspace = true } | ||
datafusion = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this dependency tripped the circular dependency check even though it's a dev dependency for test setup. Is there an alternative mechanism to get access to a SessionStateBuilder
for testing rather than using the import here?
I agree with your assesment that this is likely to be minimal -- especially given that queries that request thousands of objects will therefore require many thousand of s3 requests for the data files themselves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BlakeOrth -- I started reviewing this PR and hope to do more later today
fn task_ctx(&self) -> Arc<datafusion_execution::TaskContext> { | ||
unimplemented!() | ||
} | ||
let state = SessionStateBuilder::new().build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to avoid circular dependencies (needed to allow datafusion to compile faster) the API needed for the catalog is in the Session
trait, which is implemented by SessionState, but can be implemented by other things
Thus, in this case the options are:
- Keep the MockSession and implement whatever APIs it needs
- Move the tests to the
datafusion
crate (e.g. somewhere in https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs)
Which issue does this PR close?
ListFilesCache
to be available for partitioned tables #17211It's not yet clear to me if this will fully close the above issue, or if it's just the first step. I think there may be more work to do, so I'm not going to have this auto-close the issue.
Rationale for this change
tl;dr of the issue: normalizing the access pattern(s) for objects for partitioned tables should not only reduce the number of requests to a backing object store, but will also allow any existing and/or future caching mechanisms to apply equally to both directory-partitioned and flat tables.
List request on
main
:List requests for this PR:
List operations
main
What changes are included in this PR?
Are these changes tested?
Yes. The internal methods that have been modified are covered by existing tests.
Are there any user-facing changes?
No
Additional Notes
I want to surface that I believe there is a chance for a performance regression for certain queries against certain tables. One performance related mechanism the existing code implements, but this code currently omits, is (potentially) reducing the number of partitions listed based on query filters. In order for the existing code to exercise this optimization the query filters must contain all the path elements of a subdirectory as column filters. E.g.
Given a table with a directory-partitioning structure like:
This query:
Will result in listing the following path:
Whereas this query:
Will result in listing the following path:
I believe the real-world impact of this omission is likely minimal, at least when using high-latency storage such as S3 or other object stores, especially considering the existing implementation is likely to execute multiple sequential
LIST
operations due to its breadth-first search implementation. The most likely configuration for a table that would be negatively impacted would be a table that holds many thousands of underlying objects (most cloud stores return recursive list requests with page sizes of many hundreds to thousands of objects) with a relatively shallow partition structure. I may be able to find or build a dataset that fulfills these criteria to test this assertion if there's concern about it.I believe we could also augment the existing low-level
object_store
interactions to allow listing a prefix on a table, which would allow the same pruning of list operations with the code in this PR. The downside to this approach is it either complicates future caching efforts, or leads to cache fragmentation in a simpler cache implementation. I didn't include these changes in this PR to avoid the change set being too large.cc @alamb