-
Notifications
You must be signed in to change notification settings - Fork 49
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Data skipping correctly handles nested columns and column mapping (#512)
## What changes are proposed in this pull request? The existing implementation of data skipping has two flaws: 1. Only top-level columns are considered for data skipping (due to how the stats schema is derived) 2. Column mapping is completely ignored -- the data skipping expression with its logical column references is evaluated against physical data. At best this means data skipping is ineffective (because the column doesn't exist in the parquet). At worst, we would attempt to skip over the wrong column (e.g. if a table was upgraded to use column mapping because a column was dropped and re-added, we'd wrongly work with the old/dropped column because its logical and physical names were the same) It turns out the two issues are intertwined, because both column mapping and nested column references need a schema traversal. So while we _could_ solve them separately, it's actually easier to just do it all at once. Also -- the data skipping predicate we pass around needs an associated "referenced" schema (in order to build a stats schema); if that schema is empty, it means the data skipping predicate is "static" and should be evaluated once to decide whether to even initiate a log scan. That adds some complexity to the log replay path. But it also allows a predicate like the following to be treated as static, in spite of appearing to reference table columns: ```sql WHERE my_col < 10 AND FALSE ``` ### This PR affects the following public APIs `scan::Scan::predicate` renamed as `physical_predicate` to eliminate ambiguity `scan::log_replay::scan_action_iter` now takes fewer (and different) params. ## How was this change tested? Existing unit tests, plus new unit tests that verify the new behavior.
- Loading branch information
Showing
9 changed files
with
465 additions
and
119 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.