use a single row_count column during predicate pruning instead of one per column #14295

adriangb · 2025-01-25T16:38:53Z

Closes Why does PruningPredicate reference a row_count for each column? #13836

adriangb · 2025-01-25T16:42:16Z

@alamb this seems to be a simple enough change that is almost self contained: the only breaking change that I see would be in PredicateRewriter which we recently introduced in #12850. I can't be sure of course but I'd guess we (Pydantic) are the only ones using this, we'd be happy to absorb the breaking change. The alternative would be to add an option to PredicateRewriter to control this behavior, which requires more code, etc. Up to you 😄.

adriangb · 2025-01-25T16:44:39Z

I want to point out that this works because of how the Recordbatch is generated:

datafusion/datafusion/physical-optimizer/src/pruning.rs

Lines 937 to 965 in 20544bc

    
           for (column, statistics_type, stat_field) in required_columns.iter() { 
        
               let column = Column::from_name(column.name()); 
        
               let data_type = stat_field.data_type(); 
        
               let num_containers = statistics.num_containers(); 
        
               let array = match statistics_type { 
        
                   StatisticsType::Min => statistics.min_values(&column), 
        
                   StatisticsType::Max => statistics.max_values(&column), 
        
                   StatisticsType::NullCount => statistics.null_counts(&column), 
        
                   StatisticsType::RowCount => statistics.row_counts(&column), 
        
               }; 
        
               let array = array.unwrap_or_else(|| new_null_array(data_type, num_containers)); 
        
               if num_containers != array.len() { 
        
                   return internal_err!( 
        
                       "mismatched statistics length. Expected {}, got {}", 
        
                       num_containers, 
        
                       array.len() 
        
                   ); 
        
               } 
        
               // cast statistics array to required data type (e.g. parquet 
        
               // provides timestamp statistics as "Int64") 
        
               let array = arrow::compute::cast(&array, data_type)?; 
        
               fields.push(stat_field.clone()); 
        
               arrays.push(array); 
        
           }

Since it's generated based on the columns tracked by RequiredColumns we can just rename it internally with no consequences.

This should also save some work in creating the array, make scanning the record batch faster, etc.

alamb

Thank you @adriangb -- I think this is a nice improvement

My only concern is that now the RequiredColumns structure now can have repeated field names and that could potentially result in an error creating a RecordBatch to feed to the pruning statistics

I wrote a test case showing the issue I am worried about below.

That being said, however, since all the tests are passing it somehow works and this PR is good to go in my mind.

Maybe as a follow on PR we can make an API breaking change (maybe via deprecation) to update PruningStatistics so that row_count is not a function of column anymore

    #[test]
    fn test_unique_field_names() {
        // c1 = 100 AND c2 = 200
        let schema: SchemaRef = Arc::new(Schema::new(vec![
            Field::new("c1", DataType::Int32, true),
            Field::new("c2", DataType::Int32, true),
        ]));
        let expr = col("c1").eq(lit(100)).and(col("c2").eq(lit(200)));
        let expr = logical2physical(&expr, &schema);
        let p = PruningPredicate::try_new(expr, Arc::clone(&schema)).unwrap();
        // note pruning expression refers to row_count twice
        assert_eq!(
            "c1_null_count@2 != row_count@3 AND c1_min@0 <= 100 AND 100 <= c1_max@1 AND c2_null_count@6 != row_count@7 AND c2_min@4 <= 200 AND 200 <= c2_max@5",
            p.predicate_expr.to_string()
        );

        // Fields in required schema should be unique, otherwise when creating batches
        // it will fail because of duplicate field names
        let mut fields = HashSet::new();
        for (_col, _ty, field) in p.required_columns().iter() {
            let was_new = fields.insert(field);
            if !was_new {
                panic!(
                    "Duplicate field in required schema: {:?}. Previous fields:\n{:#?}",
                    field, fields
                );
            }
        }
    }

Fails like this:

Duplicate field in required schema: Field { name: "row_count", data_type: UInt64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }. Previous fields:
{
    Field {
        name: "c2_min",
        data_type: Int32,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c1_max",
        data_type: Int32,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c1_null_count",
        data_type: UInt64,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "row_count",
        data_type: UInt64,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c2_max",
        data_type: Int32,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c1_min",
        data_type: Int32,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c2_null_count",
        data_type: UInt64,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
}
thread 'pruning::tests::test_unique_field_names' panicked at datafusion/physical-optimizer/src/pruning.rs:4193:17:
Duplicate field in required schema: Field { name: "row_count", data_type: UInt64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }. Previous fields:
{
    Field {
        name: "c2_min",
        data_type: Int32,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c1_max",
        data_type: Int32,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c1_null_count",
        data_type: UInt64,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "row_count",
        data_type: UInt64,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c2_max",
        data_type: Int32,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c1_min",
        data_type: Int32,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
    Field {
        name: "c2_null_count",
        data_type: UInt64,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
}
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/panicking.rs:76:14
   2: datafusion_physical_optimizer::pruning::tests::test_unique_field_names
             at ./src/pruning.rs:4193:17
   3: datafusion_physical_optimizer::pruning::tests::test_unique_field_names::{{closure}}
             at ./src/pruning.rs:4172:33
   4: core::ops::function::FnOnce::call_once
             at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
   5: core::ops::function::FnOnce::call_once
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

alamb · 2025-02-05T12:24:07Z

datafusion/physical-optimizer/src/pruning.rs

-            StatisticsType::Max => "max",
-            StatisticsType::NullCount => "null_count",
-            StatisticsType::RowCount => "row_count",
+        let column_name = column.name();


The only question I have is that PruningStatistics is now still in terms of Column:

datafusion/datafusion/physical-optimizer/src/pruning.rs

Lines 120 to 121 in 20544bc

/// [`UInt64Array`]: arrow::array::UInt64Array

fn row_counts(&self, column: &Column) -> Option<ArrayRef>;

So that seems like it means that when building the schema for required columns there will be multiple entries for row_count (I provide a test elsewhere)

pub struct RequiredColumns { /// The statistics required to evaluate this predicate: /// * The unqualified column in the input schema /// * Statistics type (e.g. Min or Max or Null_Count) /// * The field the statistics value should be placed in for /// pruning predicate evaluation (e.g. `min_value` or `max_value`) columns: Vec<(phys_expr::Column, StatisticsType, Field)>, }

alamb · 2025-02-06T19:25:16Z

@adriangb -- this PR now seems to have some conflicts that need to be resolved prior to merge. Marking it as a draft as we sort them out

… per column

adriangb · 2025-02-06T20:01:19Z

I was able to work around the issue pretty easily by keeping the first row count we find 😄: b9a5ccb#diff-d7f1f2d85e23fed3c817de645c0a41428dc7ff8db61c7b9822d75afc71078dbd

adriangb · 2025-02-06T20:57:43Z

@alamb conflicts resolved and your test was added and fixed

alamb · 2025-02-09T10:54:37Z

I merge this PR up from main locally and ran the tests again to be sure and everything looks good. Thanks @adriangb ❤️

github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels Jan 25, 2025

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 25, 2025

adriangb force-pushed the only-require-a-single-row-count-2 branch from adf5e7c to da0f264 Compare January 25, 2025 18:55

alamb approved these changes Feb 5, 2025

View reviewed changes

alamb reviewed Feb 5, 2025

View reviewed changes

alamb marked this pull request as draft February 6, 2025 19:25

adriangb added 3 commits February 6, 2025 11:48

use a single row_count column during predicate pruning instead of one…

f6a8bbd

… per column

fix tests

4c474c8

fix conflicts and test:

b9a5ccb

adriangb force-pushed the only-require-a-single-row-count-2 branch from da0f264 to b9a5ccb Compare February 6, 2025 19:55

lint

25d68f5

adriangb marked this pull request as ready for review February 6, 2025 20:01

fix assertions

a493172

alamb merged commit df0d966 into apache:main Feb 9, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use a single row_count column during predicate pruning instead of one per column #14295

use a single row_count column during predicate pruning instead of one per column #14295

Uh oh!

adriangb commented Jan 25, 2025 •

edited by alamb

Loading

Uh oh!

adriangb commented Jan 25, 2025

Uh oh!

adriangb commented Jan 25, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Feb 5, 2025

Uh oh!

alamb commented Feb 6, 2025

Uh oh!

adriangb commented Feb 6, 2025

Uh oh!

adriangb commented Feb 6, 2025

Uh oh!

alamb commented Feb 9, 2025

Uh oh!

Uh oh!

Uh oh!

	/// [`UInt64Array`]: arrow::array::UInt64Array
	fn row_counts(&self, column: &Column) -> Option<ArrayRef>;

use a single row_count column during predicate pruning instead of one per column #14295

use a single row_count column during predicate pruning instead of one per column #14295

Uh oh!

Conversation

adriangb commented Jan 25, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Jan 25, 2025

Uh oh!

adriangb commented Jan 25, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 6, 2025

Uh oh!

adriangb commented Feb 6, 2025

Uh oh!

adriangb commented Feb 6, 2025

Uh oh!

alamb commented Feb 9, 2025

Uh oh!

Uh oh!

Uh oh!

adriangb commented Jan 25, 2025 •

edited by alamb

Loading