Skip to content

Commit

Permalink
Add advanced_parquet_index.rs example of index in into parquet files (
Browse files Browse the repository at this point in the history
#10701)

* Add `advanced_parquet_index.rs` example of indexing into parquet files

* pre-load page index

* fix comment

* Apply suggestions from code review

Thank you @Weijun-H

Co-authored-by: Alex Huang <[email protected]>

* Add ASCII ART

* Update datafusion-examples/README.md

Co-authored-by: Alex Huang <[email protected]>

* Update datafusion-examples/examples/advanced_parquet_index.rs

Co-authored-by: Alex Huang <[email protected]>

* Improve / clarify comments based on review

* Add page index caveat

---------

Co-authored-by: Alex Huang <[email protected]>
  • Loading branch information
alamb and Weijun-H authored Jun 22, 2024
1 parent 6c0e4fb commit ea46e82
Show file tree
Hide file tree
Showing 8 changed files with 695 additions and 5 deletions.
1 change: 1 addition & 0 deletions datafusion-examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ cargo run --example csv_sql
- [`advanced_udaf.rs`](examples/advanced_udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF)
- [`advanced_udf.rs`](examples/advanced_udf.rs): Define and invoke a more complicated User Defined Scalar Function (UDF)
- [`advanced_udwf.rs`](examples/advanced_udwf.rs): Define and invoke a more complicated User Defined Window Function (UDWF)
- [`advanced_parquet_index.rs`](examples/advanced_parquet_index.rs): Creates a detailed secondary index that covers the contents of several parquet files
- [`avro_sql.rs`](examples/avro_sql.rs): Build and run a query plan from a SQL statement against a local AVRO file
- [`catalog.rs`](examples/catalog.rs): Register the table into a custom catalog
- [`csv_sql.rs`](examples/csv_sql.rs): Build and run a query plan from a SQL statement against a local CSV file
Expand Down
664 changes: 664 additions & 0 deletions datafusion-examples/examples/advanced_parquet_index.rs

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions datafusion/common/src/column.rs
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,13 @@ impl Column {
})
}

/// return the column's name.
///
/// Note: This ignores the relation and returns the column name only.
pub fn name(&self) -> &str {
&self.name
}

/// Serialize column into a flat name string
pub fn flat_name(&self) -> String {
match &self.relation {
Expand Down
7 changes: 7 additions & 0 deletions datafusion/common/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1393,6 +1393,13 @@ pub struct TableParquetOptions {
pub key_value_metadata: HashMap<String, Option<String>>,
}

impl TableParquetOptions {
/// Return new default TableParquetOptions
pub fn new() -> Self {
Self::default()
}
}

impl ConfigField for TableParquetOptions {
fn visit<V: Visit>(&self, v: &mut V, key_prefix: &str, description: &'static str) {
self.global.visit(v, key_prefix, description);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,11 @@ impl ParquetAccessPlan {
self.set(idx, RowGroupAccess::Skip);
}

/// scan the i-th row group
pub fn scan(&mut self, idx: usize) {
self.set(idx, RowGroupAccess::Scan);
}

/// Return true if the i-th row group should be scanned
pub fn should_scan(&self, idx: usize) -> bool {
self.row_groups[idx].should_scan()
Expand Down
4 changes: 2 additions & 2 deletions datafusion/core/src/datasource/physical_plan/parquet/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -186,9 +186,9 @@ pub use writer::plan_to_parquet;
/// let exec = ParquetExec::builder(file_scan_config).build();
/// ```
///
/// For a complete example, see the [`parquet_index_advanced` example]).
/// For a complete example, see the [`advanced_parquet_index` example]).
///
/// [`parquet_index_advanced` example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index_advanced.rs
/// [`parquet_index_advanced` example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
///
/// # Execution Overview
///
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ impl RowGroupAccessPlanFilter {
Self { access_plan }
}

/// Return true if there are no row groups to scan
/// Return true if there are no row groups
pub fn is_empty(&self) -> bool {
self.access_plan.is_empty()
}
Expand Down
10 changes: 8 additions & 2 deletions datafusion/core/src/physical_optimizer/pruning.rs
Original file line number Diff line number Diff line change
Expand Up @@ -471,8 +471,10 @@ pub struct PruningPredicate {
/// Original physical predicate from which this predicate expr is derived
/// (required for serialization)
orig_expr: Arc<dyn PhysicalExpr>,
/// [`LiteralGuarantee`]s that are used to try and prove a predicate can not
/// possibly evaluate to `true`.
/// [`LiteralGuarantee`]s used to try and prove a predicate can not possibly
/// evaluate to `true`.
///
/// See [`PruningPredicate::literal_guarantees`] for more details.
literal_guarantees: Vec<LiteralGuarantee>,
}

Expand Down Expand Up @@ -595,6 +597,10 @@ impl PruningPredicate {
}

/// Returns a reference to the literal guarantees
///
/// Note that **All** `LiteralGuarantee`s must be satisfied for the
/// expression to possibly be `true`. If any is not satisfied, the
/// expression is guaranteed to be `null` or `false`.
pub fn literal_guarantees(&self) -> &[LiteralGuarantee] {
&self.literal_guarantees
}
Expand Down

0 comments on commit ea46e82

Please sign in to comment.