Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: merge using partition filters #1958

Merged
merged 12 commits into from
Dec 20, 2023
49 changes: 38 additions & 11 deletions crates/deltalake-core/src/operations/merge.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1474,6 +1474,7 @@ mod tests {
use datafusion_expr::lit;
use datafusion_expr::Expr;
use datafusion_expr::LogicalPlanBuilder;
use datafusion_expr::Operator;
use serde_json::json;
use std::collections::HashMap;
use std::ops::Neg;
Expand Down Expand Up @@ -2569,17 +2570,43 @@ mod tests {

assert!(pred.is_some());

let expected_pred = ["C", "X", "B"]
.into_iter()
.map(|id| {
lit(ScalarValue::Utf8(id.to_owned().into())).eq(col(Column {
relation: Some(target_name.clone()),
name: "id".to_owned(),
}))
})
.reduce(Expr::or)
.unwrap();
let split_pred = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to implement this yourself. Datafusion provides a utility for this: https://docs.rs/datafusion/latest/datafusion/logical_expr/utils/fn.split_binary_owned.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll use that for the first layer unpacking the ORs. but I think I still need to do manual mangling on unwrapping the EQ expressions, because Expr doesn't implement Ord so I need to extract the strings to sort them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah alright if that's the case then I think this is good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_binary_owned is new to v34: https://docs.rs/datafusion/33.0.0/datafusion/?search=split_binary_owned

if #1983 makes it first then I can update it. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge it now just want to give some time before I press the merge button. If you want to wait until after the other PR let me know.

fn split(expr: Expr, parts: &mut Vec<(String, String)>) {
match expr {
Expr::BinaryExpr(ex) if ex.op == Operator::Or => {
split(*ex.left, parts);
split(*ex.right, parts);
}
Expr::BinaryExpr(ex) if ex.op == Operator::Eq => {
let col = match *ex.right {
Expr::Column(col) => col.name,
ex => panic!("expected column in pred, got {ex}!"),
};

let value = match *ex.left {
Expr::Literal(ScalarValue::Utf8(Some(value))) => value,
ex => panic!("expected value in predicate, got {ex}!"),
};

parts.push((col, value))
}

expr => panic!("expected either = or OR, got {expr}"),
}
}

let mut parts = vec![];
split(pred.unwrap(), &mut parts);
parts.sort();
parts
};

let expected_pred_parts = [
("id".to_owned(), "B".to_owned()),
("id".to_owned(), "C".to_owned()),
("id".to_owned(), "X".to_owned()),
];

assert_eq!(pred.unwrap(), expected_pred);
assert_eq!(split_pred, expected_pred_parts);
}
}
Loading