Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial logical to physical compile #792

Merged
merged 6 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 15 additions & 0 deletions crates/sparrow-backend/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,27 @@ Compilation backend for Kaskada queries.

[dependencies]
arrow-schema.workspace = true
bitvec.workspace = true
derive_more.workspace = true
egg.workspace = true
enum-as-inner.workspace = true
error-stack.workspace = true
hashbrown.workspace = true
index_vec.workspace = true
itertools.workspace = true
rand.workspace = true
smallvec.workspace = true
sparrow-arrow = { path = "../sparrow-arrow" }
sparrow-core = { path = "../sparrow-core" }
sparrow-expressions = { path = "../sparrow-expressions" }
sparrow-logical = { path = "../sparrow-logical" }
sparrow-physical = { path = "../sparrow-physical" }
uuid.workspace = true
static_init.workspace = true
tracing.workspace = true

[dev-dependencies]
insta.workspace = true

[lib]
doctest = false
23 changes: 23 additions & 0 deletions crates/sparrow-backend/src/compile.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
use std::borrow::Cow;

use crate::logical_to_physical::LogicalToPhysical;
use crate::Error;

/// Options for compiling logical plans to physical plans.
#[derive(Clone, Debug, Default)]
pub struct CompileOptions {}

/// Compile a logical plan to a physical execution plan.
pub fn compile(
root: &sparrow_logical::ExprRef,
options: Option<&CompileOptions>,
) -> error_stack::Result<sparrow_physical::Plan, Error> {
let _options = if let Some(options) = options {
Cow::Borrowed(options)
} else {
Cow::Owned(CompileOptions::default())
};

let physical = LogicalToPhysical::new().apply(root)?;
Ok(physical)
}
23 changes: 23 additions & 0 deletions crates/sparrow-backend/src/error.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
use std::borrow::Cow;

#[derive(derive_more::Display, Debug)]
pub enum Error {
#[display(fmt = "no instruction named '{_0}'")]
NoSuchInstruction(String),
#[display(fmt = "invalid logical plan: {_0}")]
InvalidLogicalPlan(Cow<'static, str>),
#[display(fmt = "internal error: {_0}")]
Internal(Cow<'static, str>),
}

impl Error {
pub fn invalid_logical_plan(message: impl Into<Cow<'static, str>>) -> Self {
Self::InvalidLogicalPlan(message.into())
}

pub fn internal(message: impl Into<Cow<'static, str>>) -> Self {
Self::Internal(message.into())
}
}

impl error_stack::Context for Error {}
6 changes: 6 additions & 0 deletions crates/sparrow-backend/src/exprs.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
mod expr_lang;
mod expr_pattern;
mod expr_vec;

pub(crate) use expr_pattern::*;
pub(crate) use expr_vec::*;
79 changes: 79 additions & 0 deletions crates/sparrow-backend/src/exprs/expr_lang.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
use arrow_schema::DataType;
use egg::Id;
use smallvec::SmallVec;
use sparrow_arrow::scalar_value::ScalarValue;

use crate::Error;

#[derive(Hash, PartialOrd, Ord, PartialEq, Eq, Clone, Debug)]
pub(crate) struct ExprLang {
/// The name of the instruction being applied by this expression.
///
/// Similar to an opcode or function.
///
/// Generally, interning owned strings to the specific static strings is preferred.
pub name: &'static str,
/// Literal arguments to the expression.
pub literal_args: SmallVec<[ScalarValue; 2]>,
/// Arguments to the expression.
pub args: SmallVec<[egg::Id; 2]>,
// TODO: This includes the `DataType` in the enodes.
// This is necessary for ensuring that cast instructions to different types are treated
// as distinct, however it is potentially risky for writing simplifications, since the
// patterns won't have specific types. We may need to make this optional, so only the
// cast instruction has to specify it, and then rely on analysis to infer the types.
pub result_type: DataType,
}

// It is weird that we need to implement `Display` for `ExprLang` to pretty print
// only the kind. But, this is a requirement of `egg`.
impl std::fmt::Display for ExprLang {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
self.name.fmt(f)
}
}

impl egg::Language for ExprLang {
fn children(&self) -> &[egg::Id] {
&self.args
}

fn children_mut(&mut self) -> &mut [egg::Id] {
&mut self.args
}

fn matches(&self, other: &Self) -> bool {
// Note: As per
// https://egraphs-good.github.io/egg/egg/trait.Language.html#tymethod.matches,
// "This should only consider the operator, not the children `Id`s".
//
// egg itself will check whether the arguments are *equivalent*.
//
// Some instructions (especially `cast`) depend on the `result_type` to
// determine the operation being performed.

// `(field-ref["foo"] ?base)`
// `(cast[i64] ?base)`
bjchambers marked this conversation as resolved.
Show resolved Hide resolved

self.name == other.name
&& self.literal_args == other.literal_args
&& self.result_type == other.result_type
}
}

impl egg::FromOp for ExprLang {
type Error = error_stack::Report<Error>;

fn from_op(op: &str, children: Vec<Id>) -> Result<Self, Self::Error> {
let name = sparrow_expressions::intern_name(op)
.ok_or_else(|| Error::NoSuchInstruction(op.to_owned()))?;

let args = SmallVec::from_vec(children);
Ok(Self {
name,
literal_args: smallvec::smallvec![],
args,
result_type: DataType::Null,
})
}
}
190 changes: 190 additions & 0 deletions crates/sparrow-backend/src/exprs/expr_pattern.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
use std::str::FromStr;

use smallvec::{smallvec, SmallVec};
use sparrow_arrow::scalar_value::ScalarValue;

use crate::exprs::expr_lang::ExprLang;
use crate::exprs::ExprVec;
use crate::Error;

/// A representation of an expression with "holes" that may be instantiated.
bjchambers marked this conversation as resolved.
Show resolved Hide resolved
///
/// For instance, while `(add (source["uuid"]) (literal 1))` is an expression that adds
/// the literal `1` to the identified `source`, `(add ?input (literal 1))` is a pattern
/// with a hole (placeholder) named `?input` that adds 1 to whatever we substitute in for
/// `?input`.
#[derive(Debug, Default)]
pub(crate) struct ExprPattern {
pub(super) expr: egg::PatternAst<ExprLang>,
}

#[static_init::dynamic]
pub(crate) static INPUT_VAR: egg::Var = egg::Var::from_str("?input").unwrap();

impl ExprPattern {
/// Create a new `ExprPattern` which is an identity expression referencing the input.
pub fn new_input() -> error_stack::Result<Self, Error> {
let mut exprs = ExprPattern::default();
exprs.add_var(*INPUT_VAR)?;
Ok(exprs)
}

/// Create an expr pattern containing the given arguments.
///
/// Returns the resulting pattern as well as the IDs of each of the arguments.
pub fn new_instruction(
name: &'static str,
literal_args: smallvec::SmallVec<[ScalarValue; 2]>,
args: Vec<ExprPattern>,
data_type: arrow_schema::DataType,
) -> error_stack::Result<ExprPattern, Error> {
// NOTE: This adds the pattern for each argument to an `EGraph`
// to add an instruction. This may be overkill, but does simplify
// (a) de-duplicating expressions that appear in multiple arguments
// (b) managing things like "all of these arguments should have a
// singe input".
bjchambers marked this conversation as resolved.
Show resolved Hide resolved
//
// If the use of the EGraph and extractor proves to be too expensive
// we could do this "combine while de-duplicating" ourselves.
let mut graph = egg::EGraph::<egg::ENodeOrVar<ExprLang>, ()>::default();
let mut arg_ids = SmallVec::with_capacity(args.len());
for arg in args {
let id = graph.add_expr(&arg.expr);
arg_ids.push(id);
}

// We can only extract a single expression, so we create one.
// This is why we need to know the instruction to create, rather than
// just returning the resulting `egg::Id` for each argument.
let output = graph.add(egg::ENodeOrVar::ENode(ExprLang {
name,
literal_args,
args: arg_ids,
result_type: data_type,
}));

let cost_function = egg::AstSize;
let extractor = egg::Extractor::new(&graph, cost_function);
bjchambers marked this conversation as resolved.
Show resolved Hide resolved
let (_best_cost, expr) = extractor.find_best(output);

Ok(ExprPattern { expr })
}

/// Instantiate the pattern.
///
/// Replaces `?input` with the `input` instruction.
pub fn instantiate(
&self,
input_type: arrow_schema::DataType,
) -> error_stack::Result<ExprVec, Error> {
// Note: Instead of instantiating the pattern ourselves (replacing `?input` with the
// input expression) we instead make an `EGraph`, add the input expression, and then
// instantiate the pattern into that.
//
// This lets us extract the *best* (shortest) expression, rather than copying all of
// the pattern. One nice thing about this is that the `EGraph` will de-duplicate
// equivalent operations, etc.
let mut graph = egg::EGraph::<ExprLang, ()>::default();

let input_id = graph.add(ExprLang {
name: "input",
literal_args: smallvec![],
args: smallvec![],
result_type: input_type,
});
let mut subst = egg::Subst::with_capacity(1);
subst.insert(*INPUT_VAR, input_id);

let result = graph.add_instantiation(&self.expr, &subst);

let cost_function = egg::AstSize;
let extractor = egg::Extractor::new(&graph, cost_function);
let (_best_cost, expr) = extractor.find_best(result);
bjchambers marked this conversation as resolved.
Show resolved Hide resolved

Ok(ExprVec { expr })
}

pub fn len(&self) -> usize {
self.expr.as_ref().len()
}

/// Return true if this pattern just returns `?input`.
///
/// This is used to identify expression patterns that "just pass the value through".
/// For instance, a projection step with the `identity` pattern is a noop and can
/// be removed.
pub fn is_identity(&self) -> bool {
// TODO: We may want to make this more intelligent and detect cases where
// the expression is *equivalent* to the identity. But for now, we think
// we can treat that as an optimization performed by a later pass.
let instructions = self.expr.as_ref();
instructions.len() == 1 && instructions[0] == egg::ENodeOrVar::Var(*INPUT_VAR)
}
bjchambers marked this conversation as resolved.
Show resolved Hide resolved

/// Return the `egg::Id` corresponding to the last expression.
pub fn last_value(&self) -> egg::Id {
egg::Id::from(self.expr.as_ref().len() - 1)
}

pub fn last(&self) -> &egg::ENodeOrVar<ExprLang> {
self.expr.as_ref().last().expect("non empty")
}

/// Add a physical expression.
///
/// Args:
/// - name: The name of the operation to apply.
/// - literal_args: Literal arguments to the physical expression.
/// - args: The actual arguments to use.
///
/// Returns the `egg::Id` referencing the expression.
pub fn add_instruction(
&mut self,
name: &'static str,
literal_args: smallvec::SmallVec<[ScalarValue; 2]>,
args: smallvec::SmallVec<[egg::Id; 2]>,
data_type: arrow_schema::DataType,
) -> error_stack::Result<egg::Id, Error> {
let expr = self.expr.add(egg::ENodeOrVar::ENode(ExprLang {
name,
literal_args,
args,
result_type: data_type,
}));

Ok(expr)
}

/// Add a variable to the expression.
pub fn add_var(&mut self, var: egg::Var) -> error_stack::Result<egg::Id, Error> {
Ok(self.expr.add(egg::ENodeOrVar::Var(var)))
}

/// Add the given pattern to this pattern, applying the substitution.
pub fn add_pattern(
&mut self,
pattern: &ExprPattern,
subst: &egg::Subst,
) -> error_stack::Result<egg::Id, Error> {
let mut new_ids = Vec::with_capacity(pattern.len());
for expr in pattern.expr.as_ref() {
let new_id = match expr {
egg::ENodeOrVar::Var(var) => match subst.get(*var) {
Some(existing_id) => *existing_id,
None => self.add_var(*var)?,
},
egg::ENodeOrVar::ENode(node) => {
let mut node = node.clone();
node.args = node
.args
.into_iter()
.map(|arg| new_ids[usize::from(arg)])
.collect();
self.expr.add(egg::ENodeOrVar::ENode(node))
}
};
new_ids.push(new_id);
}
Ok(*new_ids.last().unwrap())
}
}
Loading
Loading