Intermediate representation #6973

frankmcsherry · 2019-05-17T12:45:57Z

frankmcsherry
May 17, 2019
Maintainer

Our intermediate representation, the dataflow::Plan type, was not chosen with any particular eye towards plan transformations, optimization, or potentially anything other than getting the dataflow built. We should probably be more intentional about our choice here, to make sure that we preserve as much information about the query as possible, and present it in an easy-to-use format.

Here are some thoughts, which are not yet actionably structured. We may need to try some of these out to see if they work for us.

There are several things we could discuss, but to summarize:

We could decompose outer joins into more elemental plan stages.
We could represent multiway joins as single plan stages.
We could decompose delta queries into more elemental plan stages.
We could move from Datum records to [Datum] records.

None of these are "should" choices. I am not sure we have enough experience with the tradeoffs yet.

Relational joins vs outer joins

We currently use the Join plan stage for both inner and outer joins. We may want to reduce outer joins down to more elemental operators, for example as

fn left_outer_join(left, right) {
    left.semijoin(y.keys().distinct())
        .negate()
        .concat(x)
        .map(|x| (x, None))
        .concat(left.join(y))
}

This might give us more insight into the shareable arrangements available to use (e.g. y.keys().distinct() and left).

Downsides include the loss of relational information. If we could have realized something structure from x.left_outer_join(y) that is no longer visible from the reduced representation, that is a problem.

Multiway joins

It seems like multiway joins are a useful abstraction, as at some point in the analysis process we will probably want to explore different join orders, or plans like delta queries and worst-case optimal joins. It seems appealing to have all of this information co-located in a single operator, when possible.

Downsides include:

Outer joins are not obviously natural here, and we might need to break them down into their non-relational base operators.
We can probably always reconstruct multiway joins as connected components of joins. If we are planning on decomposing outer joins we may need to do this anyhow, at which point forcing things into a multiway blob may not have added much value.

Delta joins

The implementation of delta joins uses some weird operators from the dogsdogsdogs differential project, but they actually each have fairly crisp semantics. For example, the propose operator (which is probably the most important one) takes two input update streams, left and right, and produces what I've been calling a "half-join"

for_each ((l_key, l_val), l_time, l_diff) in left : 
    for each ((l_key, r_val), r_accum) in right@l_time :
        output ((l_key, l_val, r_val), l_time, l_diff * r_accum)

which is essentially look-ups into right for each change in left. It is like an incremental join that does not respond to changes in right (and blocks on changes to left until it has the correct answer).

We could introduce this type of operator in to our plans, once we understand it a bit better. This would have the advantage that we could push projections and selections backwards through the operator. We could also plausibly find more opportunities for re-use: for a join on relations A, B, C, D, .. Z the delta rules often look like

dA x B x C x D .. x Z
dB x A x C x D .. x Z
...

which we should be able to reformulate as

((dA x B) + (dB x A)) x C x D .. x Z

which .. probably gives better performance as larger batches move through less dataflow.

Tuple vs Datum record types

We currently have a Datum type that can be a tuple, but need not be a tuple. This makes it a little less natural to discuss certain relational idioms, like the provenance of particular columns (e.g. from which source relations does a particular field of a tuple derive, so that we can push predicates closer to the sources).

Ideally we prefer a record type that allows us to maintain information about the mapping of "columns" as we cross plan stages, so that any per-record analyses can be propagated backwards through the dataflow graph.

There are several options, which include but are not limited to:

We could leave things how they are, which passes a bit of a burden on to the analysis stages, but avoids pre-optimizing anything before we understand the pain points better.
We could pivot to a [Datum] record type, and discuss support and provenance at the column level.
We could go "whole PL" and retain information about the relations between arbitrary input and output fields in tree-structure Datum types.

One additional bonus for [Datum] records is that there is that much less interpretation for the common case of relational equijoins, in the special case where we only plan to pull out fields from the input records and act on them. In addition, it seems easier to specialize trace implementations to store contiguous [Datum] slices if we know the records look like that, and harder if there are flavors we might have to decode.

frankmcsherry · 2019-05-17T13:17:39Z

frankmcsherry
May 17, 2019
Maintainer Author

Another consideration: I imagine what we'll see from the SQL front end is a sequence of SQL queries many of which are either identical, or nearly identical, to prior queries. If we can avoid screwing up our ability to notice that through any eager transformation, possibly for the best (maybe?).

For example, I could easily imagine a bunch of queries which are identical except for some constants in expressions (e.g. "look up data for Alice; look up data for Bob; look up ..."). We probably want to eventually notice this and transform the queries to maintained streaming look-ups.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermediate representation #6973

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Intermediate representation #6973

frankmcsherry May 17, 2019 Maintainer

Relational joins vs outer joins

Multiway joins

Delta joins

Tuple vs Datum record types

Replies: 1 comment

frankmcsherry May 17, 2019 Maintainer Author

frankmcsherry
May 17, 2019
Maintainer

frankmcsherry
May 17, 2019
Maintainer Author