Intermediate representation #6973
frankmcsherry
started this conversation in
Technical musings
Replies: 1 comment
-
Another consideration: I imagine what we'll see from the SQL front end is a sequence of SQL queries many of which are either identical, or nearly identical, to prior queries. If we can avoid screwing up our ability to notice that through any eager transformation, possibly for the best (maybe?). For example, I could easily imagine a bunch of queries which are identical except for some constants in expressions (e.g. "look up data for Alice; look up data for Bob; look up ..."). We probably want to eventually notice this and transform the queries to maintained streaming look-ups. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Our intermediate representation, the
dataflow::Plan
type, was not chosen with any particular eye towards plan transformations, optimization, or potentially anything other than getting the dataflow built. We should probably be more intentional about our choice here, to make sure that we preserve as much information about the query as possible, and present it in an easy-to-use format.Here are some thoughts, which are not yet actionably structured. We may need to try some of these out to see if they work for us.
There are several things we could discuss, but to summarize:
Datum
records to[Datum]
records.None of these are "should" choices. I am not sure we have enough experience with the tradeoffs yet.
Relational joins vs outer joins
We currently use the
Join
plan stage for both inner and outer joins. We may want to reduce outer joins down to more elemental operators, for example asThis might give us more insight into the shareable arrangements available to use (e.g.
y.keys().distinct()
andleft
).Downsides include the loss of relational information. If we could have realized something structure from
x.left_outer_join(y)
that is no longer visible from the reduced representation, that is a problem.Multiway joins
It seems like multiway joins are a useful abstraction, as at some point in the analysis process we will probably want to explore different join orders, or plans like delta queries and worst-case optimal joins. It seems appealing to have all of this information co-located in a single operator, when possible.
Downsides include:
Delta joins
The implementation of delta joins uses some weird operators from the
dogsdogsdogs
differential project, but they actually each have fairly crisp semantics. For example, thepropose
operator (which is probably the most important one) takes two input update streams,left
andright
, and produces what I've been calling a "half-join"which is essentially look-ups into
right
for each change inleft
. It is like an incremental join that does not respond to changes inright
(and blocks on changes toleft
until it has the correct answer).We could introduce this type of operator in to our plans, once we understand it a bit better. This would have the advantage that we could push projections and selections backwards through the operator. We could also plausibly find more opportunities for re-use: for a join on relations A, B, C, D, .. Z the delta rules often look like
which we should be able to reformulate as
which .. probably gives better performance as larger batches move through less dataflow.
Tuple vs Datum record types
We currently have a
Datum
type that can be a tuple, but need not be a tuple. This makes it a little less natural to discuss certain relational idioms, like the provenance of particular columns (e.g. from which source relations does a particular field of a tuple derive, so that we can push predicates closer to the sources).Ideally we prefer a record type that allows us to maintain information about the mapping of "columns" as we cross plan stages, so that any per-record analyses can be propagated backwards through the dataflow graph.
There are several options, which include but are not limited to:
[Datum]
record type, and discuss support and provenance at the column level.Datum
types.One additional bonus for
[Datum]
records is that there is that much less interpretation for the common case of relational equijoins, in the special case where we only plan to pull out fields from the input records and act on them. In addition, it seems easier to specialize trace implementations to store contiguous[Datum]
slices if we know the records look like that, and harder if there are flavors we might have to decode.Beta Was this translation helpful? Give feedback.
All reactions