Skip to content

Commit

Permalink
XXX: Write README for Substrait dialect.
Browse files Browse the repository at this point in the history
Signed-off-by: Ingo Müller <[email protected]>
  • Loading branch information
ingomueller-net committed Jul 15, 2024
1 parent 088dec1 commit 433e94b
Show file tree
Hide file tree
Showing 2 changed files with 104 additions and 1 deletion.
99 changes: 99 additions & 0 deletions README-Substrait.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Substrait Dialect for MLIR

This project consist of building an input/output dialect in
[MLIR](https://mlir.llvm.org/) for [Substrait](https://substrait.io/), the
cross-language serialization format of database query plans (akin to an
intermediate represenation/IR for database queries). The immediate goal is to
create common infrastructure that can be used to implement consumers, producers,
optimizers, and transpilers of Substrait; the more transcending goal is to study
the viability of using MLIR to implement database query compilers.

## Motivation

Substrait defines a serialization format for data-intensive compute operations
similar to relational algebra as they typically occur in database query plans
and similar systems, i.e., the "intermediate representation" (or IR) of database
queries. This allows to separate the development of user frontends such as
dataframe libraries or SQL dialects (aka "Substrait producers") from that of
backends such as database engines (aka "Substrait consumers") and, thus, to
interoperate more easily between different data processing systems.

While Substrait has significant momentum and finds increasing
[adoption](https://substrait.io/community/powered_by/) in mature systems, it is
only concerned with implementing the *serialization format* of query plans, and
leaves the *handling* of that format and, hence, the *in-memory format* of
plans up to the systems that want to adopt it. This will likely lead to repeated
implementation effort for everything else required to deal with that
intermediate representation, including serialization/desiralization to and from
text and other formats, a host-language representation of the IR such as
native classes, error and location tracking, rewrite engines, rewrite rules, and
pass management, common optimizations such as common sub-expression elimination,
and potentially even full-blown query optimizations.

This project aims to create a base for any system dealing with Substrait by
building a "dialect" for Substrait in [MLIR](https://mlir.llvm.org/). In a way,
it aims to build an *in-memory* format for the concepts defined by Substrait,
for which the latter only describe their *serialization format*. MLIR is a
generic compiler framework providing infrastructure for writing compilers from
any domain that is part of the LLVM ecosystem. It makes it easy to add new IR
consisting of domain-specific operations, types, attributes, etc., which are
organized in "dialects" (either in-tree and out-of-tree), as well as rewrites,
passes, conversions, translations, etc. on those dialects. Creating a Substrait
dialect and a number of common related transformations in such a mature
framework has the potential to eliminate some of the repeated effort described
above and, thus, to ease and eventually increase adoption of Substrait. By
extension, building out a dialect for Substrait can show that MLIR is a viable
base for any database-style query compiler.

## Target Use Cases

The aim of the Substrait dialect is to support all of the following use cases:

* Implement the **translation** of the IR of a particular system to or from
Substrait by converting it to or from the Substrait dialect (rather than
Substrait's protobuf messages) and then use the serialization/deserializing
routines from this project.
* Use the Substrait dialect as the **sole in-memory format** for the IR of a
particular system, e.g., parsing some frontend format into its own dialect
and then converting that into the Substrait dialect for export or converting
from the Substrait dialect for import and then translating that into an
execution plan.
* Implement **simplifying and "canonicalizing" transformations** of Substrait
plans such as common sub-expression elimination, dead code elimination,
sub-query/common table-expression inlining, selection and projection
push-down, etc., for example, as part of a producer, consumer, or transpiler.
* Implement **"compatibility rewrites"** that transforms plans that using
features that are unsupported by a particular consumer into equivalent plans
using features that it does support, for example, as part of a producer,
consumer, or transpiler.

## Design Rationale

The main objective of the Substrait dialect is to allow handling Substrait plans
in MLIR: it replicates the definition of Substrait plans in MLIR. In the
[taxonomy of Niu and Amini](https://www.youtube.com/watch?v=hIt6J1_E21c&t=795s),
this means that the Substrait dialect is both and an "input" and an "output"
dialect for Substrait. As such, there is only little freedom in designing the
dialect. To guide the design of the few choices, we shall follow the following
rationale (from most important to least important):

* Every valid Substrait plan MUST be representable in the dialect.
* Every valid Substrait plan MUST round-trip through the dialect to the same
plan as the input. This includes names and ordering.
* The import routine MUST be able to report all constraint violations of
Substrait plans (such as type mismatches, dangling references, etc.).
* The dialect MAY be able to represent programs that do not correspond to valid
Substrait plans. It MAY be impossible to export those to Substrait. For
example, this allows to represent DAGs of operators rather than just trees.
* Every valid program in the Substrait dialect that can be exported to Substrait
MUST round-trip through Substrait to a *semantically* equivalent program but
MAY be different in terms of names, ordering, used operations, attributes,
etc.
* The dialect SHOULD be understood easily by anyone familiar with Substrait. In
particular, the dialect SHOULD use the same terminilogy as the Substrait
specification whereever applicable.
* The dialect SHOULD follow MLIR conventions, idioms, and best practices.
* The dialect SHOULD reuse types, attributes, operations, and interfaces of
upstream dialects whereever applicable.
* The dialect SHOULD allow simple optimizations and rewrites of Substrait
plans without requiring other dialects.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,11 @@ The repository currently houses the following projects:

* The [Iterators](README-Iterators.md) dialect: database-style iterators for
expressing computations on streams of data.
* The [Tuple](include/structured/Dialect/Tuple/): ops for manipulation of built-in tuples (used by the Iterators dialect).
* The [Substrait](README-Substrait.md) dialect: an import/export dialect for
[Substrait](https://substrait.io/), the cross-language serialization format
of database query plans.
* The [Tuple](include/structured/Dialect/Tuple/): ops for manipulation of
built-in tuples (used by the Iterators dialect).

## Build Instructions

Expand Down

0 comments on commit 433e94b

Please sign in to comment.