From bafb0d968717606e321ffe29a74146ec26e73cd8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ingo=20M=C3=BCller?= <ingomueller@google.com>
Date: Mon, 15 Jul 2024 08:29:02 +0000
Subject: [PATCH] XXX: Write README for Substrait dialect.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Ingo Müller <ingomueller@google.com>
---
 README-Substrait.md | 139 ++++++++++++++++++++++++++++++++++++++++++++
 README.md           |   6 +-
 2 files changed, 144 insertions(+), 1 deletion(-)
 create mode 100644 README-Substrait.md

diff --git a/README-Substrait.md b/README-Substrait.md
new file mode 100644
index 000000000000..5649a5c73ae5
--- /dev/null
+++ b/README-Substrait.md
@@ -0,0 +1,139 @@
+# Substrait Dialect for MLIR
+
+This project consist of building an input/output dialect in
+[MLIR](https://mlir.llvm.org/) for [Substrait](https://substrait.io/), the
+cross-language serialization format of database query plans (akin to an
+intermediate represenation/IR for database queries). The immediate goal is to
+create common infrastructure that can be used to implement consumers, producers,
+optimizers, and transpilers of Substrait; the more transcending goal is to study
+the viability of using MLIR to implement database query compilers.
+
+## Motivation
+
+Substrait defines a serialization format for data-intensive compute operations
+similar to relational algebra as they typically occur in database query plans
+and similar systems, i.e., the "intermediate representation" (or IR) of database
+queries. This allows to separate the development of user frontends such as
+dataframe libraries or SQL dialects (aka "Substrait producers") from that of
+backends such as database engines (aka "Substrait consumers") and, thus, to
+interoperate more easily between different data processing systems.
+
+While Substrait has significant momentum and finds increasing
+[adoption](https://substrait.io/community/powered_by/) in mature systems, it is
+only concerned with implementing the *serialization format* of query plans, and
+leaves the *handling* of that format and, hence, the *in-memory format* of plans
+up to the systems that adopt it. This will likely lead to repeated
+implementation effort for everything else required to deal with that
+intermediate representation, including serialization/desiralization to and from
+text and other formats, a host-language representation of the IR such as native
+classes, error and location tracking, rewrite engines, rewrite rules, and pass
+management, common optimizations such as common sub-expression elimination, and
+similar.
+
+This project aims to create a base for any system dealing with Substrait by
+building a "dialect" for Substrait in [MLIR](https://mlir.llvm.org/). In a way,
+it aims to build an *in-memory* format for the concepts defined by Substrait,
+for which the latter only describe their *serialization format*. MLIR is a
+generic compiler framework providing infrastructure for writing compilers from
+any domain and is part of the LLVM ecosystem. It makes it easy to add new IR
+consisting of domain-specific operations, types, attributes, etc., which are
+organized in "dialects" (either in-tree and out-of-tree), as well as rewrites,
+passes, conversions, translations, etc. on those dialects. Creating a Substrait
+dialect and a number of common related transformations in such a mature
+framework has the potential to eliminate some of the repeated effort described
+above and, thus, to ease and eventually increase adoption of Substrait. By
+extension, building out a dialect for Substrait can show that MLIR is a viable
+base for any database-style query compiler.
+
+## Target Use Cases
+
+The aim of the Substrait dialect is to support all of the following use cases:
+
+* Implement the **translation** of the IR of a particular system to or from
+  Substrait by converting it to or from the Substrait dialect (rather than
+  Substrait's protobuf messages) and then use the serialization/deserializing
+  routines from this project.
+* Use the Substrait dialect as the **sole in-memory format** for the IR of a
+  particular system, e.g., parsing some frontend format into its own dialect
+  and then converting that into the Substrait dialect for export or converting
+  from the Substrait dialect for import and then translating that into an
+  execution plan.
+* Implement **simplifying and "canonicalizing" transformations** of Substrait
+  plans such as common sub-expression elimination, dead code elimination,
+  sub-query/common table-expression inlining, selection and projection
+  push-down, etc., for example, as part of a producer, consumer, or transpiler.
+* Implement **"compatibility rewrites"** that transforms plans that using
+  features that are unsupported by a particular consumer into equivalent plans
+  using features that it does support, for example, as part of a producer,
+  consumer, or transpiler.
+
+## Design Rationale
+
+The main objective of the Substrait dialect is to allow handling Substrait plans
+in MLIR: it replicates the definition of Substrait plans in MLIR. In the
+[taxonomy of Niu and Amini](https://www.youtube.com/watch?v=hIt6J1_E21c&t=795s),
+this means that the Substrait dialect is both and an "input" and an "output"
+dialect for Substrait. As such, there is only little freedom in designing the
+dialect. To guide the design of the few choices, we shall follow the following
+rationale (from most important to least important):
+
+* Every valid Substrait plan MUST be representable in the dialect.
+* Every valid Substrait plan MUST round-trip through the dialect to the same
+  plan as the input. This includes names and ordering.
+* The import routine MUST be able to report all constraint violations of
+  Substrait plans (such as type mismatches, dangling references, etc.).
+* The dialect MAY be able to represent programs that do not correspond to valid
+  Substrait plans. It MAY be impossible to export those to Substrait. For
+  example, this allows to represent DAGs of operators rather than just trees.
+* Every valid program in the Substrait dialect that can be exported to Substrait
+  MUST round-trip through Substrait to a *semantically* equivalent program but
+  MAY be different in terms of names, ordering, used operations, attributes,
+  etc.
+* The dialect SHOULD be understood easily by anyone familiar with Substrait. In
+  particular, the dialect SHOULD use the same terminilogy as the Substrait
+  specification whereever applicable.
+* The dialect SHOULD follow MLIR conventions, idioms, and best practices.
+* The dialect SHOULD reuse types, attributes, operations, and interfaces of
+  upstream dialects whereever applicable.
+* The dialect SHOULD allow simple optimizations and rewrites of Substrait
+  plans without requiring other dialects.
+* The serialization of the dialect (aka its "assembly") MAY change over time.
+  (In other words, the dialect is not meant as an exchange format between
+  systems -- that's what Substrait is for.)
+
+## Features (Inherited by MLIR)
+
+MLIR provides infrastructure for virtually all aspects of writing a compiler.
+The following is a list of features that we inherit by using MLIR:
+
+* Mostly declarative approach to defining relations and expressions (via ODS).
+* Documentation generation from declared relations and expressions.
+* Declarative serialization/parsing to/from human-readable text representation
+  (via custom assembly).
+* Syntax high-lighting, auto-complete, as-you-type diagnostics, code navigation,
+  etc. for the MLIR text format (via LSP servers).
+* (Partially declarative) type deduction framework (via ODS or C++ interface
+  implementations).
+* (Partially declarative) verification of arbitrary consistency constraints,
+  declarative (via ODS) or imperative (via C++ verifiers).
+* Mostly declarative pass management (via ODS).
+* Versatile infrastructure for pattern-based rewriting (via CRR and C++
+  classes).
+* Powerful manipulation of imperative handling, creation, and modification of
+  IR using native classes for IR components, walkers, builders, (IR) interfaces,
+  etc. (via ODS and C++ infrastructure).
+* Powerful location tracking and location-based error reporting.
+* Generated Python bindings of IR components, passes, and generic infrastructure
+  (via ODS).
+* Powerful command line argument handling and customizable implementation of
+  typical tools (`X-opt`, `X-translate`, `X-lsp-server`, ...).
+* Testing infrastructure that is optimized for compilers (via `lit` and
+  `FileCheck`).
+* A collection of common types and attributes as well as dialects (i.e.,
+  operations) for more or less generic purposes that can be used in or combined
+  with custom dialects and that come with transformations on and conversions
+  to/from other dialects.
+* A collection of interfaces and transformation passes on those interfaces,
+  which allows to extend existing transformations to new dialects easily.
+* A support library with efficient data structures, platform-independent file
+  system abstraction, string utilities, etc. (via LLVM support library).
diff --git a/README.md b/README.md
index ab0b5aaaaf4c..6d0ea7a4f8b2 100644
--- a/README.md
+++ b/README.md
@@ -22,7 +22,11 @@ The repository currently houses the following projects:
 
 * The [Iterators](README-Iterators.md) dialect: database-style iterators for
   expressing computations on streams of data.
-* The [Tuple](include/structured/Dialect/Tuple/): ops for manipulation of built-in tuples (used by the Iterators dialect).
+* The [Substrait](README-Substrait.md) dialect: an import/export dialect for
+  [Substrait](https://substrait.io/), the cross-language serialization format
+  of database query plans.
+* The [Tuple](include/structured/Dialect/Tuple/): ops for manipulation of
+  built-in tuples (used by the Iterators dialect).
 
 ## Build Instructions