Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
apacheGH-33985: [C++] Add substrait serialization/deserialization for…
… expressions (apache#34834) ### Rationale for this change Substrait provides a library-independent way to represent compute expressions. By serializing and deserializing pyarrow compute expression to substrait we can allow interoperability with other libraries. Originally it was thought this would not be needed because users would be sending entire query plans (which contain expressions) back and forth and so there was no need to work with expressions by themselves. However, as more and more APIs and integration points emerge it turns out there are situations where serializing expressions by themselves is useful. For example, the proposed datasets protocol, or for the Java JNI datasets implementation (which uses Arrow-C++'s datasets) ### What changes are included in this PR? In Arrow-C++ we add two new methods to serialize and deserialize a collection of named, bound expressions to Substrait's ExtendedExpression message. In pyarrow we expose these two methods and also add utility methods to pyarrow.compute.Expression to convert a single expression to/from substrait (these will be encoded as an ExtendedExpression message with one expression named "expression") In addition, this PR exposed that we do not have very many bindings for arrow-functions to substrait-functions (previous work has mostly focused on the reverse direction). This PR adds many (though not all) new bindings. In addition, this PR adds ToProto for cast and both FromProto and ToProto support for the SingularOrList expression type (we convert is_in to SingularOrList and convert SingularOrList to an or list). This should provide support for all the sargable operators except between (there is no Arrow-C++ between function) and like (we still don't have arrow->substrait bindings for the string functions) which should be a sufficient set of expressions for a first release. ### Are these changes tested? Yes. ### Are there any user-facing changes? There are new features, as described above, but no backwards incompatible changes. ### Caveats There are a fair number of minor inconsistencies or surprises, many of which can be smoothed over by follow-up work. #### Bound Expressions Arrow-C++ has long had a distinction between "unbound expressions" (e.g. `a + b`) and "bound expressions" (e.g. `a:i32 + b:i32`). A bound expression is an expression that has been bound to a schema of some kind. Field references are resolved and the output type is known for every node of the AST. Pyarrow has hidden this complexity and most pyarrow compute expressions that the user encounters will be unbound expressions. Substrait is only capable (currently) of representing bound expressions. As a result, in order to serialize expressions, the user will need to provide an input schema. This can be an inconvenience for some workflows. To resolve this, I would like to eventually add support for unbound expressions to Substrait (substrait-io/substrait#515) Another minor annoyance of bound expressions is that an unbound pyarrow.compute.Expression object will not be equal to a bound pyarrow.compute.Expression object. It would make testing easier if we had a `pyarrow.compute.Expression.equals` variant that did not examine bound fields. #### Named field references Pyarrow datasets users are used to working with named field references. For example, one can set a filter `pc.equal(ds.field("x"), 7)`. Substrait, since it requires everything to be bound, considers named references to be superfluous and does everything in terms of numeric indices into the base schema. So the above expression, after round tripping, would become something like `pc.equal(ds.field(3), 7)` (assuming `"x"` is at index `3` in the schema used for serialization). This is something that can be overcome in the future if Substrait adds support for unbound expressions. Or, if that doesn't happen, it could still be implemented as a Substrait expression hint (this would allow named references to be used even if the user wants to work with bound expressions). #### UDFs UDFs ARE supported by this PR. This covers both "builtin arrow functions that do not exist in substrait (e.g. shift_left)" and "custom UDFs added with `register_scalar_function`". By default, UDFs will not be allowed when converting to Substrait because the resulting message would not be portable (e.g. you can't expect an external system to know about your custom UDFs). However, you can set the `allow_udfs` flag to True and these will be allowed. The Substrait representation will have the URI `urn:arrow:substrait_simple_extension_function`. **Options**: Although UDFs are allowed we do not yet support UDFs that take function options. These are trickier to convert to Substrait (though it should be possible in the future if someone is motivated enough). #### Rough Edges There are a few corner cases: * The function `is_in` converts to Substrait's `SingularOrList`. On conversion back to Arrow this becomes an or list. In other words, the function `is_in(5, [1, 2, 5])` converts to `5 == 1 || 5 == 2 || 5 == 5`. This is because Substrait's or list is more expression and allows things like `5 == field_ref(0) || 5 == 7` which cannot be expressed as an `is_in` function. * Arrow functions can either be converted to Substrait or are considered UDFs. However, there are a small number of functions which can "sometimes" be converted to Substrait depending on the function options. At the moment I think this is only the `is_null` function. The `is_null` function has an option `nan_is_null` which will allow you to consider `NaN` as a null value. Substrait has no single function that evaluates both `NULL` and `NaN` as true. In the meantime you can use `is_null || is_nan`. In the future, should someone want to, they could add special logic to convert this case. * Closes: apache#33985 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Weston Pace <[email protected]>
- Loading branch information