Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add fluent expression builder API #11414

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pedroerp
Copy link
Contributor

@pedroerp pedroerp commented Nov 2, 2024

Summary:
Adding an API to allow developers to fluently create (untyped)
expression trees without having to rely on a SQL parser. Details and
extensively examples provided in code comments and unit tests

Differential Revision: D65371064

Part of #11383

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 2, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65371064

Copy link

netlify bot commented Nov 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 996f0af
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6760f397b233e60008e25411

@pedroerp
Copy link
Contributor Author

pedroerp commented Nov 2, 2024

Cc: @mbasmanova @xiaoxmeng @majetideepak @kgpai @Yuhta @bikramSingh91 @rui-mo

What do you guys think about something like this as the API?

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pedroerp this looks nice!
I wonder if it is more convenient to use f"" for fields and l*"" for literals.
eg. f"col" and lf"1.0" for literal float, ld"1.0" for literal double.
Something like the Arrow CDataInterface.
https://arrow.apache.org/docs/format/CDataInterface.html

@@ -44,7 +45,42 @@ class IExpr {
return alias_;
}

std::optional<std::string>& alias() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return string_view?

@@ -255,7 +252,7 @@ class PlanBuilder {
}

/// @param tableHandle Optional tableHandle. Other builder arguments such as
/// the subfieldFilters and remainingFilter will be ignored.
/// the `subfieldFilters` and `remainingFilter` will be ignored.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "``" is not used consistently in the comments throughout.

return head;
}

// Macro to reduce builerplate when overloading C++ operators. The template
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: boilerplate

@pedroerp
Copy link
Contributor Author

pedroerp commented Nov 4, 2024

@majetideepak thanks for the quick review.

I wonder if it is more convenient to use f"" for fields and l*"" for literals. eg. f"col" and lf"1.0" for literal float, ld"1.0" for literal double. Something like the Arrow CDataInterface.

You mean using the same arrow identifiers as custom C++ literals? That is a great idea for consistency, but I don't think C++ let's you define the literals as a prefix; if needs to be a trailing "_xx". But we could make the trailing characters align with arrow for literals.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65371064

@majetideepak
Copy link
Collaborator

@pedroerp You are right. It has to be a suffix https://en.cppreference.com/w/cpp/language/user_literal.

@majetideepak
Copy link
Collaborator

I learned from the above link that we can also have user-defined integer literals. The Arrow CDataInterface will allow us to define NULL literals as well.

/// Conjuncts and "between":
///
/// > ExprPtr e = "a"_f && "b"_f || "c"_f; // "a AND b OR c"
/// > ExprPtr e = between("a"_f, 0L, 10L); // "a between 0 and 10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would 'in' look like?

/// execution using type binding from `core::Expressions::inferTypes()`.
///
/// The API provided is as close to the actual expression trees as possible.
/// Filters, arithmetics, function calls, literals, aliases, and more are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'cast' look like?

Copy link
Contributor Author

@pedroerp pedroerp Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't written it yet, but it would follow the exact pattern in the CastExpr node, so something like

cast(BIGINT(), "col"_f)

This would actually be better than SQL as we can be very precise about the type being cast.

/// C++ literals are automatically converted into ConstantExpr (expression
/// literals) when part of an expression. To explicitly create a literal:
///
/// > ExprPtr e = literal(10.3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add some examples of complex types (arrays, maps and structs)?

Would be nice to clarify how to define literals of various integer types.

///
/// > ExprPtr e = literal(10.3);
/// > ExprPtr e = literal("str");
/// > ExprPtr e = literalTimestamp("1970-01-01 01:30:00");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would timestamp with timezone look like?

/// When building long expressions, be careful about C++ operator precedence.
/// For example:
///
/// > ExprPtr e = "col"_f + 5 * 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you cautioning about plus(col, multiply(5, 100)) vs. plus(col, 500)? This wasn't obvious right away.

/// > ExprPtr e = "col"_f + 5 * literal(100);
/// > ExprPtr e = "col"_f + literal(5) * 100;
///
/// Both will generate "col + 5 * 100", which is "plus(col, multiply(5, 100))".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this implicitly convert + to "plus" function call? Isn't this assuming "semantics"? Perhaps, list all such implicit conversions somewhere.

using core::CallExpr;
using core::ConstantExpr;
using core::ExprPtr;
using core::FieldAccessExpr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add an example of accessing a field in a struct?

TEST(ExpressionBuilderTest, fieldAccess) {
ExprWrapper result;

result = field("col");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that applications using Velox library often do not have the luxury of using Velox symbols unqualified and forced to use velox::BIGINT() vs. just BIGINT().

Since this API is all about brevity and readability, how do you think applications can avoid having to prefix all these names with velox::expr::builder?

velox::expr::builder::field("f") vs. field("f") 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. We can naturally move this to the top-level namespace. My concern was to prevent inadvertent symbols clashes by requiring API users to explicitly import namespace. For example, maybe there could be other velox::field() symbols defined somewhere else? This was out of abundant caution though; more likely than not moving to top-level should be fine.

Copy link
Contributor

@Yuhta Yuhta Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what is the reason preventing applications from doing using namespace velox::expr::builder. using namespace velox is probably a bit too broad but using namespace velox::expr::builder probably is fine.

In the worst case they can also do namespace VEB = velox::expr::builder and use VEB::field()

TEST(ExpressionBuilderTest, combined) {
ExprWrapper result;

result = 10L * "c1"_f > call("func", 3.4, "g"_f / "h"_f, call("j"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expression is not nearly as complex as expressions commonly found in production, yet, even that is notably less readable as compared to SQL.

I think it is useful to have this API, but, perhaps, not as a replacement for SQL.

@pedroerp
Copy link
Contributor Author

pedroerp commented Nov 5, 2024

@mbasmanova thanks for the comments. Will address them in the next few days.

Overall I just wanted to prototype how this API could look like and get a feeling about how the community feels about it, but if we are aligned that this might be useful, I'm happy to complete it add support for the missing expressions you pointed out (complex types, casts, lambdas, etc)

}

// Define C++ operator overload for comparisons (filters).
VELOX_EXPR_BUILDER_OPERATOR(operator==, "eq");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have namespace prefix for functions (e.g. presto.default.eq) so this need to be configurable. Maybe a global variable for prefix is enough. But I would guess there will be other things people want to configure in the future. So a global singleton of config object probably is better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Besides the prefix, I believe Spark functions have different names, e.g. 'eq' in Presto is called 'equalto' in Spark.

TEST(ExpressionBuilderTest, fieldAccess) {
ExprWrapper result;

result = field("col");
Copy link
Contributor

@Yuhta Yuhta Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what is the reason preventing applications from doing using namespace velox::expr::builder. using namespace velox is probably a bit too broad but using namespace velox::expr::builder probably is fine.

In the worst case they can also do namespace VEB = velox::expr::builder and use VEB::field()


/// Field access (references to columns).
inline ExprWrapper field(const std::string& name) {
return {std::make_shared<FieldAccessExpr>(name, std::nullopt)};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field access could contain 'input', which makes it a dereference expression. Do we need to support it? Thanks.

Summary:

Adding an API to allow developers to fluently create (untyped)
expression trees without having to rely on a SQL parser. Details and
extensively examples provided in code comments and unit tests.

Part of facebookincubator#11383

Differential Revision: D65371064
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65371064

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants