Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(sql): avoid extraneous operator parenthesis #10007

Merged
merged 2 commits into from
Sep 4, 2024

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Sep 3, 2024

This PR is motivated by #9987, and is one part of decreasing the compilation overhead here.

This PR:

  • Adds a benchmark for the problem. On main the SQL generation tests fail due to excessive recursion within SQLGlot
  • Modifies our binary operator handling code to reduce some boilerplate
  • Modifies the parenthesizing behavior of associative operators to avoid unnecessary parenthesis. Doing so unlocks an optimization within SQLGlot to generate SQL without recursion (excessive parenthesis defeat this optimization).
  • Adds a snapshot test for operators to check the behavior

There are still a few cases where we excessively apply parenthesis (mostly in backends where a function is used instead of a binary operator), but this seemed good enough for now and is a nice improvement overall. After this PR the benchmarks now run successfully. I have some follow-ups to further improve performance in other places, but figured I'd split the PRs out by topic.

@jcrist jcrist added performance Issues related to ibis's performance sql Backends that generate SQL labels Sep 3, 2024
@cpcloud
Copy link
Member

cpcloud commented Sep 3, 2024

Nice!

What does the performance versus DuckDB raw SQL look like?

@jcrist
Copy link
Member Author

jcrist commented Sep 3, 2024

from functools import reduce
from operator import add
from timeit import Timer

import pyarrow as pa
import ibis

con = ibis.duckdb.connect()

N = 100
data = pa.Table.from_pydict({f"x{i}": [1, 2, 3] for i in range(N)})
t = con.create_table("test", data)
expr = t.select(sum=reduce(add, map(t.__getitem__, t.columns)))
sql = f"SELECT {'+'.join(t.columns)} AS sum FROM test"

for name, func in [
    ("duckdb", lambda: con.raw_sql(sql).arrow()),
    ("ibis-duckdb", lambda: expr.to_pyarrow()),
]:
    n, t = Timer(func).autorange()
    print(f"{name}: {1000 * t / n:.1f} ms")

On this PR:

$ python bench.py
duckdb: 2.6 ms
ibis-duckdb: 25.9 ms

With another upcoming PR:

$  python bench.py
duckdb: 2.5 ms
ibis-duckdb: 7.2 ms

@jcrist jcrist added the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Sep 3, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Sep 3, 2024
@jcrist jcrist requested a review from cpcloud September 3, 2024 22:42
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking, but I do think the .associative approach would be a bit cleaner.

@@ -469,6 +460,19 @@ def impl(self, _, *, _name: str = target_name, **kw):
for op, target_name in cls.SIMPLE_OPS.items():
setattr(cls, methodname(op), make_impl(op, target_name))

# Define binary op methods, only if BINARY_INFIX_OPS is set on the
# compiler class.
if binops := cls.__dict__.get("BINARY_INFIX_OPS", {}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this checking specifically for subclass overrides of this attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. If we did cls.BINARY_INFIX_OPS it would define the methods in every subclass (since no subclasses currently define BINARY_INFIX_OPS), clobbering any explicit visit_* methods the subclass defined. We'd need to do this for SIMPLE_OPS etc... too, except that every compiler subclass currently sets that attribute themselves with at least one override.

ops.Power,
BINARY_INFIX_OPS = {
# Numeric
ops.Add: (sge.Add, True),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bool here seems like it should be a property of the operation instead of a flag.

It's hard to tell what the boolean is referring to unless you know where it's being used.

What do you think about sticking a class attribute on the ops for this property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea. I'm going to merge this now but may refactor that in a follow-up.

# compiler class.
if binops := cls.__dict__.get("BINARY_INFIX_OPS", {}):

def make_binop(sge_cls, associative):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like most or all of the associative arguments can go away if you make it an op attribute.

@jcrist jcrist merged commit f86515c into ibis-project:main Sep 4, 2024
81 checks passed
@jcrist jcrist deleted the avoid-op-parens branch September 4, 2024 15:09
@cpcloud cpcloud added this to the 9.5 milestone Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to ibis's performance sql Backends that generate SQL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants