Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC - Enable support for clickhouse on SL #1592

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

thiagosalvatore
Copy link

No description provided.

Copy link

cla-bot bot commented Jan 13, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @thiagosalvatore

Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@cla-bot cla-bot bot added the cla:yes label Jan 13, 2025
settings = ["allow_experimental_join_condition = 1", "allow_experimental_analyzer = 1", "join_use_nulls = 0"]
return SqlPlanRenderResult(sql=f"SETTINGS {', '.join(settings)}", bind_parameter_set=SqlBindParameterSet())

def _render_joins_section(self, join_descriptions: Sequence[SqlJoinDescription]) -> Optional[SqlPlanRenderResult]:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't link this at all.

The problem is that clickhouse doesn't support inequality INNER JOINS

https://clickhouse.com/docs/en/sql-reference/statements/select/join#join-with-inequality-conditions-for-columns-from-different-tables.

The suggested approach is to use CROSS JOIN instead, but it forces a change on the rendering logic of any join that uses inequality.

The tests are passing, but I'm open for suggestions here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely want to discuss this at the MF sync tomorrow. The problem is that a SQL plan theoretically should be engine-agnostic. In this case, it isn't, since the INNER JOIN in the SQL plan is invalid for Clickhouse.

A couple of other potential options:

  • We change the SQL plan to use CROSS JOIN, which I'm assuming would work for all engines. Then, we could implement an engine-specific optimizer that swaps in an INNER JOIN if the engine supports it. In practice, if that optimizer errors for some reason, the user might be surprised to see a CROSS JOIN here when it's not necessary. We might be ok with that tradeoff.
  • Another option would be for us to just create an internal concept of a generic "inequality join" or something similar to use in the SQL plan. And that would get translated into either INNER JOIN or CROSS JOIN by the SQL renderer.
    Both of those options might have issues integrating with the WHERE clause, though, since we would need to include that when CROSS JOIN is used 🤔

Let's hold off on making any changes here for now - I want to see if Paul has any better ideas for how to work around this.

@@ -322,6 +322,9 @@ def _render_limit_section(self, limit_value: Optional[int]) -> Optional[SqlPlanR
return None
return SqlPlanRenderResult(sql=f"LIMIT {limit_value}", bind_parameter_set=SqlBindParameterSet())

def _render_adapter_specific_flags(self) -> Optional[SqlPlanRenderResult]:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clickhouse has some flags that can be enabled on a query level. Instead of adding a bunch of workarounds to the statement execution I decided to add it here so if we have other adapters with similar features we can just use it.


if self.sql_engine_type == SqlEngine.CLICKHOUSE:
create_table_statement = (
f"{create_table_statement} ENGINE = MergeTree ORDER BY ({columns_to_insert[0].split(" ")[0]})"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clickhouse enforces that every table being created needs to specify what is the engine


def drop_schema(self, schema_name: str, cascade: bool = True) -> None:
"""Drop the given schema from the data warehouse. Only used in tests."""
self.execute(f"DROP SCHEMA IF EXISTS {schema_name}{' CASCADE' if cascade else ''}")
if self.sql_engine_type is SqlEngine.CLICKHOUSE:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no concept of schema on clickhouse. A schema is a database

@thiagosalvatore thiagosalvatore changed the title Poc clickhouse POC - Enable support for clickhouse on SL Jan 13, 2025
@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jan 13, 2025
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jan 13, 2025
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's awesome how quickly you were able to figure all this out with so little context! 🙌

}

@override
def render_date_part(self, date_part: DatePart) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this will need to be moved to render_extract()?

settings = ["allow_experimental_join_condition = 1", "allow_experimental_analyzer = 1", "join_use_nulls = 0"]
return SqlPlanRenderResult(sql=f"SETTINGS {', '.join(settings)}", bind_parameter_set=SqlBindParameterSet())

def _render_joins_section(self, join_descriptions: Sequence[SqlJoinDescription]) -> Optional[SqlPlanRenderResult]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely want to discuss this at the MF sync tomorrow. The problem is that a SQL plan theoretically should be engine-agnostic. In this case, it isn't, since the INNER JOIN in the SQL plan is invalid for Clickhouse.

A couple of other potential options:

  • We change the SQL plan to use CROSS JOIN, which I'm assuming would work for all engines. Then, we could implement an engine-specific optimizer that swaps in an INNER JOIN if the engine supports it. In practice, if that optimizer errors for some reason, the user might be surprised to see a CROSS JOIN here when it's not necessary. We might be ok with that tradeoff.
  • Another option would be for us to just create an internal concept of a generic "inequality join" or something similar to use in the SQL plan. And that would get translated into either INNER JOIN or CROSS JOIN by the SQL renderer.
    Both of those options might have issues integrating with the WHERE clause, though, since we would need to include that when CROSS JOIN is used 🤔

Let's hold off on making any changes here for now - I want to see if Paul has any better ideas for how to work around this.

@@ -132,8 +140,14 @@ def _quote_escape_value(self, value: str) -> str:

def create_schema(self, schema_name: str) -> None:
"""Create the given schema in a data warehouse. Only used in tutorials and tests."""
self.execute(f"CREATE SCHEMA IF NOT EXISTS {schema_name}")
if self.sql_engine_type is SqlEngine.CLICKHOUSE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this to the renderers so that we don't need any logic here to check the engine type (especially since it's used twice). You can do similar to what we do on line 125 above to get the string needed for timestamp_data_type. You can add a property to the renderer called like schema_str or something like that that just returns "SCHEMA" or "DATABASE".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it makes sense to move this to the renderer given that we only create/drop schemas on tests.

Maybe we can turn this into a property inside the renderer that looks like:

has_schema_support that by default is True and then when this value is false we use the create database instead. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that works too!

@@ -32,6 +32,10 @@
"engine_url": trino://...",
"engine_password": "..."
},
"clickhouse": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a 1pass entry that stores these creds for easy access, so it would be great if we could update that too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants