-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC - Enable support for clickhouse on SL #1592
base: main
Are you sure you want to change the base?
Conversation
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR. CLA has not been signed by users: @thiagosalvatore |
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
settings = ["allow_experimental_join_condition = 1", "allow_experimental_analyzer = 1", "join_use_nulls = 0"] | ||
return SqlPlanRenderResult(sql=f"SETTINGS {', '.join(settings)}", bind_parameter_set=SqlBindParameterSet()) | ||
|
||
def _render_joins_section(self, join_descriptions: Sequence[SqlJoinDescription]) -> Optional[SqlPlanRenderResult]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't link this at all.
The problem is that clickhouse doesn't support inequality INNER JOINS
The suggested approach is to use CROSS JOIN instead, but it forces a change on the rendering logic of any join that uses inequality.
The tests are passing, but I'm open for suggestions here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll definitely want to discuss this at the MF sync tomorrow. The problem is that a SQL plan theoretically should be engine-agnostic. In this case, it isn't, since the INNER JOIN
in the SQL plan is invalid for Clickhouse.
A couple of other potential options:
- We change the SQL plan to use
CROSS JOIN
, which I'm assuming would work for all engines. Then, we could implement an engine-specific optimizer that swaps in anINNER JOIN
if the engine supports it. In practice, if that optimizer errors for some reason, the user might be surprised to see aCROSS JOIN
here when it's not necessary. We might be ok with that tradeoff. - Another option would be for us to just create an internal concept of a generic "inequality join" or something similar to use in the SQL plan. And that would get translated into either
INNER JOIN
orCROSS JOIN
by the SQL renderer.
Both of those options might have issues integrating with theWHERE
clause, though, since we would need to include that whenCROSS JOIN
is used 🤔
Let's hold off on making any changes here for now - I want to see if Paul has any better ideas for how to work around this.
@@ -322,6 +322,9 @@ def _render_limit_section(self, limit_value: Optional[int]) -> Optional[SqlPlanR | |||
return None | |||
return SqlPlanRenderResult(sql=f"LIMIT {limit_value}", bind_parameter_set=SqlBindParameterSet()) | |||
|
|||
def _render_adapter_specific_flags(self) -> Optional[SqlPlanRenderResult]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clickhouse has some flags that can be enabled on a query level. Instead of adding a bunch of workarounds to the statement execution I decided to add it here so if we have other adapters with similar features we can just use it.
|
||
if self.sql_engine_type == SqlEngine.CLICKHOUSE: | ||
create_table_statement = ( | ||
f"{create_table_statement} ENGINE = MergeTree ORDER BY ({columns_to_insert[0].split(" ")[0]})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clickhouse enforces that every table being created needs to specify what is the engine
|
||
def drop_schema(self, schema_name: str, cascade: bool = True) -> None: | ||
"""Drop the given schema from the data warehouse. Only used in tests.""" | ||
self.execute(f"DROP SCHEMA IF EXISTS {schema_name}{' CASCADE' if cascade else ''}") | ||
if self.sql_engine_type is SqlEngine.CLICKHOUSE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no concept of schema on clickhouse. A schema is a database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's awesome how quickly you were able to figure all this out with so little context! 🙌
} | ||
|
||
@override | ||
def render_date_part(self, date_part: DatePart) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this will need to be moved to render_extract()
?
settings = ["allow_experimental_join_condition = 1", "allow_experimental_analyzer = 1", "join_use_nulls = 0"] | ||
return SqlPlanRenderResult(sql=f"SETTINGS {', '.join(settings)}", bind_parameter_set=SqlBindParameterSet()) | ||
|
||
def _render_joins_section(self, join_descriptions: Sequence[SqlJoinDescription]) -> Optional[SqlPlanRenderResult]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll definitely want to discuss this at the MF sync tomorrow. The problem is that a SQL plan theoretically should be engine-agnostic. In this case, it isn't, since the INNER JOIN
in the SQL plan is invalid for Clickhouse.
A couple of other potential options:
- We change the SQL plan to use
CROSS JOIN
, which I'm assuming would work for all engines. Then, we could implement an engine-specific optimizer that swaps in anINNER JOIN
if the engine supports it. In practice, if that optimizer errors for some reason, the user might be surprised to see aCROSS JOIN
here when it's not necessary. We might be ok with that tradeoff. - Another option would be for us to just create an internal concept of a generic "inequality join" or something similar to use in the SQL plan. And that would get translated into either
INNER JOIN
orCROSS JOIN
by the SQL renderer.
Both of those options might have issues integrating with theWHERE
clause, though, since we would need to include that whenCROSS JOIN
is used 🤔
Let's hold off on making any changes here for now - I want to see if Paul has any better ideas for how to work around this.
@@ -132,8 +140,14 @@ def _quote_escape_value(self, value: str) -> str: | |||
|
|||
def create_schema(self, schema_name: str) -> None: | |||
"""Create the given schema in a data warehouse. Only used in tutorials and tests.""" | |||
self.execute(f"CREATE SCHEMA IF NOT EXISTS {schema_name}") | |||
if self.sql_engine_type is SqlEngine.CLICKHOUSE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add this to the renderers so that we don't need any logic here to check the engine type (especially since it's used twice). You can do similar to what we do on line 125 above to get the string needed for timestamp_data_type
. You can add a property to the renderer called like schema_str
or something like that that just returns "SCHEMA"
or "DATABASE"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it makes sense to move this to the renderer given that we only create/drop schemas on tests.
Maybe we can turn this into a property inside the renderer that looks like:
has_schema_support
that by default is True and then when this value is false we use the create database instead. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that works too!
@@ -32,6 +32,10 @@ | |||
"engine_url": trino://...", | |||
"engine_password": "..." | |||
}, | |||
"clickhouse": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a 1pass entry that stores these creds for easy access, so it would be great if we could update that too!
No description provided.