Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time granularity generates invalid query for Dremio #20349

Open
3 tasks done
jbvsmo opened this issue Jun 10, 2022 · 10 comments
Open
3 tasks done

Time granularity generates invalid query for Dremio #20349

jbvsmo opened this issue Jun 10, 2022 · 10 comments
Labels
#bug Bug report data:connect:dremio Related to Dremio data:connect:drill Related to Drill

Comments

@jbvsmo
Copy link

jbvsmo commented Jun 10, 2022

I'm using Dremio database and it throws a VALIDATION ERROR on any query that tries to group by a time field that has a granularity expression. This might be valid SQL for another database system, but not Dremio. This makes superset useless for Dremio tables since grouping by time is a very common and used feature

How to reproduce the bug

  1. Create a new table chart and pick a timestamp field like 2022-06-10 12:34:56 and select time grain day or month
  2. Select as dimension the same timestamp field
  3. Add any metric such as COUNT(*)
  4. Superset will generate this query:
SELECT DATE_TRUNC('day', "OperationDate") AS "OperationDate",
       COUNT(*) AS "count"
FROM "no-partitions-queries"."messages_operations_v"
GROUP BY DATE_TRUNC('day', "OperationDate")
ORDER BY "count" DESC
LIMIT 10000;

It is not using the name of the column OperationDate in the group by and is rewriting the same expression.

Expected results

to have this clause:

GROUP BY "OperationDate"

Actual results

Query execution error. Details:
VALIDATION ERROR: Expression 'messages_operations_v.OperationDate' is not being grouped

Unexpected error
Error: ('HY000', '[HY000] [Dremio][Connector] (1040) Dremio failed to execute the query: SELECT DATE_TRUNC('day', "OperationDate") AS "OperationDate",\n COUNT(*) AS "count"\nFROM "no-partitions-queries"."messages_operations_v"\nGROUP BY DATE_TRUNC('day', "OperationDate")\nORDER BY "count" DESC\nLIMIT 10000\n[30038]Query execution error. Details:[ \nVALIDATION ERROR: Expression 'messages_operations_v.OperationDate' is not being grouped\n\nSQL Query SELECT DATE_...[see log] (1040) (SQLExecDirectW)')

Screenshots

Screen Shot 2022-06-10 at 10 50 01

Environment

  • browser type and version: Chrome v102.0.5005.61
  • superset version: It show Superset 0.0.0dev -- I'm using the latest docker image
  • python version: 3.8.12
  • node.js version: none
  • any feature flags active: no

Checklist

Make sure to follow these steps before submitting your issue - thank you!

  • I have checked the superset logs for python stacktraces and included it here as text if there are any.
  • I have reproduced the issue with at least the latest released version of superset.
  • I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

Add any other context about the problem here.

@jbvsmo jbvsmo added the #bug Bug report label Jun 10, 2022
@rusackas
Copy link
Member

Wondering if @eschutho or @zhaoyongjie have any insight on this issue.

@eschutho
Copy link
Member

eschutho commented Jun 14, 2022

@jbvsmo I found this post which looks related to this issue: https://community.dremio.com/t/unable-to-group-by-date-trunc/7953

@rusackas I think it makes sense to investigate this further and whether it's indeed a fix that's needed on our side or Dremio, but I think as as workaround you should be able to create a calculated column on your dataset for DATE_TRUNC('day', "OperationDate") and give it a new name and then use that new column for your dimension/group by (you can use the original column in your time dimension). When trying that change, my query now looks like this one, which matches what looks to be the fix for Dremio:

DEV__Explore-_Untitled_Query_1_06_14_2022_14_17_27

also cc @betodealmeida for any additional insight.

@jbvsmo
Copy link
Author

jbvsmo commented Jun 24, 2022

@eschutho Following up on this, seems like Dremio still doesn't support that syntax and they recommend using the alias on the groupby clause. I asked on that community post you showed and they responded that.

There are a few issues with calculated columns in the dataset, one being annoying to do it for several tables and another that sometimes I want to aggregate by day, month, year, quarter, week... I would have to create many columns when a simply fix of using the correct alias would be enough

@eschutho
Copy link
Member

@jbvsmo you can also create an alias/calculated column without the time coercion if you want it to be more flexible. Here's an example of renaming ds to date:
_DEV__Superset
Then just use date for your time grain and ds for your dimension.

I hear what you're saying about it not being ideal with regard to having to do this on many tables, but it is a temporary workaround for now until Dremio can fix the bug on their end. The code solution on the Superset side around this would need to be something where we alias all time dimensions for Dremio which could get messy, but we're definitely open to any PRs if someone wants to contribute an idea.

@meyergin
Copy link

Same issue at Apache drill,
maybe need to check database driver to choose native name/ alias name.

@mbrannstrom
Copy link

I'm using Apache Drill and having the same problem (as @meyergin also mentioned).

Apache Drill supports GROUP BY both with and without alias. E.g. both these works:

SELECT LEFT(name,2) as name2 FROM ... GROUP BY LEFT(name,2) - OK
SELECT LEFT(name,2) as name2 FROM ... GROUP BY name2 - OK

However, when the alias is the same as the column name, we get a different behaviour:
SELECT LEFT(name,2) as name FROM ... GROUP BY LEFT(name,2) - FAIL - Method used by Superset
SELECT LEFT(name,2) as name FROM ... GROUP BY name - OK

@eschutho's workaround by adding a new calculated column ts, will make superset not use the same alias as the original column. However, this is only reasonable for the time column, and not custom SQL adhoc metrics.

I'm not sure what is "proper SQL", but many engines seem to handle GROUP BY very differently.

In superset/superset/models/helpers.py make_orderby_compatible:

    If needed, make sure aliases for selected columns are not used in
    `ORDER BY`.

    In some databases (e.g. Presto), `ORDER BY` clause is not able to
    automatically pick the source column if a `SELECT` clause alias is named
    the same as a source column. In this case, we update the SELECT alias to
    another name to avoid the conflict.`

So, some adjustment is done for ORDER BY, and it seems like the same is needed for GROUP BY.

@rusackas rusackas added data:connect:dremio Related to Dremio data:connect:drill Related to Drill labels Feb 1, 2024
@rusackas
Copy link
Member

rusackas commented Feb 1, 2024

I'm not sure who might be best to take a look at the current state of affairs here, as I don't know of anyone Drill or Dremio volunteers to reproduce or assess.

I might consider filing the Drill issue separately, lest this issue get conflated and stuck.

CC @naren-dremio in case he knows of anyone fitting to take a look at this from the Dremio side.

@rusackas
Copy link
Member

Anyone still facing this, or looking to work on it? It's at risk of being closed as stale.

@mbrannstrom
Copy link

@rusackas: See issue #28443. The PR #28444 solves my issue with Drill. Just waiting for it to be merged ... and released.

@rusackas
Copy link
Member

rusackas commented Jun 3, 2024

Adding more reviewers to that PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
#bug Bug report data:connect:dremio Related to Dremio data:connect:drill Related to Drill
Projects
None yet
Development

No branches or pull requests

5 participants