-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(tpcds): add query 64 #9955
Conversation
This query is absolutely monstrous. |
d558c19
to
80e9cf9
Compare
ok, well, it passes on the other backends and the timeout will handle I'm also wondering, for the sake of correctness checks, if we should shunt the |
oh FFS, of course it doesn't timeout on some of the DuckDB jobs |
This query is cursed |
The ClickHouse timeout also does not surprise me, but it also seems to trigger what looks like a resource leakage: the remaining TPC-DS queries cannot run, because the client from query 64 appears to still be hanging around. |
Two totally weird facts:
|
Here's a stripped down version of the query that has all the same annoying and mysterious characteristics we're observing: @tpc_test("ds", result_is_empty=True)
def test_64(store_sales, customer_demographics):
cd1 = customer_demographics
cd2 = customer_demographics.view()
expr = (
store_sales.join(cd1, _.ss_cdemo_sk == cd1.cd_demo_sk)
.join(
cd2[["cd_marital_status"]], cd1.cd_marital_status != cd2.cd_marital_status
)
.select(cd1.cd_marital_status)
.limit(1)
)
return expr |
Found a reproducer without import duckdb
con = duckdb.connect()
sql = """
SELECT
t3.cd_marital_status
FROM read_parquet('ci/ibis-testing-data/tpcds/sf=0.45/parquet/store_sales.parquet') AS t2
JOIN read_parquet('ci/ibis-testing-data/tpcds/sf=0.45/parquet/customer_demographics.parquet') AS t3
ON ss_cdemo_sk = cd_demo_sk
JOIN read_parquet('ci/ibis-testing-data/tpcds/sf=0.45/parquet/customer_demographics.parquet') AS t4
ON t3.cd_marital_status <> t4.cd_marital_status
LIMIT 1
"""
result = con.sql(sql)
result = result.arrow()
assert len(result) == 1 |
Reported upstream here duckdb/duckdb#13657 |
Since this is fixed upstream already and the DuckDB release is out soon, I'll suggest that for now we alter our Not entirely sure what to do about ClickHouse yet. |
…erformance bug when using views
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for chugging through this abomination.
Snowflake passing query 64:
|
So this should probably not get merged yet? Or we can merge it and mark the
DuckDB failure and investigate further.
My notes so far:
There is some weird stuff happening only inside Pytest when running
this query on DuckDB. This doesn't happen with any other tpc-ds
queries, and the query in question seems to run without issue on the
other backends that can run the tpc-ds queries.
If I generate sf=1 tpc-ds data in the DuckDB CLI, then run the included
sql file (which is the SQL generated by Ibis for this query), it runs in
about half a second.
If I
%load ibis_tpcds_64_local.py
and then run theexpr
there (whichis a copy-paste of the test code), it takes about half a second.
If I remove the
pytest.timeout
from thetpc_test
decorator, thissame query, even at sf=.45 (which is empty at that size), takes
minutes to run.
I also thought it might be the difference between using
memory
as the catalog instead oftpcds
, but no, if I createtpcds.ddb
, then create the tables in there, it still runs very quickly.