Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(tpcds): add query 64 #9955

Merged
merged 8 commits into from
Sep 1, 2024
Merged

test(tpcds): add query 64 #9955

merged 8 commits into from
Sep 1, 2024

Conversation

gforsyth
Copy link
Member

@gforsyth gforsyth commented Aug 28, 2024

So this should probably not get merged yet? Or we can merge it and mark the
DuckDB failure and investigate further.

My notes so far:

There is some weird stuff happening only inside Pytest when running
this query on DuckDB. This doesn't happen with any other tpc-ds
queries, and the query in question seems to run without issue on the
other backends that can run the tpc-ds queries.

If I generate sf=1 tpc-ds data in the DuckDB CLI, then run the included
sql file (which is the SQL generated by Ibis for this query), it runs in
about half a second.

If I %load ibis_tpcds_64_local.py and then run the expr there (which
is a copy-paste of the test code), it takes about half a second.

If I remove the pytest.timeout from the tpc_test decorator, this
same query, even at sf=.45 (which is empty at that size), takes
minutes to run.

I also thought it might be the difference between using memory as the catalog instead of tpcds, but no, if I create tpcds.ddb, then create the tables in there, it still runs very quickly.

@cpcloud
Copy link
Member

cpcloud commented Aug 29, 2024

This query is absolutely monstrous.

@gforsyth gforsyth force-pushed the tpcds64 branch 2 times, most recently from d558c19 to 80e9cf9 Compare August 29, 2024 15:07
@gforsyth
Copy link
Member Author

ok, well, it passes on the other backends and the timeout will handle duckdb for now.

I'm also wondering, for the sake of correctness checks, if we should shunt the tpcds data creation into CI at sf=1

@gforsyth
Copy link
Member Author

oh FFS, of course it doesn't timeout on some of the DuckDB jobs

@gforsyth
Copy link
Member Author

This query is cursed

@cpcloud
Copy link
Member

cpcloud commented Aug 30, 2024

The ClickHouse timeout also does not surprise me, but it also seems to trigger what looks like a resource leakage: the remaining TPC-DS queries cannot run, because the client from query 64 appears to still be hanging around.

@cpcloud
Copy link
Member

cpcloud commented Aug 30, 2024

Two totally weird facts:

  1. .tmp grows quite large while waiting for the query to finish
  2. Loading TPC-DS data using CREATE TABLE instead of CREATE VIEW allows the query to finish on DuckDB quickly.

@cpcloud
Copy link
Member

cpcloud commented Aug 30, 2024

Here's a stripped down version of the query that has all the same annoying and mysterious characteristics we're observing:

   @tpc_test("ds", result_is_empty=True)
   def test_64(store_sales, customer_demographics):
       cd1 = customer_demographics
       cd2 = customer_demographics.view()

       expr = (
           store_sales.join(cd1, _.ss_cdemo_sk == cd1.cd_demo_sk)
           .join(
               cd2[["cd_marital_status"]], cd1.cd_marital_status != cd2.cd_marital_status
           )
           .select(cd1.cd_marital_status)
           .limit(1)
       )

       return expr

@cpcloud
Copy link
Member

cpcloud commented Aug 30, 2024

Found a reproducer without pytest:

import duckdb


con = duckdb.connect()
sql = """
SELECT
 t3.cd_marital_status
FROM read_parquet('ci/ibis-testing-data/tpcds/sf=0.45/parquet/store_sales.parquet') AS t2
JOIN read_parquet('ci/ibis-testing-data/tpcds/sf=0.45/parquet/customer_demographics.parquet') AS t3
 ON ss_cdemo_sk = cd_demo_sk
JOIN read_parquet('ci/ibis-testing-data/tpcds/sf=0.45/parquet/customer_demographics.parquet') AS t4
 ON t3.cd_marital_status <> t4.cd_marital_status
LIMIT 1
"""
result = con.sql(sql)
result = result.arrow()
assert len(result) == 1

@cpcloud
Copy link
Member

cpcloud commented Aug 30, 2024

Reported upstream here duckdb/duckdb#13657

@cpcloud
Copy link
Member

cpcloud commented Aug 31, 2024

Since this is fixed upstream already and the DuckDB release is out soon, I'll suggest that for now we alter our duckdb/conftest.py to create tables instead of views.

Not entirely sure what to do about ClickHouse yet.

@cpcloud cpcloud added this to the 9.4 milestone Aug 31, 2024
@cpcloud cpcloud added the tests Issues or PRs related to tests label Aug 31, 2024
@cpcloud cpcloud added clickhouse The ClickHouse backend duckdb The DuckDB backend labels Sep 1, 2024
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for chugging through this abomination.

@cpcloud
Copy link
Member

cpcloud commented Sep 1, 2024

Snowflake passing query 64:

❯ pytest -m snowflake ibis/backends/tests/tpc/ds/test_queries.py -k 64 -q
.                                                                                                                                                                                              [100%]
1 passed, 1980 deselected in 18.66s

@cpcloud cpcloud enabled auto-merge (squash) September 1, 2024 14:16
@cpcloud cpcloud merged commit d40e6e9 into ibis-project:main Sep 1, 2024
81 checks passed
@gforsyth gforsyth deleted the tpcds64 branch September 1, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse The ClickHouse backend duckdb The DuckDB backend tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants