[TPC-H] Polars does not run at scale 1000 #1389

hendrikmakait · 2024-02-13T18:26:34Z

At scale 1000, Polars fails with pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())

For now, I'll try manually installing polars-u64-idx and re-running the tests. I'll update this issue with my findings.

Cluster: https://cloud.coiled.io/clusters/383561/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

Traceback:

2024-02-13 19:04:05.9950
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-3500c860159128597d97e62786e99d95
Function:  _run
args:      (<function test_query_1.<locals>._ at 0x7fe99433e520>)
kwargs:    {}
Exception: 'PanicException("polars\' maximum length reached. Consider installing \'polars-u64-idx\'.: TryFromIntError(())")'
2024-02-13 19:04:05.9360
scheduler

distributed.core - INFO - Event loop was unresponsive in Worker for 33.25s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2024-02-13 19:04:05.9350
scheduler

distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 374.80 GiB -- Worker memory limit: 393.79 GiB
2024-02-13 19:04:02.0430
scheduler

pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())
2024-02-13 19:04:02.0250
scheduler

Traceback (most recent call last):
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/utils/_scan.py", line 28, in _execute_from_rust
    return function(with_columns, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/io/pyarrow_dataset/anonymous_scan.py", line 105, in _scan_pyarrow_dataset_impl
    return from_arrow(ds.to_table(**common_params))  # type: ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/convert.py", line 594, in from_arrow
    return pl.DataFrame._from_arrow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/dataframe/frame.py", line 599, in _from_arrow
    arrow_to_pydf(
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/utils/_construction.py", line 1546, in arrow_to_pydf
    pydf = pydf.rechunk()
           ^^^^^^^^^^^^^^
2024-02-13 19:04:02.0240
scheduler

--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
2024-02-13 19:04:00.4060
scheduler

thread '<unnamed>' panicked at crates/polars-core/src/chunked_array/ops/chunkops.rs:83:62:
polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The text was updated successfully, but these errors were encountered:

hendrikmakait · 2024-02-13T19:04:58Z

Installing polars-u64-idx removed the error, but now it's running OOM even on query 1: https://cloud.coiled.io/clusters/383584/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics&filterPattern=

I'll abort further tests.

ritchie46 · 2024-02-13T23:47:35Z

@hendrikmakait the u64 error is expected. If you work on more than 4.2 billion rows you need that version of polars. The new string type introduced a bug in our out of core aggregation. Can you retry with Polars 0.20.8? I expect this to release this afternoon.

hendrikmakait · 2024-02-14T07:14:33Z

@ritchie46: Thanks for the additional info, I'll rerun the suite on 0.20.8. What's the performance difference between polars and polars-u64-idx, i.e., would it be fine to run all scales using polars-u64-idx, or would that significantly skew the results?

hendrikmakait · 2024-02-14T08:52:50Z

Polars still runs OOM on query 1, even with polars-u64-idx=0.20.8: https://cloud.coiled.io/clusters/384122/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

ritchie46 · 2024-02-14T18:19:03Z

Polars still runs OOM on query 1, even with polars-u64-idx=0.20.8: https://cloud.coiled.io/clusters/384122/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

Yeah, I see the query starts from a pyarrow_dataset. Polars cannot run on the streaming engine with the pyarow dataset. The query should start with scan_parquet.

So this

def read_data(filename):
    pyarrow_dataset = dataset(filename, format="parquet")
    return pl.scan_pyarrow_dataset(pyarrow_dataset)

    if filename.startswith("s3://"):
        import boto3

        session = boto3.session.Session()
        credentials = session.get_credentials()
        return pl.scan_parquet(
            filename,
            storage_options={
                "aws_access_key_id": credentials.access_key,
                "aws_secret_access_key": credentials.secret_key,
                "region": "us-east-2",
            },
        )
    else:
        return pl.scan_parquet(filename + "/*")

should be

def read_data(filename):
        return pl.scan_parquet(filename + "/*")

What's the performance difference between polars and polars-u64-idx, i.e., would it be fine to run all scales using polars-u64-idx

The default binary is optimized for smaller dataset. It is slower if you start from disk. I believe you are benchmarking from s3, so I think the difference will be less. But you'll have to try it.

phofl · 2024-02-14T18:30:34Z

Yeah, I see the query starts from a pyarrow_dataset. Polars cannot run on the streaming engine with the pyarow dataset. The query should start with scan_parquet.

#1394

Can Polars figure out storage options automatically now?

hendrikmakait · 2024-02-15T08:58:10Z

It looks like we fixed the OOM problem, but now Polars appears to be "stuck": https://cloud.coiled.io/clusters/385166/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics&cluster+network_variation=Cluster+Total+Rate

hendrikmakait · 2024-02-15T09:16:18Z

To summarize a few findings: It's not stuck per sé, but it didn't show much hardware utilization and wasn't done after 30 minutes, so I aborted the test. Looking at the hardware metrics, CPU utilization is at ~400% for most of the time, suggesting that it's still doing something, but not a lot. Looking at a run at scale 100, we can see that CPU is at 100% - 200% for most of the time, so maybe our configuration is off?

Scale 100 cluster: https://cloud.coiled.io/clusters/385189/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics

phofl · 2024-02-15T09:58:18Z

Turns out, if you use the deprecated pl.count it will block the streaming mode. It seems to give us a proper output form explain if we use pl.len, that was very surprising

#1395

hendrikmakait · 2024-02-15T11:10:53Z

Switching to scan_parquet brings its own set of problems: #1396

hendrikmakait added the tpch label Feb 13, 2024

hendrikmakait changed the title ~~[TPC-H] Polars fails with pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())~~ [TPC-H] Polars does not run at scale 1000 Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPC-H] Polars does not run at scale 1000 #1389

[TPC-H] Polars does not run at scale 1000 #1389

hendrikmakait commented Feb 13, 2024

hendrikmakait commented Feb 13, 2024

ritchie46 commented Feb 13, 2024

hendrikmakait commented Feb 14, 2024

hendrikmakait commented Feb 14, 2024

ritchie46 commented Feb 14, 2024 •

edited

Loading

phofl commented Feb 14, 2024

hendrikmakait commented Feb 15, 2024

hendrikmakait commented Feb 15, 2024 •

edited

Loading

phofl commented Feb 15, 2024 •

edited

Loading

hendrikmakait commented Feb 15, 2024

[TPC-H] Polars does not run at scale 1000 #1389

[TPC-H] Polars does not run at scale 1000 #1389

Comments

hendrikmakait commented Feb 13, 2024

hendrikmakait commented Feb 13, 2024

ritchie46 commented Feb 13, 2024

hendrikmakait commented Feb 14, 2024

hendrikmakait commented Feb 14, 2024

ritchie46 commented Feb 14, 2024 • edited Loading

phofl commented Feb 14, 2024

hendrikmakait commented Feb 15, 2024

hendrikmakait commented Feb 15, 2024 • edited Loading

phofl commented Feb 15, 2024 • edited Loading

hendrikmakait commented Feb 15, 2024

ritchie46 commented Feb 14, 2024 •

edited

Loading

hendrikmakait commented Feb 15, 2024 •

edited

Loading

phofl commented Feb 15, 2024 •

edited

Loading