Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPC-H] Polars does not run at scale 1000 #1389

Open
hendrikmakait opened this issue Feb 13, 2024 · 10 comments
Open

[TPC-H] Polars does not run at scale 1000 #1389

hendrikmakait opened this issue Feb 13, 2024 · 10 comments
Labels

Comments

@hendrikmakait
Copy link
Member

At scale 1000, Polars fails with pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())

For now, I'll try manually installing polars-u64-idx and re-running the tests. I'll update this issue with my findings.

Cluster: https://cloud.coiled.io/clusters/383561/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

Traceback:

2024-02-13 19:04:05.9950
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-3500c860159128597d97e62786e99d95
Function:  _run
args:      (<function test_query_1.<locals>._ at 0x7fe99433e520>)
kwargs:    {}
Exception: 'PanicException("polars\' maximum length reached. Consider installing \'polars-u64-idx\'.: TryFromIntError(())")'
2024-02-13 19:04:05.9360
scheduler

distributed.core - INFO - Event loop was unresponsive in Worker for 33.25s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2024-02-13 19:04:05.9350
scheduler

distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 374.80 GiB -- Worker memory limit: 393.79 GiB
2024-02-13 19:04:02.0430
scheduler

pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())
2024-02-13 19:04:02.0250
scheduler

Traceback (most recent call last):
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/utils/_scan.py", line 28, in _execute_from_rust
    return function(with_columns, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/io/pyarrow_dataset/anonymous_scan.py", line 105, in _scan_pyarrow_dataset_impl
    return from_arrow(ds.to_table(**common_params))  # type: ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/convert.py", line 594, in from_arrow
    return pl.DataFrame._from_arrow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/dataframe/frame.py", line 599, in _from_arrow
    arrow_to_pydf(
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/utils/_construction.py", line 1546, in arrow_to_pydf
    pydf = pydf.rechunk()
           ^^^^^^^^^^^^^^
2024-02-13 19:04:02.0240
scheduler

--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
2024-02-13 19:04:00.4060
scheduler

thread '<unnamed>' panicked at crates/polars-core/src/chunked_array/ops/chunkops.rs:83:62:
polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
@hendrikmakait
Copy link
Member Author

Installing polars-u64-idx removed the error, but now it's running OOM even on query 1: https://cloud.coiled.io/clusters/383584/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics&filterPattern=

I'll abort further tests.

@ritchie46
Copy link
Contributor

@hendrikmakait the u64 error is expected. If you work on more than 4.2 billion rows you need that version of polars. The new string type introduced a bug in our out of core aggregation. Can you retry with Polars 0.20.8? I expect this to release this afternoon.

@hendrikmakait
Copy link
Member Author

@ritchie46: Thanks for the additional info, I'll rerun the suite on 0.20.8. What's the performance difference between polars and polars-u64-idx, i.e., would it be fine to run all scales using polars-u64-idx, or would that significantly skew the results?

@hendrikmakait
Copy link
Member Author

Polars still runs OOM on query 1, even with polars-u64-idx=0.20.8: https://cloud.coiled.io/clusters/384122/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

@ritchie46
Copy link
Contributor

ritchie46 commented Feb 14, 2024

Polars still runs OOM on query 1, even with polars-u64-idx=0.20.8: https://cloud.coiled.io/clusters/384122/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

Yeah, I see the query starts from a pyarrow_dataset. Polars cannot run on the streaming engine with the pyarow dataset. The query should start with scan_parquet.

So this

def read_data(filename):
    pyarrow_dataset = dataset(filename, format="parquet")
    return pl.scan_pyarrow_dataset(pyarrow_dataset)

    if filename.startswith("s3://"):
        import boto3

        session = boto3.session.Session()
        credentials = session.get_credentials()
        return pl.scan_parquet(
            filename,
            storage_options={
                "aws_access_key_id": credentials.access_key,
                "aws_secret_access_key": credentials.secret_key,
                "region": "us-east-2",
            },
        )
    else:
        return pl.scan_parquet(filename + "/*")

should be

def read_data(filename):
        return pl.scan_parquet(filename + "/*")

What's the performance difference between polars and polars-u64-idx, i.e., would it be fine to run all scales using polars-u64-idx

The default binary is optimized for smaller dataset. It is slower if you start from disk. I believe you are benchmarking from s3, so I think the difference will be less. But you'll have to try it.

@phofl
Copy link
Contributor

phofl commented Feb 14, 2024

Yeah, I see the query starts from a pyarrow_dataset. Polars cannot run on the streaming engine with the pyarow dataset. The query should start with scan_parquet.

#1394

Can Polars figure out storage options automatically now?

@hendrikmakait
Copy link
Member Author

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Feb 15, 2024

To summarize a few findings: It's not stuck per sé, but it didn't show much hardware utilization and wasn't done after 30 minutes, so I aborted the test. Looking at the hardware metrics, CPU utilization is at ~400% for most of the time, suggesting that it's still doing something, but not a lot. Looking at a run at scale 100, we can see that CPU is at 100% - 200% for most of the time, so maybe our configuration is off?

Scale 100 cluster: https://cloud.coiled.io/clusters/385189/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics

@hendrikmakait hendrikmakait changed the title [TPC-H] Polars fails with pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(()) [TPC-H] Polars does not run at scale 1000 Feb 15, 2024
@phofl
Copy link
Contributor

phofl commented Feb 15, 2024

Turns out, if you use the deprecated pl.count it will block the streaming mode. It seems to give us a proper output form explain if we use pl.len, that was very surprising

#1395

@hendrikmakait
Copy link
Member Author

Switching to scan_parquet brings its own set of problems: #1396

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants