-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TPC-H] Polars does not run at scale 1000 #1389
Comments
Installing I'll abort further tests. |
@hendrikmakait the |
@ritchie46: Thanks for the additional info, I'll rerun the suite on |
Polars still runs OOM on query 1, even with |
Yeah, I see the query starts from a So this def read_data(filename):
pyarrow_dataset = dataset(filename, format="parquet")
return pl.scan_pyarrow_dataset(pyarrow_dataset)
if filename.startswith("s3://"):
import boto3
session = boto3.session.Session()
credentials = session.get_credentials()
return pl.scan_parquet(
filename,
storage_options={
"aws_access_key_id": credentials.access_key,
"aws_secret_access_key": credentials.secret_key,
"region": "us-east-2",
},
)
else:
return pl.scan_parquet(filename + "/*") should be def read_data(filename):
return pl.scan_parquet(filename + "/*")
The default binary is optimized for smaller dataset. It is slower if you start from disk. I believe you are benchmarking from s3, so I think the difference will be less. But you'll have to try it. |
Can Polars figure out storage options automatically now? |
It looks like we fixed the OOM problem, but now Polars appears to be "stuck": https://cloud.coiled.io/clusters/385166/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics&cluster+network_variation=Cluster+Total+Rate |
To summarize a few findings: It's not stuck per sé, but it didn't show much hardware utilization and wasn't done after 30 minutes, so I aborted the test. Looking at the hardware metrics, CPU utilization is at ~400% for most of the time, suggesting that it's still doing something, but not a lot. Looking at a run at scale 100, we can see that CPU is at 100% - 200% for most of the time, so maybe our configuration is off? Scale 100 cluster: https://cloud.coiled.io/clusters/385189/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics |
pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())
Turns out, if you use the deprecated pl.count it will block the streaming mode. It seems to give us a proper output form explain if we use pl.len, that was very surprising |
Switching to |
At scale 1000, Polars fails with
pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())
For now, I'll try manually installing
polars-u64-idx
and re-running the tests. I'll update this issue with my findings.Cluster: https://cloud.coiled.io/clusters/383561/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=
Traceback:
The text was updated successfully, but these errors were encountered: