-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloading https dataset doesn't appropriately push down predicates #12493
Comments
the object_store library is only used specifically for azure/google/s3 not https. It looks like the library supports http so I guess this should be a feature request to use object_store for http links too. I don't think the issue is that predicate pushdown isn't working, the issue is that it's just eagerly downloading files instead of using, as duckdb does, http range requests to get statistics first. As a corollary, it's downloading in serial because it's seemingly downloading the whole file at each invocation of scan_parquet |
That makes sense, I guess it would make sense if a maintainer could convert this into a feature request. |
FWIW, most http endpoints will still require downloading the entire file. IIRC, |
Do they though? I struggled to find anything authoritative on "percent of web servers that support range requests" but I did find this https://news.ycombinator.com/item?id=27019819 Without range requests, video and audio streaming don't work. Resuming downloads don't work. Web servers (the software) are foss so it's not like you're relying on hosts to be paying for some premium product to have range request support. |
I guess a better way of stating my OP would be that polars has no way to guarantee that any given http server supports range requests. Some may include the So you can end up with a situation where the range request was not accepted Does polars error out, or does it retry with downloading the whole file? So while it is possible to be done through object-store, there are some considerations on how to handle these kind of endpoints, as it'll cause some backwards incompatible changes. People using polars to read from private servers that may not be set up to handle range requests will be most affected. |
Maybe we could try supporting it via object-store and let object store error when we cannot do range queries. We cant try-catch and fall back to current behavior. Note that there are not predicates here, so I would say projection pushdown is not fully utilized. |
Will add http support in This query would have worked: base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"
files = []
for i in range(32):
url = f"{base_url}{i:06}.parquet"
files.append(url)
df = pl.scan_parquet(files).select(pl.col("size").sum())
print(df.collect()) If pandas didn't write dataset mixing floats/ints :') ComputeError: schema of all files in a single scan_parquet must be equal
Expected: Schema:
name: id, data type: Utf8
name: channel_id, data type: Utf8
name: content, data type: Utf8
name: timestamp, data type: Utf8
name: image_id, data type: Utf8
name: height, data type: Int64
name: width, data type: Int64
name: url, data type: Utf8
name: size, data type: Int64
Got: Schema:
name: id, data type: Utf8
name: channel_id, data type: Utf8
name: content, data type: Utf8
name: timestamp, data type: Utf8
name: image_id, data type: Utf8
name: height, data type: Float64
name: width, data type: Float64
name: url, data type: Utf8
name: size, data type: Int64 @kszlim query will also work, but will be slower as we cannot amortize the web costs over a single dataset. |
edit: This query now works: base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"
files = []
for i in range(56):
url = f"{base_url}{i:06}.parquet"
files.append(url)
df = pl.scan_parquet(files).select(pl.col("size").sum())
print(df.collect()) |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Trying to replicate this https://til.simonwillison.net/duckdb/remote-parquet in polars, I end up downloading the whole dataset (according to nettop).
Log output
PID is: 23755 UNION: union is run in parallel shape: (1, 1) ┌─────────────────┐ │ size │ │ --- │ │ i64 │ ╞═════════════════╡ │ 162800469938172 │ └─────────────────┘
Issue description
Seems to download the files one at a time (at least according to nettop) I don't see any pruning in the verbose logs based off statistics.
Is this an issue with the https implementation in
object_store
or how it's called?Expected behavior
Should only download exactly the data needed to compute the sum (on the order of 100s of MB, whereas it downloads GBs).
Installed versions
The text was updated successfully, but these errors were encountered: