Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading https dataset doesn't appropriately push down predicates #12493

Closed
2 tasks done
kszlim opened this issue Nov 15, 2023 · 9 comments · Fixed by #12517
Closed
2 tasks done

Downloading https dataset doesn't appropriately push down predicates #12493

kszlim opened this issue Nov 15, 2023 · 9 comments · Fixed by #12517
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@kszlim
Copy link
Contributor

kszlim commented Nov 15, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Trying to replicate this https://til.simonwillison.net/duckdb/remote-parquet in polars, I end up downloading the whole dataset (according to nettop).

import polars as pl
import os
import time

pid = os.getpid()
print(f"PID is: {pid}") # for nettop
time.sleep(5)

base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"

ldfs = []

for i in range(56):
    url = f"{base_url}{i:06}.parquet"
    ldfs.append(pl.scan_parquet(url))

ldf = pl.concat(ldfs)
df = ldf.select(pl.col("size").sum())
print(df.collect())

Log output

PID is: 23755
UNION: union is run in parallel
shape: (1, 1)
┌─────────────────┐
│ size            │
│ ---             │
│ i64             │
╞═════════════════╡
│ 162800469938172 │
└─────────────────┘

Issue description

Seems to download the files one at a time (at least according to nettop) I don't see any pruning in the verbose logs based off statistics.

Is this an issue with the https implementation in object_store or how it's called?

Expected behavior

Should only download exactly the data needed to compute the sum (on the order of 100s of MB, whereas it downloads GBs).

Installed versions

--------Version info---------
Polars:              0.19.13
Index type:          UInt32
Platform:            macOS-13.5.2-arm64-arm-64bit
Python:              3.9.6 (default, Aug 11 2023, 19:44:49)
[Clang 15.0.0 (clang-1500.0.40.1)]

----Optional dependencies----
adbc_driver_sqlite:  0.5.1
cloudpickle:         <not installed>
connectorx:          0.3.1
deltalake:           0.10.0
fsspec:              2023.6.0
gevent:              <not installed>
matplotlib:          3.7.2
numpy:               1.25.0
openpyxl:            <not installed>
pandas:              2.0.2
pyarrow:             12.0.1
pydantic:            2.0.2
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          2.0.18
xlsx2csv:            0.8.1
xlsxwriter:          3.1.2```

</details>
@kszlim kszlim added bug Something isn't working python Related to Python Polars labels Nov 15, 2023
@kszlim
Copy link
Contributor Author

kszlim commented Nov 15, 2023

image

Here's an example of it in progress (you can see it's downloading GBs).

@deanm0000
Copy link
Collaborator

the object_store library is only used specifically for azure/google/s3 not https. It looks like the library supports http so I guess this should be a feature request to use object_store for http links too.

I don't think the issue is that predicate pushdown isn't working, the issue is that it's just eagerly downloading files instead of using, as duckdb does, http range requests to get statistics first.

As a corollary, it's downloading in serial because it's seemingly downloading the whole file at each invocation of scan_parquet

@kszlim
Copy link
Contributor Author

kszlim commented Nov 15, 2023

That makes sense, I guess it would make sense if a maintainer could convert this into a feature request.

@alexander-beedie alexander-beedie added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working labels Nov 15, 2023
@universalmind303
Copy link
Collaborator

FWIW, most http endpoints will still require downloading the entire file. object_store relies on http range requests for filtering, and there is no way to actually know ahead of time if a certain server supports range requests or not.

IIRC, object_store will actually reject the download request if it submits a range request & it is not fulfilled by the server.

@deanm0000
Copy link
Collaborator

most http endpoints will still require downloading the entire file

Do they though? I struggled to find anything authoritative on "percent of web servers that support range requests" but I did find this https://news.ycombinator.com/item?id=27019819

Without range requests, video and audio streaming don't work. Resuming downloads don't work. Web servers (the software) are foss so it's not like you're relying on hosts to be paying for some premium product to have range request support.

@universalmind303
Copy link
Collaborator

Do they though?

I guess a better way of stating my OP would be that polars has no way to guarantee that any given http server supports range requests. Some may include the accept-ranges: bytes header in a HEAD request, but it is not a requirement.

So you can end up with a situation where the range request was not accepted Does polars error out, or does it retry with downloading the whole file?

object-store will error out.

So while it is possible to be done through object-store, there are some considerations on how to handle these kind of endpoints, as it'll cause some backwards incompatible changes. People using polars to read from private servers that may not be set up to handle range requests will be most affected.

@ritchie46
Copy link
Member

Maybe we could try supporting it via object-store and let object store error when we cannot do range queries. We cant try-catch and fall back to current behavior.

Note that there are not predicates here, so I would say projection pushdown is not fully utilized.

@ritchie46
Copy link
Member

Will add http support in scan_parquet.

This query would have worked:

base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"

files = []

for i in range(32):
    url = f"{base_url}{i:06}.parquet"
    files.append(url)


df = pl.scan_parquet(files).select(pl.col("size").sum())
print(df.collect())

If pandas didn't write dataset mixing floats/ints :')

ComputeError: schema of all files in a single scan_parquet must be equal

Expected: Schema:
name: id, data type: Utf8
name: channel_id, data type: Utf8
name: content, data type: Utf8
name: timestamp, data type: Utf8
name: image_id, data type: Utf8
name: height, data type: Int64
name: width, data type: Int64
name: url, data type: Utf8
name: size, data type: Int64


Got: Schema:
name: id, data type: Utf8
name: channel_id, data type: Utf8
name: content, data type: Utf8
name: timestamp, data type: Utf8
name: image_id, data type: Utf8
name: height, data type: Float64
name: width, data type: Float64
name: url, data type: Utf8
name: size, data type: Int64

@kszlim query will also work, but will be slower as we cannot amortize the web costs over a single dataset.

@ritchie46
Copy link
Member

edit:

This query now works:

base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"

files = []

for i in range(56):
    url = f"{base_url}{i:06}.parquet"
    files.append(url)


df = pl.scan_parquet(files).select(pl.col("size").sum())
print(df.collect())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants