Downloading https dataset doesn't appropriately push down predicates #12493

kszlim · 2023-11-15T19:35:58Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Trying to replicate this https://til.simonwillison.net/duckdb/remote-parquet in polars, I end up downloading the whole dataset (according to nettop).

import polars as pl
import os
import time

pid = os.getpid()
print(f"PID is: {pid}") # for nettop
time.sleep(5)

base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"

ldfs = []

for i in range(56):
    url = f"{base_url}{i:06}.parquet"
    ldfs.append(pl.scan_parquet(url))

ldf = pl.concat(ldfs)
df = ldf.select(pl.col("size").sum())
print(df.collect())

Log output

PID is: 23755
UNION: union is run in parallel
shape: (1, 1)
┌─────────────────┐
│ size            │
│ ---             │
│ i64             │
╞═════════════════╡
│ 162800469938172 │
└─────────────────┘

Issue description

Seems to download the files one at a time (at least according to nettop) I don't see any pruning in the verbose logs based off statistics.

Is this an issue with the https implementation in object_store or how it's called?

Expected behavior

Should only download exactly the data needed to compute the sum (on the order of 100s of MB, whereas it downloads GBs).

Installed versions

--------Version info---------
Polars:              0.19.13
Index type:          UInt32
Platform:            macOS-13.5.2-arm64-arm-64bit
Python:              3.9.6 (default, Aug 11 2023, 19:44:49)
[Clang 15.0.0 (clang-1500.0.40.1)]

----Optional dependencies----
adbc_driver_sqlite:  0.5.1
cloudpickle:         <not installed>
connectorx:          0.3.1
deltalake:           0.10.0
fsspec:              2023.6.0
gevent:              <not installed>
matplotlib:          3.7.2
numpy:               1.25.0
openpyxl:            <not installed>
pandas:              2.0.2
pyarrow:             12.0.1
pydantic:            2.0.2
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          2.0.18
xlsx2csv:            0.8.1
xlsxwriter:          3.1.2```

</details>

The text was updated successfully, but these errors were encountered:

kszlim · 2023-11-15T19:39:20Z

Here's an example of it in progress (you can see it's downloading GBs).

deanm0000 · 2023-11-15T20:06:00Z

the object_store library is only used specifically for azure/google/s3 not https. It looks like the library supports http so I guess this should be a feature request to use object_store for http links too.

I don't think the issue is that predicate pushdown isn't working, the issue is that it's just eagerly downloading files instead of using, as duckdb does, http range requests to get statistics first.

As a corollary, it's downloading in serial because it's seemingly downloading the whole file at each invocation of scan_parquet

kszlim · 2023-11-15T20:11:07Z

That makes sense, I guess it would make sense if a maintainer could convert this into a feature request.

universalmind303 · 2023-11-15T23:41:42Z

FWIW, most http endpoints will still require downloading the entire file. object_store relies on http range requests for filtering, and there is no way to actually know ahead of time if a certain server supports range requests or not.

IIRC, object_store will actually reject the download request if it submits a range request & it is not fulfilled by the server.

deanm0000 · 2023-11-16T00:02:34Z

most http endpoints will still require downloading the entire file

Do they though? I struggled to find anything authoritative on "percent of web servers that support range requests" but I did find this https://news.ycombinator.com/item?id=27019819

Without range requests, video and audio streaming don't work. Resuming downloads don't work. Web servers (the software) are foss so it's not like you're relying on hosts to be paying for some premium product to have range request support.

universalmind303 · 2023-11-16T02:07:49Z

Do they though?

I guess a better way of stating my OP would be that polars has no way to guarantee that any given http server supports range requests. Some may include the accept-ranges: bytes header in a HEAD request, but it is not a requirement.

So you can end up with a situation where the range request was not accepted Does polars error out, or does it retry with downloading the whole file?

object-store will error out.

So while it is possible to be done through object-store, there are some considerations on how to handle these kind of endpoints, as it'll cause some backwards incompatible changes. People using polars to read from private servers that may not be set up to handle range requests will be most affected.

ritchie46 · 2023-11-16T08:13:58Z

Maybe we could try supporting it via object-store and let object store error when we cannot do range queries. We cant try-catch and fall back to current behavior.

Note that there are not predicates here, so I would say projection pushdown is not fully utilized.

ritchie46 · 2023-11-16T14:41:31Z

Will add http support in scan_parquet.

This query would have worked:

base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"

files = []

for i in range(32):
    url = f"{base_url}{i:06}.parquet"
    files.append(url)


df = pl.scan_parquet(files).select(pl.col("size").sum())
print(df.collect())

If pandas didn't write dataset mixing floats/ints :')

ComputeError: schema of all files in a single scan_parquet must be equal

Expected: Schema:
name: id, data type: Utf8
name: channel_id, data type: Utf8
name: content, data type: Utf8
name: timestamp, data type: Utf8
name: image_id, data type: Utf8
name: height, data type: Int64
name: width, data type: Int64
name: url, data type: Utf8
name: size, data type: Int64


Got: Schema:
name: id, data type: Utf8
name: channel_id, data type: Utf8
name: content, data type: Utf8
name: timestamp, data type: Utf8
name: image_id, data type: Utf8
name: height, data type: Float64
name: width, data type: Float64
name: url, data type: Utf8
name: size, data type: Int64

@kszlim query will also work, but will be slower as we cannot amortize the web costs over a single dataset.

ritchie46 · 2023-11-16T16:08:35Z

edit:

This query now works:

base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"

files = []

for i in range(56):
    url = f"{base_url}{i:06}.parquet"
    files.append(url)


df = pl.scan_parquet(files).select(pl.col("size").sum())
print(df.collect())

kszlim added bug Something isn't working python Related to Python Polars labels Nov 15, 2023

alexander-beedie added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working labels Nov 15, 2023

ritchie46 mentioned this issue Nov 16, 2023

feat: support http scan_parquet #12517

Merged

ritchie46 closed this as completed in #12517 Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloading https dataset doesn't appropriately push down predicates #12493

Downloading https dataset doesn't appropriately push down predicates #12493

kszlim commented Nov 15, 2023

kszlim commented Nov 15, 2023

deanm0000 commented Nov 15, 2023

kszlim commented Nov 15, 2023

universalmind303 commented Nov 15, 2023

deanm0000 commented Nov 16, 2023

universalmind303 commented Nov 16, 2023

ritchie46 commented Nov 16, 2023

ritchie46 commented Nov 16, 2023

ritchie46 commented Nov 16, 2023

Downloading https dataset doesn't appropriately push down predicates #12493

Downloading https dataset doesn't appropriately push down predicates #12493

Comments

kszlim commented Nov 15, 2023

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

kszlim commented Nov 15, 2023

deanm0000 commented Nov 15, 2023

kszlim commented Nov 15, 2023

universalmind303 commented Nov 15, 2023

deanm0000 commented Nov 16, 2023

universalmind303 commented Nov 16, 2023

ritchie46 commented Nov 16, 2023

ritchie46 commented Nov 16, 2023

ritchie46 commented Nov 16, 2023