Be able to stream the results of query #3

freddyaboulton · 2022-11-21T20:21:31Z

I'd like to query a large remote dataset (on the hub or elsewhere) and then stream the results of the query so that I don't have to download the entire dataset to my machine.

For example, you could query diffusiondb for images generated with prompts containing the word "ceo" to visualize biases:

SELECT * from poloclub/diffusiondb
WHERE contains('prompt', 'ceo')

This combined with huggingface/dataset-viewer#398 would open the door for a lot of cool applications of gradio + datasets where users could interactively explore datasets that don't fit on their machines and create spaces without having to download/store large datasets.

I see that data can be streamed from duckdb with pyarrow: https://duckdb.org/2021/12/03/duck-arrow.html . I wonder if this can be leveraged for this use case.

The text was updated successfully, but these errors were encountered:

lhoestq · 2022-11-22T13:32:37Z

With datasets-server we'll store all the datasets as parquet, so you'll be able to use duckdb on every dataset, streaming the result from the remote parquet files

mariosasko · 2022-11-22T17:42:17Z

SELECT * from poloclub/diffusiondb
WHERE contains('prompt', 'ceo')

For this to work, besides storing the parquet version of a dataset, we would also have to implement a HF filesystem as a DuckDB extension. Otherwise, the full URL path of a parquet file will have to be used.

Considering this, I think the most flexible solution would be to implement the HF filesystem and Dataset.from_reader/IterableDataset.from_reader that create a HF dataset from a query. Then the workflow would look as follows:

import duckdb
con = duckdb.connect()
reader = con.execute("SELECT * FROM hf://datasets/poloclub/diffusiondb").fetch_record_batch(chunk_size)
ds = Dataset.from_reader(reader) / IterableDataset.from_reader(reader)

freddyaboulton · 2022-11-28T14:19:45Z

Would it be possible to patch the query string on the fly to replace poloclub/diffusiondb or hf://datasets/poloclub/diffusiondb with the full path to the parquet file in the meantime? Don't think this issue should be blocked on implementing HF filesystem which I agree will be great to have.

mariosasko · 2022-12-20T19:03:03Z

@freddyaboulton

I don't think an implicit conversion like this is a good design since a repo can contain (multiple) Parquet files where not all of them are included in the generated dataset, which means these files could not be referenced with your approach. Hence, it would be cleaner to construct Parquet URLs with huggingface_hub.hf_hub_url.

freddyaboulton · 2022-12-20T23:04:27Z

Makes sense @mariosasko

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be able to stream the results of query #3

Be able to stream the results of query #3

freddyaboulton commented Nov 21, 2022

lhoestq commented Nov 22, 2022

mariosasko commented Nov 22, 2022

freddyaboulton commented Nov 28, 2022

mariosasko commented Dec 20, 2022

freddyaboulton commented Dec 20, 2022

Be able to stream the results of query #3

Be able to stream the results of query #3

Comments

freddyaboulton commented Nov 21, 2022

lhoestq commented Nov 22, 2022

mariosasko commented Nov 22, 2022

freddyaboulton commented Nov 28, 2022

mariosasko commented Dec 20, 2022

freddyaboulton commented Dec 20, 2022