Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be able to stream the results of query #3

Open
freddyaboulton opened this issue Nov 21, 2022 · 5 comments
Open

Be able to stream the results of query #3

freddyaboulton opened this issue Nov 21, 2022 · 5 comments

Comments

@freddyaboulton
Copy link

I'd like to query a large remote dataset (on the hub or elsewhere) and then stream the results of the query so that I don't have to download the entire dataset to my machine.

For example, you could query diffusiondb for images generated with prompts containing the word "ceo" to visualize biases:

SELECT * from poloclub/diffusiondb
WHERE contains('prompt', 'ceo')

This combined with huggingface/dataset-viewer#398 would open the door for a lot of cool applications of gradio + datasets where users could interactively explore datasets that don't fit on their machines and create spaces without having to download/store large datasets.

I see that data can be streamed from duckdb with pyarrow: https://duckdb.org/2021/12/03/duck-arrow.html . I wonder if this can be leveraged for this use case.

@lhoestq
Copy link

lhoestq commented Nov 22, 2022

With datasets-server we'll store all the datasets as parquet, so you'll be able to use duckdb on every dataset, streaming the result from the remote parquet files

@mariosasko
Copy link
Owner

SELECT * from poloclub/diffusiondb
WHERE contains('prompt', 'ceo')

For this to work, besides storing the parquet version of a dataset, we would also have to implement a HF filesystem as a DuckDB extension. Otherwise, the full URL path of a parquet file will have to be used.

Considering this, I think the most flexible solution would be to implement the HF filesystem and Dataset.from_reader/IterableDataset.from_reader that create a HF dataset from a query. Then the workflow would look as follows:

import duckdb
con = duckdb.connect()
reader = con.execute("SELECT * FROM hf://datasets/poloclub/diffusiondb").fetch_record_batch(chunk_size)
ds = Dataset.from_reader(reader) / IterableDataset.from_reader(reader)

@freddyaboulton
Copy link
Author

Would it be possible to patch the query string on the fly to replace poloclub/diffusiondb or hf://datasets/poloclub/diffusiondb with the full path to the parquet file in the meantime? Don't think this issue should be blocked on implementing HF filesystem which I agree will be great to have.

@mariosasko
Copy link
Owner

@freddyaboulton

I don't think an implicit conversion like this is a good design since a repo can contain (multiple) Parquet files where not all of them are included in the generated dataset, which means these files could not be referenced with your approach. Hence, it would be cleaner to construct Parquet URLs with huggingface_hub.hf_hub_url.

@freddyaboulton
Copy link
Author

Makes sense @mariosasko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants