-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be able to stream the results of query #3
Comments
With datasets-server we'll store all the datasets as parquet, so you'll be able to use duckdb on every dataset, streaming the result from the remote parquet files |
For this to work, besides storing the parquet version of a dataset, we would also have to implement a HF filesystem as a DuckDB extension. Otherwise, the full URL path of a parquet file will have to be used. Considering this, I think the most flexible solution would be to implement the HF filesystem and import duckdb
con = duckdb.connect()
reader = con.execute("SELECT * FROM hf://datasets/poloclub/diffusiondb").fetch_record_batch(chunk_size)
ds = Dataset.from_reader(reader) / IterableDataset.from_reader(reader) |
Would it be possible to patch the query string on the fly to replace |
I don't think an implicit conversion like this is a good design since a repo can contain (multiple) Parquet files where not all of them are included in the generated dataset, which means these files could not be referenced with your approach. Hence, it would be cleaner to construct Parquet URLs with |
Makes sense @mariosasko |
I'd like to query a large remote dataset (on the hub or elsewhere) and then stream the results of the query so that I don't have to download the entire dataset to my machine.
For example, you could query diffusiondb for images generated with prompts containing the word "ceo" to visualize biases:
This combined with huggingface/dataset-viewer#398 would open the door for a lot of cool applications of gradio + datasets where users could interactively explore datasets that don't fit on their machines and create spaces without having to download/store large datasets.
I see that data can be streamed from duckdb with pyarrow: https://duckdb.org/2021/12/03/duck-arrow.html . I wonder if this can be leveraged for this use case.
The text was updated successfully, but these errors were encountered: