-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_pyarrow_table does not deduce partition filters from filter list #1997
Comments
@zalmane according to the pyarrow docs, filter predicates in the |
I also saw this in the to_table docs, but it seems in delta-rs it is a two step process - first calls to_pyarrow_dataset, then invokes to_table. So the to_pyarrow_dataset would scan these fragments and return a dataset that includes all files. Then, to_table will filter those. |
Maybe @wjones127 knows that one. I can only assume it doesn't matter per se |
Did some more digging using plain vanilla pyarrow and strace. Definitely seems that filtering at the dataset levels allows to skip files, which will otherwise at least get scanned for the metadata or statistics if only referenced in to_table. |
At this point, it doesn't scan the files. Their metadata are already known in the delta log. The partition values and file-level statistics are taken from the delta log and attached to each of the dataset fragments, then a dataset is made of those fragments. This all happens in this loop: delta-rs/python/deltalake/table.py Lines 945 to 954 in bc9253c
No IO should be happening here between the scanning of the delta log and creation of the PyArrow dataset. (If there is, that is a bug.) The filters passed into
Do you have a repro here? Are you using |
If this is still relevant, please ping me |
Description
In the documentation for to_pyarrow_table it is stated that if filters are used, partitions do not need to be specified.
However, looking at the code here:
delta-rs/python/deltalake/table.py
Line 969 in bc9253c
it seems that only partitions are used when creating the dataset. Then, filters are used when converting to a table.
Is this a bug or expected behavior? Seems that in a large Delta table, we will be scanning a lot of irrelevant files in the first stage.
Use Case
Specify filters list and expect Delta-rs to find the relevant partition filters and apply those to the Dataset.
Related Issue(s)
The text was updated successfully, but these errors were encountered: