Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_pyarrow_table does not deduce partition filters from filter list #1997

Closed
zalmane opened this issue Dec 28, 2023 · 6 comments
Closed

to_pyarrow_table does not deduce partition filters from filter list #1997

zalmane opened this issue Dec 28, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@zalmane
Copy link

zalmane commented Dec 28, 2023

Description

In the documentation for to_pyarrow_table it is stated that if filters are used, partitions do not need to be specified.
However, looking at the code here:

def to_pyarrow_table(

it seems that only partitions are used when creating the dataset. Then, filters are used when converting to a table.

Is this a bug or expected behavior? Seems that in a large Delta table, we will be scanning a lot of irrelevant files in the first stage.

Use Case
Specify filters list and expect Delta-rs to find the relevant partition filters and apply those to the Dataset.

Related Issue(s)

@zalmane zalmane added the enhancement New feature or request label Dec 28, 2023
@ion-elgreco
Copy link
Collaborator

@zalmane according to the pyarrow docs, filter predicates in the .to_table uses partition information and file statistics to do the table scan, so it should skip files if they are not in the predicate

@zalmane
Copy link
Author

zalmane commented Dec 28, 2023

I also saw this in the to_table docs, but it seems in delta-rs it is a two step process - first calls to_pyarrow_dataset, then invokes to_table. So the to_pyarrow_dataset would scan these fragments and return a dataset that includes all files. Then, to_table will filter those.
I assume filtering the partitions in the to_pyarrow_dataset functions as an optimization? Otherwise, why filter the partitions there at all vs waiting for the to_table call?

@ion-elgreco
Copy link
Collaborator

Maybe @wjones127 knows that one.

I can only assume it doesn't matter per se

@zalmane
Copy link
Author

zalmane commented Dec 28, 2023

Did some more digging using plain vanilla pyarrow and strace. Definitely seems that filtering at the dataset levels allows to skip files, which will otherwise at least get scanned for the metadata or statistics if only referenced in to_table.
This can be meaningful in case of large Delta tables that have a lot of partitions and files.
Thoughts?

@wjones127
Copy link
Collaborator

So the to_pyarrow_dataset would scan these fragments and return a dataset that includes all files

At this point, it doesn't scan the files. Their metadata are already known in the delta log. The partition values and file-level statistics are taken from the delta log and attached to each of the dataset fragments, then a dataset is made of those fragments. This all happens in this loop:

fragments = [
format.make_fragment(
file,
filesystem=filesystem,
partition_expression=part_expression,
)
for file, part_expression in self._table.dataset_partitions(
self.schema().to_pyarrow(), partitions
)
]

No IO should be happening here between the scanning of the delta log and creation of the PyArrow dataset. (If there is, that is a bug.)

The filters passed into Dataset.to_table() are first evaluated against the fragment metadata and used to prune them. If you want to see which fragments are pruned, you can see which fragments are left over after passing the same filter to Dataset.get_fragments().

Definitely seems that filtering at the dataset levels allows to skip files, which will otherwise at least get scanned for the metadata or statistics if only referenced in to_table.
This can be meaningful in case of large Delta tables that have a lot of partitions and files.

Do you have a repro here? Are you using FileFormat.make_fragment() like we do?

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024
@ion-elgreco
Copy link
Collaborator

If this is still relevant, please ping me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants