to_pyarrow_table does not deduce partition filters from filter list #1997

zalmane · 2023-12-28T16:50:58Z

Description

In the documentation for to_pyarrow_table it is stated that if filters are used, partitions do not need to be specified.
However, looking at the code here:

delta-rs/python/deltalake/table.py

Line 969 in bc9253c

def to_pyarrow_table(

it seems that only partitions are used when creating the dataset. Then, filters are used when converting to a table.

Is this a bug or expected behavior? Seems that in a large Delta table, we will be scanning a lot of irrelevant files in the first stage.

Use Case
Specify filters list and expect Delta-rs to find the relevant partition filters and apply those to the Dataset.

Related Issue(s)

ion-elgreco · 2023-12-28T17:47:10Z

@zalmane according to the pyarrow docs, filter predicates in the .to_table uses partition information and file statistics to do the table scan, so it should skip files if they are not in the predicate

zalmane · 2023-12-28T17:58:09Z

I also saw this in the to_table docs, but it seems in delta-rs it is a two step process - first calls to_pyarrow_dataset, then invokes to_table. So the to_pyarrow_dataset would scan these fragments and return a dataset that includes all files. Then, to_table will filter those.
I assume filtering the partitions in the to_pyarrow_dataset functions as an optimization? Otherwise, why filter the partitions there at all vs waiting for the to_table call?

ion-elgreco · 2023-12-28T18:45:54Z

Maybe @wjones127 knows that one.

I can only assume it doesn't matter per se

zalmane · 2023-12-28T18:55:55Z

Did some more digging using plain vanilla pyarrow and strace. Definitely seems that filtering at the dataset levels allows to skip files, which will otherwise at least get scanned for the metadata or statistics if only referenced in to_table.
This can be meaningful in case of large Delta tables that have a lot of partitions and files.
Thoughts?

wjones127 · 2023-12-29T03:22:06Z

So the to_pyarrow_dataset would scan these fragments and return a dataset that includes all files

At this point, it doesn't scan the files. Their metadata are already known in the delta log. The partition values and file-level statistics are taken from the delta log and attached to each of the dataset fragments, then a dataset is made of those fragments. This all happens in this loop:

delta-rs/python/deltalake/table.py

Lines 945 to 954 in bc9253c

    
           fragments = [ 
        
               format.make_fragment( 
        
                   file, 
        
                   filesystem=filesystem, 
        
                   partition_expression=part_expression, 
        
               ) 
        
               for file, part_expression in self._table.dataset_partitions( 
        
                   self.schema().to_pyarrow(), partitions 
        
               ) 
        
           ]

No IO should be happening here between the scanning of the delta log and creation of the PyArrow dataset. (If there is, that is a bug.)

The filters passed into Dataset.to_table() are first evaluated against the fragment metadata and used to prune them. If you want to see which fragments are pruned, you can see which fragments are left over after passing the same filter to Dataset.get_fragments().

Definitely seems that filtering at the dataset levels allows to skip files, which will otherwise at least get scanned for the metadata or statistics if only referenced in to_table.
This can be meaningful in case of large Delta tables that have a lot of partitions and files.

Do you have a repro here? Are you using FileFormat.make_fragment() like we do?

ion-elgreco · 2024-12-07T16:58:24Z

If this is still relevant, please ping me

zalmane added the enhancement New feature or request label Dec 28, 2023

ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_pyarrow_table does not deduce partition filters from filter list #1997

to_pyarrow_table does not deduce partition filters from filter list #1997

zalmane commented Dec 28, 2023

ion-elgreco commented Dec 28, 2023

zalmane commented Dec 28, 2023

ion-elgreco commented Dec 28, 2023

zalmane commented Dec 28, 2023

wjones127 commented Dec 29, 2023

ion-elgreco commented Dec 7, 2024

to_pyarrow_table does not deduce partition filters from filter list #1997

to_pyarrow_table does not deduce partition filters from filter list #1997

Comments

zalmane commented Dec 28, 2023

Description

ion-elgreco commented Dec 28, 2023

zalmane commented Dec 28, 2023

ion-elgreco commented Dec 28, 2023

zalmane commented Dec 28, 2023

wjones127 commented Dec 29, 2023

ion-elgreco commented Dec 7, 2024