Performance issue with range queries over a partitioned table. #811

ryzhyk · 2024-12-16T23:43:39Z

I ran into a performance issue querying an Iceberg table in S3 via the datafusion provider. The table was created using pyiceberg with the following schema:

schema = Schema(
    NestedField(1, "id", LongType(), required=True),
    NestedField(2, "name", StringType(), required=False),
    NestedField(3, "b", BooleanType(), required=True),
    NestedField(4, "ts", TimestampType(), required=True),
    NestedField(5, "dt", DateType(), required=True),
)

The table is partitioned by date extracted from the ts column:

partition_spec = PartitionSpec(
    PartitionField(
        source_id=4, field_id=1000, transform=DayTransform(), name="date"
    )
)

There are 10,000,000 records in the table spread evenly across ~200 partitions for dates between 2023-01-01 and 2023-08-02.

I query the table using iceberg-rust via the datafusion table provider using range queries of the form:

select * from my_table where ts >= timestamp '2023-01-05T00:00:00' and ts < timestamp '2023-01-06T00:00:00'

I expect this query to be very efficient, as it only needs to read one partition, however in reality it takes about as long as scanning the entire table with select * from my_table (approximately 10 seconds). It looks like predicate pushdown doesn't work here for some reason.

Questions:

Is this a performance issue in iceberg-rust or am I doing something wrong?
Is there a better way to perform this query efficiently?

I am using the latest main branch of this repo.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

Initial implementation of the Iceberg source connector. The connector is built on the `iceberg` crate, which still in its early days and has many limitations and performance issues. * It currently only supports primitive types (no structs, maps, lists) * It only supports reading tables (hence no sink connector yet) * It only supports snapshot reads, not table following, although I think the latter could be mostly implemented using available low-level APIs. * I haven't figured out how to do efficient range queries for time seried data: apache/iceberg-rust#811 The implementation has a very similar structure to the Delta Lake connector and actually share a bunch of code with it (I moved some of this code to `adapterslib`, but I copied some other code, which I thought may diverge in the future). Both connectors register the table as a datafusion table provider and mostly work with it via the datafusion API. The main difference between Iceberg and Delta is that Iceberg cannot really be used without a catalog, since catalog is responsible for tracking the location of the latest metadata file (metadata file is the root object required to do anything with the Iceberg table). We currently support two of the most common catalog APIs: Glue (for Iceberg tables in AWS), and REST, which seems to be increasingly popular in the Iceberg community. We should be able to easily add SQL and hive catalogs, which are supported by the `iceberg` crate. The connector should work with tables in S3, local FS, and GCS, but only the first two have been tested. The `iceberg` crate currently doesn't support azure and other data stores, although it should be easy to add them if necessary, since they are supported by the `opendal` crate, which `iceberg` uses for FileIO. Signed-off-by: Leonid Ryzhyk <[email protected]>

Resolves apache#811 Depends on apache#820

Fokko · 2024-12-18T12:26:14Z

Thanks for reporting this @ryzhyk. I did take a look and it looks like what I already suspected, it does not take the cast from string to datetime into consideration. This means that it will fallback to a full table scan.

Resolves apache#811 Depends on apache#820

ryzhyk · 2024-12-18T16:11:14Z

Many thanks for investigating the issue and for making a fix. That was really fast!

Initial implementation of the Iceberg source connector. The connector is built on the `iceberg` crate, which still in its early days and has many limitations and performance issues. * It currently only supports primitive types (no structs, maps, lists) * It only supports reading tables (hence no sink connector yet) * It only supports snapshot reads, not table following, although I think the latter could be mostly implemented using available low-level APIs. * I haven't figured out how to do efficient range queries for time seried data: apache/iceberg-rust#811 The implementation has a very similar structure to the Delta Lake connector and actually share a bunch of code with it (I moved some of this code to `adapterslib`, but I copied some other code, which I thought may diverge in the future). Both connectors register the table as a datafusion table provider and mostly work with it via the datafusion API. The main difference between Iceberg and Delta is that Iceberg cannot really be used without a catalog, since catalog is responsible for tracking the location of the latest metadata file (metadata file is the root object required to do anything with the Iceberg table). We currently support two of the most common catalog APIs: Glue (for Iceberg tables in AWS), and REST, which seems to be increasingly popular in the Iceberg community. We should be able to easily add SQL and hive catalogs, which are supported by the `iceberg` crate. The connector should work with tables in S3, local FS, and GCS, but only the first two have been tested. The `iceberg` crate currently doesn't support azure and other data stores, although it should be easy to add them if necessary, since they are supported by the `opendal` crate, which `iceberg` uses for FileIO. Signed-off-by: Leonid Ryzhyk <[email protected]>

ryzhyk mentioned this issue Dec 17, 2024

[adapters] Iceberg source connector. feldera/feldera#3172

Merged

Fokko added a commit to Fokko/iceberg-rust that referenced this issue Dec 18, 2024

Datafusion: Support cast operations

13798a3

Resolves apache#811 Depends on apache#820

Fokko added a commit to Fokko/iceberg-rust that referenced this issue Dec 18, 2024

Datafusion: Support cast operations

2b20f79

Resolves apache#811 Depends on apache#820

Fokko added a commit to Fokko/iceberg-rust that referenced this issue Dec 18, 2024

Datafusion: Support cast operations

24d103d

Resolves apache#811 Depends on apache#820

Fokko linked a pull request Dec 18, 2024 that will close this issue

feat(datafusion): Support cast operations #821

Draft

Fokko added a commit to Fokko/iceberg-rust that referenced this issue Dec 18, 2024

Datafusion: Support cast operations

073e4ee

Resolves apache#811 Depends on apache#820

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with range queries over a partitioned table. #811

Performance issue with range queries over a partitioned table. #811

ryzhyk commented Dec 16, 2024

Fokko commented Dec 18, 2024 •

edited

Loading

ryzhyk commented Dec 18, 2024

Performance issue with range queries over a partitioned table. #811

Performance issue with range queries over a partitioned table. #811

Comments

ryzhyk commented Dec 16, 2024

Fokko commented Dec 18, 2024 • edited Loading

ryzhyk commented Dec 18, 2024

Fokko commented Dec 18, 2024 •

edited

Loading