-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue with range queries over a partitioned table. #811
Comments
ryzhyk
pushed a commit
to feldera/feldera
that referenced
this issue
Dec 17, 2024
Initial implementation of the Iceberg source connector. The connector is built on the `iceberg` crate, which still in its early days and has many limitations and performance issues. * It currently only supports primitive types (no structs, maps, lists) * It only supports reading tables (hence no sink connector yet) * It only supports snapshot reads, not table following, although I think the latter could be mostly implemented using available low-level APIs. * I haven't figured out how to do efficient range queries for time seried data: apache/iceberg-rust#811 The implementation has a very similar structure to the Delta Lake connector and actually share a bunch of code with it (I moved some of this code to `adapterslib`, but I copied some other code, which I thought may diverge in the future). Both connectors register the table as a datafusion table provider and mostly work with it via the datafusion API. The main difference between Iceberg and Delta is that Iceberg cannot really be used without a catalog, since catalog is responsible for tracking the location of the latest metadata file (metadata file is the root object required to do anything with the Iceberg table). We currently support two of the most common catalog APIs: Glue (for Iceberg tables in AWS), and REST, which seems to be increasingly popular in the Iceberg community. We should be able to easily add SQL and hive catalogs, which are supported by the `iceberg` crate. The connector should work with tables in S3, local FS, and GCS, but only the first two have been tested. The `iceberg` crate currently doesn't support azure and other data stores, although it should be easy to add them if necessary, since they are supported by the `opendal` crate, which `iceberg` uses for FileIO. Signed-off-by: Leonid Ryzhyk <[email protected]>
ryzhyk
pushed a commit
to feldera/feldera
that referenced
this issue
Dec 17, 2024
Initial implementation of the Iceberg source connector. The connector is built on the `iceberg` crate, which still in its early days and has many limitations and performance issues. * It currently only supports primitive types (no structs, maps, lists) * It only supports reading tables (hence no sink connector yet) * It only supports snapshot reads, not table following, although I think the latter could be mostly implemented using available low-level APIs. * I haven't figured out how to do efficient range queries for time seried data: apache/iceberg-rust#811 The implementation has a very similar structure to the Delta Lake connector and actually share a bunch of code with it (I moved some of this code to `adapterslib`, but I copied some other code, which I thought may diverge in the future). Both connectors register the table as a datafusion table provider and mostly work with it via the datafusion API. The main difference between Iceberg and Delta is that Iceberg cannot really be used without a catalog, since catalog is responsible for tracking the location of the latest metadata file (metadata file is the root object required to do anything with the Iceberg table). We currently support two of the most common catalog APIs: Glue (for Iceberg tables in AWS), and REST, which seems to be increasingly popular in the Iceberg community. We should be able to easily add SQL and hive catalogs, which are supported by the `iceberg` crate. The connector should work with tables in S3, local FS, and GCS, but only the first two have been tested. The `iceberg` crate currently doesn't support azure and other data stores, although it should be easy to add them if necessary, since they are supported by the `opendal` crate, which `iceberg` uses for FileIO. Signed-off-by: Leonid Ryzhyk <[email protected]>
ryzhyk
pushed a commit
to feldera/feldera
that referenced
this issue
Dec 17, 2024
Initial implementation of the Iceberg source connector. The connector is built on the `iceberg` crate, which still in its early days and has many limitations and performance issues. * It currently only supports primitive types (no structs, maps, lists) * It only supports reading tables (hence no sink connector yet) * It only supports snapshot reads, not table following, although I think the latter could be mostly implemented using available low-level APIs. * I haven't figured out how to do efficient range queries for time seried data: apache/iceberg-rust#811 The implementation has a very similar structure to the Delta Lake connector and actually share a bunch of code with it (I moved some of this code to `adapterslib`, but I copied some other code, which I thought may diverge in the future). Both connectors register the table as a datafusion table provider and mostly work with it via the datafusion API. The main difference between Iceberg and Delta is that Iceberg cannot really be used without a catalog, since catalog is responsible for tracking the location of the latest metadata file (metadata file is the root object required to do anything with the Iceberg table). We currently support two of the most common catalog APIs: Glue (for Iceberg tables in AWS), and REST, which seems to be increasingly popular in the Iceberg community. We should be able to easily add SQL and hive catalogs, which are supported by the `iceberg` crate. The connector should work with tables in S3, local FS, and GCS, but only the first two have been tested. The `iceberg` crate currently doesn't support azure and other data stores, although it should be easy to add them if necessary, since they are supported by the `opendal` crate, which `iceberg` uses for FileIO. Signed-off-by: Leonid Ryzhyk <[email protected]>
ryzhyk
pushed a commit
to feldera/feldera
that referenced
this issue
Dec 18, 2024
Initial implementation of the Iceberg source connector. The connector is built on the `iceberg` crate, which still in its early days and has many limitations and performance issues. * It currently only supports primitive types (no structs, maps, lists) * It only supports reading tables (hence no sink connector yet) * It only supports snapshot reads, not table following, although I think the latter could be mostly implemented using available low-level APIs. * I haven't figured out how to do efficient range queries for time seried data: apache/iceberg-rust#811 The implementation has a very similar structure to the Delta Lake connector and actually share a bunch of code with it (I moved some of this code to `adapterslib`, but I copied some other code, which I thought may diverge in the future). Both connectors register the table as a datafusion table provider and mostly work with it via the datafusion API. The main difference between Iceberg and Delta is that Iceberg cannot really be used without a catalog, since catalog is responsible for tracking the location of the latest metadata file (metadata file is the root object required to do anything with the Iceberg table). We currently support two of the most common catalog APIs: Glue (for Iceberg tables in AWS), and REST, which seems to be increasingly popular in the Iceberg community. We should be able to easily add SQL and hive catalogs, which are supported by the `iceberg` crate. The connector should work with tables in S3, local FS, and GCS, but only the first two have been tested. The `iceberg` crate currently doesn't support azure and other data stores, although it should be easy to add them if necessary, since they are supported by the `opendal` crate, which `iceberg` uses for FileIO. Signed-off-by: Leonid Ryzhyk <[email protected]>
Fokko
added a commit
to Fokko/iceberg-rust
that referenced
this issue
Dec 18, 2024
Resolves apache#811 Depends on apache#820
Fokko
added a commit
to Fokko/iceberg-rust
that referenced
this issue
Dec 18, 2024
Resolves apache#811 Depends on apache#820
Fokko
added a commit
to Fokko/iceberg-rust
that referenced
this issue
Dec 18, 2024
Resolves apache#811 Depends on apache#820
Thanks for reporting this @ryzhyk. I did take a look and it looks like what I already suspected, it does not take the cast from string to datetime into consideration. This means that it will fallback to a full table scan. |
Fokko
added a commit
to Fokko/iceberg-rust
that referenced
this issue
Dec 18, 2024
Resolves apache#811 Depends on apache#820
Many thanks for investigating the issue and for making a fix. That was really fast! |
ryzhyk
pushed a commit
to feldera/feldera
that referenced
this issue
Dec 18, 2024
Initial implementation of the Iceberg source connector. The connector is built on the `iceberg` crate, which still in its early days and has many limitations and performance issues. * It currently only supports primitive types (no structs, maps, lists) * It only supports reading tables (hence no sink connector yet) * It only supports snapshot reads, not table following, although I think the latter could be mostly implemented using available low-level APIs. * I haven't figured out how to do efficient range queries for time seried data: apache/iceberg-rust#811 The implementation has a very similar structure to the Delta Lake connector and actually share a bunch of code with it (I moved some of this code to `adapterslib`, but I copied some other code, which I thought may diverge in the future). Both connectors register the table as a datafusion table provider and mostly work with it via the datafusion API. The main difference between Iceberg and Delta is that Iceberg cannot really be used without a catalog, since catalog is responsible for tracking the location of the latest metadata file (metadata file is the root object required to do anything with the Iceberg table). We currently support two of the most common catalog APIs: Glue (for Iceberg tables in AWS), and REST, which seems to be increasingly popular in the Iceberg community. We should be able to easily add SQL and hive catalogs, which are supported by the `iceberg` crate. The connector should work with tables in S3, local FS, and GCS, but only the first two have been tested. The `iceberg` crate currently doesn't support azure and other data stores, although it should be easy to add them if necessary, since they are supported by the `opendal` crate, which `iceberg` uses for FileIO. Signed-off-by: Leonid Ryzhyk <[email protected]>
github-merge-queue bot
pushed a commit
to feldera/feldera
that referenced
this issue
Dec 18, 2024
Initial implementation of the Iceberg source connector. The connector is built on the `iceberg` crate, which still in its early days and has many limitations and performance issues. * It currently only supports primitive types (no structs, maps, lists) * It only supports reading tables (hence no sink connector yet) * It only supports snapshot reads, not table following, although I think the latter could be mostly implemented using available low-level APIs. * I haven't figured out how to do efficient range queries for time seried data: apache/iceberg-rust#811 The implementation has a very similar structure to the Delta Lake connector and actually share a bunch of code with it (I moved some of this code to `adapterslib`, but I copied some other code, which I thought may diverge in the future). Both connectors register the table as a datafusion table provider and mostly work with it via the datafusion API. The main difference between Iceberg and Delta is that Iceberg cannot really be used without a catalog, since catalog is responsible for tracking the location of the latest metadata file (metadata file is the root object required to do anything with the Iceberg table). We currently support two of the most common catalog APIs: Glue (for Iceberg tables in AWS), and REST, which seems to be increasingly popular in the Iceberg community. We should be able to easily add SQL and hive catalogs, which are supported by the `iceberg` crate. The connector should work with tables in S3, local FS, and GCS, but only the first two have been tested. The `iceberg` crate currently doesn't support azure and other data stores, although it should be easy to add them if necessary, since they are supported by the `opendal` crate, which `iceberg` uses for FileIO. Signed-off-by: Leonid Ryzhyk <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I ran into a performance issue querying an Iceberg table in S3 via the datafusion provider. The table was created using pyiceberg with the following schema:
The table is partitioned by date extracted from the
ts
column:There are 10,000,000 records in the table spread evenly across ~200 partitions for dates between 2023-01-01 and 2023-08-02.
I query the table using
iceberg-rust
via the datafusion table provider using range queries of the form:I expect this query to be very efficient, as it only needs to read one partition, however in reality it takes about as long as scanning the entire table with
select * from my_table
(approximately 10 seconds). It looks like predicate pushdown doesn't work here for some reason.Questions:
iceberg-rust
or am I doing something wrong?I am using the latest
main
branch of this repo.Thanks in advance!
The text was updated successfully, but these errors were encountered: