-
Notifications
You must be signed in to change notification settings - Fork 114
Arrow performance optimizations #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
e1484c2
to
8cdfd88
Compare
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Description
This pull request introduces several performance optimizations for operations involving Apache Arrow tables within the Databricks SQL Python client.
Optimizations
Arrow Table Concatenation Optimizations
Batching Concatenations:
Instead of repeatedly calling pyarrow.concat_tables on pairs of tables (which is inefficient), partial results are now collected into a list (partial_result_chunks) and concatenated only once at the end using pyarrow.concat_tables(partial_result_chunks, use_threads=True).
CloudFetch Downloader Refactor
HTTP Client Consolidation:
Replaces direct use of requests.Session with a singleton pattern via DatabricksHttpClient. This centralizes HTTP handling and is more robust for connection management and configuration.
Benchmarking
Arrow concatentation optimization
Benchmarked using
num_tables : 10000 | row_per_table : 10000 , columns_per_table: 10 | attempts : 10
End to end optimization
This includes the end to end test include the arrow update and http client update
benchmarking workspace: benchmarking-staging-aws-us-east-1.staging.cloud.databricks.com
test runs: 10 per benchmark
benchmarking query : SELECT * FROM main.tpcds_sf100_delta.catalog_sales WHERE cs_ship_mode_sk <= 14 AND cs_sold_date_sk BETWEEN 2450815 AND (2450815 + 410) LIMIT {LIMIT} OFFSET {row_offset}
Summary
Arrow Optimizations - 81% faster ⚡
End to End optimizations - Greater than 30% faster ⚡