Skip to content

Arrow performance optimizations #638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 16, 2025
Merged

Arrow performance optimizations #638

merged 5 commits into from
Jul 16, 2025

Conversation

jprakash-db
Copy link
Contributor

@jprakash-db jprakash-db commented Jul 15, 2025

Description

This pull request introduces several performance optimizations for operations involving Apache Arrow tables within the Databricks SQL Python client.

  • Reduce overhead and improve efficiency when concatenating Arrow tables, especially when fetching data in batches.
  • Additionally, the PR streamlines HTTP download logic and improves code readability and maintainability.

Optimizations

Arrow Table Concatenation Optimizations

Batching Concatenations:
Instead of repeatedly calling pyarrow.concat_tables on pairs of tables (which is inefficient), partial results are now collected into a list (partial_result_chunks) and concatenated only once at the end using pyarrow.concat_tables(partial_result_chunks, use_threads=True).

CloudFetch Downloader Refactor

HTTP Client Consolidation:
Replaces direct use of requests.Session with a singleton pattern via DatabricksHttpClient. This centralizes HTTP handling and is more robust for connection management and configuration.

Benchmarking

Arrow concatentation optimization

Benchmarked using
num_tables : 10000 | row_per_table : 10000 , columns_per_table: 10 | attempts : 10

Metric pre - latency post - latency Improvement
count 10.0s 10.0s
mean 9.26s 1.48s
std 0.78s 0.61s
min 8.27s 0.005s
95% 10.43s 1.89s 81%
99% 10.55s 1.90s 82%
max 10.58s 1.90s

End to end optimization

This includes the end to end test include the arrow update and http client update
benchmarking workspace: benchmarking-staging-aws-us-east-1.staging.cloud.databricks.com
test runs: 10 per benchmark
benchmarking query : SELECT * FROM main.tpcds_sf100_delta.catalog_sales WHERE cs_ship_mode_sk <= 14 AND cs_sold_date_sk BETWEEN 2450815 AND (2450815 + 410) LIMIT {LIMIT} OFFSET {row_offset}

Num of Rows p95 Pre p95 Post Improvement p99 Pre p99 Post Improvement
10,000 4.01s 2.71s 32.41% 4.33s 2.85s 34.1%
100,000 19.52s 14.32s 26% 22.23s 14.55s 34.5%

Summary

Arrow Optimizations - 81% faster ⚡

End to End optimizations - Greater than 30% faster ⚡

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@jprakash-db jprakash-db force-pushed the jprakash-db/arrow-optim branch from e1484c2 to 8cdfd88 Compare July 15, 2025 07:14
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@jprakash-db jprakash-db marked this pull request as ready for review July 15, 2025 11:02
@jprakash-db jprakash-db merged commit e0ca049 into main Jul 16, 2025
22 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants