Arrow performance optimizations #638

jprakash-db · 2025-07-15T07:10:41Z

Description

This pull request introduces several performance optimizations for operations involving Apache Arrow tables within the Databricks SQL Python client.

Reduce overhead and improve efficiency when concatenating Arrow tables, especially when fetching data in batches.
Additionally, the PR streamlines HTTP download logic and improves code readability and maintainability.

Optimizations

Arrow Table Concatenation Optimizations

Batching Concatenations:
Instead of repeatedly calling pyarrow.concat_tables on pairs of tables (which is inefficient), partial results are now collected into a list (partial_result_chunks) and concatenated only once at the end using pyarrow.concat_tables(partial_result_chunks, use_threads=True).

CloudFetch Downloader Refactor

HTTP Client Consolidation:
Replaces direct use of requests.Session with a singleton pattern via DatabricksHttpClient. This centralizes HTTP handling and is more robust for connection management and configuration.

Benchmarking

Arrow concatentation optimization

Benchmarked using
num_tables : 10000 | row_per_table : 10000 , columns_per_table: 10 | attempts : 10

Metric	pre - latency	post - latency	Improvement
count	10.0s	10.0s
mean	9.26s	1.48s
std	0.78s	0.61s
min	8.27s	0.005s
95%	10.43s	1.89s	81%
99%	10.55s	1.90s	82%
max	10.58s	1.90s

End to end optimization

This includes the end to end test include the arrow update and http client update
benchmarking workspace: benchmarking-staging-aws-us-east-1.staging.cloud.databricks.com
test runs: 10 per benchmark
benchmarking query : SELECT * FROM main.tpcds_sf100_delta.catalog_sales WHERE cs_ship_mode_sk <= 14 AND cs_sold_date_sk BETWEEN 2450815 AND (2450815 + 410) LIMIT {LIMIT} OFFSET {row_offset}

Num of Rows	p95 Pre	p95 Post	Improvement	p99 Pre	p99 Post	Improvement
10,000	4.01s	2.71s	32.41%	4.33s	2.85s	34.1%
100,000	19.52s	14.32s	26%	22.23s	14.55s	34.5%

Summary

Arrow Optimizations - 81% faster ⚡

End to End optimizations - Greater than 30% faster ⚡

github-actions · 2025-07-15T07:12:22Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

github-actions · 2025-07-15T08:08:32Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

github-actions · 2025-07-15T08:23:01Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

github-actions · 2025-07-15T11:01:32Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

src/databricks/sql/result_set.py

jprakash-db added 2 commits July 15, 2025 11:36

Minor fix

0cbfae6

Perf update

8cdfd88

jprakash-db had a problem deploying to azure-prod July 15, 2025 07:12 — with GitHub Actions Failure

jprakash-db force-pushed the jprakash-db/arrow-optim branch from e1484c2 to 8cdfd88 Compare July 15, 2025 07:14

merged main

270e27c

jprakash-db temporarily deployed to azure-prod July 15, 2025 08:08 — with GitHub Actions Inactive

more

7c7b121

jprakash-db temporarily deployed to azure-prod July 15, 2025 08:22 — with GitHub Actions Inactive

test fix

e9040cb

jprakash-db had a problem deploying to azure-prod July 15, 2025 11:01 — with GitHub Actions Failure

jprakash-db marked this pull request as ready for review July 15, 2025 11:02

jprakash-db requested review from jayantsing-db, gopalldb, vikrantpuppala and samikshya-db July 15, 2025 11:02

vikrantpuppala reviewed Jul 15, 2025

View reviewed changes

src/databricks/sql/result_set.py Show resolved Hide resolved

jprakash-db temporarily deployed to azure-prod July 15, 2025 15:19 — with GitHub Actions Inactive

vikrantpuppala approved these changes Jul 16, 2025

View reviewed changes

jprakash-db merged commit e0ca049 into main Jul 16, 2025
22 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Arrow performance optimizations #638

Arrow performance optimizations #638

jprakash-db commented Jul 15, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Arrow performance optimizations #638

Arrow performance optimizations #638

Conversation

jprakash-db commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Optimizations

Arrow Table Concatenation Optimizations

CloudFetch Downloader Refactor

Benchmarking

Arrow concatentation optimization

End to end optimization

Summary

Arrow Optimizations - 81% faster ⚡

End to End optimizations - Greater than 30% faster ⚡

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jprakash-db commented Jul 15, 2025 •

edited

Loading