Skip to content

SEA: re-fetch links in case of expiry #635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 380 commits into
base: less-defensive-download
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
380 commits
Select commit Hold shift + click to select a range
015fb76
remove reliance on schema_bytes in SEA
varun-edachali-dbx Jun 16, 2025
0385ffb
remove redundant note on arrow_schema_bytes
varun-edachali-dbx Jun 16, 2025
5380c7a
use more fetch methods
varun-edachali-dbx Jun 16, 2025
27b781f
remove redundant schema_bytes from parent constructor
varun-edachali-dbx Jun 16, 2025
238dc0a
only call get_chunk_link with non null chunk index
varun-edachali-dbx Jun 16, 2025
b3bb07e
align SeaResultSet structure with ThriftResultSet
varun-edachali-dbx Jun 16, 2025
13e6346
remvoe _fill_result_buffer from SeaResultSet
varun-edachali-dbx Jun 16, 2025
f90b4d4
reduce code repetition
varun-edachali-dbx Jun 16, 2025
23963fc
align SeaResultSet with ext-links-sea
varun-edachali-dbx Jun 16, 2025
dd43715
remove redundant methods
varun-edachali-dbx Jun 16, 2025
34a7f66
update unit tests
varun-edachali-dbx Jun 16, 2025
715cc13
remove accidental venv changes
varun-edachali-dbx Jun 16, 2025
fb53dd9
pre-fetch next chunk link on processing current
varun-edachali-dbx Jun 17, 2025
d893877
reduce nesting
varun-edachali-dbx Jun 17, 2025
a165f1c
line break after multi line pydoc
varun-edachali-dbx Jun 17, 2025
d68e4ea
re-introduce schema_bytes for better abstraction (likely temporary)
varun-edachali-dbx Jun 17, 2025
be17812
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 17, 2025
a0705bc
add fetchmany_arrow and fetchall_arrow
varun-edachali-dbx Jun 17, 2025
1b90c4a
Merge branch 'metadata-sea' into fetch-json-inline
varun-edachali-dbx Jun 17, 2025
f7c11b9
remove accidental changes in sea backend tests
varun-edachali-dbx Jun 17, 2025
349c021
Merge branch 'exec-phase-sea' into metadata-sea
varun-edachali-dbx Jun 17, 2025
6229848
remove irrelevant changes
varun-edachali-dbx Jun 17, 2025
fd52356
remove un-necessary test changes
varun-edachali-dbx Jun 17, 2025
64e58b0
remove un-necessary changes in thrift backend tests
varun-edachali-dbx Jun 17, 2025
2903473
remove unimplemented methods test
varun-edachali-dbx Jun 17, 2025
b300709
Merge branch 'metadata-sea' into fetch-json-inline
varun-edachali-dbx Jun 17, 2025
021ff4c
remove unimplemented method tests
varun-edachali-dbx Jun 17, 2025
adecd53
modify example scripts to include fetch calls
varun-edachali-dbx Jun 17, 2025
d33e5fd
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 17, 2025
e3cef5c
add GetChunksResponse
varun-edachali-dbx Jun 17, 2025
ac50669
remove changes to sea test
varun-edachali-dbx Jun 17, 2025
03cdc4f
re-introduce accidentally removed description extraction method
varun-edachali-dbx Jun 17, 2025
e1842d8
fix type errors (ssl_options, CHUNK_PATH_WITH_ID..., etc.)
varun-edachali-dbx Jun 17, 2025
89a46af
access ssl_options through connection
varun-edachali-dbx Jun 17, 2025
1d0b28b
DEBUG level
varun-edachali-dbx Jun 17, 2025
c8820d4
remove explicit multi chunk test
varun-edachali-dbx Jun 17, 2025
fe47787
move cloud fetch queues back into utils.py
varun-edachali-dbx Jun 17, 2025
74f59b7
remove excess docstrings
varun-edachali-dbx Jun 17, 2025
4b456b2
move ThriftCloudFetchQueue above SeaCloudFetchQueue
varun-edachali-dbx Jun 17, 2025
bfc1f01
fix sea connector tests
varun-edachali-dbx Jun 17, 2025
a4447a1
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 17, 2025
4883aff
correct patch module path in cloud fetch queue tests
varun-edachali-dbx Jun 17, 2025
0a2cdfd
remove unimplemented methods test
varun-edachali-dbx Jun 17, 2025
cd3378c
correct add_link docstring
varun-edachali-dbx Jun 17, 2025
90bb09c
Merge branch 'sea-migration' into exec-phase-sea
varun-edachali-dbx Jun 17, 2025
cd22389
remove invalid import
varun-edachali-dbx Jun 17, 2025
82e0f8b
Merge branch 'sea-migration' into exec-phase-sea
varun-edachali-dbx Jun 17, 2025
e64b81b
Merge branch 'exec-phase-sea' into metadata-sea
varun-edachali-dbx Jun 17, 2025
27564ca
Merge branch 'metadata-sea' into fetch-json-inline
varun-edachali-dbx Jun 17, 2025
bc467d1
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 17, 2025
5ab9bbe
better align queries with JDBC impl
varun-edachali-dbx Jun 18, 2025
1ab6e87
line breaks after multi-line PRs
varun-edachali-dbx Jun 18, 2025
f469c24
remove unused imports
varun-edachali-dbx Jun 18, 2025
68ec65f
fix: introduce ExecuteResponse import
varun-edachali-dbx Jun 18, 2025
ffd478e
Merge branch 'sea-migration' into metadata-sea
varun-edachali-dbx Jun 18, 2025
f6d873d
remove unimplemented metadata methods test, un-necessary imports
varun-edachali-dbx Jun 18, 2025
28675f5
introduce unit tests for metadata methods
varun-edachali-dbx Jun 18, 2025
3578659
remove verbosity in ResultSetFilter docstring
varun-edachali-dbx Jun 20, 2025
8713023
remove un-necessary info in ResultSetFilter docstring
varun-edachali-dbx Jun 20, 2025
22dc252
remove explicit type checking, string literals around forward annotat…
varun-edachali-dbx Jun 20, 2025
390f592
house SQL commands in constants
varun-edachali-dbx Jun 20, 2025
dd7dc6a
convert complex types to string if not _use_arrow_native_complex_types
varun-edachali-dbx Jun 23, 2025
28308fe
Merge branch 'metadata-sea' into fetch-json-inline
varun-edachali-dbx Jun 23, 2025
2712d1c
introduce unit tests for altered functionality
varun-edachali-dbx Jun 23, 2025
dabba55
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 23, 2025
48ad7b3
Revert "Merge branch 'fetch-json-inline' into ext-links-sea"
varun-edachali-dbx Jun 23, 2025
a1f9b9c
reduce verbosity of ResultSetFilter docstring
varun-edachali-dbx Jun 23, 2025
984e8ee
remove unused imports
varun-edachali-dbx Jun 23, 2025
3a999c0
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 23, 2025
c313c2b
Revert "Merge branch 'fetch-json-inline' into ext-links-sea"
varun-edachali-dbx Jun 23, 2025
3bc615e
Revert "reduce verbosity of ResultSetFilter docstring"
varun-edachali-dbx Jun 23, 2025
b6e1a10
Reapply "Merge branch 'fetch-json-inline' into ext-links-sea"
varun-edachali-dbx Jun 23, 2025
2df3d39
Revert "Merge branch 'fetch-json-inline' into ext-links-sea"
varun-edachali-dbx Jun 23, 2025
5e75fb5
remove un-necessary filters changes
varun-edachali-dbx Jun 23, 2025
20822e4
remove un-necessary backend changes
varun-edachali-dbx Jun 23, 2025
802d045
remove constants changes
varun-edachali-dbx Jun 23, 2025
f3f795a
remove changes in filters tests
varun-edachali-dbx Jun 23, 2025
f6c5950
remove unit test backend and JSON queue changes
varun-edachali-dbx Jun 23, 2025
d210ccd
remove changes in sea result set testing
varun-edachali-dbx Jun 23, 2025
22a953e
Revert "remove changes in sea result set testing"
varun-edachali-dbx Jun 23, 2025
3aed144
Revert "remove unit test backend and JSON queue changes"
varun-edachali-dbx Jun 23, 2025
0fe4da4
Revert "remove changes in filters tests"
varun-edachali-dbx Jun 23, 2025
0e3c0a1
Revert "remove constants changes"
varun-edachali-dbx Jun 23, 2025
93edb93
Revert "remove un-necessary backend changes"
varun-edachali-dbx Jun 23, 2025
871a44f
Revert "remove un-necessary filters changes"
varun-edachali-dbx Jun 23, 2025
0ce144d
remove unused imports
varun-edachali-dbx Jun 23, 2025
08ca60d
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 23, 2025
8c5cc77
working version
varun-edachali-dbx Jun 23, 2025
7f5c715
adopy _wait_until_command_done
varun-edachali-dbx Jun 23, 2025
9ef5fad
introduce metadata commands
varun-edachali-dbx Jun 23, 2025
44183db
use new backend structure
varun-edachali-dbx Jun 23, 2025
d59b351
constrain backend diff
varun-edachali-dbx Jun 23, 2025
1edc80a
remove changes to filters
varun-edachali-dbx Jun 23, 2025
f82658a
make _parse methods in models internal
varun-edachali-dbx Jun 23, 2025
54eb0a4
reduce changes in unit tests
varun-edachali-dbx Jun 23, 2025
50cc1e2
run small queries with SEA during integration tests
varun-edachali-dbx Jun 24, 2025
242307a
run some tests for sea
varun-edachali-dbx Jun 24, 2025
8a138e8
allow empty schema bytes for alignment with SEA
varun-edachali-dbx Jun 25, 2025
82f9d6b
pass is_vl_op to Sea backend ExecuteResponse
varun-edachali-dbx Jun 25, 2025
35f1ef0
remove catalog requirement in get_tables
varun-edachali-dbx Jun 26, 2025
a515d26
move filters.py to SEA utils
varun-edachali-dbx Jun 26, 2025
59b1330
ensure SeaResultSet
varun-edachali-dbx Jun 26, 2025
293e356
Merge branch 'sea-migration' into metadata-sea
varun-edachali-dbx Jun 26, 2025
dd40beb
prevent circular imports
varun-edachali-dbx Jun 26, 2025
14057ac
remove unused imports
varun-edachali-dbx Jun 26, 2025
a4d5bdb
remove cast, throw error if not SeaResultSet
varun-edachali-dbx Jun 26, 2025
156421a
Merge branch 'metadata-sea' into fetch-json-inline
varun-edachali-dbx Jun 26, 2025
eb1a9b4
pass param as TSparkParameterValue
varun-edachali-dbx Jun 26, 2025
9000666
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 26, 2025
a3ca7c7
remove failing test (temp)
varun-edachali-dbx Jun 26, 2025
2c22010
remove SeaResultSet type assertion
varun-edachali-dbx Jun 26, 2025
c09508e
change errors to align with spec, instead of arbitrary ValueError
varun-edachali-dbx Jun 26, 2025
e9b1314
make SEA backend methods return SeaResultSet
varun-edachali-dbx Jun 26, 2025
8ede414
use spec-aligned Exceptions in SEA backend
varun-edachali-dbx Jun 26, 2025
09a1b11
remove defensive row type check
varun-edachali-dbx Jun 26, 2025
5e01e7b
Merge branch 'metadata-sea' into fetch-json-inline
varun-edachali-dbx Jun 26, 2025
3becefe
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 26, 2025
a026d31
raise ProgrammingError for invalid id
varun-edachali-dbx Jun 26, 2025
4446a9e
make is_volume_operation strict bool
varun-edachali-dbx Jun 26, 2025
138359d
remove complex types code
varun-edachali-dbx Jun 26, 2025
b99d0c4
Revert "remove complex types code"
varun-edachali-dbx Jun 26, 2025
21c389d
introduce type conversion for primitive types for JSON + INLINE
varun-edachali-dbx Jun 27, 2025
734321a
Merge branch 'sea-migration' into fetch-json-inline
varun-edachali-dbx Jun 27, 2025
9f0f969
remove SEA running on metadata queries (known failures
varun-edachali-dbx Jun 27, 2025
04a1936
remove un-necessary docstrings
varun-edachali-dbx Jun 27, 2025
278b8cd
align expected types with databricks sdk
varun-edachali-dbx Jun 27, 2025
91b7f7f
link rest api reference to validate types
varun-edachali-dbx Jun 27, 2025
7a5ae13
remove test_catalogs_returns_arrow_table test
varun-edachali-dbx Jun 27, 2025
f1776f3
fix fetchall_arrow and fetchmany_arrow
varun-edachali-dbx Jun 27, 2025
6143331
remove thrift aligned test_cancel_during_execute from SEA tests
varun-edachali-dbx Jun 27, 2025
8949d0c
Merge branch 'sea-migration' into fetch-json-inline
varun-edachali-dbx Jun 27, 2025
5eaded4
remove un-necessary changes in example scripts
varun-edachali-dbx Jun 27, 2025
eeed9a1
remove un-necessary chagnes in example scripts
varun-edachali-dbx Jun 27, 2025
f233886
_convert_json_table -> _create_json_table
varun-edachali-dbx Jun 27, 2025
68ac437
remove accidentally removed test
varun-edachali-dbx Jun 27, 2025
7fd0845
remove new unit tests (to be re-added based on new arch)
varun-edachali-dbx Jun 27, 2025
ea7ff73
remove changes in sea_result_set functionality (to be re-added)
varun-edachali-dbx Jun 27, 2025
563da71
introduce more integration tests
varun-edachali-dbx Jun 27, 2025
a018273
remove SEA tests in parameterized queries
varun-edachali-dbx Jun 27, 2025
c0e98f4
remove partial parameter fix changes
varun-edachali-dbx Jun 27, 2025
7343035
remove un-necessary timestamp tests
varun-edachali-dbx Jun 27, 2025
ec500b6
slightly stronger typing of _convert_json_types
varun-edachali-dbx Jun 27, 2025
0b3e91d
stronger typing of json utility func s
varun-edachali-dbx Jun 27, 2025
7664e44
stronger typing of fetch*_json
varun-edachali-dbx Jun 27, 2025
db7b8e5
remove unused helper methods in SqlType
varun-edachali-dbx Jun 27, 2025
f75f2b5
line breaks after multi line pydocs, remove excess logs
varun-edachali-dbx Jun 27, 2025
e2d4ef5
line breaks after multi line pydocs, reduce diff of redundant changes
varun-edachali-dbx Jun 27, 2025
21e3078
reduce diff of redundant changes
varun-edachali-dbx Jun 27, 2025
bb015e6
mandate ResultData in SeaResultSet constructor
varun-edachali-dbx Jun 27, 2025
3944e39
Merge branch 'fetch-json-inline' into ext-links-sea
varun-edachali-dbx Jun 27, 2025
b3273c7
remove complex type conversion
varun-edachali-dbx Jun 27, 2025
38c2b88
correct fetch*_arrow
varun-edachali-dbx Jun 27, 2025
b77acbe
Merge branch 'sea-migration' into ext-links-sea
varun-edachali-dbx Jul 3, 2025
fa2359d
recover old sea tests
varun-edachali-dbx Jul 3, 2025
c07f709
move queue and result set into SEA specific dir
varun-edachali-dbx Jul 3, 2025
9e4ef2e
pass ssl_options into CloudFetchQueue
varun-edachali-dbx Jul 3, 2025
b00c06c
reduce diff
varun-edachali-dbx Jul 3, 2025
10f55f0
remove redundant conversion.py
varun-edachali-dbx Jul 3, 2025
cd119e9
fix type issues
varun-edachali-dbx Jul 3, 2025
d79638b
ValueError not ProgrammingError
varun-edachali-dbx Jul 3, 2025
f84578a
reduce diff
varun-edachali-dbx Jul 3, 2025
c621c0c
introduce SEA cloudfetch e2e tests
varun-edachali-dbx Jul 3, 2025
7958cd9
allow empty cloudfetch result
varun-edachali-dbx Jul 3, 2025
e2d17ff
add unit tests for CloudFetchQueue and SeaResultSet
varun-edachali-dbx Jul 3, 2025
d348b35
skip pyarrow dependent tests
varun-edachali-dbx Jul 3, 2025
811205e
Merge branch 'sea-migration' into ext-links-sea
varun-edachali-dbx Jul 3, 2025
4bd290e
simplify download process: no pre-fetching
varun-edachali-dbx Jul 4, 2025
dfbbf79
correct class name in logs
varun-edachali-dbx Jul 4, 2025
ed4d7ab
Merge branch 'sea-migration' into ext-links-sea
varun-edachali-dbx Jul 7, 2025
a5e9cdf
align with old impl
varun-edachali-dbx Jul 7, 2025
be16634
align next_n_rows with prev imple
varun-edachali-dbx Jul 7, 2025
6ec8656
align remaining_rows with prev impl
varun-edachali-dbx Jul 7, 2025
7ea7b75
remove un-necessary Optional params
varun-edachali-dbx Jul 7, 2025
64be07b
remove un-necessary changes in thrift field if tests
varun-edachali-dbx Jul 7, 2025
165644c
remove unused imports
varun-edachali-dbx Jul 7, 2025
4a9ba21
init hybrid
varun-edachali-dbx Jul 8, 2025
abef941
run large queries
varun-edachali-dbx Jul 8, 2025
e74ccd1
hybrid disposition
varun-edachali-dbx Jul 8, 2025
06a5d54
remove un-ncessary log
varun-edachali-dbx Jul 8, 2025
9cf6e02
formatting (black)
varun-edachali-dbx Jul 8, 2025
b5578ca
remove redundant tests
varun-edachali-dbx Jul 8, 2025
7f67cfc
multi frame decompression of lz4
varun-edachali-dbx Jul 9, 2025
2202057
ensure no compression (temp)
varun-edachali-dbx Jul 9, 2025
92fe5fd
introduce separate link fetcher
varun-edachali-dbx Jul 9, 2025
54b43fa
log time to create table
varun-edachali-dbx Jul 9, 2025
2750c5e
add chunk index to table creation time log
varun-edachali-dbx Jul 9, 2025
239cd8d
remove custom multi-frame decompressor for lz4
varun-edachali-dbx Jul 9, 2025
57ebf06
remove excess logs
varun-edachali-dbx Jul 9, 2025
6894b4c
remove redundant tests (temp)
varun-edachali-dbx Jul 10, 2025
4abb3ad
add link to download manager before notifying consumer
varun-edachali-dbx Jul 10, 2025
fce324b
move link fetching immediately before table creation so link expiry i…
varun-edachali-dbx Jul 11, 2025
671dbca
Merge branch 'ext-links-sea' into sea-hybrid
varun-edachali-dbx Jul 11, 2025
7257168
Merge branch 'sea-hybrid' into sea-decouple-link-fetch
varun-edachali-dbx Jul 11, 2025
9590af7
resolve merge artifacts
varun-edachali-dbx Jul 11, 2025
82fc0b6
remove redundant methods
varun-edachali-dbx Jul 11, 2025
39469fa
Merge branch 'sea-migration' into ext-links-sea
varun-edachali-dbx Jul 11, 2025
b2d1579
formatting (black)
varun-edachali-dbx Jul 11, 2025
bd51b1c
introduce callback to handle link expiry
varun-edachali-dbx Jul 11, 2025
077a71c
fix types
varun-edachali-dbx Jul 11, 2025
7985639
fix param type in unit tests
varun-edachali-dbx Jul 11, 2025
0e1abfa
Merge branch 'ext-links-sea' into sea-hybrid
varun-edachali-dbx Jul 12, 2025
a736bb4
Merge branch 'sea-hybrid' into sea-decouple-link-fetch
varun-edachali-dbx Jul 12, 2025
0868fe3
formatting + minor type fixes
varun-edachali-dbx Jul 12, 2025
3f42a03
Revert "introduce callback to handle link expiry"
varun-edachali-dbx Jul 12, 2025
d038d84
remove unused callback (to be introduced later)
varun-edachali-dbx Jul 12, 2025
dfc32b4
Merge branch 'sea-migration' into ext-links-sea
varun-edachali-dbx Jul 14, 2025
0a0643b
correct param extraction
varun-edachali-dbx Jul 14, 2025
f374f5f
init sea link retry func
varun-edachali-dbx Jul 14, 2025
0829e67
correct unit tests for download management
varun-edachali-dbx Jul 14, 2025
a18be78
create strongly typed Future for download tasks
varun-edachali-dbx Jul 14, 2025
f7fd1d9
remove common constructor for databricks client abc
varun-edachali-dbx Jul 14, 2025
510b0a3
make SEA Http Client instance a private member
varun-edachali-dbx Jul 14, 2025
dd2864b
make GetChunksResponse model more robust
varun-edachali-dbx Jul 14, 2025
c32b281
add link to doc of GetChunk response model
varun-edachali-dbx Jul 14, 2025
0b1eba5
pass result_data instead of "initial links" into SeaCloudFetchQueue
varun-edachali-dbx Jul 14, 2025
777f7c1
move download_manager init into parent CloudFetchQueue
varun-edachali-dbx Jul 14, 2025
130b0d3
raise ServerOperationError for no 0th chunk
varun-edachali-dbx Jul 14, 2025
1920375
unused iports
varun-edachali-dbx Jul 14, 2025
d882c6e
Merge branch 'sea-migration' into ext-links-sea
varun-edachali-dbx Jul 15, 2025
5a43686
return None in case of empty respose
varun-edachali-dbx Jul 15, 2025
28c6bb1
ensure table is empty on no initial link s
varun-edachali-dbx Jul 15, 2025
4dd9434
Merge branch 'ext-links-sea' into sea-hybrid
varun-edachali-dbx Jul 15, 2025
5791745
account for total chunk count
varun-edachali-dbx Jul 15, 2025
2701e5d
iterate by chunk index instead of link
varun-edachali-dbx Jul 16, 2025
a618115
Merge branch 'sea-hybrid' into sea-decouple-link-fetch
varun-edachali-dbx Jul 16, 2025
eb24409
Merge branch 'sea-migration' into sea-decouple-link-fetch
varun-edachali-dbx Jul 17, 2025
f56497f
make LinkFetcher convert link static
varun-edachali-dbx Jul 17, 2025
76fd190
add helper for link addition, check for edge case to prevent inf wait
varun-edachali-dbx Jul 17, 2025
ab2c2c4
add unit tests for LinkFetcher
varun-edachali-dbx Jul 17, 2025
3af2333
remove un-necessary download manager check
varun-edachali-dbx Jul 17, 2025
480528e
remove un-necessary string literals around param type
varun-edachali-dbx Jul 17, 2025
fba1812
remove duplicate download_manager init
varun-edachali-dbx Jul 17, 2025
f2ea729
Merge branch 'sea-migration' into sea-decouple-link-fetch
varun-edachali-dbx Jul 19, 2025
23dba80
account for empty response in LinkFetcher init
varun-edachali-dbx Jul 20, 2025
26aa45c
make get_chunk_link return mandatory ExternalLink
varun-edachali-dbx Jul 20, 2025
9fd57d8
set shutdown_event instead of breaking on completion so get_chunk_lin…
varun-edachali-dbx Jul 20, 2025
dd009df
docstrings, logging, pydoc
varun-edachali-dbx Jul 21, 2025
7c360d1
use total_chunk_cound > 0
varun-edachali-dbx Jul 21, 2025
6b1b972
clarify that link has already been submitted on getting row_offset
varun-edachali-dbx Jul 21, 2025
2ca7c2b
return None for out of range
varun-edachali-dbx Jul 21, 2025
00db613
default link_fetcher to None
varun-edachali-dbx Jul 21, 2025
47bb758
Merge branch 'sea-decouple-link-fetch' into sea-link-expiry
varun-edachali-dbx Jul 21, 2025
2cd802e
reduce repeated init
varun-edachali-dbx Jul 21, 2025
0475abf
Merge branch 'sea-migration' into sea-link-expiry
varun-edachali-dbx Jul 21, 2025
b5fbeb6
Merge branch 'less-defensive-download' into sea-link-expiry
varun-edachali-dbx Jul 21, 2025
b114599
add_link -> add_links for less heavy link additions
varun-edachali-dbx Jul 21, 2025
38c9343
Merge branch 'less-defensive-download' into sea-link-expiry
varun-edachali-dbx Jul 21, 2025
92821e7
move param order
varun-edachali-dbx Jul 21, 2025
63deb04
type issues
varun-edachali-dbx Jul 21, 2025
ade7054
Merge branch 'less-defensive-download' into sea-link-expiry
varun-edachali-dbx Jul 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions examples/experimental/tests/test_sea_sync_query.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@
import os
import sys
import logging
import time
from databricks.sql.client import Connection

logging.basicConfig(level=logging.INFO)
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)


Expand Down Expand Up @@ -51,20 +52,19 @@ def test_sea_sync_query_with_cloud_fetch():
)

# Execute a query that generates large rows to force multiple chunks
requested_row_count = 10000
requested_row_count = 100000000
cursor = connection.cursor()
query = f"""
SELECT
id,
concat('value_', repeat('a', 10000)) as test_value
FROM range(1, {requested_row_count} + 1) AS t(id)
SELECT * FROM samples.tpch.lineitem LIMIT {requested_row_count}
"""

logger.info(
f"Executing synchronous query with cloud fetch to generate {requested_row_count} rows"
)
cursor.execute(query)
results = [cursor.fetchone()]
logger.info("SLEEPING FOR 1000 SECONDS TO EXPIRE LINKS")
time.sleep(1000)
results.extend(cursor.fetchmany(10))
results.extend(cursor.fetchall())
actual_row_count = len(results)
Expand Down
35 changes: 32 additions & 3 deletions src/databricks/sql/backend/sea/queue.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,9 +189,11 @@ def _add_links(self, links: List[ExternalLink]):
len(links),
", ".join(str(l.chunk_index) for l in links) if links else "<none>",
)
for link in links:
self.chunk_index_to_link[link.chunk_index] = link
self.download_manager.add_link(LinkFetcher._convert_to_thrift_link(link))

self.chunk_index_to_link.update({link.chunk_index: link for link in links})
self.download_manager.add_links(
[LinkFetcher._convert_to_thrift_link(link) for link in links]
)

def _get_next_chunk_index(self) -> Optional[int]:
"""Return the next *chunk_index* that should be requested from the backend, or ``None`` if we have them all."""
Expand Down Expand Up @@ -281,9 +283,27 @@ def _worker_loop(self):
with self._link_data_update:
self._link_data_update.notify_all()

def _restart_from_expired_link(self, link: TSparkArrowResultLink):
"""Restart the link fetcher from the expired link."""
self.stop()

with self._link_data_update:
self.download_manager.cancel_tasks_from_offset(link.startRowOffset)

chunks_to_restart = []
for chunk_index, l in self.chunk_index_to_link.items():
if l.row_offset < link.startRowOffset:
continue
chunks_to_restart.append(chunk_index)
for chunk_index in chunks_to_restart:
self.chunk_index_to_link.pop(chunk_index)

self.start()

def start(self):
"""Spawn the worker thread."""
logger.debug("LinkFetcher[%s]: starting worker thread", self._statement_id)
self._shutdown_event.clear()
self._worker_thread = threading.Thread(
target=self._worker_loop, name=f"LinkFetcher-{self._statement_id}"
)
Expand Down Expand Up @@ -333,6 +353,7 @@ def __init__(
schema_bytes=None,
lz4_compressed=lz4_compressed,
description=description,
expiry_callback=self._expiry_callback,
# TODO: fix these arguments when telemetry is implemented in SEA
session_id_hex=None,
chunk_id=0,
Expand Down Expand Up @@ -363,6 +384,14 @@ def __init__(
# Initialize table and position
self.table = self._create_next_table()

def _expiry_callback(self, link: TSparkArrowResultLink):
logger.info(
f"SeaCloudFetchQueue: Link expired, restarting from offset {link.startRowOffset}"
)
if not self.link_fetcher:
return
self.link_fetcher._restart_from_expired_link(link)

def _create_next_table(self) -> "pyarrow.Table":
"""Create next table by retrieving the logical next downloaded file."""
if self.link_fetcher is None:
Expand Down
89 changes: 74 additions & 15 deletions src/databricks/sql/cloudfetch/download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from concurrent.futures import ThreadPoolExecutor, Future
import threading
from typing import List, Union, Tuple, Optional
from typing import Callable, List, Optional, Union, Generic, TypeVar, Tuple, Optional

from databricks.sql.cloudfetch.downloader import (
ResultSetDownloadHandler,
Expand All @@ -16,6 +16,27 @@

logger = logging.getLogger(__name__)

T = TypeVar("T")


class TaskWithMetadata(Generic[T]):
"""
Wrapper around Future that stores additional metadata (the link).
Provides type-safe access to both the Future result and the associated link.
"""

def __init__(self, future: Future[T], link: TSparkArrowResultLink):
self.future = future
self.link = link

def result(self, timeout: Optional[float] = None) -> T:
"""Get the result of the Future, blocking if necessary."""
return self.future.result(timeout)

def cancel(self) -> bool:
"""Cancel the Future if possible."""
return self.future.cancel()


class ResultFileDownloadManager:
def __init__(
Expand All @@ -27,6 +48,7 @@ def __init__(
session_id_hex: Optional[str],
statement_id: str,
chunk_id: int,
expiry_callback: Callable[[TSparkArrowResultLink], None],
):
self._pending_links: List[Tuple[int, TSparkArrowResultLink]] = []
self.chunk_id = chunk_id
Expand All @@ -44,11 +66,12 @@ def __init__(
self._max_download_threads: int = max_download_threads

self._download_condition = threading.Condition()
self._download_tasks: List[Future[DownloadedFile]] = []
self._download_tasks: List[TaskWithMetadata[DownloadedFile]] = []
self._thread_pool = ThreadPoolExecutor(max_workers=self._max_download_threads)

self._downloadable_result_settings = DownloadableResultSettings(lz4_compressed)
self._ssl_options = ssl_options
self._expiry_callback = expiry_callback
self.session_id_hex = session_id_hex
self.statement_id = statement_id

Expand Down Expand Up @@ -89,6 +112,41 @@ def get_next_downloaded_file(self, next_row_offset: int) -> DownloadedFile:

return file

def cancel_tasks_from_offset(self, start_row_offset: int):
"""
Cancel all download tasks starting from a specific row offset.
This is used when links expire and we need to restart from a certain point.

Args:
start_row_offset (int): Row offset from which to cancel tasks
"""

def to_cancel(link: TSparkArrowResultLink) -> bool:
return link.startRowOffset < start_row_offset

tasks_to_cancel = [
task for task in self._download_tasks if to_cancel(task.link)
]
for task in tasks_to_cancel:
task.cancel()
logger.info(
f"ResultFileDownloadManager: cancelled {len(tasks_to_cancel)} tasks from offset {start_row_offset}"
)

# Remove cancelled tasks from the download queue
tasks_to_keep = [
task for task in self._download_tasks if not to_cancel(task.link)
]
self._download_tasks = tasks_to_keep

pending_links_to_keep = [
link for link in self._pending_links if not to_cancel(link[1])
]
self._pending_links = pending_links_to_keep
logger.info(
f"ResultFileDownloadManager: removed {len(self._pending_links) - len(pending_links_to_keep)} links from pending links"
)

def _schedule_downloads(self):
"""
While download queue has a capacity, peek pending links and submit them to thread pool.
Expand All @@ -107,34 +165,35 @@ def _schedule_downloads(self):
settings=self._downloadable_result_settings,
link=link,
ssl_options=self._ssl_options,
expiry_callback=self._expiry_callback,
chunk_id=chunk_id,
session_id_hex=self.session_id_hex,
statement_id=self.statement_id,
)
task = self._thread_pool.submit(handler.run)
future = self._thread_pool.submit(handler.run)
task = TaskWithMetadata(future, link)
self._download_tasks.append(task)

with self._download_condition:
self._download_condition.notify_all()

def add_link(self, link: TSparkArrowResultLink):
def add_links(self, links: List[TSparkArrowResultLink]):
"""
Add more links to the download manager.

Args:
link: Link to add
link (TSparkArrowResultLink): The link to add to the download manager.
"""

if link.rowCount <= 0:
return

logger.debug(
"ResultFileDownloadManager: adding file link, start offset {}, row count: {}".format(
link.startRowOffset, link.rowCount
for link in links:
if link.rowCount <= 0:
continue
logger.debug(
"ResultFileDownloadManager: adding file link, start offset {}, row count: {}".format(
link.startRowOffset, link.rowCount
)
)
)
self._pending_links.append((self.chunk_id, link))
self.chunk_id += 1
self._pending_links.append((self.chunk_id, link))
self.chunk_id += 1

self._schedule_downloads()

Expand Down
12 changes: 6 additions & 6 deletions src/databricks/sql/cloudfetch/downloader.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import logging
from dataclasses import dataclass
from typing import Callable
from typing import Optional

import requests
Expand Down Expand Up @@ -69,13 +70,15 @@ def __init__(
settings: DownloadableResultSettings,
link: TSparkArrowResultLink,
ssl_options: SSLOptions,
expiry_callback: Callable[[TSparkArrowResultLink], None],
chunk_id: int,
session_id_hex: Optional[str],
statement_id: str,
):
self.settings = settings
self.link = link
self._ssl_options = ssl_options
self._expiry_callback = expiry_callback
self.chunk_id = chunk_id
self.session_id_hex = session_id_hex
self.statement_id = statement_id
Expand All @@ -96,9 +99,7 @@ def run(self) -> DownloadedFile:
)

# Check if link is already expired or is expiring
ResultSetDownloadHandler._validate_link(
self.link, self.settings.link_expiry_buffer_secs
)
self._validate_link(self.link, self.settings.link_expiry_buffer_secs)

session = requests.Session()
session.mount("http://", HTTPAdapter(max_retries=retryPolicy))
Expand Down Expand Up @@ -146,8 +147,7 @@ def run(self) -> DownloadedFile:
if session:
session.close()

@staticmethod
def _validate_link(link: TSparkArrowResultLink, expiry_buffer_secs: int):
def _validate_link(self, link: TSparkArrowResultLink, expiry_buffer_secs: int):
"""
Check if a link has expired or will expire.

Expand All @@ -159,7 +159,7 @@ def _validate_link(link: TSparkArrowResultLink, expiry_buffer_secs: int):
link.expiryTime <= current_time
or link.expiryTime - current_time <= expiry_buffer_secs
):
raise Error("CloudFetch link has expired")
self._expiry_callback(link)

@staticmethod
def _decompress_data(compressed_data: bytes) -> bytes:
Expand Down
12 changes: 10 additions & 2 deletions src/databricks/sql/utils.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from __future__ import annotations
from typing import Dict, List, Optional, Union
from typing import Callable, Dict, List, Optional, Union

from dateutil import parser
import datetime
Expand All @@ -14,6 +14,8 @@

import lz4.frame

from databricks.sql.exc import Error

try:
import pyarrow
except ImportError:
Expand Down Expand Up @@ -227,6 +229,7 @@ def __init__(
schema_bytes: Optional[bytes] = None,
lz4_compressed: bool = True,
description: List[Tuple] = [],
expiry_callback: Callable[[TSparkArrowResultLink], None] = lambda _: None,
):
"""
Initialize the base CloudFetchQueue.
Expand Down Expand Up @@ -261,6 +264,7 @@ def __init__(
session_id_hex=session_id_hex,
statement_id=statement_id,
chunk_id=chunk_id,
expiry_callback=expiry_callback,
)

def next_n_rows(self, num_rows: int) -> "pyarrow.Table":
Expand Down Expand Up @@ -381,6 +385,7 @@ def __init__(
session_id_hex=session_id_hex,
statement_id=statement_id,
chunk_id=chunk_id,
expiry_callback=self._expiry_callback,
)

self.start_row_index = start_row_offset
Expand All @@ -404,11 +409,14 @@ def __init__(
result_link.startRowOffset, result_link.rowCount
)
)
self.download_manager.add_link(result_link)
self.download_manager.add_links(self.result_links)

# Initialize table and position
self.table = self._create_next_table()

def _expiry_callback(self, link: TSparkArrowResultLink):
raise Error("Cloudfetch link has expired")

def _create_next_table(self) -> "pyarrow.Table":
if self.num_links_downloaded >= len(self.result_links):
return self._create_empty_table()
Expand Down
4 changes: 4 additions & 0 deletions tests/unit/test_download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,15 @@ class DownloadManagerTests(unittest.TestCase):
def create_download_manager(
self, links, max_download_threads=10, lz4_compressed=True
):
def expiry_callback(link: TSparkArrowResultLink):
return None

return download_manager.ResultFileDownloadManager(
links,
max_download_threads,
lz4_compressed,
ssl_options=SSLOptions(),
expiry_callback=expiry_callback,
session_id_hex=Mock(),
statement_id=Mock(),
chunk_id=0,
Expand Down
Loading
Loading