object_store: Retry on connection duration timeouts? #6287

flokli · 2024-08-21T19:34:00Z

**Is your feature request related to a problem or challenge?
I'm using object_store to stream large(r) files to in this case, the body of a HTTP .

I essentially do a store.get(path).await?.into_stream() to get the data stream.

When using it with the Azure backend, I noticed that Azure reliably closes the connection after 30 seconds. Other providers (S3) also explicitly inject errors, but keep most of the error handling in their SDKs.

I know there's some retry logic for some error cases in object_store, but the "connection closed while getting data as a stream" part doesn't seem to be covered. I think there should be a high-level function that retries receiving the remaining data in these error cases (verifying the etag is still the same).

Describe alternatives you've considered
Manually dealing with error handling and retries in all object_store consumers

The text was updated successfully, but these errors were encountered:

cbrewster · 2024-08-21T19:53:57Z

The golang GCS client supports automatic retrying here via keeping track of how many bytes the client has already seen and creating a new ranged read request to resume the read where the client left off:

https://github.com/googleapis/google-cloud-go/blob/9afb797d75499807b29c372ec375668be4d2995e/storage/http_client.go#L1268-L1275

flokli · 2024-08-22T11:18:55Z

So apparently object_store::buffered::BufReader might be the answer to this, at least it prevents long-running connections to the backend.

I feel like the docstring of into_stream() should be extended to point it out that you might want to use object_store::buffered::BufReader instead.

I'm not that happy yet object_store::buffered::BufReader doesn't do any "readahead", meaning we'd have to wait the summed up round-trip times whenever the buffer gets empty, but that's probably also tricky to do in the general case.

I'm also not sure if the calls to get_range it does are actually retried, in case of temporary connectivity issues / rate-limiting.

itsjunetime · 2024-09-13T16:47:31Z

take

erratic-pattern · 2024-09-30T00:13:06Z

take

Closes apache#6287 This PR includes `reqwest::Error::Decode` as an error case to retry on, which can occur when a server drops a connection in the middle of sending the response body.

erratic-pattern · 2024-10-06T20:58:45Z

I've made a PR #6519 similar to the closed PR #5383, but only permits reqwest::Error::Decode to be retried, since this is the error type that we see associated with this issue (see #5882)

EDIT: I am not sure this is exactly the same issue that @flokli is seeing, but it is the one being seen in the #5882 issue so maybe I should associate this PR with that issue instead?

alamb · 2024-10-08T16:21:22Z

Given the subtle nature of errors and retries in the context of streaming requests, I think the only practical way forward will be to create an example/ test case that shows the problem.

The example would likely look something like:

A mock http server that pretends to be an S3 endpoint
Some settings in the mock server that can inject errors (like abort after some bytes have already been returned)
Testing the object store client then with errors / how it would automatically retry.

This test would also let us explore the various issues / corner cases that could be encountered

It would be likely that writing the test harness would be a significant undertaking, but that would be the best way to inform a proposal for API changes

tustvold · 2024-10-08T16:28:20Z

We already have all of this setup as part of the retry tests - https://github.com/apache/arrow-rs/blob/master/object_store/src/client/retry.rs#L505. It would be a relatively straightforward enhancement to inject a failure mid-stream.

Edit: As for what the API would look like, given the only streaming request is get, and the corresponding logic is already extracted into GetClientExt which is shared across all implementations. This would be the obvious place to put any retry logic for streaming requests.

alamb · 2024-10-08T16:31:03Z

Perfect! So maybe @erratic-pattern you can make a PR with the failure scenario ? It would be valuable to document in code the current behavior, even if we decide we don't have the time to fix it.

flokli added the enhancement Any new improvement worthy of a entry in the changelog label Aug 21, 2024

flokli mentioned this issue Aug 21, 2024

Should object store retry on connection reset by peer? #5378

Closed

tustvold mentioned this issue Sep 10, 2024

error decoding response body after upgrade to object store 0.10 #5882

Open

crepererum added the object-store Object Store Interface label Sep 10, 2024

crepererum self-assigned this Sep 10, 2024

github-actions bot assigned itsjunetime Sep 13, 2024

crepererum removed their assignment Sep 16, 2024

github-actions bot assigned erratic-pattern Sep 30, 2024

itsjunetime removed their assignment Sep 30, 2024

erratic-pattern linked a pull request Oct 6, 2024 that will close this issue

object_store: retry on response decoding errors #6519

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

object_store: Retry on connection duration timeouts? #6287

object_store: Retry on connection duration timeouts? #6287

flokli commented Aug 21, 2024

cbrewster commented Aug 21, 2024 •

edited

Loading

flokli commented Aug 22, 2024

itsjunetime commented Sep 13, 2024

erratic-pattern commented Sep 30, 2024

erratic-pattern commented Oct 6, 2024 •

edited

Loading

alamb commented Oct 8, 2024 •

edited

Loading

tustvold commented Oct 8, 2024 •

edited

Loading

alamb commented Oct 8, 2024 •

edited

Loading

object_store: Retry on connection duration timeouts? #6287

object_store: Retry on connection duration timeouts? #6287

Comments

flokli commented Aug 21, 2024

cbrewster commented Aug 21, 2024 • edited Loading

flokli commented Aug 22, 2024

itsjunetime commented Sep 13, 2024

erratic-pattern commented Sep 30, 2024

erratic-pattern commented Oct 6, 2024 • edited Loading

alamb commented Oct 8, 2024 • edited Loading

tustvold commented Oct 8, 2024 • edited Loading

alamb commented Oct 8, 2024 • edited Loading

cbrewster commented Aug 21, 2024 •

edited

Loading

erratic-pattern commented Oct 6, 2024 •

edited

Loading

alamb commented Oct 8, 2024 •

edited

Loading

tustvold commented Oct 8, 2024 •

edited

Loading

alamb commented Oct 8, 2024 •

edited

Loading