Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should object store retry on connection reset by peer? #5378

Closed
kszlim opened this issue Feb 9, 2024 · 6 comments
Closed

Should object store retry on connection reset by peer? #5378

kszlim opened this issue Feb 9, 2024 · 6 comments
Labels
question Further information is requested

Comments

@kszlim
Copy link
Contributor

kszlim commented Feb 9, 2024

Which part is this question about
Object store

Describe your question
Should object store's retry logic also retry when the connection is reset by a peer?

Additional context
I'm querying an s3 bucket via polars (which uses object store under the hood) and I'm encountering this issue, this only happens when I'm querying many (~10k files)

Generic S3 error: Error after 0 retries in 221.182308ms, max_retries:10, retry_timeout:10s, source:error sending request for url (https://s3.us-west-2.amazonaws.com/bucket/some_parquet_id.parquet): connection error: Connection reset by peer (os error 104)

This seems to be because this error isn't covered under object store's retry policy.

It'd be very handy if this was, though I'm not certain if it'd be strictly correct behavior (ie. Should it be handled by the caller of object store instead? Which in this case is polars).

@kszlim kszlim added the question Further information is requested label Feb 9, 2024
@tustvold
Copy link
Contributor

tustvold commented Feb 9, 2024

We probably could retry, but connection reset normally means you are hitting rate limits and should reduce the amount of concurrent IO you are performing. There is a LimitStore that might achieve this, if polars can't do this itself

@kszlim
Copy link
Contributor Author

kszlim commented Feb 9, 2024

means you are hitting rate limits and should reduce the amount of concurrent IO you are performing. There is a LimitStore

Hmm, I've tried reducing concurrency in polars (to 32) and it doesn't seem to fix this, though I doubt i'm anywhere near the stated rate limit of s3 which is 5500 GET requests per partitioned prefix per second.

If it's not incorrect behavior, do you think it makes sense to also build this condition into the retries done by object store @tustvold ?

@kszlim
Copy link
Contributor Author

kszlim commented Feb 10, 2024

Do you think it'd be alright to modify:
https://github.com/apache/arrow-rs/blob/master/object_store/src/client/retry.rs#L266

to:
!(e.is_parse() || e.is_parse_status() || e.is_parse_too_large() || e.is_user() || e.is_canceled()) Note the negation.

Which would be much more aggressive wrt what states to retry on? Ideally the error Kind in hyper would be publically exposed to allow deeper introspection into failure modes that can be retried, but that doesn't seem like it's going to be on the roadmap for a while (see: hyperium/hyper#2845).

Whilst this might be more aggressive than desired, I think users could accordingly adjust their retries/backoff config to compensate, what do you think?

@kszlim
Copy link
Contributor Author

kszlim commented Feb 10, 2024

Opened a PR #5383, feel free to close it if you think it's unreasonable.

@tustvold
Copy link
Contributor

tustvold commented Mar 6, 2024

The conclusion of pola-rs/polars#14598 appears to be that this was an upstream issue, so closing this. Feel free to reopen if I am mistaken

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2024
@flokli
Copy link

flokli commented Aug 21, 2024

I've been running into something like this - apparently Azure has a 30 second request duration limit, and closes the connection afterwards. I opened #6287 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants