-
-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic retry for rust crypto outgoing requests #4061
Add basic retry for rust crypto outgoing requests #4061
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying 504 isn't necessarily safe as the action you requested may have already been performed but the response from the backend was dropped. It can only be retried on idempotent APIs and there a resulting error may need to be caught accordingly so the caller isn't told that the API failed due to e.g. M_NOT_FOUND or whatever.
Ok i removed 504 from the list. It's true that some of the concerned requests are idempotent but I'll keep this PR simple for now |
Virtually (?) all matrix APIs are idempotent, deliberately so that they can be safely retried by clients. I would argue that, for simplicity, all 5xx response codes should be retried. We should maybe also think about timeouts. Should we ever retry when the request times out? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the PR title, this implies it applies to outgoing requests generally, not specifically to a part of rust crypto.
I updated given that all requests are indempotent, now retrying any 5xx requests. |
|
||
currentRetryCount++; | ||
|
||
const maybeRetryAfter = this.shouldWaitBeforeRetryingMillis(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a class that implements the retry logic. It would probably be better to use that instead of writing something else here. https://github.com/matrix-org/matrix-js-sdk/blob/develop/src/scheduler.ts#L54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I have seen it. Indeed I think that there should be a retry mecanism inside the FetchHttpApi
maybe with a new flag in IRequestOpts
retriable
(default true as most mx request are indempotent).
Then the event scheduler will benefit from it too (unless there needs to be special logic for events).
It will help on other issues like element-hq/element-web#26967
The thing is that it would impact the application a bit everywhere, I'd like to have support for the web team if we want to go that way.
(on side note I noticed that the scheduler has blocked retry on 502/M_TOO_LARGE, might not be common for rust outgoing request, but might want to exclude it too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a two different suggestions in this thread:
MatrixScheduler
andOutgoingRequestManager
both implement logic to decide whether to retry a request, and if so when. We could extract that logic and share it, rather than having two copies. (Presumably, we would make a new method somewhere, and haveMatrixScheduler.RETRY_BACKOFF_RATELIMIT
call it.) That seems like quite a good idea to me. @BillCarsonFr: any reason we can't do this?- Move the retry mechanism down to
FetchHttpApi
, so that everything (includingMatrixScheduler
andOutgoingRequestManager
) can benefit from it. This seems like a much more invasive change, and one I don't think we should do right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we can't do this?
The RETRY_BACKOFF_RATELIMIT
function is a bit strange, it has an used parameter, it is trying to cast a MatrixError
into a ConnectionError
(this can't work? or maybe at runtime?)
It doesn't want to retry on ConnectionError (like if the client internet connection is down), and we want that.
So I think we can't use it directly as it doesn't seem to be the same logic, and touching it will be also quite invasive. The only thing we could take easily is the exponential backoff, but it's just one line.
It's still bad that the logic doesn't match, but we might want to unify that in a second step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so after discussion I extracted a common retry alg between the scheduler and the outgoing request processor.
2 slight changes on the previous state:
- Outgoing request processor try 5 times in total instead of 4
- The matrix scheduer is not retrying on client timeouts anymore (as this timeout is already > 1mn depending on browser)
Here is the refactoring commit c2126a6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sensible but a few suggestions for cleanups
// All keys/signatures uploads are, message and to device are because of txn_id, keys claim in worst case will claim | ||
// several keys but won't cause harm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is hard to understand.
// All keys/signatures uploads are, message and to device are because of txn_id, keys claim in worst case will claim | |
// several keys but won't cause harm. | |
// * All key/signature uploads are idempotent. | |
// * Room message and to-device send requests are idempotent because of txn_id. | |
// * Keys claim in worst case will claim several keys but won't cause harm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, updated
* is resolved or when the rate limit is reset. | ||
* @param httpStatus - the HTTP status code of the response | ||
*/ | ||
private canRetry(httpStatus: number): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this should probably be after shouldWaitBeforeRetryingMillis
for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks sensible otherwise
@BillCarsonFr could you explain why we're retrying UIA requests? This just makes it such that the user has to wait 20 seconds to see the error which isn't going to change with retries. element-hq/element-web#27863 |
This was to fix a login failure which was caused by the homeserver returning a 500 error to |
Right but a |
If that's the case, it's certainly not the intention of this PR: https://github.com/matrix-org/matrix-js-sdk/pull/4061/files#diff-85a675b0871ea2f4a8991c253d876126df5949b8383211936b1c74657641864cR175-R178 |
In fact there are even tests that a 401 is not retried? |
Add basic retry for outgoing requests supported by the rust SDK.
It will retry 3 times if:
It's also looking for the
retry_after_ms
ofDEFAULT_RETRY_DELAY_MS
to wait for the appropriate time.Fixes element-hq/element-web#26755 as it ultimatly fails on 500 and will work if retried
Checklist
Here's what your changelog entry will look like:
🐛 Bug Fixes