Add basic retry for rust crypto outgoing requests #4061

BillCarsonFr · 2024-02-09T18:15:28Z

Add basic retry for outgoing requests supported by the rust SDK.
It will retry 3 times if:

Http status is 429, 500, 502, 503, 504, 525 (a bit arbitrary what would you suggest?)
If it's a connection error (timeout, connection went down ...)

It's also looking for the retry_after_ms of DEFAULT_RETRY_DELAY_MS to wait for the appropriate time.

Fixes element-hq/element-web#26755 as it ultimatly fails on 500 and will work if retried

Checklist

Tests written for new code (and old code if feasible)
Linter and other CI checks pass
Sign-off given on the changes (see CONTRIBUTING.md)

Here's what your changelog entry will look like:

🐛 Bug Fixes

Add basic retry for rust crypto outgoing requests (#4061). Fixes Element-R | "Unable to set up keys" error on register element-hq/element-web#26755.

t3chguy

Retrying 504 isn't necessarily safe as the action you requested may have already been performed but the response from the backend was dropped. It can only be retried on idempotent APIs and there a resulting error may need to be caught accordingly so the caller isn't told that the API failed due to e.g. M_NOT_FOUND or whatever.

BillCarsonFr · 2024-02-12T12:46:37Z

Retrying 504 isn't necessarily safe as the action you requested may have already been performed but the response from the backend was dropped. It can only be retried on idempotent APIs and there a resulting error may need to be caught accordingly so the caller isn't told that the API failed due to e.g. M_NOT_FOUND or whatever.

Ok i removed 504 from the list. It's true that some of the concerned requests are idempotent but I'll keep this PR simple for now

richvdh · 2024-02-12T14:51:43Z

Virtually (?) all matrix APIs are idempotent, deliberately so that they can be safely retried by clients. I would argue that, for simplicity, all 5xx response codes should be retried.

We should maybe also think about timeouts. Should we ever retry when the request times out?

t3chguy

Please fix the PR title, this implies it applies to outgoing requests generally, not specifically to a part of rust crypto.

BillCarsonFr · 2024-02-13T13:30:59Z

Virtually (?) all matrix APIs are idempotent, deliberately so that they can be safely retried by clients. I would argue that, for simplicity, all 5xx response codes should be retried.

We should maybe also think about timeouts. Should we ever retry when the request times out?

I updated given that all requests are indempotent, now retrying any 5xx requests.
Also client timeouts are not currently retried (Will be AbortError and not ConnectionError). There is an option to set the local timeout on MatrixClient, but it's not used and defaults to a very very high timing, so asm you said we should probably not retry them

uhoreg · 2024-02-16T02:33:38Z

src/rust-crypto/OutgoingRequestProcessor.ts

+
+                currentRetryCount++;
+
+                const maybeRetryAfter = this.shouldWaitBeforeRetryingMillis(e);


There is already a class that implements the retry logic. It would probably be better to use that instead of writing something else here. https://github.com/matrix-org/matrix-js-sdk/blob/develop/src/scheduler.ts#L54

Yes I have seen it. Indeed I think that there should be a retry mecanism inside the FetchHttpApi maybe with a new flag in IRequestOpts retriable (default true as most mx request are indempotent).
Then the event scheduler will benefit from it too (unless there needs to be special logic for events).
It will help on other issues like element-hq/element-web#26967

The thing is that it would impact the application a bit everywhere, I'd like to have support for the web team if we want to go that way.

(on side note I noticed that the scheduler has blocked retry on 502/M_TOO_LARGE, might not be common for rust outgoing request, but might want to exclude it too)

There are a two different suggestions in this thread:

MatrixScheduler and OutgoingRequestManager both implement logic to decide whether to retry a request, and if so when. We could extract that logic and share it, rather than having two copies. (Presumably, we would make a new method somewhere, and have MatrixScheduler.RETRY_BACKOFF_RATELIMIT call it.) That seems like quite a good idea to me. @BillCarsonFr: any reason we can't do this?

Move the retry mechanism down to FetchHttpApi, so that everything (including MatrixScheduler and OutgoingRequestManager) can benefit from it. This seems like a much more invasive change, and one I don't think we should do right now.

any reason we can't do this?

The RETRY_BACKOFF_RATELIMIT function is a bit strange, it has an used parameter, it is trying to cast a MatrixError into a ConnectionError (this can't work? or maybe at runtime?)
It doesn't want to retry on ConnectionError (like if the client internet connection is down), and we want that.

So I think we can't use it directly as it doesn't seem to be the same logic, and touching it will be also quite invasive. The only thing we could take easily is the exponential backoff, but it's just one line.

It's still bad that the logic doesn't match, but we might want to unify that in a second step?

Ok, so after discussion I extracted a common retry alg between the scheduler and the outgoing request processor.
2 slight changes on the previous state:

Outgoing request processor try 5 times in total instead of 4

The matrix scheduer is not retrying on client timeouts anymore (as this timeout is already > 1mn depending on browser)

Here is the refactoring commit c2126a6

richvdh

Looks sensible but a few suggestions for cleanups

src/rust-crypto/OutgoingRequestProcessor.ts

richvdh · 2024-02-19T11:53:45Z

src/rust-crypto/OutgoingRequestProcessor.ts

+        // All keys/signatures uploads are, message and to device are because of txn_id, keys claim in worst case will claim
+        // several keys but won't cause harm.


This sentence is hard to understand.

Suggested change

// All keys/signatures uploads are, message and to device are because of txn_id, keys claim in worst case will claim

// several keys but won't cause harm.

// * All key/signature uploads are idempotent.

// * Room message and to-device send requests are idempotent because of txn_id.

// * Keys claim in worst case will claim several keys but won't cause harm.

thx, updated

richvdh · 2024-02-19T11:55:04Z

src/rust-crypto/OutgoingRequestProcessor.ts

+     * is resolved or when the rate limit is reset.
+     * @param httpStatus - the HTTP status code of the response
+     */
+    private canRetry(httpStatus: number): boolean {


again, this should probably be after shouldWaitBeforeRetryingMillis for consistency.

yes, changed

spec/unit/rust-crypto/OutgoingRequestProcessor.spec.ts

richvdh

looks sensible otherwise

src/request-retry-utils.ts

src/rust-crypto/OutgoingRequestProcessor.ts

src/request-retry-utils.ts

PR title updated

t3chguy · 2024-07-31T15:03:21Z

@BillCarsonFr could you explain why we're retrying UIA requests? This just makes it such that the user has to wait 20 seconds to see the error which isn't going to change with retries. element-hq/element-web#27863

richvdh · 2024-07-31T15:29:54Z

could you explain why we're retrying UIA requests?

This was to fix a login failure which was caused by the homeserver returning a 500 error to /keys/upload. /keys/upload is a UIA request, so we have to retry it.

t3chguy · 2024-07-31T15:30:43Z

Right but a ~~403~~ 401 is a valid correct error code during UIA which should not be retried, yet here we're retrying it?

richvdh · 2024-07-31T15:40:30Z

If that's the case, it's certainly not the intention of this PR: https://github.com/matrix-org/matrix-js-sdk/pull/4061/files#diff-85a675b0871ea2f4a8991c253d876126df5949b8383211936b1c74657641864cR175-R178

richvdh · 2024-07-31T15:43:10Z

In fact there are even tests that a 401 is not retried?

t3chguy · 2024-07-31T15:44:28Z

That doesn't match its description

But yes good spot

Add basic retry for outgoing requests

c2126a6

BillCarsonFr requested a review from a team as a code owner February 9, 2024 18:15

BillCarsonFr requested review from andybalaam and uhoreg February 9, 2024 18:15

BillCarsonFr added the T-Defect label Feb 10, 2024

Update doc

c112dcf

github-actions bot deployed to PR Documentation Preview February 12, 2024 08:09 View deployment

t3chguy requested changes Feb 12, 2024

View reviewed changes

Remove 504 from retryable

735f2ee

github-actions bot deployed to PR Documentation Preview February 12, 2024 12:47 View deployment

BillCarsonFr requested a review from t3chguy February 12, 2024 13:03

t3chguy previously requested changes Feb 12, 2024

View reviewed changes

BillCarsonFr changed the title ~~Add basic retry for outgoing requests~~ Add basic retry for rust crypto outgoing requests Feb 13, 2024

Retry all 5xx and clarify client timeouts

afa6a09

github-actions bot deployed to PR Documentation Preview February 13, 2024 12:47 View deployment

BillCarsonFr requested a review from t3chguy February 13, 2024 13:31

richvdh self-requested a review February 15, 2024 14:13

uhoreg reviewed Feb 16, 2024

View reviewed changes

richvdh requested changes Feb 19, 2024

View reviewed changes

BillCarsonFr added 3 commits February 23, 2024 11:17

code review cleaning

5a707e1

do not retry rust request if M_TOO_LARGE

01c4d0e

Merge branch 'develop' into valere/element-r/retry_outgoing_requests

fd0b03a

github-actions bot deployed to PR Documentation Preview February 23, 2024 10:50 View deployment

BillCarsonFr requested a review from richvdh February 23, 2024 10:51

refactor use common retry alg between scheduler and rust requests

4dc469b

BillCarsonFr requested a review from a team as a code owner February 23, 2024 12:48

github-actions bot deployed to PR Documentation Preview February 23, 2024 12:49 View deployment

richvdh reviewed Feb 23, 2024

View reviewed changes

richvdh approved these changes Feb 23, 2024

View reviewed changes

BillCarsonFr added 2 commits February 26, 2024 14:39

Code review, cleaning and doc

f57912a

Merge branch 'develop' into valere/element-r/retry_outgoing_requests

afe82fb

github-actions bot deployed to PR Documentation Preview February 26, 2024 13:41 View deployment

BillCarsonFr enabled auto-merge February 26, 2024 13:43

BillCarsonFr added this pull request to the merge queue Feb 26, 2024

Merged via the queue into develop with commit d3dfcd9 Feb 26, 2024
23 checks passed

BillCarsonFr deleted the valere/element-r/retry_outgoing_requests branch February 26, 2024 14:28

t3chguy mentioned this pull request Jul 31, 2024

X-signing: error message and prompt to enable MAS temporary key reset doesn't display for ~20 seconds element-hq/element-web#27863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic retry for rust crypto outgoing requests #4061

Add basic retry for rust crypto outgoing requests #4061

BillCarsonFr commented Feb 9, 2024 •

edited by github-actions bot

Loading

t3chguy left a comment •

edited

Loading

BillCarsonFr commented Feb 12, 2024

richvdh commented Feb 12, 2024

t3chguy left a comment

BillCarsonFr commented Feb 13, 2024

uhoreg Feb 16, 2024

BillCarsonFr Feb 16, 2024 •

edited

Loading

richvdh Feb 19, 2024

BillCarsonFr Feb 23, 2024

BillCarsonFr Feb 23, 2024

richvdh left a comment

richvdh Feb 19, 2024

BillCarsonFr Feb 23, 2024

richvdh Feb 19, 2024

BillCarsonFr Feb 23, 2024

richvdh left a comment

t3chguy commented Jul 31, 2024

richvdh commented Jul 31, 2024

t3chguy commented Jul 31, 2024 •

edited

Loading

richvdh commented Jul 31, 2024

richvdh commented Jul 31, 2024

t3chguy commented Jul 31, 2024


		currentRetryCount++;

		const maybeRetryAfter = this.shouldWaitBeforeRetryingMillis(e);

		// All keys/signatures uploads are, message and to device are because of txn_id, keys claim in worst case will claim
		// several keys but won't cause harm.

-        // All keys/signatures uploads are, message and to device are because of txn_id, keys claim in worst case will claim
-        // several keys but won't cause harm.
+        //  * All key/signature uploads are idempotent.
+        //  * Room message and to-device send requests are idempotent because of txn_id.
+        //  * Keys claim in worst case will claim several keys but won't cause harm.

Add basic retry for rust crypto outgoing requests #4061

Add basic retry for rust crypto outgoing requests #4061

Conversation

BillCarsonFr commented Feb 9, 2024 • edited by github-actions bot Loading

Checklist

🐛 Bug Fixes

t3chguy left a comment • edited Loading

Choose a reason for hiding this comment

BillCarsonFr commented Feb 12, 2024

richvdh commented Feb 12, 2024

t3chguy left a comment

Choose a reason for hiding this comment

BillCarsonFr commented Feb 13, 2024

Choose a reason for hiding this comment

BillCarsonFr Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richvdh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richvdh left a comment

Choose a reason for hiding this comment

t3chguy commented Jul 31, 2024

richvdh commented Jul 31, 2024

t3chguy commented Jul 31, 2024 • edited Loading

richvdh commented Jul 31, 2024

richvdh commented Jul 31, 2024

t3chguy commented Jul 31, 2024

BillCarsonFr commented Feb 9, 2024 •

edited by github-actions bot

Loading

t3chguy left a comment •

edited

Loading

BillCarsonFr Feb 16, 2024 •

edited

Loading

t3chguy commented Jul 31, 2024 •

edited

Loading