-
-
Notifications
You must be signed in to change notification settings - Fork 740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
poc: transaction executor #5273
Conversation
Signed-off-by: andreas-unleash <[email protected]>
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 2 Ignored Deployments
|
Signed-off-by: andreas-unleash <[email protected]>
*/ | ||
private isTransientError(error: any): boolean { | ||
const transientErrors: { [key: string]: string } = { | ||
'40001': 'serialization_failure', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this transient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serialization failure is an error condition that arises in scenarios involving concurrent database transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great and elegant, but I'm not fully convinced this will actually solve our problem (are we going to get a deadlock 3 times increasing the load in our system?), and also it may negatively impact our ability to scale. I'd prefer keeping things as they are right now and try to deal with the deadlocks in a different way (we should be able to fix the deadlock).
If we were to merge this (which I'd say we should not), I'd like to see some tests.
console.error( | ||
`Transaction failed: ${error.message}. Retrying in ${delayMillis} ms.`, | ||
); | ||
await delay(delayMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will hold the connection open for longer which reduces our ability to answer more requests. It's not only about the increased latency but more about the server's ability to handle requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, also adding the exponential backoff to the clients instead
It's not my intention to merge, just wanted to start a discussion on better handling some db errors (since we are keeping away from orm). |
I wonder if our problems would go away if we added ISOLATION LEVEL SERIALIZABLE to (some of) our transactions (eg the ones that get triggered by a scheduler).
There are things to consider however:
|
:) 👍 great then! We can keep it going, but I'm also wondering how big of a problem is it right now, and whether we identified where this happens
Maybe it is worth trying something like this... Every time I work with databases I have to do some memory refreshers, cause it's not my area of expertise. But something like this can be tested through load tests to validate there's no performance regression, but it won't be as good as real production testing. I know in some languages like java it could be hard to add a feature flag to it, but in node maybe it's not that difficult. In such case we could test this targeting specifically the customer having issues more frequently... |
So, here are my 2 cents about what I think is happening:
So the only place we get the deadlocks as of now is when trying to commit lastSeen metrics. My hypothesis is that this is happening when 2 pods are sending the same request at the same time ending up with 2 transactions in deadlock. Sorting the entries helped. (which tells me even 1 sec delay/retry would get them out of deadlock. Theoretically any scheduled task could end up in this state given enough data to persist without (some) transaction management/error handling. I will try to simulate the conditions and give it a go. :) |
If this is the only case this happens I wouldn't make all operations serializable. I think it's fine to lose some "lastSeen" events, it means the last seen will be eventually consistent. I believe the main issue is errors triggering some alerts. But considering the last seen are eventually consistent, I wonder if we can batch them in memory and then have a single process to persist them on a regular cadence. When returning last-seen metric we can just read from memory and go to the DB on a miss (using memory as a read-through cache). I can still see 2 of our pods trying to write to the db at the same time, but that can be handled with jitter and maybe here it makes sense to do retries...
Sure, but consider the other option I just shared. If you want we can talk about this problem and do some brainstorming |
Thinking about the deadlock pain we have been experiencing, I think this will help.
Creates a transaction executor, able to handle transient errors (like deadlocks) with a configurable number of retries with exponential backoff