poc: transaction executor #5273

andreas-unleash · 2023-11-06T10:33:26Z

Thinking about the deadlock pain we have been experiencing, I think this will help.

Creates a transaction executor, able to handle transient errors (like deadlocks) with a configurable number of retries with exponential backoff

Signed-off-by: andreas-unleash <[email protected]>

vercel · 2023-11-06T10:33:32Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments

Name	Status	Preview	Comments	Updated (UTC)
unleash-docs	⬜️ Ignored (Inspect)			Nov 6, 2023 11:49am
unleash-monorepo-frontend	⬜️ Ignored (Inspect)	Visit Preview		Nov 6, 2023 11:49am

Signed-off-by: andreas-unleash <[email protected]>

gastonfournier · 2023-11-15T14:33:18Z

src/lib/db/transaction-executor.ts

+     */
+    private isTransientError(error: any): boolean {
+        const transientErrors: { [key: string]: string } = {
+            '40001': 'serialization_failure',


Why is this transient?

Serialization failure is an error condition that arises in scenarios involving concurrent database transactions.

gastonfournier

The code looks great and elegant, but I'm not fully convinced this will actually solve our problem (are we going to get a deadlock 3 times increasing the load in our system?), and also it may negatively impact our ability to scale. I'd prefer keeping things as they are right now and try to deal with the deadlocks in a different way (we should be able to fix the deadlock).

If we were to merge this (which I'd say we should not), I'd like to see some tests.

gastonfournier · 2023-11-15T14:36:50Z

src/lib/db/transaction-executor.ts

+                    console.error(
+                        `Transaction failed: ${error.message}. Retrying in ${delayMillis} ms.`,
+                    );
+                    await delay(delayMillis);


This will hold the connection open for longer which reduces our ability to answer more requests. It's not only about the increased latency but more about the server's ability to handle requests.

Makes sense, also adding the exponential backoff to the clients instead

andreas-unleash · 2023-11-15T15:17:22Z

It's not my intention to merge, just wanted to start a discussion on better handling some db errors (since we are keeping away from orm).

andreas-unleash · 2023-11-15T15:28:36Z

I wonder if our problems would go away if we added ISOLATION LEVEL SERIALIZABLE to (some of) our transactions (eg the ones that get triggered by a scheduler).
Basically forces them to run serially even in concurrent situations.

await trx.raw('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');

There are things to consider however:

Performance Impact: While providing the highest level of data integrity, the Serializable Isolation level can have a performance impact, especially in a highly concurrent environment, because it can lead to increased transaction rollbacks and retries.
Handling Rollbacks: Since Serializable transactions are more prone to rollbacks due to serialization failures (error code 40001), applications should be designed to gracefully handle and potentially retry transactions that are safe to retry.

gastonfournier · 2023-11-16T08:01:32Z

It's not my intention to merge, just wanted to start a discussion on better handling some db errors (since we are keeping away from orm).

:) 👍 great then! We can keep it going, but I'm also wondering how big of a problem is it right now, and whether we identified where this happens

I wonder if our problems would go away if we added ISOLATION LEVEL SERIALIZABLE to (some of) our transactions (eg the ones that get triggered by a scheduler). Basically forces them to run serially even in concurrent situations.
await trx.raw('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');
There are things to consider however:

Performance Impact: While providing the highest level of data integrity, the Serializable Isolation level can have a performance impact, especially in a highly concurrent environment, because it can lead to increased transaction rollbacks and retries.

Handling Rollbacks: Since Serializable transactions are more prone to rollbacks due to serialization failures (error code 40001), applications should be designed to gracefully handle and potentially retry transactions that are safe to retry.

Maybe it is worth trying something like this... Every time I work with databases I have to do some memory refreshers, cause it's not my area of expertise. But something like this can be tested through load tests to validate there's no performance regression, but it won't be as good as real production testing. I know in some languages like java it could be hard to add a feature flag to it, but in node maybe it's not that difficult. In such case we could test this targeting specifically the customer having issues more frequently...

andreas-unleash · 2023-11-16T08:25:37Z

So, here are my 2 cents about what I think is happening:

It's not my intention to merge, just wanted to start a discussion on better handling some db errors (since we are keeping away from orm).

:) 👍 great then! We can keep it going, but I'm also wondering how big of a problem is it right now, and whether we identified where this happens
I wonder if our problems would go away if we added ISOLATION LEVEL SERIALIZABLE to (some of) our transactions (eg the ones that get triggered by a scheduler). Basically forces them to run serially even in concurrent situations.
await trx.raw('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');
There are things to consider however:

Performance Impact: While providing the highest level of data integrity, the Serializable Isolation level can have a performance impact, especially in a highly concurrent environment, because it can lead to increased transaction rollbacks and retries.

Handling Rollbacks: Since Serializable transactions are more prone to rollbacks due to serialization failures (error code 40001), applications should be designed to gracefully handle and potentially retry transactions that are safe to retry.
Maybe it is worth trying something like this... Every time I work with databases I have to do some memory refreshers, cause it's not my area of expertise. But something like this can be tested through load tests to validate there's no performance regression, but it won't be as good as real production testing. I know in some languages like java it could be hard to add a feature flag to it, but in node maybe it's not that difficult. In such case we could test this targeting specifically the customer having issues more frequently...

So the only place we get the deadlocks as of now is when trying to commit lastSeen metrics. My hypothesis is that this is happening when 2 pods are sending the same request at the same time ending up with 2 transactions in deadlock. Sorting the entries helped. (which tells me even 1 sec delay/retry would get them out of deadlock.
This is happening for big-ish users (who are also the most likely to have some sort of clock sync going on in their infrastructure).

Theoretically any scheduled task could end up in this state given enough data to persist without (some) transaction management/error handling.

I will try to simulate the conditions and give it a go. :)

gastonfournier · 2023-11-16T09:41:52Z

So the only place we get the deadlocks as of now is when trying to commit lastSeen metrics. My hypothesis is that this is happening when 2 pods are sending the same request at the same time ending up with 2 transactions in deadlock. Sorting the entries helped. (which tells me even 1 sec delay/retry would get them out of deadlock.
This is happening for big-ish users (who are also the most likely to have some sort of clock sync going on in their infrastructure).

Theoretically any scheduled task could end up in this state given enough data to persist without (some) transaction management/error handling.

If this is the only case this happens I wouldn't make all operations serializable. I think it's fine to lose some "lastSeen" events, it means the last seen will be eventually consistent. I believe the main issue is errors triggering some alerts. But considering the last seen are eventually consistent, I wonder if we can batch them in memory and then have a single process to persist them on a regular cadence. When returning last-seen metric we can just read from memory and go to the DB on a miss (using memory as a read-through cache). I can still see 2 of our pods trying to write to the db at the same time, but that can be handled with jitter and maybe here it makes sense to do retries...

I will try to simulate the conditions and give it a go. :)

Sure, but consider the other option I just shared. If you want we can talk about this problem and do some brainstorming

poc: transaction executor

062c0b6

Signed-off-by: andreas-unleash <[email protected]>

github-actions bot assigned andreas-unleash Nov 6, 2023

poc: add exponential backoff to the retries

fdcd25c

Signed-off-by: andreas-unleash <[email protected]>

FredrikOseberg requested review from gastonfournier, chriswk and kwasniew November 15, 2023 09:34

gastonfournier reviewed Nov 15, 2023

View reviewed changes

ivarconr marked this pull request as draft May 7, 2024 21:57

gastonfournier closed this Jun 4, 2024

gastonfournier deleted the feat/transactional_executor_poc branch June 4, 2024 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

poc: transaction executor #5273

poc: transaction executor #5273

andreas-unleash commented Nov 6, 2023 •

edited

Loading

vercel bot commented Nov 6, 2023 •

edited

Loading

gastonfournier Nov 15, 2023

andreas-unleash Nov 15, 2023

gastonfournier left a comment

gastonfournier Nov 15, 2023

andreas-unleash Nov 15, 2023 •

edited

Loading

andreas-unleash commented Nov 15, 2023

andreas-unleash commented Nov 15, 2023 •

edited

Loading

gastonfournier commented Nov 16, 2023

andreas-unleash commented Nov 16, 2023

gastonfournier commented Nov 16, 2023

poc: transaction executor #5273

poc: transaction executor #5273

Conversation

andreas-unleash commented Nov 6, 2023 • edited Loading

vercel bot commented Nov 6, 2023 • edited Loading

gastonfournier Nov 15, 2023

Choose a reason for hiding this comment

andreas-unleash Nov 15, 2023

Choose a reason for hiding this comment

gastonfournier left a comment

Choose a reason for hiding this comment

gastonfournier Nov 15, 2023

Choose a reason for hiding this comment

andreas-unleash Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

andreas-unleash commented Nov 15, 2023

andreas-unleash commented Nov 15, 2023 • edited Loading

gastonfournier commented Nov 16, 2023

andreas-unleash commented Nov 16, 2023

gastonfournier commented Nov 16, 2023

andreas-unleash commented Nov 6, 2023 •

edited

Loading

vercel bot commented Nov 6, 2023 •

edited

Loading

andreas-unleash Nov 15, 2023 •

edited

Loading

andreas-unleash commented Nov 15, 2023 •

edited

Loading