fix(concurrency): applying blocks concurrently can lead to unexpected errors #700

mtsitrin · 2024-04-17T08:09:25Z

Mutex protected applyBlock for full node (can be called from gossiping and when retreiving blocks)
moved m.attemptApplyCachedBlocks from processNextDABatch to syncUntilTarget

Close #658

<-- Briefly describe the content of this pull request -->

For Author:

Targeted PR against correct branch
included the correct type prefix in the PR title
Linked to Github issue with discussion and accepted design
Targets only one github issue
Wrote unit and integration tests
All CI checks have passed
Added relevant godoc comments

For Reviewer:

confirmed the correct type prefix in the PR title
Reviewers assigned
confirmed all author checklist items have been addressed

After reviewer approval:

In case targets main branch, PR should be squashed and merged.
In case PR targets a release branch, PR should be rebased.

block/retriever.go

danwt

Hey @mtsitrin , I think it makes sense, but these things are always tricky.

I added some comments, could you please check if they make sense?

Also

From my analysis, maybe we can combine the produce and execute mutexes into 1 mutex, and rename it to something which makes sense for both cases?

Original problem description:
    since store and abci access is not atomic, can have races
    if two goroutines try to apply block

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

submitBatchMutex      sync.Mutex
    on handleSubmissionTrigger, try to lock and unlock
        calls produceBlock

produceBlockMutex     sync.Mutex

    on handleSubmissionTrigger
    in ProduceBlockLoop
        on produceBlock lock and unlock
            calls applyBlock

executeBlockMutex     sync.Mutex
    on applyBlockCallback lock and unlock
        call apply block

    on initialSync
    in retrieveLoop
        on syncUntilTarget, lock and unlock
            call apply block

my reasoning being that depending on aggregator/not-aggregator, they are never using both mutexes

mtsitrin · 2024-04-17T16:08:57Z

@danwt u right
I prefer not do to this change now. We have few more things we can refactor/abstract between aggregator / non-aggregator.
otherwise I think it will be more confusing in current state

one thing at a time

…n-lead-to-unexpected-errors

omritoptix · 2024-04-17T18:51:32Z

block/retriever.go

@@ -42,6 +42,9 @@ func (m *Manager) RetriveLoop(ctx context.Context) {
 // It fetches the batches from the settlement, gets the DA height and gets
 // the actual blocks from the DA.
 func (m *Manager) syncUntilTarget(ctx context.Context, syncTarget uint64) error {
+	m.executeBlockMutex.Lock()


This process (syncUntilTarget) can be long as it needs to actually fetch the data from the DA upon new sync target (which can be dozens of seconds or even longer).

seems to me like during this time no gossiped blocks will be applied as the lock will block it (so basically rpc nodes won't provide "real-time" responses while syncing from the da) .

I think the lock should be more fine grained on the actual execution of the blocks and not include the fetching from the da.

Why are we not just putting it directly on the applyBlock/executeBlock function to only allow one caller access?

aggreed. fetching data won't be mutex locked.

during this time no gossiped block won't be applied

gossiped block will be applied if it's height is correct, it quite conflicts with syncing process. I think it's fine for gossiped block to wait while syncing in progress.
in the happy flow there's no syncing anyway (as blocks are gossiped)

directly on the applyBlock/executeBlock function

The lock is more lock regarding the store.Height() and not only on execution

still not sure I understand why can't executeBlockMutex can't live inside applyBlock?

the only operation I see that is being protected with current implementation is m.store.NextHeight().

If that's indeed the case I suggest changing the nextHeight or in general height access to be atomic and by that simplify the lock and only put it on the applyBlock function.

omritoptix · 2024-04-18T09:05:54Z

block/retriever.go

 	}
+	// check for cached blocks
+	err := m.attemptApplyCachedBlocks(ctx)


This shouldn't be here. it should be in applyBlockCallback function. Not sure when it was removed from there.

The purpose of this is the following:

full node is at height x

full node got gossiped with blocks at height x+2, x+3,.. , x+n

full node can't apply it until it has x+1 recieved, so it keeps those blocks in the cache

once full node gets x+1, than it applies all the rest from the cache.

Not sure why was this moved here and what's the purpose of it here.

I think it's importnat for it to be here

full node missed x

full node got x+1 till x+3000

got sync target of assuming 2000 from SL

it make sense to go over the cache and apply what possible.
otherwise u'll wait for next gossiped block

danwt

Please request my review again after conflicts and Omri's Q's resolved

…n-lead-to-unexpected-errors

pruning cache only on synctarget

mtsitrin · 2024-04-18T17:34:50Z

block/block.go

@@ -140,13 +141,17 @@ func (m *Manager) attemptApplyCachedBlocks() error {
 		m.logger.Debug("applied cached block", "height", expectedHeight)
 	}

+	return nil


I removed the pruning from each gossiped block, as it's not efficient, it goes over all the cached blocks.
it will be called when syncing the node

omritoptix · 2024-04-18T22:51:19Z

block/block.go

+}
+
+// pruneCache prunes the cache of gossiped blocks.
+func (m *Manager) pruneCache() {


not sure why we need this pruneCache vs just deleting each cached block after we apply it

I guess it was to handle the cases blocks didn't applied from cache, but from syncTarge

block/manager.go

omritoptix · 2024-04-19T09:29:04Z

block/block.go

 func (m *Manager) attemptApplyCachedBlocks() error {
-	m.applyCachedBlockMutex.Lock()
-	defer m.applyCachedBlockMutex.Unlock()
+	m.executeBlockMutex.Lock()


as mentioned before, I think it's best to have this lock on the ApplyBlock function (and change it's name accordingly) as even if you get the race condition on NextHeight as current block with this height is currently being applied, the ApplyBlock have a sanity check on correct height and the block won't be applied.

I think it makes the code much more elegant and simplifies reading and the needs of dealing with future applyBlock callers.

this mutex is not only for applyBlock
it mutex between retriever thread and gossip thread
there are multiple params that can be accessed concurrently and needs protection (e.g blockCache, store height, "state" (apply block))

I can change it if u prefer, but IMO it's not hermetic enough

…n-lead-to-unexpected-errors

mtsitrin added 3 commits April 16, 2024 15:31

execute uses pubkey instead of sequencer object

a66eb8c

added syncMutex

663ae77

fixed goroutine leak in case of closed subscription

d388822

mtsitrin linked an issue Apr 17, 2024 that may be closed by this pull request

Applying blocks concurrently can lead to unexpected errors #658

Closed

github-advanced-security bot found potential problems Apr 17, 2024

View reviewed changes

block/retriever.go Fixed Show fixed Hide fixed

mtsitrin marked this pull request as ready for review April 17, 2024 08:27

mtsitrin requested a review from a team as a code owner April 17, 2024 08:27

mtsitrin requested review from zale144 and danwt April 17, 2024 08:27

danwt added 3 commits April 17, 2024 13:13

rename block execution -> execute block

c276b07

add comments

dd0a575

simplify wording

45cecf0

danwt reviewed Apr 17, 2024

View reviewed changes

Merge branch 'main' into mtsitrin/658-applying-blocks-concurrently-ca…

69521b3

…n-lead-to-unexpected-errors

danwt self-requested a review April 17, 2024 16:17

danwt previously approved these changes Apr 17, 2024

View reviewed changes

mtsitrin requested a review from danwt April 17, 2024 16:29

omritoptix requested changes Apr 17, 2024

View reviewed changes

moved locking post data fetching

6433abc

mtsitrin dismissed danwt’s stale review via 6433abc April 17, 2024 20:29

locked on attemptApplyCachedBlocks

481c705

mtsitrin requested a review from omritoptix April 17, 2024 21:09

omritoptix reviewed Apr 18, 2024

View reviewed changes

danwt reviewed Apr 18, 2024

View reviewed changes

mtsitrin added 3 commits April 18, 2024 19:36

Merge branch 'main' into mtsitrin/658-applying-blocks-concurrently-ca…

d343dca

…n-lead-to-unexpected-errors

fixed compilation

3fd1aa5

applying all cached blocks on gossiped block

8ad28b5

pruning cache only on synctarget

mtsitrin commented Apr 18, 2024

View reviewed changes

calling applycache on correct heihgt only

e234d07

mtsitrin requested review from omritoptix and danwt April 18, 2024 17:41

mtsitrin mentioned this pull request Apr 18, 2024

Simplify code to only apply blocks from cache #701

Open

omritoptix requested changes Apr 19, 2024

View reviewed changes

mtsitrin added 3 commits April 19, 2024 14:42

Merge branch 'main' into mtsitrin/658-applying-blocks-concurrently-ca…

0a43157

…n-lead-to-unexpected-errors

unified the block cache

8c9a1ac

renamed m.retrieverMutex. add lock on cache pruning

a329116

mtsitrin requested a review from omritoptix April 19, 2024 12:33

cleaning cache on apply instead of pruning

086e320

omritoptix approved these changes Apr 21, 2024

View reviewed changes

omritoptix merged commit 7290af6 into main Apr 21, 2024
4 checks passed

omritoptix deleted the mtsitrin/658-applying-blocks-concurrently-can-lead-to-unexpected-errors branch April 21, 2024 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(concurrency): applying blocks concurrently can lead to unexpected errors #700

fix(concurrency): applying blocks concurrently can lead to unexpected errors #700

mtsitrin commented Apr 17, 2024 •

edited by danwt

Loading

danwt left a comment

mtsitrin commented Apr 17, 2024

omritoptix Apr 17, 2024 •

edited

Loading

mtsitrin Apr 17, 2024

omritoptix Apr 18, 2024

omritoptix Apr 18, 2024

mtsitrin Apr 18, 2024

danwt left a comment

mtsitrin Apr 18, 2024 •

edited

Loading

omritoptix Apr 18, 2024

mtsitrin Apr 19, 2024

omritoptix Apr 19, 2024

mtsitrin Apr 19, 2024

fix(concurrency): applying blocks concurrently can lead to unexpected errors #700

fix(concurrency): applying blocks concurrently can lead to unexpected errors #700

Conversation

mtsitrin commented Apr 17, 2024 • edited by danwt Loading

danwt left a comment

Choose a reason for hiding this comment

mtsitrin commented Apr 17, 2024

omritoptix Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwt left a comment

Choose a reason for hiding this comment

mtsitrin Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtsitrin commented Apr 17, 2024 •

edited by danwt

Loading

omritoptix Apr 17, 2024 •

edited

Loading

mtsitrin Apr 18, 2024 •

edited

Loading