Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(concurrency): applying blocks concurrently can lead to unexpected errors #700

Conversation

mtsitrin
Copy link
Contributor

@mtsitrin mtsitrin commented Apr 17, 2024

  • Mutex protected applyBlock for full node (can be called from gossiping and when retreiving blocks)
  • moved m.attemptApplyCachedBlocks from processNextDABatch to syncUntilTarget

Close #658

<-- Briefly describe the content of this pull request -->

For Author:

  • Targeted PR against correct branch
  • included the correct type prefix in the PR title
  • Linked to Github issue with discussion and accepted design
  • Targets only one github issue
  • Wrote unit and integration tests
  • All CI checks have passed
  • Added relevant godoc comments

For Reviewer:

  • confirmed the correct type prefix in the PR title
  • Reviewers assigned
  • confirmed all author checklist items have been addressed

After reviewer approval:

  • In case targets main branch, PR should be squashed and merged.
  • In case PR targets a release branch, PR should be rebased.

@mtsitrin mtsitrin linked an issue Apr 17, 2024 that may be closed by this pull request
block/retriever.go Fixed Show fixed Hide fixed
@mtsitrin mtsitrin marked this pull request as ready for review April 17, 2024 08:27
@mtsitrin mtsitrin requested a review from a team as a code owner April 17, 2024 08:27
@mtsitrin mtsitrin requested review from zale144 and danwt April 17, 2024 08:27
Copy link
Contributor

@danwt danwt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mtsitrin , I think it makes sense, but these things are always tricky.

I added some comments, could you please check if they make sense?

Also

From my analysis, maybe we can combine the produce and execute mutexes into 1 mutex, and rename it to something which makes sense for both cases?

Original problem description:
    since store and abci access is not atomic, can have races
    if two goroutines try to apply block

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

submitBatchMutex      sync.Mutex
    on handleSubmissionTrigger, try to lock and unlock
        calls produceBlock

produceBlockMutex     sync.Mutex

    on handleSubmissionTrigger
    in ProduceBlockLoop
        on produceBlock lock and unlock
            calls applyBlock

executeBlockMutex     sync.Mutex
    on applyBlockCallback lock and unlock
        call apply block

    on initialSync
    in retrieveLoop
        on syncUntilTarget, lock and unlock
            call apply block

my reasoning being that depending on aggregator/not-aggregator, they are never using both mutexes

@mtsitrin
Copy link
Contributor Author

@danwt u right
I prefer not do to this change now. We have few more things we can refactor/abstract between aggregator / non-aggregator.
otherwise I think it will be more confusing in current state

one thing at a time

@danwt danwt self-requested a review April 17, 2024 16:17
danwt
danwt previously approved these changes Apr 17, 2024
@mtsitrin mtsitrin requested a review from danwt April 17, 2024 16:29
@@ -42,6 +42,9 @@ func (m *Manager) RetriveLoop(ctx context.Context) {
// It fetches the batches from the settlement, gets the DA height and gets
// the actual blocks from the DA.
func (m *Manager) syncUntilTarget(ctx context.Context, syncTarget uint64) error {
m.executeBlockMutex.Lock()
Copy link
Contributor

@omritoptix omritoptix Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This process (syncUntilTarget) can be long as it needs to actually fetch the data from the DA upon new sync target (which can be dozens of seconds or even longer).

seems to me like during this time no gossiped blocks will be applied as the lock will block it (so basically rpc nodes won't provide "real-time" responses while syncing from the da) .

I think the lock should be more fine grained on the actual execution of the blocks and not include the fetching from the da.

Why are we not just putting it directly on the applyBlock/executeBlock function to only allow one caller access?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggreed. fetching data won't be mutex locked.


during this time no gossiped block won't be applied

gossiped block will be applied if it's height is correct, it quite conflicts with syncing process. I think it's fine for gossiped block to wait while syncing in progress.
in the happy flow there's no syncing anyway (as blocks are gossiped)

directly on the applyBlock/executeBlock function

The lock is more lock regarding the store.Height() and not only on execution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not sure I understand why can't executeBlockMutex can't live inside applyBlock?

the only operation I see that is being protected with current implementation is m.store.NextHeight().

If that's indeed the case I suggest changing the nextHeight or in general height access to be atomic and by that simplify the lock and only put it on the applyBlock function.

}
// check for cached blocks
err := m.attemptApplyCachedBlocks(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be here. it should be in applyBlockCallback function. Not sure when it was removed from there.

The purpose of this is the following:

  1. full node is at height x
  2. full node got gossiped with blocks at height x+2, x+3,.. , x+n
  3. full node can't apply it until it has x+1 recieved, so it keeps those blocks in the cache
  4. once full node gets x+1, than it applies all the rest from the cache.

Not sure why was this moved here and what's the purpose of it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's importnat for it to be here

  1. full node missed x
  2. full node got x+1 till x+3000
  3. got sync target of assuming 2000 from SL

it make sense to go over the cache and apply what possible.
otherwise u'll wait for next gossiped block

Copy link
Contributor

@danwt danwt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please request my review again after conflicts and Omri's Q's resolved

@@ -140,13 +141,17 @@ func (m *Manager) attemptApplyCachedBlocks() error {
m.logger.Debug("applied cached block", "height", expectedHeight)
}

return nil
Copy link
Contributor Author

@mtsitrin mtsitrin Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the pruning from each gossiped block, as it's not efficient, it goes over all the cached blocks.
it will be called when syncing the node

block/block.go Outdated
}

// pruneCache prunes the cache of gossiped blocks.
func (m *Manager) pruneCache() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we need this pruneCache vs just deleting each cached block after we apply it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it was to handle the cases blocks didn't applied from cache, but from syncTarge

block/manager.go Outdated Show resolved Hide resolved
block/block.go Outdated
func (m *Manager) attemptApplyCachedBlocks() error {
m.applyCachedBlockMutex.Lock()
defer m.applyCachedBlockMutex.Unlock()
m.executeBlockMutex.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned before, I think it's best to have this lock on the ApplyBlock function (and change it's name accordingly) as even if you get the race condition on NextHeight as current block with this height is currently being applied, the ApplyBlock have a sanity check on correct height and the block won't be applied.

I think it makes the code much more elegant and simplifies reading and the needs of dealing with future applyBlock callers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this mutex is not only for applyBlock
it mutex between retriever thread and gossip thread
there are multiple params that can be accessed concurrently and needs protection (e.g blockCache, store height, "state" (apply block))

I can change it if u prefer, but IMO it's not hermetic enough

@omritoptix omritoptix merged commit 7290af6 into main Apr 21, 2024
4 checks passed
@omritoptix omritoptix deleted the mtsitrin/658-applying-blocks-concurrently-can-lead-to-unexpected-errors branch April 21, 2024 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Applying blocks concurrently can lead to unexpected errors
3 participants