Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(block): da unavailability fixes #1215

Closed
wants to merge 3 commits into from
Closed

Conversation

mtsitrin
Copy link
Contributor

@mtsitrin mtsitrin commented Nov 11, 2024

  • if ApplyBatchFromSL fails, don't go unhealthy (might be caused due to DA unavailability)
  • avoid being stuck in an infinite loop in case of missing DA data

@mtsitrin mtsitrin requested a review from a team as a code owner November 11, 2024 09:40
@mtsitrin mtsitrin changed the title fix(block): allowing for DA unavalability. checking for DA data missing fix(block): da unavailability fixes Nov 11, 2024
srene
srene previously approved these changes Nov 11, 2024
Copy link
Contributor

@srene srene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why removing the error log when applying from local

block/sync.go Outdated
@@ -75,7 +70,14 @@ func (m *Manager) SettlementSyncLoop(ctx context.Context) error {

err = m.ApplyBatchFromSL(settlementBatch.Batch)
if err != nil {
return fmt.Errorf("process next DA batch. err:%w", err)
m.logger.Error("Apply batch from SL", "err", err)
Copy link
Contributor

@omritoptix omritoptix Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem I see with logging and breaking here is that as a node operator you'll never know if you have an issue.

imagine u use a wrong celestia rpc or the rpc is down and you need to change it.
in that case you simply break and the node operator has no idea that he needs to change it. that's why we prefer to emit health issues.

the da rpc retry loop should be long enough to indicate there is an rpc problem and now the operator should do something about it (the node operator doesn't look at the logs but have alerts to health status)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok makes sense
I'll revert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that what we need to do here is only log and break in case the error is a da error (ErrRetrieval or ErrBlobNotfound), in the other cases return error as usual. This way if da fails the node will not stop and retry in the next state update but fail in case applyblock actually fails and is not da related.

block/sync.go Outdated

// if height havent been updated, we are stuck
// this covers the scenario where no applicable blocks were found in the DA
if m.State.NextHeight() == currH {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u elaborate how this can happen? afaiu we got into the loop because we have new settlement height, which means the blocks are in the da. assuming no fraud (as it's handled by blocks unavaliable in the DA), when will this case happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which means the blocks are in the da

how u know that? if the data is not there, u'll be stuck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants