-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(block): da unavailability fixes #1215
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why removing the error log when applying from local
block/sync.go
Outdated
@@ -75,7 +70,14 @@ func (m *Manager) SettlementSyncLoop(ctx context.Context) error { | |||
|
|||
err = m.ApplyBatchFromSL(settlementBatch.Batch) | |||
if err != nil { | |||
return fmt.Errorf("process next DA batch. err:%w", err) | |||
m.logger.Error("Apply batch from SL", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem I see with logging and breaking here is that as a node operator you'll never know if you have an issue.
imagine u use a wrong celestia rpc or the rpc is down and you need to change it.
in that case you simply break and the node operator has no idea that he needs to change it. that's why we prefer to emit health issues.
the da rpc retry loop should be long enough to indicate there is an rpc problem and now the operator should do something about it (the node operator doesn't look at the logs but have alerts to health
status)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok makes sense
I'll revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that what we need to do here is only log and break in case the error is a da error (ErrRetrieval or ErrBlobNotfound), in the other cases return error as usual. This way if da fails the node will not stop and retry in the next state update but fail in case applyblock actually fails and is not da related.
block/sync.go
Outdated
|
||
// if height havent been updated, we are stuck | ||
// this covers the scenario where no applicable blocks were found in the DA | ||
if m.State.NextHeight() == currH { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u elaborate how this can happen? afaiu we got into the loop because we have new settlement height, which means the blocks are in the da. assuming no fraud (as it's handled by blocks unavaliable in the DA), when will this case happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which means the blocks are in the da
how u know that? if the data is not there, u'll be stuck
ApplyBatchFromSL
fails, don't go unhealthy (might be caused due to DA unavailability)