-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: (log_poller backfill) exit on non-rpc err #14114
Conversation
14597fb
to
389875c
Compare
var rpcErr *client.JsonError | ||
if !pkgerrors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded { | ||
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially tried to make this change in Dec 2023, but during review it became clear that this is a somewhat risky change and there was more to investigate about what the right way is to fix it. I re-opened it a few weeks ago, and got a more complex fix working that actually detects the error correctly (at least on some rpc servers.) This by itself will not, because it's comparing against a concrete type that will only match during testing, not in production.
I thought that was ready to merge last week, but unfortunately during review it again became clear that things are more complicated. There are so many different error codes and string formats returned from different rpc servers for the same condition we'll have to rely mostly on string parsing to differentiate them.
Leaving it as-is at least errs on the side of caution because, even though the logic is flawed, it's just taking preventative measures when it doesn't actually need to in most cases... and reporting a critical error in some cases when it shouldn't. Not taking those preventative measures, or not reporting a critical condition is more risky in production.
Is there a way to fix the flaky test without making this change?
Here is the current PR as it stands:
#11654
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment
Quality Gate passedIssues Measures |
What
Properly detect error type and exit on non-rpc error
Why
On graceful shutdown backfill method reacts on
context canceled
error as rpc batch size error, which leads tocrit
log and failed testExample log cl-node-54af8721.log
Resolves Dependencies
#13647