Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: (log_poller backfill) exit on non-rpc err #14114

Closed
wants to merge 4 commits into from

Conversation

bukata-sa
Copy link
Contributor

@bukata-sa bukata-sa commented Aug 14, 2024

What

Properly detect error type and exit on non-rpc error

Why

On graceful shutdown backfill method reacts on context canceled error as rpc batch size error, which leads to crit log and failed test
Example log cl-node-54af8721.log

Resolves Dependencies

#13647

@bukata-sa bukata-sa requested a review from a team as a code owner August 14, 2024 12:02
@bukata-sa bukata-sa requested review from reductionista, andrevmatos, jmank88 and a team and removed request for a team and jmank88 August 14, 2024 12:03
var rpcErr *client.JsonError
if !pkgerrors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded {
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)
return err
Copy link
Contributor

@reductionista reductionista Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried to make this change in Dec 2023, but during review it became clear that this is a somewhat risky change and there was more to investigate about what the right way is to fix it. I re-opened it a few weeks ago, and got a more complex fix working that actually detects the error correctly (at least on some rpc servers.) This by itself will not, because it's comparing against a concrete type that will only match during testing, not in production.

I thought that was ready to merge last week, but unfortunately during review it again became clear that things are more complicated. There are so many different error codes and string formats returned from different rpc servers for the same condition we'll have to rely mostly on string parsing to differentiate them.

Leaving it as-is at least errs on the side of caution because, even though the logic is flawed, it's just taking preventative measures when it doesn't actually need to in most cases... and reporting a critical error in some cases when it shouldn't. Not taking those preventative measures, or not reporting a critical condition is more risky in production.

Is there a way to fix the flaky test without making this change?

Here is the current PR as it stands:
#11654

Copy link
Contributor

@reductionista reductionista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment

@bukata-sa bukata-sa closed this Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants