-
Notifications
You must be signed in to change notification settings - Fork 233
fix: health endpoint now detects stopped block production #2800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks. Changes: - Updated HealthServer to accept store, config, and logger dependencies - Implemented block production monitoring in the Livez endpoint: * For aggregator nodes, checks if LastBlockTime is recent * Returns WARN if block production is slow (> 3x block time) * Returns FAIL if block production has stopped (> 5x block time) * Uses LazyBlockInterval for lazy mode aggregators * Non-aggregator nodes continue to return PASS - Added constants for health check thresholds: * healthCheckWarnMultiplier = 3 * healthCheckFailMultiplier = 5 - Added comprehensive unit tests covering all scenarios: Server tests (pkg/rpc/server/server_test.go): * Non-aggregator nodes * Aggregator with no blocks * Aggregator with recent blocks (PASS) * Aggregator with slow production (WARN) * Aggregator with stopped production (FAIL) * Lazy aggregator with correct thresholds * Error handling Client tests (pkg/rpc/client/client_test.go): * Non-aggregator returns PASS * Aggregator with recent blocks returns PASS * Aggregator with slow block production returns WARN * Aggregator with stopped block production returns FAIL - Updated setupTestServer to pass new dependencies - Added createCustomTestServer helper for testing with custom configs The thresholds are configurable based on the node's BlockTime or LazyBlockInterval settings, making the health check adaptive to different node configurations. Fixes #2643
|
The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).
|
|
Claude finished @randygrok's task —— View job Comprehensive Code Review ✓I've completed a thorough review of this PR. Here are my findings addressing the health check issue and overall code quality: Critical Issue: Ready Endpoint Logic FlawThe issue reported by @auricom is valid and represents a significant flaw in the ready endpoint logic. Problem Analysis (pkg/rpc/server/http.go:67-77)The ready endpoint currently checks if non-aggregator nodes have peers, but this check happens before verifying that the node has synced any blocks. This creates a scenario where:
The logic flow is: // Line 56-78: Current flow
if pm != nil {
// Check P2P listening
// Check peers for non-aggregators
}
// Line 80: GetState check happens AFTER peer check
state, err := s.GetState(r.Context())
// Line 87: Block height check
if localHeight == 0 {
return UNREADY
}Root CauseThe Recommended FixReorder the checks so block production/sync verification happens before P2P checks: // 1. First verify basic state access
state, err := s.GetState(r.Context())
if err != nil {
return UNREADY
}
// 2. Then verify node has synced at least one block
if state.LastBlockHeight == 0 {
return UNREADY: no blocks yet
}
// 3. THEN check P2P (only for nodes that have blocks)
if pm != nil {
// ... P2P checks
}Health Check Coverage AssessmentBased on @auricom's requirements, here's the coverage status:
Missing: RPC Request ValidationNeither endpoint validates that the RPC server can actually process client requests per specifications. Consider:
The current Code Quality Issues1. Inconsistent Error Responses (http.go:53-127)The ready endpoint uses inconsistent error message formats:
Recommendation: Standardize error messages with consistent structure including context about what failed and why. 2. Magic Number Without Explanation (http.go:94)maxAllowedDelay := 5 * cfg.Node.BlockTime.DurationThe Recommendation:
3. Missing Context in Logs (http.go:92-100)When the aggregator health check fails, there's no logging. This makes debugging production issues difficult. Recommendation: Add structured logging similar to the gemini-code-assist suggestion but at Error level: if timeSinceLastBlock > maxAllowedDelay {
logger.Error().
Dur("time_since_last_block", timeSinceLastBlock).
Dur("max_allowed_delay", maxAllowedDelay).
Uint64("last_block_height", state.LastBlockHeight).
Time("last_block_time", state.LastBlockTime).
Msg("Aggregator failed readiness check: not producing blocks at expected rate")
http.Error(w, "UNREADY: aggregator not producing blocks at expected rate", http.StatusServiceUnavailable)
return
}4. Missing Nil Check (http.go:102)if bestKnownHeightProvider == nil {
http.Error(w, "UNREADY: best-known height unavailable", http.StatusServiceUnavailable)
return
}This check exists, but if it's required for readiness, why is it optional in the function signature? This suggests an architectural issue. Recommendation: Either:
Test Quality Issues5. Flaky Integration Test (single_sequencer_integration_test.go:422-463)The time.Sleep(time.Duration(config.Node.MaxPendingHeadersAndData+2) * config.Node.BlockTime.Duration)Impact: This test can be flaky in CI environments with variable timing. Recommendation: Follow gemini-code-assist's suggestion to use polling with require.Eventually(t, func() bool {
resp, err := http.Get("http://" + config.RPC.Address + "/health/ready")
if err != nil {
return false
}
defer resp.Body.Close()
return resp.StatusCode == http.StatusServiceUnavailable
}, 10*time.Second, 100*time.Millisecond, "Readiness should become UNREADY after block production stops")6. Missing Test Coverage (server_test.go)Missing test scenarios:
Architecture & Design7. Separation of Concerns (http.go:36-127)The
Recommendation: Refactor into separate, testable health check functions: type HealthCheck interface {
Check(ctx context.Context) error
}
type P2PHealthCheck struct { /* ... */ }
func (c *P2PHealthCheck) Check(ctx context.Context) error { /* ... */ }
type BlockProductionHealthCheck struct { /* ... */ }
func (c *BlockProductionHealthCheck) Check(ctx context.Context) error { /* ... */ }
type SyncStatusHealthCheck struct { /* ... */ }
func (c *SyncStatusHealthCheck) Check(ctx context.Context) error { /* ... */ }This would:
8. Configuration Validation (defaults.go:42-50)The if blockTime == 0 {
return 30 // fallback to safe default if blockTime is not set
}Issue: If Recommendation: Return an error or panic during config validation rather than silently using a fallback that might not be appropriate. Security & Performance9. No Rate Limiting on Health EndpointsHealth endpoints are typically exposed without authentication for load balancers. However, there's no rate limiting, which could enable:
Recommendation: Consider adding lightweight rate limiting or request throttling for health endpoints. 10. Blocking Store Operations (http.go:25, 80)Both health endpoints perform blocking store operations:
If the store is slow or deadlocked, health check requests will hang. Recommendation: Add timeouts: ctx, cancel := context.WithTimeout(r.Context(), 100*time.Millisecond)
defer cancel()
state, err := s.GetState(ctx)Documentation Issues11. Incomplete Documentation (docs/learn/config.md:610-637)The documentation mentions checks performed but doesn't explain:
Recommendation: Expand documentation with:
Breaking Changes Concern12. Protobuf Deletion (CI Failure)The CI shows a breaking change failure: The PR deletes several protobuf files:
Impact: This is a breaking API change that will affect clients using the health RPC endpoints. Recommendation:
Summary & RecommendationsHigh Priority (Must Fix)
Medium Priority (Should Fix)
Low Priority (Nice to Have)
Positive Aspects✅ Good test coverage - Integration tests verify the behavior end-to-end Overall Assessment: This PR makes important improvements to health checking, but has a critical logic flaw in the ready endpoint that allows nodes stuck at genesis to report as ready. The integration test confirms the aggregator scenario works, but doesn't catch the non-aggregator stuck-at-genesis issue reported by @auricom. |
Summary of ChangesHello @randygrok, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves issue #2643 by significantly improving the accuracy of the health endpoint for aggregator nodes. Previously, the endpoint might incorrectly report a healthy status even when a node had ceased block production. The updated logic now checks the time since the last block was produced against dynamically calculated warning and failure thresholds, derived from the node's configured block time. This ensures that operational issues related to block generation are promptly detected and reflected in the health status, providing better visibility into node performance and stability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the issue of the health endpoint not detecting stopped block production for aggregator nodes. The core logic is sound, and the changes are accompanied by a comprehensive set of unit tests and a new integration test. My feedback includes suggestions to improve the robustness of the integration test by replacing fixed-duration sleeps with polling, refactoring duplicated test setup code for better maintainability, and adjusting a log level to better reflect the severity of a health check failure.
39dc439 to
819b015
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2800 +/- ##
==========================================
+ Coverage 64.54% 64.69% +0.14%
==========================================
Files 80 80
Lines 7176 7177 +1
==========================================
+ Hits 4632 4643 +11
+ Misses 2008 1999 -9
+ Partials 536 535 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Remove createCustomTestServer function which was redundant after making setupTestServer accept optional config parameter. This eliminates 36 lines of duplicated server setup code. Changes: - Make setupTestServer accept variadic config parameter - Update all test cases to use setupTestServer directly - Remove createCustomTestServer function entirely Result: -21 net lines of code, improved maintainability, cleaner API.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…n test Replace fixed duration time.Sleep calls with require.Eventually polling to make the health endpoint integration test more robust and less flaky. Changes: - Use require.Eventually to poll for health state transitions - Poll every 100ms instead of fixed sleeps - Generous timeouts (5s and 10s) that terminate early on success - Better error handling during polling Benefits: - More resilient to timing variations in CI/CD environments - Faster test execution (completes as soon as conditions are met) - Eliminates magic numbers (1700ms compensation) - Expresses intent clearly (wait until condition is met) - Non-flaky (tested 3x consecutively)
tac0turtle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK, we should make sure this covers @auricom's needs before merging
|
A healthcheck endpoint is used by infrastructure automation to determine if a process can respond to client requests. AFAIK based on this PR, the existing endpoint did not perform substantive validation. For a node to be considered healthy, the following conditions should be met: 1- The RPC server is operational and responds correctly to basic client queries per specifications I cannot verify whether this change effectively validates items 1 and 4. And I may be overly cautious, so I would appreciate your expertise on these points 🧑💻 |
dismissing so we can cover all paths that claude mentioned
|
thanks @auricom , let me check those points. By your definition only point 2 is covered |
… HTTP endpoint - Removed the HealthService and its related proto definitions. - Implemented a new HTTP health check endpoint at `/health/live`. - Updated the Client to use the new HTTP health check instead of the gRPC call. - Enhanced health check logic to return PASS, WARN, or FAIL based on block production status. - Modified tests to validate the new health check endpoint and its responses. - Updated server to register custom HTTP endpoints including the new health check.
…status checks; deprecate legacy gRPC endpoint in favor of HTTP
…; recommend migration to HTTP
…dpoints; migrate to HTTP health checks
…d clarity and accuracy
…nts in config and server files
|
… HTTP health endpoints
…or health endpoints
…client feat(health): add `readyz()` and `is_ready()` methods to Rust `HealthClient` feat(health): introduce `ReadinessStatus` enum for Rust client refactor: update Rust client example to demonstrate liveness and readiness checks
…ks, including E2E test updates
| if !cfg.Node.Aggregator { | ||
| peers, err := pm.GetPeers() | ||
| if err != nil { | ||
| http.Error(w, "UNREADY: failed to query peers", http.StatusServiceUnavailable) | ||
| return | ||
| } | ||
| if len(peers) == 0 { | ||
| http.Error(w, "UNREADY: no peers connected", http.StatusServiceUnavailable) | ||
| return | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont get this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was already there before.
We don't consider ready a non aggregator node that does not contain any peer.
@auricom do you think this is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That totally makes sense for me:
Let's imagine this situation.
gateway is running 10 fullnodes behind its https://rpc.eden.gateway.io url
one of the fullnode loses connection to other peers => it will experience latency and would not be able to forward transactions
So this node needs to be booted from the fullnodes pool so that users won't interact with a unready node.
tac0turtle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good start but doesnt solve the issue at hand
…g endpoint modifications and test updates
…s endpoints, update documentation and remove deprecated health client
…eLive for lazy mode scenarios
… remove unused RPC client code and adjust e2e test parameters
|
@auricom take a look specifically into the server endpoints |
|
the included documentation does seem to be what we want to achieve, thanks but i'm not sure the ready endpoint is working as intended ran a fullnode on eden testnet using node was configured without p2p peer and with wrong namespaces (so it wasnt able to da sync) node was stuck on block 0 with 0 peer but ready endpoint was considered curl -v http://localhost:7331/health/ready
|
|
thanks for the message, let me check, this is helpful |
|
turns out that was issues coming from my workstation. Can confirm that unready node was detected unready on the endpoint |
This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks.
Closes #2643
Overview
Needs merge and tag of tastora PR celestiaorg/tastora#149