Skip to content

Conversation

@randygrok
Copy link
Contributor

@randygrok randygrok commented Nov 2, 2025

This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks.

Closes #2643

Overview

Needs merge and tag of tastora PR celestiaorg/tastora#149

This commit fixes issue #2643 where the health endpoint still reports
OK when a node has stopped producing blocks.

Changes:
- Updated HealthServer to accept store, config, and logger dependencies
- Implemented block production monitoring in the Livez endpoint:
  * For aggregator nodes, checks if LastBlockTime is recent
  * Returns WARN if block production is slow (> 3x block time)
  * Returns FAIL if block production has stopped (> 5x block time)
  * Uses LazyBlockInterval for lazy mode aggregators
  * Non-aggregator nodes continue to return PASS
- Added constants for health check thresholds:
  * healthCheckWarnMultiplier = 3
  * healthCheckFailMultiplier = 5
- Added comprehensive unit tests covering all scenarios:
  Server tests (pkg/rpc/server/server_test.go):
  * Non-aggregator nodes
  * Aggregator with no blocks
  * Aggregator with recent blocks (PASS)
  * Aggregator with slow production (WARN)
  * Aggregator with stopped production (FAIL)
  * Lazy aggregator with correct thresholds
  * Error handling
  Client tests (pkg/rpc/client/client_test.go):
  * Non-aggregator returns PASS
  * Aggregator with recent blocks returns PASS
  * Aggregator with slow block production returns WARN
  * Aggregator with stopped block production returns FAIL
- Updated setupTestServer to pass new dependencies
- Added createCustomTestServer helper for testing with custom configs

The thresholds are configurable based on the node's BlockTime or
LazyBlockInterval settings, making the health check adaptive to
different node configurations.

Fixes #2643
@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed❌ failed (1)Nov 7, 2025, 2:04 PM

@claude
Copy link
Contributor

claude bot commented Nov 2, 2025

Claude finished @randygrok's task —— View job


Comprehensive Code Review ✓

I've completed a thorough review of this PR. Here are my findings addressing the health check issue and overall code quality:


Critical Issue: Ready Endpoint Logic Flaw

The issue reported by @auricom is valid and represents a significant flaw in the ready endpoint logic.

Problem Analysis (pkg/rpc/server/http.go:67-77)

The ready endpoint currently checks if non-aggregator nodes have peers, but this check happens before verifying that the node has synced any blocks. This creates a scenario where:

  1. Node starts with P2P listening ✅
  2. Node has 0 peers but passes the listening check initially
  3. Eventually gets 1 peer ✅
  4. Node is stuck at block 0 due to wrong DA namespace/config
  5. Ready endpoint returns READY ❌ (incorrect!)

The logic flow is:

// Line 56-78: Current flow
if pm != nil {
    // Check P2P listening
    // Check peers for non-aggregators
}
// Line 80: GetState check happens AFTER peer check
state, err := s.GetState(r.Context())
// Line 87: Block height check
if localHeight == 0 {
    return UNREADY
}

Root Cause

The GetState() and localHeight == 0 checks occur after the P2P peer checks. When a node has 1 peer but is stuck at genesis (height 0) due to misconfiguration, it can briefly pass the peer check before the state check executes.

Recommended Fix

Reorder the checks so block production/sync verification happens before P2P checks:

// 1. First verify basic state access
state, err := s.GetState(r.Context())
if err != nil {
    return UNREADY
}

// 2. Then verify node has synced at least one block
if state.LastBlockHeight == 0 {
    return UNREADY: no blocks yet
}

// 3. THEN check P2P (only for nodes that have blocks)
if pm != nil {
    // ... P2P checks
}

Health Check Coverage Assessment

Based on @auricom's requirements, here's the coverage status:

Requirement Status Implementation
1. RPC server operational ✅ Partial /health/live checks store access, but doesn't validate RPC request handling
2. Aggregator producing blocks ✅ Implemented Lines 92-100: Checks block production within 5x block time
3. Full node syncing ✅ Implemented Lines 107-123: Checks sync status against best known height
4. P2P network ready ⚠️ Partial Lines 56-77: Checks listening + peers, but order is wrong

Missing: RPC Request Validation

Neither endpoint validates that the RPC server can actually process client requests per specifications. Consider:

  • Can it parse and respond to a basic query?
  • Are the RPC handlers properly initialized?
  • Can it serialize/deserialize proto messages?

The current /health/live only checks store access, not RPC functionality.


Code Quality Issues

1. Inconsistent Error Responses (http.go:53-127)

The ready endpoint uses inconsistent error message formats:

  • "UNREADY: failed to query P2P network info" (descriptive)
  • "UNREADY: state unavailable" (generic)
  • "UNREADY: no blocks yet" (descriptive)
  • "UNREADY: best-known height unavailable" (descriptive)

Recommendation: Standardize error messages with consistent structure including context about what failed and why.

2. Magic Number Without Explanation (http.go:94)

maxAllowedDelay := 5 * cfg.Node.BlockTime.Duration

The 5x multiplier is hardcoded without justification. Based on reviewer comments, @auricom suggested this might be too aggressive for chains with fast block times (250ms).

Recommendation:

  • Extract to a named constant with clear documentation: const aggregatorBlockProductionFailureThreshold = 5
  • Or make it configurable: cfg.Node.AggregatorHealthCheckFailureMultiplier
  • Document why 5x was chosen and for what block time ranges it's appropriate

3. Missing Context in Logs (http.go:92-100)

When the aggregator health check fails, there's no logging. This makes debugging production issues difficult.

Recommendation: Add structured logging similar to the gemini-code-assist suggestion but at Error level:

if timeSinceLastBlock > maxAllowedDelay {
    logger.Error().
        Dur("time_since_last_block", timeSinceLastBlock).
        Dur("max_allowed_delay", maxAllowedDelay).
        Uint64("last_block_height", state.LastBlockHeight).
        Time("last_block_time", state.LastBlockTime).
        Msg("Aggregator failed readiness check: not producing blocks at expected rate")
    http.Error(w, "UNREADY: aggregator not producing blocks at expected rate", http.StatusServiceUnavailable)
    return
}

4. Missing Nil Check (http.go:102)

if bestKnownHeightProvider == nil {
    http.Error(w, "UNREADY: best-known height unavailable", http.StatusServiceUnavailable)
    return
}

This check exists, but if it's required for readiness, why is it optional in the function signature? This suggests an architectural issue.

Recommendation: Either:

  • Make bestKnownHeightProvider required (non-nil) in RegisterCustomHTTPEndpoints
  • Or document why it might be nil and when that's acceptable

Test Quality Issues

5. Flaky Integration Test (single_sequencer_integration_test.go:422-463)

The TestReadinessEndpointWhenBlockProductionStops test uses fixed sleeps:

time.Sleep(time.Duration(config.Node.MaxPendingHeadersAndData+2) * config.Node.BlockTime.Duration)

Impact: This test can be flaky in CI environments with variable timing.

Recommendation: Follow gemini-code-assist's suggestion to use polling with require.Eventually:

require.Eventually(t, func() bool {
    resp, err := http.Get("http://" + config.RPC.Address + "/health/ready")
    if err != nil {
        return false
    }
    defer resp.Body.Close()
    return resp.StatusCode == http.StatusServiceUnavailable
}, 10*time.Second, 100*time.Millisecond, "Readiness should become UNREADY after block production stops")

6. Missing Test Coverage (server_test.go)

Missing test scenarios:

  • Aggregator with block time < 1s (to test if 5x multiplier is appropriate)
  • Node with bestKnownHeightProvider = nil
  • Race conditions between state updates and health checks
  • Transition from READY → UNREADY → READY (recovery scenarios)

Architecture & Design

7. Separation of Concerns (http.go:36-127)

The /health/ready endpoint has grown to 70+ lines with multiple responsibilities:

  • P2P validation
  • Block production validation (aggregator-specific)
  • Sync status validation
  • Configuration-based thresholds

Recommendation: Refactor into separate, testable health check functions:

type HealthCheck interface {
    Check(ctx context.Context) error
}

type P2PHealthCheck struct { /* ... */ }
func (c *P2PHealthCheck) Check(ctx context.Context) error { /* ... */ }

type BlockProductionHealthCheck struct { /* ... */ }
func (c *BlockProductionHealthCheck) Check(ctx context.Context) error { /* ... */ }

type SyncStatusHealthCheck struct { /* ... */ }
func (c *SyncStatusHealthCheck) Check(ctx context.Context) error { /* ... */ }

This would:

  • Enable unit testing of individual checks
  • Make it easier to add/remove checks
  • Improve code readability
  • Support dependency injection for testing

8. Configuration Validation (defaults.go:42-50)

The calculateReadinessMaxBlocksBehind function has fallback defaults:

if blockTime == 0 {
    return 30 // fallback to safe default if blockTime is not set
}

Issue: If blockTime is 0, this indicates a configuration error, not a case for silent fallback.

Recommendation: Return an error or panic during config validation rather than silently using a fallback that might not be appropriate.


Security & Performance

9. No Rate Limiting on Health Endpoints

Health endpoints are typically exposed without authentication for load balancers. However, there's no rate limiting, which could enable:

  • Denial of service through health check spam
  • Resource exhaustion from concurrent requests

Recommendation: Consider adding lightweight rate limiting or request throttling for health endpoints.

10. Blocking Store Operations (http.go:25, 80)

Both health endpoints perform blocking store operations:

  • s.Height(r.Context())
  • s.GetState(r.Context())

If the store is slow or deadlocked, health check requests will hang.

Recommendation: Add timeouts:

ctx, cancel := context.WithTimeout(r.Context(), 100*time.Millisecond)
defer cancel()
state, err := s.GetState(ctx)

Documentation Issues

11. Incomplete Documentation (docs/learn/config.md:610-637)

The documentation mentions checks performed but doesn't explain:

  • What "expected rate" means for aggregators (5x block time)
  • Why 5x was chosen
  • How to tune for different block times
  • What happens during health check failures

Recommendation: Expand documentation with:

  • Specific threshold values
  • Tuning guidance for different deployment scenarios
  • Troubleshooting guide for common health check failures

Breaking Changes Concern

12. Protobuf Deletion (CI Failure)

The CI shows a breaking change failure:

Breaking: ❌ failed (1)

The PR deletes several protobuf files:

  • proto/evnode/v1/health.proto
  • types/pb/evnode/v1/health.pb.go
  • types/pb/evnode/v1/v1connect/health.connect.go

Impact: This is a breaking API change that will affect clients using the health RPC endpoints.

Recommendation:

  • If this is intentional (migrating from gRPC to HTTP-only health checks), update the CHANGELOG to clearly mark this as a breaking change
  • Provide migration guide for existing clients
  • Consider deprecation period rather than immediate removal

Summary & Recommendations

High Priority (Must Fix)

  1. Fix ready endpoint logic order - Move block height check before P2P checks (http.go:67-88)
  2. Add logging to aggregator health check - Critical for debugging production issues
  3. Fix integration test flakiness - Use polling instead of fixed sleeps
  4. Address breaking changes - Document or provide migration path for proto deletions

Medium Priority (Should Fix)

  1. Make 5x multiplier configurable or document rationale - Current hardcoded value may not suit all chains
  2. Add timeout to store operations - Prevent health check hangs
  3. Refactor health checks - Improve testability and maintainability
  4. Expand test coverage - Add edge cases and race condition tests

Low Priority (Nice to Have)

  1. Add rate limiting - Protect against health check abuse
  2. Standardize error messages - Improve debugging experience
  3. Improve documentation - Add troubleshooting guide and tuning recommendations

Positive Aspects

Good test coverage - Integration tests verify the behavior end-to-end
Clear separation - /health/live (liveness) vs /health/ready (readiness) follows Kubernetes best practices
Comprehensive checks - Covers P2P, block production, and sync status
Configuration flexibility - ReadinessMaxBlocksBehind is configurable
Documentation - Added health endpoint docs to config.md


Overall Assessment: This PR makes important improvements to health checking, but has a critical logic flaw in the ready endpoint that allows nodes stuck at genesis to report as ready. The integration test confirms the aggregator scenario works, but doesn't catch the non-aggregator stuck-at-genesis issue reported by @auricom.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @randygrok, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves issue #2643 by significantly improving the accuracy of the health endpoint for aggregator nodes. Previously, the endpoint might incorrectly report a healthy status even when a node had ceased block production. The updated logic now checks the time since the last block was produced against dynamically calculated warning and failure thresholds, derived from the node's configured block time. This ensures that operational issues related to block generation are promptly detected and reflected in the health status, providing better visibility into node performance and stability.

Highlights

  • Enhanced Health Endpoint Logic: The health endpoint ('/livez') now actively monitors block production for aggregator nodes, reporting 'WARN' or 'FAIL' statuses if blocks are not produced within configurable time thresholds.
  • Configurable Thresholds: Introduced 'healthCheckWarnMultiplier' (3x block time) and 'healthCheckFailMultiplier' (5x block time) constants to define when a node's block production is considered slow or stopped.
  • Comprehensive Testing: Added new integration tests and expanded unit tests to cover various scenarios for the health endpoint, including non-aggregator nodes, aggregator nodes with recent blocks, slow block production, stopped block production, and lazy mode.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of the health endpoint not detecting stopped block production for aggregator nodes. The core logic is sound, and the changes are accompanied by a comprehensive set of unit tests and a new integration test. My feedback includes suggestions to improve the robustness of the integration test by replacing fixed-duration sleeps with polling, refactoring duplicated test setup code for better maintainability, and adjusting a log level to better reflect the severity of a health check failure.

@randygrok randygrok force-pushed the health-endpoint-block-check branch from 39dc439 to 819b015 Compare November 2, 2025 10:04
@codecov
Copy link

codecov bot commented Nov 2, 2025

Codecov Report

❌ Patch coverage is 85.71429% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.69%. Comparing base (12c2574) to head (c37cc3c).

Files with missing lines Patch % Lines
pkg/rpc/server/http.go 85.18% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2800      +/-   ##
==========================================
+ Coverage   64.54%   64.69%   +0.14%     
==========================================
  Files          80       80              
  Lines        7176     7177       +1     
==========================================
+ Hits         4632     4643      +11     
+ Misses       2008     1999       -9     
+ Partials      536      535       -1     
Flag Coverage Δ
combined 64.69% <85.71%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

randygrok and others added 3 commits November 2, 2025 11:18
Remove createCustomTestServer function which was redundant after making
setupTestServer accept optional config parameter. This eliminates 36 lines
of duplicated server setup code.

Changes:
- Make setupTestServer accept variadic config parameter
- Update all test cases to use setupTestServer directly
- Remove createCustomTestServer function entirely

Result: -21 net lines of code, improved maintainability, cleaner API.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…n test

Replace fixed duration time.Sleep calls with require.Eventually polling
to make the health endpoint integration test more robust and less flaky.

Changes:
- Use require.Eventually to poll for health state transitions
- Poll every 100ms instead of fixed sleeps
- Generous timeouts (5s and 10s) that terminate early on success
- Better error handling during polling

Benefits:
- More resilient to timing variations in CI/CD environments
- Faster test execution (completes as soon as conditions are met)
- Eliminates magic numbers (1700ms compensation)
- Expresses intent clearly (wait until condition is met)
- Non-flaky (tested 3x consecutively)
tac0turtle
tac0turtle previously approved these changes Nov 3, 2025
Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK, we should make sure this covers @auricom's needs before merging

@randygrok randygrok added this pull request to the merge queue Nov 3, 2025
@tac0turtle tac0turtle removed this pull request from the merge queue due to a manual request Nov 3, 2025
@auricom
Copy link
Contributor

auricom commented Nov 4, 2025

A healthcheck endpoint is used by infrastructure automation to determine if a process can respond to client requests.

AFAIK based on this PR, the existing endpoint did not perform substantive validation.

For a node to be considered healthy, the following conditions should be met:

1- The RPC server is operational and responds correctly to basic client queries per specifications
2- The aggregator is producing blocks at the expected rate
3- The full node is synchronizing with the network
4- The P2P network is ready to accept incoming connections (if enabled)

I cannot verify whether this change effectively validates items 1 and 4. And I may be overly cautious, so I would appreciate your expertise on these points 🧑‍💻

@tac0turtle tac0turtle dismissed their stale review November 4, 2025 15:54

dismissing so we can cover all paths that claude mentioned

@randygrok
Copy link
Contributor Author

thanks @auricom , let me check those points.

By your definition only point 2 is covered

randygrok and others added 6 commits November 4, 2025 20:14
… HTTP endpoint

- Removed the HealthService and its related proto definitions.
- Implemented a new HTTP health check endpoint at `/health/live`.
- Updated the Client to use the new HTTP health check instead of the gRPC call.
- Enhanced health check logic to return PASS, WARN, or FAIL based on block production status.
- Modified tests to validate the new health check endpoint and its responses.
- Updated server to register custom HTTP endpoints including the new health check.
…status checks; deprecate legacy gRPC endpoint in favor of HTTP
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://evstack.github.io/docs-preview/pr-2800/

Built to branch main at 2025-11-07 14:04 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

randygrok and others added 5 commits November 5, 2025 17:17
…client

feat(health): add `readyz()` and `is_ready()` methods to Rust `HealthClient`
feat(health): introduce `ReadinessStatus` enum for Rust client
refactor: update Rust client example to demonstrate liveness and readiness checks
Comment on lines +72 to +82
if !cfg.Node.Aggregator {
peers, err := pm.GetPeers()
if err != nil {
http.Error(w, "UNREADY: failed to query peers", http.StatusServiceUnavailable)
return
}
if len(peers) == 0 {
http.Error(w, "UNREADY: no peers connected", http.StatusServiceUnavailable)
return
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont get this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was already there before.

We don't consider ready a non aggregator node that does not contain any peer.

@auricom do you think this is correct?

Copy link
Contributor

@auricom auricom Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That totally makes sense for me:

Let's imagine this situation.

gateway is running 10 fullnodes behind its https://rpc.eden.gateway.io url
one of the fullnode loses connection to other peers => it will experience latency and would not be able to forward transactions
So this node needs to be booted from the fullnodes pool so that users won't interact with a unready node.

Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good start but doesnt solve the issue at hand

@randygrok randygrok marked this pull request as draft November 6, 2025 09:37
…s endpoints, update documentation and remove deprecated health client
… remove unused RPC client code and adjust e2e test parameters
@randygrok randygrok requested a review from tac0turtle November 6, 2025 17:43
@randygrok randygrok marked this pull request as ready for review November 6, 2025 17:43
@randygrok
Copy link
Contributor Author

@auricom take a look specifically into the server endpoints

@auricom
Copy link
Contributor

auricom commented Nov 7, 2025

the included documentation does seem to be what we want to achieve, thanks

but i'm not sure the ready endpoint is working as intended

ran a fullnode on eden testnet using ghcr.io/evstack/ev-node-evm-single:pr-2800

evm-single start  --evm.jwt-secret-file /root/jwt/jwt.hex --evm.genesis-hash 0x915128c096aac37917bd6b987190ee22d9f1c8421b990c5896a6b300795fab90 --evm.engine-url http://localhost:18551 --evm.eth-url http://localhost:18555 --evnode.da.address http://100.123.142.81:36658 --home=/root/.evm-single --evnode.rpc.address=0.0.0.0:7341 --evnode.instrumentation.prometheus --evnode.instrumentation.prometheus_listen_addr=:26670

node was configured without p2p peer and with wrong namespaces (so it wasnt able to da sync)

ev-node-1   | 1:03PM INF initialized syncer state chain_id=edennet-2 component=syncer da_height=7970386 height=0
ev-node-1   | 1:03PM INF syncer started component=syncer
ev-node-1   | 1:03PM INF starting process loop component=syncer
ev-node-1   | 1:03PM INF starting sync loop component=syncer
ev-node-1   | 1:03PM INF starting DA inclusion processing loop component=submitter

node was stuck on block 0 with 0 peer but ready endpoint was considered READY

curl -v http://localhost:7331/health/ready

  • Host localhost:7331 was resolved.
  • IPv6: ::1
  • IPv4: 127.0.0.1
  • Trying [::1]:7331...
  • Established connection to localhost (::1 port 7331) from ::1 port 39896
  • using HTTP/1.x

GET /health/ready HTTP/1.1
Host: localhost:7331
User-Agent: curl/8.17.0
Accept: /

  • Request completely sent off
    < HTTP/1.1 200 OK
    < Content-Type: text/plain
    < Date: Fri, 07 Nov 2025 13:07:19 GMT
    < Content-Length: 6
    <
    READY
  • Connection #0 to host localhost:7331 left intact

@randygrok
Copy link
Contributor Author

thanks for the message, let me check, this is helpful

@auricom
Copy link
Contributor

auricom commented Nov 7, 2025

turns out that was issues coming from my workstation. Can confirm that unready node was detected unready on the endpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

bug: health endpoint still reports OK when node has stopped producing blocks.

4 participants