fix: health endpoint now detects stopped block production #2800

randygrok · 2025-11-02T10:02:04Z

This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks.

Overview

Needs merge and tag of tastora PR celestiaorg/tastora#149

This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks. Changes: - Updated HealthServer to accept store, config, and logger dependencies - Implemented block production monitoring in the Livez endpoint: * For aggregator nodes, checks if LastBlockTime is recent * Returns WARN if block production is slow (> 3x block time) * Returns FAIL if block production has stopped (> 5x block time) * Uses LazyBlockInterval for lazy mode aggregators * Non-aggregator nodes continue to return PASS - Added constants for health check thresholds: * healthCheckWarnMultiplier = 3 * healthCheckFailMultiplier = 5 - Added comprehensive unit tests covering all scenarios: Server tests (pkg/rpc/server/server_test.go): * Non-aggregator nodes * Aggregator with no blocks * Aggregator with recent blocks (PASS) * Aggregator with slow production (WARN) * Aggregator with stopped production (FAIL) * Lazy aggregator with correct thresholds * Error handling Client tests (pkg/rpc/client/client_test.go): * Non-aggregator returns PASS * Aggregator with recent blocks returns PASS * Aggregator with slow block production returns WARN * Aggregator with stopped block production returns FAIL - Updated setupTestServer to pass new dependencies - Added createCustomTestServer helper for testing with custom configs The thresholds are configurable based on the node's BlockTime or LazyBlockInterval settings, making the health check adaptive to different node configurations. Fixes #2643

github-actions · 2025-11-02T10:02:16Z

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`❌ failed (1)`	Nov 7, 2025, 2:04 PM

claude · 2025-11-02T10:02:19Z

Claude finished @randygrok's task —— View job

Comprehensive Code Review ✓

I've completed a thorough review of this PR. Here are my findings addressing the health check issue and overall code quality:

Critical Issue: Ready Endpoint Logic Flaw

The issue reported by @auricom is valid and represents a significant flaw in the ready endpoint logic.

Problem Analysis (pkg/rpc/server/http.go:67-77)

The ready endpoint currently checks if non-aggregator nodes have peers, but this check happens before verifying that the node has synced any blocks. This creates a scenario where:

Node starts with P2P listening ✅
Node has 0 peers but passes the listening check initially
Eventually gets 1 peer ✅
Node is stuck at block 0 due to wrong DA namespace/config
Ready endpoint returns READY ❌ (incorrect!)

The logic flow is:

// Line 56-78: Current flow
if pm != nil {
    // Check P2P listening
    // Check peers for non-aggregators
}
// Line 80: GetState check happens AFTER peer check
state, err := s.GetState(r.Context())
// Line 87: Block height check
if localHeight == 0 {
    return UNREADY
}

Root Cause

The GetState() and localHeight == 0 checks occur after the P2P peer checks. When a node has 1 peer but is stuck at genesis (height 0) due to misconfiguration, it can briefly pass the peer check before the state check executes.

Recommended Fix

Reorder the checks so block production/sync verification happens before P2P checks:

// 1. First verify basic state access
state, err := s.GetState(r.Context())
if err != nil {
    return UNREADY
}

// 2. Then verify node has synced at least one block
if state.LastBlockHeight == 0 {
    return UNREADY: no blocks yet
}

// 3. THEN check P2P (only for nodes that have blocks)
if pm != nil {
    // ... P2P checks
}

Health Check Coverage Assessment

Based on @auricom's requirements, here's the coverage status:

Requirement	Status	Implementation
1. RPC server operational	✅ Partial	`/health/live` checks store access, but doesn't validate RPC request handling
2. Aggregator producing blocks	✅ Implemented	Lines 92-100: Checks block production within 5x block time
3. Full node syncing	✅ Implemented	Lines 107-123: Checks sync status against best known height
4. P2P network ready	⚠️ Partial	Lines 56-77: Checks listening + peers, but order is wrong

Missing: RPC Request Validation

Neither endpoint validates that the RPC server can actually process client requests per specifications. Consider:

Can it parse and respond to a basic query?
Are the RPC handlers properly initialized?
Can it serialize/deserialize proto messages?

The current /health/live only checks store access, not RPC functionality.

Code Quality Issues

1. Inconsistent Error Responses (http.go:53-127)

The ready endpoint uses inconsistent error message formats:

"UNREADY: failed to query P2P network info" (descriptive)
"UNREADY: state unavailable" (generic)
"UNREADY: no blocks yet" (descriptive)
"UNREADY: best-known height unavailable" (descriptive)

Recommendation: Standardize error messages with consistent structure including context about what failed and why.

2. Magic Number Without Explanation (http.go:94)

maxAllowedDelay := 5 * cfg.Node.BlockTime.Duration

The 5x multiplier is hardcoded without justification. Based on reviewer comments, @auricom suggested this might be too aggressive for chains with fast block times (250ms).

Recommendation:

Extract to a named constant with clear documentation: const aggregatorBlockProductionFailureThreshold = 5
Or make it configurable: cfg.Node.AggregatorHealthCheckFailureMultiplier
Document why 5x was chosen and for what block time ranges it's appropriate

3. Missing Context in Logs (http.go:92-100)

When the aggregator health check fails, there's no logging. This makes debugging production issues difficult.

Recommendation: Add structured logging similar to the gemini-code-assist suggestion but at Error level:

if timeSinceLastBlock > maxAllowedDelay {
    logger.Error().
        Dur("time_since_last_block", timeSinceLastBlock).
        Dur("max_allowed_delay", maxAllowedDelay).
        Uint64("last_block_height", state.LastBlockHeight).
        Time("last_block_time", state.LastBlockTime).
        Msg("Aggregator failed readiness check: not producing blocks at expected rate")
    http.Error(w, "UNREADY: aggregator not producing blocks at expected rate", http.StatusServiceUnavailable)
    return
}

4. Missing Nil Check (http.go:102)

if bestKnownHeightProvider == nil {
    http.Error(w, "UNREADY: best-known height unavailable", http.StatusServiceUnavailable)
    return
}

This check exists, but if it's required for readiness, why is it optional in the function signature? This suggests an architectural issue.

Recommendation: Either:

Make bestKnownHeightProvider required (non-nil) in RegisterCustomHTTPEndpoints
Or document why it might be nil and when that's acceptable

Test Quality Issues

5. Flaky Integration Test (single_sequencer_integration_test.go:422-463)

The TestReadinessEndpointWhenBlockProductionStops test uses fixed sleeps:

time.Sleep(time.Duration(config.Node.MaxPendingHeadersAndData+2) * config.Node.BlockTime.Duration)

Impact: This test can be flaky in CI environments with variable timing.

Recommendation: Follow gemini-code-assist's suggestion to use polling with require.Eventually:

require.Eventually(t, func() bool {
    resp, err := http.Get("http://" + config.RPC.Address + "/health/ready")
    if err != nil {
        return false
    }
    defer resp.Body.Close()
    return resp.StatusCode == http.StatusServiceUnavailable
}, 10*time.Second, 100*time.Millisecond, "Readiness should become UNREADY after block production stops")

6. Missing Test Coverage (server_test.go)

Missing test scenarios:

Aggregator with block time < 1s (to test if 5x multiplier is appropriate)
Node with bestKnownHeightProvider = nil
Race conditions between state updates and health checks
Transition from READY → UNREADY → READY (recovery scenarios)

Architecture & Design

7. Separation of Concerns (http.go:36-127)

The /health/ready endpoint has grown to 70+ lines with multiple responsibilities:

P2P validation
Block production validation (aggregator-specific)
Sync status validation
Configuration-based thresholds

Recommendation: Refactor into separate, testable health check functions:

type HealthCheck interface {
    Check(ctx context.Context) error
}

type P2PHealthCheck struct { /* ... */ }
func (c *P2PHealthCheck) Check(ctx context.Context) error { /* ... */ }

type BlockProductionHealthCheck struct { /* ... */ }
func (c *BlockProductionHealthCheck) Check(ctx context.Context) error { /* ... */ }

type SyncStatusHealthCheck struct { /* ... */ }
func (c *SyncStatusHealthCheck) Check(ctx context.Context) error { /* ... */ }

This would:

Enable unit testing of individual checks
Make it easier to add/remove checks
Improve code readability
Support dependency injection for testing

8. Configuration Validation (defaults.go:42-50)

The calculateReadinessMaxBlocksBehind function has fallback defaults:

if blockTime == 0 {
    return 30 // fallback to safe default if blockTime is not set
}

Issue: If blockTime is 0, this indicates a configuration error, not a case for silent fallback.

Recommendation: Return an error or panic during config validation rather than silently using a fallback that might not be appropriate.

Security & Performance

9. No Rate Limiting on Health Endpoints

Health endpoints are typically exposed without authentication for load balancers. However, there's no rate limiting, which could enable:

Denial of service through health check spam
Resource exhaustion from concurrent requests

Recommendation: Consider adding lightweight rate limiting or request throttling for health endpoints.

10. Blocking Store Operations (http.go:25, 80)

Both health endpoints perform blocking store operations:

s.Height(r.Context())
s.GetState(r.Context())

If the store is slow or deadlocked, health check requests will hang.

Recommendation: Add timeouts:

ctx, cancel := context.WithTimeout(r.Context(), 100*time.Millisecond)
defer cancel()
state, err := s.GetState(ctx)

Documentation Issues

11. Incomplete Documentation (docs/learn/config.md:610-637)

The documentation mentions checks performed but doesn't explain:

What "expected rate" means for aggregators (5x block time)
Why 5x was chosen
How to tune for different block times
What happens during health check failures

Recommendation: Expand documentation with:

Specific threshold values
Tuning guidance for different deployment scenarios
Troubleshooting guide for common health check failures

Breaking Changes Concern

12. Protobuf Deletion (CI Failure)

The CI shows a breaking change failure:

Breaking: ❌ failed (1)

The PR deletes several protobuf files:

proto/evnode/v1/health.proto
types/pb/evnode/v1/health.pb.go
types/pb/evnode/v1/v1connect/health.connect.go

Impact: This is a breaking API change that will affect clients using the health RPC endpoints.

Recommendation:

If this is intentional (migrating from gRPC to HTTP-only health checks), update the CHANGELOG to clearly mark this as a breaking change
Provide migration guide for existing clients
Consider deprecation period rather than immediate removal

Summary & Recommendations

High Priority (Must Fix)

Fix ready endpoint logic order - Move block height check before P2P checks (http.go:67-88)
Add logging to aggregator health check - Critical for debugging production issues
Fix integration test flakiness - Use polling instead of fixed sleeps
Address breaking changes - Document or provide migration path for proto deletions

Medium Priority (Should Fix)

Make 5x multiplier configurable or document rationale - Current hardcoded value may not suit all chains
Add timeout to store operations - Prevent health check hangs
Refactor health checks - Improve testability and maintainability
Expand test coverage - Add edge cases and race condition tests

Low Priority (Nice to Have)

Add rate limiting - Protect against health check abuse
Standardize error messages - Improve debugging experience
Improve documentation - Add troubleshooting guide and tuning recommendations

Positive Aspects

✅ Good test coverage - Integration tests verify the behavior end-to-end
✅ Clear separation - /health/live (liveness) vs /health/ready (readiness) follows Kubernetes best practices
✅ Comprehensive checks - Covers P2P, block production, and sync status
✅ Configuration flexibility - ReadinessMaxBlocksBehind is configurable
✅ Documentation - Added health endpoint docs to config.md

Overall Assessment: This PR makes important improvements to health checking, but has a critical logic flaw in the ready endpoint that allows nodes stuck at genesis to report as ready. The integration test confirms the aggregator scenario works, but doesn't catch the non-aggregator stuck-at-genesis issue reported by @auricom.

gemini-code-assist · 2025-11-02T10:02:23Z

Summary of Changes

Hello @randygrok, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves issue #2643 by significantly improving the accuracy of the health endpoint for aggregator nodes. Previously, the endpoint might incorrectly report a healthy status even when a node had ceased block production. The updated logic now checks the time since the last block was produced against dynamically calculated warning and failure thresholds, derived from the node's configured block time. This ensures that operational issues related to block generation are promptly detected and reflected in the health status, providing better visibility into node performance and stability.

Highlights

Enhanced Health Endpoint Logic: The health endpoint ('/livez') now actively monitors block production for aggregator nodes, reporting 'WARN' or 'FAIL' statuses if blocks are not produced within configurable time thresholds.
Configurable Thresholds: Introduced 'healthCheckWarnMultiplier' (3x block time) and 'healthCheckFailMultiplier' (5x block time) constants to define when a node's block production is considered slow or stopped.
Comprehensive Testing: Added new integration tests and expanded unit tests to cover various scenarios for the health endpoint, including non-aggregator nodes, aggregator nodes with recent blocks, slow block production, stopped block production, and lazy mode.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses the issue of the health endpoint not detecting stopped block production for aggregator nodes. The core logic is sound, and the changes are accompanied by a comprehensive set of unit tests and a new integration test. My feedback includes suggestions to improve the robustness of the integration test by replacing fixed-duration sleeps with polling, refactoring duplicated test setup code for better maintainability, and adjusting a log level to better reflect the severity of a health check failure.

node/single_sequencer_integration_test.go

pkg/rpc/client/client_test.go

pkg/rpc/server/server.go

codecov · 2025-11-02T10:06:57Z

Codecov Report

❌ Patch coverage is 85.71429% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.69%. Comparing base (12c2574) to head (c37cc3c).

Files with missing lines	Patch %	Lines
pkg/rpc/server/http.go	85.18%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2800      +/-   ##
==========================================
+ Coverage   64.54%   64.69%   +0.14%     
==========================================
  Files          80       80              
  Lines        7176     7177       +1     
==========================================
+ Hits         4632     4643      +11     
+ Misses       2008     1999       -9     
+ Partials      536      535       -1

Flag	Coverage Δ
combined	`64.69% <85.71%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Remove createCustomTestServer function which was redundant after making setupTestServer accept optional config parameter. This eliminates 36 lines of duplicated server setup code. Changes: - Make setupTestServer accept variadic config parameter - Update all test cases to use setupTestServer directly - Remove createCustomTestServer function entirely Result: -21 net lines of code, improved maintainability, cleaner API.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…n test Replace fixed duration time.Sleep calls with require.Eventually polling to make the health endpoint integration test more robust and less flaky. Changes: - Use require.Eventually to poll for health state transitions - Poll every 100ms instead of fixed sleeps - Generous timeouts (5s and 10s) that terminate early on success - Better error handling during polling Benefits: - More resilient to timing variations in CI/CD environments - Faster test execution (completes as soon as conditions are met) - Eliminates magic numbers (1700ms compensation) - Expresses intent clearly (wait until condition is met) - Non-flaky (tested 3x consecutively)

pkg/rpc/server/server.go

tac0turtle

utACK, we should make sure this covers @auricom's needs before merging

auricom · 2025-11-04T15:53:41Z

A healthcheck endpoint is used by infrastructure automation to determine if a process can respond to client requests.

AFAIK based on this PR, the existing endpoint did not perform substantive validation.

For a node to be considered healthy, the following conditions should be met:

1- The RPC server is operational and responds correctly to basic client queries per specifications
2- The aggregator is producing blocks at the expected rate
3- The full node is synchronizing with the network
4- The P2P network is ready to accept incoming connections (if enabled)

I cannot verify whether this change effectively validates items 1 and 4. And I may be overly cautious, so I would appreciate your expertise on these points 🧑‍💻

dismissing so we can cover all paths that claude mentioned

randygrok · 2025-11-04T19:14:46Z

thanks @auricom , let me check those points.

By your definition only point 2 is covered

… HTTP endpoint - Removed the HealthService and its related proto definitions. - Implemented a new HTTP health check endpoint at `/health/live`. - Updated the Client to use the new HTTP health check instead of the gRPC call. - Enhanced health check logic to return PASS, WARN, or FAIL based on block production status. - Modified tests to validate the new health check endpoint and its responses. - Updated server to register custom HTTP endpoints including the new health check.

…status checks; deprecate legacy gRPC endpoint in favor of HTTP

…; recommend migration to HTTP

…dpoints; migrate to HTTP health checks

…d clarity and accuracy

…nts in config and server files

github-actions · 2025-11-05T14:29:07Z

PR Preview Action v1.6.2
🚀 View preview at https://evstack.github.io/docs-preview/pr-2800/
Built to branch `main` at 2025-11-07 14:04 UTC. Preview will be ready when the GitHub Pages deployment is complete.

… HTTP health endpoints

…or health endpoints

…client feat(health): add `readyz()` and `is_ready()` methods to Rust `HealthClient` feat(health): introduce `ReadinessStatus` enum for Rust client refactor: update Rust client example to demonstrate liveness and readiness checks

…ks, including E2E test updates

…up parameters

pkg/rpc/client/client.go

pkg/rpc/server/http.go

tac0turtle · 2025-11-06T09:29:32Z

pkg/rpc/server/http.go

+			if !cfg.Node.Aggregator {
+				peers, err := pm.GetPeers()
+				if err != nil {
+					http.Error(w, "UNREADY: failed to query peers", http.StatusServiceUnavailable)
+					return
+				}
+				if len(peers) == 0 {
+					http.Error(w, "UNREADY: no peers connected", http.StatusServiceUnavailable)
+					return
+				}
+			}


i dont get this change?

this was already there before.

We don't consider ready a non aggregator node that does not contain any peer.

@auricom do you think this is correct?

That totally makes sense for me:

Let's imagine this situation.

gateway is running 10 fullnodes behind its https://rpc.eden.gateway.io url
one of the fullnode loses connection to other peers => it will experience latency and would not be able to forward transactions
So this node needs to be booted from the fullnodes pool so that users won't interact with a unready node.

pkg/rpc/server/http.go

tac0turtle

this is a good start but doesnt solve the issue at hand

…g endpoint modifications and test updates

…s endpoints, update documentation and remove deprecated health client

…eLive for lazy mode scenarios

… remove unused RPC client code and adjust e2e test parameters

randygrok · 2025-11-06T17:44:02Z

@auricom take a look specifically into the server endpoints

auricom · 2025-11-07T13:08:17Z

the included documentation does seem to be what we want to achieve, thanks

but i'm not sure the ready endpoint is working as intended

ran a fullnode on eden testnet using ghcr.io/evstack/ev-node-evm-single:pr-2800

evm-single start  --evm.jwt-secret-file /root/jwt/jwt.hex --evm.genesis-hash 0x915128c096aac37917bd6b987190ee22d9f1c8421b990c5896a6b300795fab90 --evm.engine-url http://localhost:18551 --evm.eth-url http://localhost:18555 --evnode.da.address http://100.123.142.81:36658 --home=/root/.evm-single --evnode.rpc.address=0.0.0.0:7341 --evnode.instrumentation.prometheus --evnode.instrumentation.prometheus_listen_addr=:26670

node was configured without p2p peer and with wrong namespaces (so it wasnt able to da sync)

ev-node-1   | 1:03PM INF initialized syncer state chain_id=edennet-2 component=syncer da_height=7970386 height=0
ev-node-1   | 1:03PM INF syncer started component=syncer
ev-node-1   | 1:03PM INF starting process loop component=syncer
ev-node-1   | 1:03PM INF starting sync loop component=syncer
ev-node-1   | 1:03PM INF starting DA inclusion processing loop component=submitter

node was stuck on block 0 with 0 peer but ready endpoint was considered READY

curl -v http://localhost:7331/health/ready

Host localhost:7331 was resolved.
IPv6: ::1
IPv4: 127.0.0.1
Trying [::1]:7331...
Established connection to localhost (::1 port 7331) from ::1 port 39896
using HTTP/1.x

GET /health/ready HTTP/1.1
Host: localhost:7331
User-Agent: curl/8.17.0
Accept: /

Request completely sent off
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Date: Fri, 07 Nov 2025 13:07:19 GMT
< Content-Length: 6
<
READY
Connection #0 to host localhost:7331 left intact

randygrok · 2025-11-07T14:05:10Z

thanks for the message, let me check, this is helpful

auricom · 2025-11-07T15:54:07Z

turns out that was issues coming from my workstation. Can confirm that unready node was detected unready on the endpoint

github-project-automation bot added this to Evolve Nov 2, 2025

github-actions bot assigned randygrok Nov 2, 2025

gemini-code-assist bot reviewed Nov 2, 2025

View reviewed changes

node/single_sequencer_integration_test.go Outdated Show resolved Hide resolved

pkg/rpc/client/client_test.go Outdated Show resolved Hide resolved

pkg/rpc/server/server.go Outdated Show resolved Hide resolved

randygrok force-pushed the health-endpoint-block-check branch from 39dc439 to 819b015 Compare November 2, 2025 10:04

randygrok and others added 3 commits November 2, 2025 11:18

Apply suggestion from @gemini-code-assist[bot]

f9abded

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

randygrok enabled auto-merge November 2, 2025 10:43

randygrok requested review from julienrbrt and tac0turtle November 2, 2025 10:43

tac0turtle requested a review from auricom November 2, 2025 14:55

tac0turtle reviewed Nov 3, 2025

View reviewed changes

pkg/rpc/server/server.go Outdated Show resolved Hide resolved

tac0turtle reviewed Nov 3, 2025

View reviewed changes

pkg/rpc/server/server.go Outdated Show resolved Hide resolved

tac0turtle previously approved these changes Nov 3, 2025

View reviewed changes

randygrok added this pull request to the merge queue Nov 3, 2025

tac0turtle removed this pull request from the merge queue due to a manual request Nov 3, 2025

randygrok and others added 6 commits November 4, 2025 20:14

Merge branch 'main' into health-endpoint-block-check

7769c77

feat(health): introduce back HealthService with Livez RPC for health …

ba282ec

…status checks; deprecate legacy gRPC endpoint in favor of HTTP

feat(health): add deprecation warnings and headers for Livez endpoint…

cf475a1

…; recommend migration to HTTP

refactor(health): remove deprecated HealthService and related gRPC en…

c0749b1

…dpoints; migrate to HTTP health checks

refactor(health): update health check tests and endpoints for improve…

29dc01d

…d clarity and accuracy

docs(health): add comprehensive documentation for health check endpoi…

4eec708

…nts in config and server files

randygrok and others added 5 commits November 5, 2025 17:17

Merge branch 'main' into health-endpoint-block-check

89a081e

docs: update CHANGELOG and buf.yaml to reflect migration from gRPC to…

94a9bc4

… HTTP health endpoints

docs: update CHANGELOG with new readiness checks and client methods f…

807e8b7

…or health endpoints

docs: update CHANGELOG with enhancements to health and readiness chec…

0d9c1e5

…ks, including E2E test updates

randygrok mentioned this pull request Nov 6, 2025

feat: migrate evstack health checks to HTTP /health/ready endpoint celestiaorg/tastora#149

Draft

randygrok added 3 commits November 6, 2025 09:25

deps: update tastora to v0.7.6 with additional changes

6bc650f

deps: update tastora to v0.7.6

015cb76

feat: add liveness check for nodes in e2e tests and update node start…

ea5219b

…up parameters

tac0turtle reviewed Nov 6, 2025

View reviewed changes

pkg/rpc/client/client.go Outdated Show resolved Hide resolved

tac0turtle reviewed Nov 6, 2025

View reviewed changes

pkg/rpc/server/http.go Outdated Show resolved Hide resolved

tac0turtle reviewed Nov 6, 2025

View reviewed changes

pkg/rpc/server/http.go Show resolved Hide resolved

tac0turtle reviewed Nov 6, 2025

View reviewed changes

randygrok marked this pull request as draft November 6, 2025 09:37

randygrok added 4 commits November 6, 2025 11:36

docs: update CHANGELOG with new health and readiness checks, includin…

d2d4393

…g endpoint modifications and test updates

feat: enhance health check system with separate liveness and readines…

8458fa0

…s endpoints, update documentation and remove deprecated health client

feat: update integration tests to use HTTP health checks and AwaitNod…

9aab503

…eLive for lazy mode scenarios

feat: refine health check endpoints and update related documentation;…

77cc822

… remove unused RPC client code and adjust e2e test parameters

randygrok requested a review from tac0turtle November 6, 2025 17:43

randygrok marked this pull request as ready for review November 6, 2025 17:43

Merge branch 'main' into health-endpoint-block-check

c37cc3c

auricom approved these changes Nov 7, 2025

View reviewed changes

fix: health endpoint now detects stopped block production #2800

Are you sure you want to change the base?

fix: health endpoint now detects stopped block production #2800

Uh oh!

Conversation

randygrok commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

github-actions bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review ✓

Critical Issue: Ready Endpoint Logic Flaw

Problem Analysis (pkg/rpc/server/http.go:67-77)

Root Cause

Recommended Fix

Health Check Coverage Assessment

Missing: RPC Request Validation

Code Quality Issues

1. Inconsistent Error Responses (http.go:53-127)

2. Magic Number Without Explanation (http.go:94)

3. Missing Context in Logs (http.go:92-100)

4. Missing Nil Check (http.go:102)

Test Quality Issues

5. Flaky Integration Test (single_sequencer_integration_test.go:422-463)

6. Missing Test Coverage (server_test.go)

Architecture & Design

7. Separation of Concerns (http.go:36-127)

8. Configuration Validation (defaults.go:42-50)

Security & Performance

9. No Rate Limiting on Health Endpoints

10. Blocking Store Operations (http.go:25, 80)

Documentation Issues

11. Incomplete Documentation (docs/learn/config.md:610-637)

Breaking Changes Concern

12. Protobuf Deletion (CI Failure)

Summary & Recommendations

High Priority (Must Fix)

Medium Priority (Should Fix)

Low Priority (Nice to Have)

Positive Aspects

Uh oh!

gemini-code-assist bot commented Nov 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

tac0turtle left a comment

Choose a reason for hiding this comment

Uh oh!

auricom commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randygrok commented Nov 4, 2025

Uh oh!

github-actions bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch main at 2025-11-07 14:04 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Uh oh!

Uh oh!

tac0turtle Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

randygrok commented Nov 2, 2025 •

edited

Loading

github-actions bot commented Nov 2, 2025 •

edited

Loading

claude bot commented Nov 2, 2025 •

edited

Loading

codecov bot commented Nov 2, 2025 •

edited

Loading

auricom commented Nov 4, 2025 •

edited

Loading

github-actions bot commented Nov 5, 2025 •

edited

Loading

Built to branch `main` at 2025-11-07 14:04 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

auricom Nov 7, 2025 •

edited

Loading

auricom commented Nov 7, 2025 •

edited

Loading