Skip to content

Conversation

alpe
Copy link
Contributor

@alpe alpe commented Oct 10, 2025

🚧 WIP - early version

E2E test: test/e2e/failover_e2e_test.go

Overview

Copy link
Contributor

github-actions bot commented Oct 10, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedOct 13, 2025, 11:42 AM

Copy link
Contributor

claude bot commented Oct 10, 2025

Claude finished @alpe's task —— View job


Comprehensive Code Review: HA Failover with Raft

I've completed a detailed review of the HA failover implementation. Below are my findings organized by focus area.


1. Code Quality & Architecture ⚠️

Strengths

  • Clean separation of concerns: Raft logic isolated in pkg/raft/, failover orchestration in node/failover.go, and syncing integration in block/internal/syncing/
  • Interface-based design: Good use of interfaces like leaderElection, clusterClient, and RaftNode for testability
  • Modular failover state: failoverState struct cleanly encapsulates components that need reset during mode switches

Issues & Recommendations

Critical: Inconsistent nil checks for RaftNode

Location: block/internal/syncing/syncer.go:112-114

if raftNode != nil && !reflect.ValueOf(raftNode).IsNil() {
    s.raftRetriever = newRaftRetriever(raftNode, genesis, logger, eventProcessorFn(s.pipeEvent))
}

Problem: Using reflection for nil checks is a code smell. If raftNode is an interface, this suggests confusion about nil interfaces vs nil concrete types.

Recommendation:

  • Make RaftNode a concrete type pointer or use a sentinel value pattern
  • If it must be an interface, document why double nil-checking is necessary
  • Consider: if raftNode != nil && raftNode.IsLeader != nil { ... } or similar type assertion

Similar issue in: block/internal/syncing/syncer.go:177 and other locations


Major: JSON serialization for production consensus

Location: pkg/raft/types.go:3-10 and pkg/raft/node.go:177-180

// todo: refactor to use proto
type RaftBlockState struct {
    Height    uint64
    Hash      []byte
    Timestamp uint64
    Header    []byte
    Data      []byte
}

Problem:

  • JSON is inefficient for binary data (base64 encoding overhead)
  • No schema evolution strategy
  • Performance impact on Raft consensus
  • Author's comment acknowledges this should use protobuf

Recommendation:

  1. Priority: Switch to protobuf before merging (as intended)
  2. Add protobuf definitions in types/pb/
  3. Benefits: smaller message size, schema evolution, type safety
  4. Consider: Add benchmark comparing JSON vs protobuf serialization time

Major: Hardcoded HTTP endpoints for maintenance

Location: pkg/rpc/server/http.go:92-174

mux.HandleFunc("/raft/join", func(w http.ResponseWriter, r *http.Request) {
    // ...
    var rsp struct {
        NodeID  string `json:"id"`
        Address string `json:"address"`
    }
    // ...
})

Author's note: "quick solution to add some maintenance function for the raft cluster. Probably proto makes more sense"

Issues:

  • No authentication/authorization on cluster management endpoints
  • Inconsistent with rest of API architecture
  • Manual JSON marshaling instead of using typed handlers
  • Commented-out leader check at line 130-133

Recommendations:

  1. Security: Add authentication before production (see Security section)
  2. Architecture: Use gRPC/protobuf for consistency
  3. Validation: Uncomment and enforce leader-only operations where appropriate
  4. Consider: Rate limiting to prevent DoS on cluster operations

Moderate: Election timing constants need review

Location: pkg/raft/election.go:44

ticker := time.NewTicker(300 * time.Millisecond)

Issue: Hardcoded polling interval for leader election detection

Location: pkg/raft/node.go:63

raftConfig.LeaderLeaseTimeout = cfg.HeartbeatTimeout / 2

Issue: Hardcoded ratio without justification

Recommendations:

  • Make polling interval configurable
  • Document why HeartbeatTimeout / 2 is chosen for lease timeout
  • Consider: Test these values under network partition scenarios

Minor: Goroutine cleanup pattern inconsistency

Location: pkg/raft/election.go:22-26

defer func() {
    workerCancel()
    wg.Wait()
    close(errCh)
}()

✅ Good: Proper cleanup in RunWithElection

But: block/internal/syncing/raft_retriever.go:71-80 has different pattern

func (r *raftRetriever) Stop() {
    r.mtx.Lock()
    if r.cancel != nil {
        r.cancel()
        r.cancel = nil
    }
    r.mtx.Unlock()
    r.wg.Wait()
}

Issue: Unlocking before wg.Wait() could allow race if Start() called concurrently

Recommendation: Hold lock until after cleanup or use atomic state machine


2. Error Handling & Resilience ⚠️

Issues

Critical: Silent failure in FSM.Apply

Location: pkg/raft/node.go:254-258

if f.applyCh != nil {
    select {
    case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
    default:
        f.logger.Warn().Msg("apply channel full, dropping message")
    }
}

Problem:

  • Dropping Raft consensus messages silently breaks consistency guarantees
  • Follower nodes may miss critical state updates
  • No backpressure mechanism

Recommendations:

  1. Critical: Block until channel accepts (Raft guarantees order)
  2. Or: Return error from Apply() to signal backpressure to Raft
  3. Add metric for dropped messages (if dropping is acceptable)
  4. Increase channel buffer size based on expected throughput

Major: Leadership loss kills entire node

Location: pkg/raft/election.go:60-63

} else if !isLeader && isCurrentlyLeader { // lost leadership
    workerCancel()
    n.logger.Info().Msg("lost leadership")
    return ErrLeadershipLost
}

Author's note: "The node is shutdown in the downstream code to ensure clean state"

Problem:

  • Forces node restart on every leadership change
  • Increases downtime during failover
  • No graceful degradation to follower mode

Context from code: This seems intentional but aggressive

Recommendations:

  1. Consider: Transition to follower mode instead of exiting
  2. If restart is necessary, document why in code comments (state corruption risk?)
  3. Add health check to verify state consistency before rejoining
  4. Test: What happens under frequent leadership changes (network flapping)?

Major: Missing context propagation in failover

Location: node/failover.go:165-230

func (f *failoverState) Run(ctx context.Context) (multiErr error) {
    // ...
    defer f.rpcServer.Shutdown(context.Background()) // nolint: errcheck
    // ...
    defer func() {
        if err := f.headerSyncService.Stop(context.Background()); err != nil && !errors.Is(err, context.Canceled) {

Problem: Using context.Background() in cleanup ignores parent cancellation timeout

Recommendation:

shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
defer f.rpcServer.Shutdown(shutdownCtx)

Moderate: Retry logic without exponential backoff

Location: block/internal/syncing/syncer.go:530-556

for attempt := 1; attempt <= common.MaxRetriesBeforeHalt; attempt++ {
    // ...
    select {
    case <-time.After(common.MaxRetriesTimeout):
        continue
    // ...
}

Issue: Fixed retry timeout may not be optimal

Recommendation: Implement exponential backoff for transient failures


Minor: Error wrapping inconsistency

Location: Mixed use of fmt.Errorf and errors.Join

Recommendation: Establish pattern in CLAUDE.md for consistency


3. Security 🚨

Critical Issues

Authentication missing on cluster management endpoints

Location: pkg/rpc/server/http.go:111-150

mux.HandleFunc("/raft/join", func(w http.ResponseWriter, r *http.Request) {
    // No authentication!
    // ...
    if err := raftNode.AddPeer(rsp.NodeID, rsp.Address); err != nil {

Risk:

  • Severity: CRITICAL
  • Any network actor can add/remove Raft peers
  • Byzantine actors can join cluster and corrupt consensus
  • Cluster membership attacks

Recommendations:

  1. MUST: Add mutual TLS for Raft communication
  2. MUST: Require authentication token for /raft/* endpoints
  3. MUST: Validate peer certificates against allowlist
  4. Consider: Pre-shared cluster secret for bootstrap
  5. Reference: Consul's gossip encryption + ACLs

Raft transport uses TCP without encryption

Location: pkg/raft/node.go:90-93

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)
if err != nil {
    return nil, fmt.Errorf("create transport: %w", err)
}

Risk:

  • Severity: HIGH
  • Raft messages (including block data) transmitted in plaintext
  • Man-in-the-middle attacks possible
  • Eavesdropping on consensus

Recommendation:

  1. Use Raft's TLS transport: raft.NewTCPTransportWithConfig() with TLS config
  2. Mutual TLS between all Raft peers
  3. Rotate certificates regularly

Input validation gaps

Location: pkg/rpc/server/http.go:126-129

if rsp.NodeID == "" || rsp.Address == "" {
    http.Error(w, "id and address are required", http.StatusBadRequest)
    return
}

Issue: No validation of address format, could inject malicious addresses

Recommendation:

// Validate address format
if _, err := net.ResolveTCPAddr("tcp", rsp.Address); err != nil {
    http.Error(w, "invalid address format", http.StatusBadRequest)
    return
}
// Validate nodeID format (alphanumeric, length limits)
if !isValidNodeID(rsp.NodeID) {
    http.Error(w, "invalid node ID", http.StatusBadRequest)
    return
}

Hardcoded log level exposure

Location: pkg/raft/node.go:61

raftConfig.LogLevel = "INFO"

Minor: Should be configurable, avoid leaking internal state in production


4. Performance & Resource Efficiency ⚙️

Issues

Moderate: JSON marshaling in hot path

Location: pkg/raft/node.go:177-186

func (n *Node) Broadcast(ctx context.Context, state *RaftBlockState) error {
    data, err := json.Marshal(state) // Hot path for every block!

Impact:

  • JSON serialization on every block broadcast
  • CPU overhead scales with block size
  • Already noted as TODO for protobuf

Measurement needed: Profile block production under load


Moderate: Unbounded channel could cause memory pressure

Location: block/internal/syncing/syncer.go:108

heightInCh:  make(chan common.DAHeightEvent, 10_000),

Issue: Large buffer (10k events) could consume significant memory

Calculation: If average event is 100KB, buffer could use 1GB RAM

Recommendation:

  • Monitor channel depth via metrics
  • Consider backpressure mechanism
  • Document memory assumptions

Minor: Snapshot store writes to stderr

Location: pkg/raft/node.go:80

snapshotStore, err := raft.NewFileSnapshotStore(cfg.RaftDir, int(cfg.SnapCount), os.Stderr)

Issue: Errors written to stderr instead of logger

Recommendation: Pass logger writer or /dev/null


Minor: Ticker not stopped in all paths

Location: pkg/raft/election.go:69-72

case <-ticker.C:
    if isStarted {
        ticker.Stop()
        continue
    }

Issue: continue after stop seems unnecessary

Recommendation: Review logic - ticker already stopped


5. Testing & Reliability 🧪

Strengths

  • ✅ E2E test covers basic failover scenario (test/e2e/failover_e2e_test.go)
  • ✅ Tests verify transaction propagation across nodes
  • ✅ Tests verify DA height progression after failover

Gaps & Recommendations

Critical: No unit tests for Raft components

Missing:

  • pkg/raft/node_test.go
  • pkg/raft/election_test.go
  • node/failover_test.go

Should test:

  1. Bootstrap vs join cluster scenarios
  2. Snapshot/restore correctness
  3. Leader election edge cases (split brain, network partition)
  4. Concurrent access to FSM state
  5. Error paths in Apply(), Snapshot(), Restore()

Major: E2E test only covers single failover

Location: test/e2e/failover_e2e_test.go:126-150

Missing scenarios:

  1. Multiple consecutive failovers: Kill leader twice in succession
  2. Network partition: Isolate leader, verify cluster survives
  3. Follower failure: What happens when non-leader dies?
  4. Concurrent leader kills: Race conditions during election
  5. Snapshot recovery: Kill all nodes, restart from snapshots
  6. Clock skew: Nodes with different HeartbeatTimeout values
  7. Byzantine behavior: Node sending invalid Raft messages

Recommendation: Expand E2E suite before merging


Major: No chaos/reliability testing

Needed:

  • Random node kills during block production
  • Network latency injection
  • Disk I/O failures during snapshot
  • Memory pressure scenarios

Moderate: Mock generation for new interfaces

Check: Do new interfaces have mocks?

  • leaderElection interface - needs mock
  • clusterClient interface - needs mock
  • eventProcessor interface - needs mock

Recommendation: Run make mock-gen and add .mockery.yaml entries


Minor: E2E test log storage

Location: test/e2e/sut_helper.go:170
Author's note: "I am storing logs to help tracing down issues. This works with non temporary work dirs only."

Recommendation: Make log storage path configurable, preserve logs on failure


6. Documentation & Developer Experience 📚

Gaps

Critical: No ADR document

Expected: adr023_ha-failover_raft mentioned in branch name

Should document:

  1. Why Raft chosen over other consensus (Paxos, etcd/raft comparison)
  2. Leadership loss → node restart decision rationale
  3. JSON → protobuf migration plan
  4. Security threat model
  5. Performance benchmarks/targets

Major: Incomplete package documentation

Missing:

  • pkg/raft/doc.go - package overview
  • Raft configuration guide (heartbeat tuning, snapshot frequency)
  • Failover behavior documentation
  • Operator runbook for cluster management

Major: Configuration documentation gaps

Location: pkg/config/config.go:255-266

Issues:

  • SendTimeout not explained (timeout for what operation?)
  • HeartbeatTimeout needs guidance on tuning (network latency considerations)
  • SnapCount - how to choose value? (based on block size/frequency)

Recommendation: Add detailed comments with examples


Moderate: Error messages could be more actionable

Example: pkg/raft/node.go:174

return fmt.Errorf("not leader")

Better:

return fmt.Errorf("not leader: current leader is %s, this node is %s", n.raft.Leader(), n.config.NodeID)

Moderate: CLAUDE.md needs update

Add section:

## High Availability with Raft

### Running a Raft Cluster
- Bootstrap first node with `--raft-bootstrap`
- Join additional nodes with `--raft-peers`
- Monitor cluster health via `/raft/node` endpoint

### Debugging Raft Issues
- Check Raft logs in `<raft-dir>/raft-log.db`
- Verify snapshot directory permissions
- Common errors and solutions

### Testing Failover
- Use `test/e2e/failover_e2e_test.go` as reference

Summary & Recommendations

Must Fix Before Merge (Blockers)

  1. 🚨 Security: Add authentication to Raft endpoints + TLS transport
  2. ⚠️ Data Integrity: Fix silent message dropping in FSM.Apply channel
  3. ⚠️ Architecture: Migrate JSON → Protobuf for Raft messages
  4. 📝 Documentation: Write ADR document explaining design decisions
  5. 🧪 Testing: Add unit tests for pkg/raft package

Should Fix Before Merge (High Priority)

  1. Fix nil interface reflection checks with better pattern
  2. Add authentication/rate limiting to cluster management endpoints
  3. Expand E2E tests to cover multiple failover scenarios
  4. Document leadership loss → restart rationale
  5. Review goroutine cleanup patterns for consistency

Can Address Post-Merge (Technical Debt)

  1. Refactor hardcoded timing constants to config
  2. Implement exponential backoff in retry logic
  3. Add chaos testing framework
  4. Create operator runbook documentation
  5. Add metrics for Raft health monitoring

Questions for Author

  1. Leadership restart: Is there a specific reason nodes must restart on leadership loss rather than transitioning to follower? (line: pkg/raft/election.go:63)
  2. JSON timeline: Is protobuf migration planned before merge or post-merge?
  3. Security model: What's the intended deployment environment? (trusted network vs hostile?)
  4. Performance targets: What's acceptable failover time? (current test shows ~200ms-5s)

Overall Assessment

This is a solid foundation for HA failover, with clean architecture and good separation of concerns. The E2E test demonstrates it works in the happy path. However, security and reliability gaps need addressing before production use. The WIP status is appropriate - this needs hardening before merge.

Recommend: Address must-fix items, expand testing, complete protobuf migration, then re-review.

--- | Branch: alex/adr023_ha-failover_raft3

)

// failoverState collect the components to reset when switching modes.
type failoverState struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type contains all stateful services that are re-reset when switching from sync to aggregator mode

})

// optional Raft node details
if raftNode != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick solution to add some maintenance function for the raft cluster. Probably proto makes more sense

@@ -0,0 +1,17 @@
package raft

// todo: refactor to use proto
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RaftBlockState is currently de- / serialized via json for transport via raft.

} else if !isLeader && isCurrentlyLeader { // lost leadership
workerCancel()
n.logger.Info().Msg("lost leadership")
return ErrLeadershipLost
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node is shutdown in the downstream code to ensure clean state.

logfile, err := os.Create(logfileName)
require.NoError(s.t, err)
errReader = io.NopCloser(io.TeeReader(errReader, logfile))
outReader = io.NopCloser(io.TeeReader(outReader, logfile))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am storing logs to help tracing down issues. This works with non temporary work dirs only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant