Skip to content

Conversation

randygrok
Copy link
Contributor

Overview

Closes: #2641

- Added Backup method to Store interface and DefaultStore implementation to stream a Badger backup of the datastore.
- Introduced BackupRequest and BackupResponse messages in the state_rpc.proto file to handle backup requests and responses.
- Implemented backup streaming logic in StoreServer, including metadata handling for current and target heights.
- Created a backupStreamWriter to manage chunked writing of backup data.
- Updated client tests to validate the Backup functionality.
- Enhanced mock store to support Backup method for testing.
- Added unit tests for Backup functionality in the store package.
Copy link
Contributor

github-actions bot commented Oct 14, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedOct 16, 2025, 10:39 AM

Copy link
Contributor

claude bot commented Oct 14, 2025

Claude finished @randygrok's task —— View job


Code Review: Live Backups EVM and ev-node

I've completed a comprehensive review of this PR implementing live backup and restore functionality. Here are my findings organized by focus area:


1. Code Quality & Architecture

Strengths:

  • Clean separation of concerns: Backup/restore logic is properly layered (store → RPC server → client → CLI)
  • Interface-driven design: The Backup interface in pkg/store/types.go:75-85 is well-designed
  • Streaming implementation: The backup streaming approach using backupStreamWriter (server.go:272-340) is excellent for memory efficiency
  • Good test coverage: Both unit tests and integration tests are present

⚠️ Issues & Recommendations:

pkg/store/types.go:75-85 - Interface Design Issue

type Backup interface {
    Backup(ctx context.Context, writer io.Writer, since uint64) (uint64, error)
    Restore(ctx context.Context, reader io.Reader) error
    Close() error  // ❌ DUPLICATE: Already in Store interface
}
  • Problem: The Close() method appears in both Backup and Store interfaces, creating redundancy
  • Reviewer feedback confirmed: @julienrbrt noted "It shouldn't be added to the Rollback interface" (actually referring to the Backup interface based on context)
  • Recommendation: Remove Close() from the Backup interface since Store already embeds it

pkg/store/backup.go:20-36 - Type Assertion Pattern

// Try direct badger4 cast first
if badgerDatastore, ok := s.db.(*badger4.Datastore); ok {
    return backupBadger(badgerDatastore, writer, since)
}

// Try to unwrap one level (e.g., PrefixTransform wrapper)
if shim, ok := s.db.(ds.Shim); ok {
    children := shim.Children()
    if len(children) > 0 {
        if badgerDatastore, ok := children[0].(*badger4.Datastore); ok {
            return backupBadger(badgerDatastore, writer, since)
        }
    }
}
  • Good: Handles wrapped datastores
  • Concern: Only unwraps one level - what if there are multiple wrappers?
  • Recommendation: Consider iterative unwrapping or document the limitation

pkg/store/restore.go:21-66 - Restore Loop Logic

visited := make(map[ds.Datastore]struct{})
for {
    // ... unwrapping logic
    if _, seen := visited[current]; seen {
        break
    }
    visited[current] = struct{}{}
    // ...
}
  • Good: Cycle detection with visited map prevents infinite loops
  • Good: Iterative unwrapping (better than backup's single-level unwrap)
  • Recommendation: Apply the same iterative unwrapping pattern to backup.go for consistency

2. Error Handling & Resilience

Strengths:

  • Comprehensive error wrapping: All errors include context (e.g., fmt.Errorf("failed to execute backup: %w", err))
  • Context cancellation: Properly checks ctx.Err() at entry points
  • Cleanup on failure: Backup command removes partial files on error (backup.go:100-116)

⚠️ Issues & Recommendations:

pkg/cmd/backup.go:100-116 - File Cleanup Pattern

if backupErr != nil {
    _ = writer.Flush()
    _ = file.Close()
    _ = os.Remove(outputPath)
    return fmt.Errorf("backup failed: %w", backupErr)
}
  • Good: Cleans up partial backups on failure
  • Issue: Silent error suppression with _ assignments could hide issues
  • Recommendation: Log cleanup errors at debug/warning level:
if backupErr != nil {
    if flushErr := writer.Flush(); flushErr != nil {
        cmd.PrintErrf("Warning: failed to flush during cleanup: %v\n", flushErr)
    }
    if closeErr := file.Close(); closeErr != nil {
        cmd.PrintErrf("Warning: failed to close file during cleanup: %v\n", closeErr)
    }
    if removeErr := os.Remove(outputPath); removeErr != nil {
        cmd.PrintErrf("Warning: failed to remove partial backup: %v\n", removeErr)
    }
    return fmt.Errorf("backup failed: %w", backupErr)
}

pkg/cmd/restore.go:86 - No Rollback on Restore Failure

defer kvStore.Close()

evStore := store.New(kvStore)
// ...
if err := evStore.Restore(ctx, reader); err != nil {
    return fmt.Errorf("restore failed: %w", err)
}
  • Critical Issue: If restore fails midway, the database may be in an inconsistent state
  • Problem: No cleanup or rollback mechanism
  • Recommendation: Either:
    1. Document that users should use --force flag workflow (backup existing, then restore)
    2. Add validation that database is empty before restore
    3. Implement transaction/rollback on restore failure

pkg/rpc/server/server.go:193-265 - Stream Error Handling

if err := stream.Send(&pb.BackupResponse{...}); err != nil {
    return err
}
  • Issue: Raw errors returned without proper gRPC error codes in some paths
  • Inconsistency: Some paths use connect.NewError() (lines 223-233), but initial metadata send doesn't (line 216)
  • Recommendation: Wrap all stream errors consistently:
if err := stream.Send(&pb.BackupResponse{...}); err != nil {
    return connect.NewError(connect.CodeInternal, fmt.Errorf("failed to send initial metadata: %w", err))
}

3. Security

Strengths:

  • Restrictive file permissions: Backup files created with 0o600 (backup.go:74)
  • No credential exposure: No secrets or credentials in backup logic
  • RPC authentication: Uses existing RPC authentication mechanisms

⚠️ Issues & Recommendations:

pkg/cmd/backup.go:37-40 - URL Construction

baseURL := rpcAddress
if !strings.HasPrefix(baseURL, "http://") && !strings.HasPrefix(baseURL, "https://") {
    baseURL = fmt.Sprintf("http://%s", baseURL)
}
  • Security Concern: Defaults to HTTP instead of HTTPS
  • Risk: Backup data transmitted in cleartext over network
  • Recommendation:
    1. Add a --tls or --insecure flag (defaulting to requiring TLS)
    2. Warn users when using HTTP
    3. Consider defaulting to HTTPS:
if !strings.HasPrefix(baseURL, "http://") && !strings.HasPrefix(baseURL, "https://") {
    baseURL = fmt.Sprintf("https://%s", baseURL)
}

pkg/cmd/restore.go:61-65 - Directory Removal

if force {
    if err := os.RemoveAll(dbPath); err != nil {
        return fmt.Errorf("failed to remove existing datastore: %w", err)
    }
}
  • Risk: os.RemoveAll() is destructive and irreversible
  • Missing: No confirmation prompt or backup recommendation
  • Recommendation: Add confirmation:
if force {
    cmd.Printf("WARNING: This will permanently delete the existing datastore at %s\n", dbPath)
    cmd.Printf("Continue? (yes/no): ")
    // Add confirmation logic
    if err := os.RemoveAll(dbPath); err != nil {
        return fmt.Errorf("failed to remove existing datastore: %w", err)
    }
}

No Input Validation on Backup File

  • Risk: Restore command doesn't validate backup file format/integrity before attempting restore
  • Recommendation: Add magic number/header validation to detect corrupt/malicious backup files

4. Performance & Resource Efficiency

Strengths:

  • Buffered I/O: 1 MiB buffers for both read and write (backup.go:80, restore.go:97)
  • Chunked streaming: 128 KiB chunks prevent memory exhaustion (server.go:268)
  • Online backups: Uses Badger's DB.Backup() which doesn't block writes
  • No goroutine leaks: All resources properly cleaned up with defer

⚠️ Issues & Recommendations:

pkg/rpc/server/server.go:290-315 - Buffer Management

func (w *backupStreamWriter) Write(p []byte) (int, error) {
    written := 0
    for len(p) > 0 {
        space := w.chunkSize - len(w.buf)
        if space == 0 {
            if err := w.flush(); err != nil {
                return written, err
            }
            space = w.chunkSize - len(w.buf)
        }
        if space > len(p) {
            space = len(p)
        }
        w.buf = append(w.buf, p[:space]...)  // ❌ Potential repeated allocations
        // ...
    }
}
  • Performance Issue: append() may cause multiple allocations if buffer grows
  • Recommendation: Pre-allocate buffer in constructor:
func newBackupStreamWriter(stream *connect.ServerStream[pb.BackupResponse], chunkSize int) *backupStreamWriter {
    if chunkSize <= 0 {
        chunkSize = defaultBackupChunkSize
    }
    return &backupStreamWriter{
        stream:    stream,
        buf:       make([]byte, 0, chunkSize),  // ✅ Pre-allocated capacity
        chunkSize: chunkSize,
    }
}

pkg/cmd/backup.go:80-82 - Duplicate Buffering

writer := bufio.NewWriterSize(file, 1<<20) // 1 MiB buffer
bytesCount := &countingWriter{}
streamWriter := io.MultiWriter(writer, bytesCount)
  • Issue: Both client-side (1 MiB) and server-side (128 KiB chunks) buffering
  • Not a bug: This is actually reasonable for network I/O
  • Observation: Could document why both layers of buffering exist

No Rate Limiting or Backpressure

  • Risk: Large backups could saturate network bandwidth
  • Recommendation: Consider adding configurable rate limiting for production use

pkg/store/restore.go:42 - Hardcoded Thread Count

if err := badgerDatastore.DB.Load(reader, 16); err != nil {
  • Issue: Hardcoded to 16 threads
  • Recommendation: Make configurable or use runtime.NumCPU()

5. Testing & Reliability

Strengths:

  • Comprehensive unit tests: pkg/cmd/backup_test.go covers success and error cases
  • Integration tests: Full client → server → store flow tested (client_test.go:179-204)
  • Mock-based testing: Proper use of mockery for dependencies
  • Error path testing: store_test.go has excellent error case coverage

⚠️ Issues & Recommendations:

Missing Test Coverage (Per Codecov Report):

  • pkg/rpc/server/server.go: 110 missing lines (0% coverage for backup RPC)
  • pkg/cmd/restore.go: 84 missing lines (0% coverage)
  • pkg/store/restore.go: 37 missing lines (0% coverage)

Recommendations:

  1. Add test for restore.go restore command execution
  2. Add test for restore.go error paths (invalid backup file, existing DB without --force)
  3. Add test for store/restore.go with shimmed datastores
  4. Add test for RPC streaming errors (network failures, context cancellation)

Missing End-to-End Test

  • Gap: No test that exercises full backup → restore → verify data integrity workflow
  • Recommendation: Add e2e test:
func TestBackupRestoreE2E(t *testing.T) {
    // 1. Create store with test data
    // 2. Perform backup
    // 3. Create new store
    // 4. Restore from backup
    // 5. Verify all data matches
}

Test Quality Issues:

pkg/cmd/backup_test.go:25-83 - Missing edge cases:

  • No test for --since-version with incremental backups
  • No test for network timeout/cancellation
  • No test for large backup files

6. Documentation & Developer Experience

Strengths:

  • Good inline comments: Explains non-obvious logic (e.g., backup.go:39)
  • Command help text: Cobra commands have clear descriptions
  • Proto documentation: Protobuf messages well-documented

⚠️ Issues & Recommendations:

Missing CLI Documentation:

  • No examples in command help text
  • Recommendation: Add usage examples:
cmd := &cobra.Command{
    Use:   "backup",
    Short: "Stream a datastore backup to a local file via RPC",
    Long: `Stream a datastore backup to a local file via RPC.

Examples:
  # Create a full backup
  evnode backup --output /backups/full-backup.badger

  # Create an incremental backup since version 1000
  evnode backup --output /backups/incremental.badger --since-version 1000

  # Overwrite existing backup file
  evnode backup --output /backups/backup.badger --force
`,
    // ...
}

Missing High-Level Documentation:

  • No README or docs explaining backup/restore workflow
  • No documentation on backup file format, versioning, or compatibility
  • Recommendation: Add documentation covering:
    • When to use full vs incremental backups
    • How to verify backup integrity
    • Disaster recovery procedures
    • Backup file compatibility between versions

Missing Godoc Comments:

  • backupStreamWriter (server.go:272) has no doc comment
  • countingWriter (backup.go:135) has no doc comment

Critical Issues Summary

Must Fix Before Merge:

  1. pkg/store/types.go:84 - Remove duplicate Close() method from Backup interface (confirmed by reviewer @julienrbrt)

  2. pkg/cmd/restore.go:107 - Add protection against partial restore failures leaving database in inconsistent state

  3. Security: pkg/cmd/backup.go:38-40 - Don't default to HTTP; require explicit --insecure flag or default to HTTPS

Should Fix:

  1. Test Coverage - Add tests for restore command and RPC streaming (current coverage: 17.66%)

  2. pkg/cmd/restore.go:61-65 - Add confirmation prompt for --force flag before destructive os.RemoveAll()

  3. pkg/rpc/server/server.go:216 - Consistent error wrapping for all stream.Send() calls

Nice to Have:

  1. Apply iterative unwrapping pattern from restore.go to backup.go for consistency

  2. Add CLI usage examples and high-level documentation

  3. Add end-to-end test for full backup → restore → verify workflow

  4. Consider making restore thread count configurable (currently hardcoded to 16)


Positive Highlights

  1. Excellent streaming implementation - Memory-efficient chunked streaming is production-ready
  2. Strong error handling foundation - Consistent error wrapping and context propagation
  3. Good test structure - Well-organized tests with proper mocking
  4. Clean architecture - Proper separation of concerns across layers
  5. Reviewer feedback addressed - Scripts moved to ev-reth repository as requested

Recommendation

Status: Conditionally approve after addressing critical issues

This is a solid implementation of a critical feature. The architecture is sound, and the code quality is generally high. However, the following must be addressed before merge:

  1. Fix the interface design issue (duplicate Close())
  2. Improve test coverage (especially restore.go and RPC server)
  3. Add safeguards for restore failures
  4. Fix security concern with HTTP default

Once these are addressed, this will be a production-ready feature.


Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 17.66234% with 317 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.55%. Comparing base (de40783) to head (7a31192).

Files with missing lines Patch % Lines
pkg/rpc/server/server.go 0.00% 110 Missing ⚠️
pkg/cmd/restore.go 0.00% 84 Missing ⚠️
pkg/cmd/backup.go 48.48% 37 Missing and 14 partials ⚠️
pkg/store/restore.go 0.00% 37 Missing ⚠️
pkg/rpc/client/client.go 35.29% 15 Missing and 7 partials ⚠️
pkg/store/backup.go 38.09% 11 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2758      +/-   ##
==========================================
- Coverage   61.42%   59.55%   -1.88%     
==========================================
  Files          81       85       +4     
  Lines        8622     9007     +385     
==========================================
+ Hits         5296     5364      +68     
- Misses       2828     3122     +294     
- Partials      498      521      +23     
Flag Coverage Δ
combined 59.55% <17.66%> (-1.88%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@randygrok randygrok marked this pull request as ready for review October 15, 2025 09:10
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits. haven't checked the core logic then.

@randygrok randygrok requested a review from julienrbrt October 16, 2025 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[FEATURE] Copy live dbs

2 participants