- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 106
feat: implement transaction atomicity with parent-child relationship for sub-operations #2009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| @claude can you review this PR | 
| Claude finished @iduartgomez's task —— View job Code Review Complete ✅I've completed a comprehensive review of this PR. Overall, this is a well-designed and solid implementation that addresses the atomicity issues described in #1975. The solution is architecturally sound with proper parent-child transaction tracking. Below are my findings: Strengths 👍
 Critical Issues 🔴1. Memory Leak: Orphaned Child Tracking When Parent Fails EarlyLocation:  When a child operation fails via  
 This can accumulate over time causing memory leaks. Fix: Add cleanup in the else branch: } else {
    tracing::warn!(...);
    self.failed_parents.insert(parent_tx);
    // ADD THIS:
    self.cleanup_parent_tracking(parent_tx); // Clean up all tracking
}2. Race Condition: expect_sub_operation vs register_sub_operationLocation:  The  
 However, if the child completes extremely fast (before  
 Fix: Ensure atomicity by registering the child before spawning: pub fn expect_and_register_sub_operation(&self, parent: Transaction, child: Transaction) {
    self.expected_sub_operations
        .entry(parent)
        .and_modify(|count| *count += 1)
        .or_insert(1);
    
    self.sub_operations
        .entry(parent)
        .or_insert_with(HashSet::new)
        .insert(child);
    
    self.parent_of.insert(child, parent);
}Then call this before spawning the task in  High Priority Issues 🟠3. Missing Timeout Handling for Sub-OperationsLocation:  When a child operation times out, the parent is never notified. The garbage collection task (line 577-658) removes timed-out operations but doesn't check if they're children or propagate failures to parents. Recommendation: In the garbage cleanup task, check if a timed-out operation is a child and call  4. Potential Duplicate Client ResponsesLocation:  The logic checks  Recommendation: Add additional guard in  Medium Priority Issues 🟡5. Inconsistent State: Failed Parent Can Still Complete SuccessfullyLocation:  In PUT's  Recommendation: Add check: if subscribe {
    if !op_manager.failed_parents.contains(&id) {
        let _child_tx = super::start_subscription_request(op_manager, *id, key);
    } else {
        tracing::warn!("Not starting subscription for failed parent {}", id);
    }
}6. Test Gap: No Multi-Level Nesting TestsLocation:  The PR mentions "Support for nested sub-operations (arbitrary depth)" as future work, but there's no test validating that the current implementation even supports 2 levels (parent → child → grandchild). Recommendation: Add a test case with nested operations to verify current behavior. 7. Missing Error Context in Sub-Operation FailuresLocation:  The error message formatting in  Recommendation: Consider accepting  Low Priority / Style Issues 🔵8. Unclear Naming:  | 
| @claude fix clippy issues and also review the changes since your last review (#2009 (comment)) to see if they have been properly addressed | 
| Claude finished @iduartgomez's task —— View job Review Complete ✅I've analyzed the changes since my last review and fixed the clippy issue. Here's my comprehensive assessment: Clippy Issues FixedFixed in commit 5eb1348: 
 Review of Changes Since Last ReviewI reviewed the latest commit 2645541 ("fix pr suggestions") which addressed my previous critical issues. Here's the status: ✅ Critical Issue #1: Memory Leak - FIXEDPrevious Issue: When a child operation failed before the parent entered  Fix Applied (lines 573-576 in op_state_manager.rs): } else {
    tracing::warn!(...);
    self.failed_parents.insert(parent_tx);
    // Clean up tracking to prevent memory leak
    self.cleanup_parent_tracking(parent_tx);
    // Mark parent as completed to prevent duplicate responses
    self.completed(parent_tx);
}Status: ✅ RESOLVED - The else branch now properly cleans up tracking and marks the parent as completed. ✅ Critical Issue #2: Race Condition - FIXEDPrevious Issue: Race condition between  Fix Applied (lines 489-505 in op_state_manager.rs): /// Atomically registers both expected count and parent-child relationship.
/// This prevents race conditions where children complete before registration.
pub fn expect_and_register_sub_operation(&self, parent: Transaction, child: Transaction) {
    // Increment expected count
    self.expected_sub_operations
        .entry(parent)
        .and_modify(|count| *count += 1)
        .or_insert(1);
    // Register parent-child relationship
    self.sub_operations
        .entry(parent)
        .or_insert_with(HashSet::new)
        .insert(child);
    self.parent_of.insert(child, parent);
}Usage (line 346 in operations/mod.rs): let child_tx = Transaction::new_child_of::<subscribe::SubscribeMsg>(&parent_tx);
op_manager.expect_and_register_sub_operation(parent_tx, child_tx);Status: ✅ RESOLVED - The atomic  ✅ High Priority Issue #3: Timeout Handling - FIXEDPrevious Issue: When a child operation timed out, the parent was never notified. Fix Applied (lines 655-672 and 703-720 in op_state_manager.rs): The garbage cleanup task now checks if timed-out operations are children and propagates failures: // Check if this is a child operation and propagate timeout to parent
if let Some(parent_entry) = parent_of.get(&tx) {
    let parent_tx = *parent_entry;
    drop(parent_entry);
    tracing::warn!(
        child_tx = %tx,
        parent_tx = %parent_tx,
        "Child operation timed out, propagating failure to parent"
    );
    let error_result = Err(freenet_stdlib::client_api::ErrorKind::OperationError {
        cause: format!("Sub-operation {} timed out", tx).into(),
    }.into());
    let _ = result_router_tx.send((parent_tx, error_result)).await;
}Status: ✅ RESOLVED - Timeouts are now properly propagated to parents. ✅ Medium Priority Issue #5: Failed Parent Check - FIXEDPrevious Issue: PUT's  Fix Applied (lines 609-623 in operations/put.rs): if subscribe {
    // Check if this parent has already failed due to a previous child failure
    if !op_manager.failed_parents.contains(id) {
        tracing::debug!(
            tx = %id,
            %key,
            "starting child subscription for PUT operation"
        );
        let _child_tx =
            super::start_subscription_request(op_manager, *id, key);
    } else {
        tracing::warn!(
            tx = %id,
            "not starting subscription for failed parent operation"
        );
    }
}Status: ✅ RESOLVED - Parent failure is now checked before starting subscriptions. 
 | 
      
        
              This comment was marked as resolved.
        
        
      
    
  This comment was marked as resolved.
      
        
              This comment was marked as resolved.
        
        
      
    
  This comment was marked as resolved.
      
        
              This comment was marked as resolved.
        
        
      
    
  This comment was marked as resolved.
      
        
              This comment was marked as resolved.
        
        
      
    
  This comment was marked as resolved.
      
        
              This comment was marked as resolved.
        
        
      
    
  This comment was marked as resolved.
      
        
              This comment was marked as resolved.
        
        
      
    
  This comment was marked as resolved.
| /// Parent-to-children mapping for sub-operation tracking. | ||
| sub_operations: Arc<DashMap<Transaction, HashSet<Transaction>>>, | ||
| /// Root operations awaiting sub-operation completion before client notification. | ||
| /// Ensures atomicity: clients receive success only when all sub-operations succeed. | ||
| pub(crate) root_ops_awaiting_sub_ops: Arc<DashMap<Transaction, OpEnum>>, | ||
| /// Child-to-parent index for O(1) parent lookups. | ||
| parent_of: Arc<DashMap<Transaction, Transaction>>, | ||
| /// Expected sub-operation count per root operation. Pre-registered before spawning | ||
| /// to prevent race conditions where children complete before parent registration. | ||
| expected_sub_operations: Arc<DashMap<Transaction, usize>>, | ||
| /// Root operations with at least one failed sub-operation. | ||
| pub(crate) failed_parents: Arc<DashSet<Transaction>>, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's refactor all this into a new type for handling suboperations; this type can be module private (at this level), and we can move it to a new module to keep it isolated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once refactored we can probably add quite a few unit tests around it
| let notifier = self.to_event_listener.clone(); | ||
| GlobalExecutor::spawn(async move { | ||
| if let Err(err) = router_tx.send((tx, host_result)).await { | ||
| tracing::error!( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think can be warn too to reduce the noise; anyway if this fails there will be more errors and the node will be likely shutting down
        
          
                crates/core/src/operations/get.rs
              
                Outdated
          
        
      | new_skip_list.insert(sender.peer.clone()); | ||
|  | ||
| super::start_subscription_request(op_manager, key).await; | ||
| let _ = super::start_subscription_request(op_manager, id, key); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this unhandled?
| @claude are we exercising the new code added and assertign it works in any integration test in the repo? analyze | 
| Claude finished @iduartgomez's task —— View job Test Coverage Analysis for Transaction Atomicity Features ✅I've completed a thorough analysis of test coverage for the new parent-child transaction atomicity code. Here are my findings: SummaryThe new atomicity code is MINIMALLY tested. Only one integration test exercises the new features, and there are NO unit tests for the core atomicity infrastructure. What's Currently TestedIntegration Tests1.  This is the only test that exercises the new atomicity features. It tests: ✅ PUT with subscribe=true creates a parent-child relationship Lines tested: 
 Unit TestsSubscribe operation tests (crates/core/src/operations/subscribe/tests.rs) These are NOT related to atomicity. They test subscription routing logic with skip lists, not parent-child relationships. What's NOT Tested (Critical Gaps)1. Sub-Operation Failure Propagation 🔴 CRITICALThe core feature of this PR - error propagation from child to parent - has ZERO test coverage. Untested code paths: 
 Missing tests: 
 2. Timeout Propagation 🔴 CRITICALTimeout handling for child operations was added in response to my previous review, but has ZERO test coverage. Untested code paths: 
 Missing tests: 
 3. Deferred Completion Logic 🟠 HIGH PRIORITYThe core atomicity mechanism - deferring parent completion until children complete - is only tested in the happy path. Partially tested code paths: 
 Missing tests: 
 4. Sub-Operation Registry 🟠 HIGH PRIORITYThe tracking infrastructure has ZERO dedicated tests. Untested code paths: 
 Missing tests: 
 5. Transaction Parent Field 🟡 MEDIUM PRIORITYThe new  Missing tests: 
 6. Edge Cases 🟡 MEDIUM PRIORITYMissing tests: 
 Test Coverage by File
 RecommendationsImmediate Actions (Before Merge)
 Follow-Up Work (Post-Merge)
 Verdict❌ Test coverage is insufficient for production. The PR adds 611 lines of new code but only ~15 lines of test code (in the modified  Risk Assessment: 
 Recommendation: Add at minimum the two critical tests (#1 and #2 above) before merging. The sub-operation failure and timeout propagation are the core value of this PR, and they should not ship without test coverage. Positive NotesThe existing test ( 
 It serves as a good template for the missing failure/timeout tests. | 
…ns to solve #1975 # Conflicts: # crates/core/src/operations/put.rs # crates/core/tests/operations.rs # Conflicts: # crates/core/src/operations/subscribe.rs # Conflicts: # crates/core/src/operations/get.rs
Co-authored-by: nacho.d.g <[email protected]>
- Removed unused methods expect_sub_operation and register_sub_operation - Changed or_insert_with(HashSet::new) to or_default() for cleaner code Co-authored-by: nacho.d.g <[email protected]>
Co-authored-by: nacho.d.g <[email protected]>
Co-authored-by: nacho.d.g <[email protected]>
ce07fdf    to
    20053eb      
    Compare
  
    
Fixes #1975
Summary
This PR implements atomic operation guarantees for operations with network-based sub-operations by introducing a parent-child transaction tracking infrastructure.
This directly addresses issue #1975 where parent operations (PUT, GET, UPDATE) could complete and respond to clients while their constituent sub-operations (e.g., SUBSCRIBE) were still running, had failed, or hadn't started, breaking fundamental ACID guarantees.
Problem
Freenet's high-level operations are not atomic when they spawn network-based sub-operations. A parent operation can complete and respond to the client while its constituent sub-operations are still running or have failed, breaking fundamental ACID guarantees and creating inconsistent distributed state.
Key Issues Addressed (from #1975)
As documented in issue #1975, the following architectural gaps are resolved:
parent: Option<Ulid>to Transactionsub_operationsandparent_ofmapsSolution Architecture
This PR implements the solution architecture proposed in issue #1975, following the phased approach:
1. Parent-Child Transaction Relationships
Added
parent: Option<Ulid>field toTransactiontype to enable hierarchical tracking:2. Sub-Operation Registry in OpManager
Added tracking infrastructure in
OpManager:sub_operations: DashMap<Transaction, HashSet<Transaction>>- Parent→children mappingparent_of: DashMap<Transaction, Transaction>- Reverse index for O(1) parent lookuppending_finalization: DashMap<Transaction, (OpEnum, DateTime)>- Operations awaiting children3. Deferred Completion Logic
Modified operation finalization to defer parent completion until all children complete:
4. Error Propagation
Sub-operation failures now propagate to parent and client:
Changes
Core Infrastructure
crates/core/src/message.rs: ExtendedTransactionwith parent trackingcrates/core/src/node/op_state_manager.rs: Added sub-operation registry and atomicity logiccrates/core/src/operations/mod.rs: Modified finalization to defer parent completionOperation Updates
crates/core/src/operations/put.rs: Updated PUT to track SUBSCRIBE sub-operations atomicallycrates/core/src/operations/subscribe.rs: Modified to support child transaction creationTesting
crates/core/tests/operations.rs: Improvedtest_put_with_subscribe_flagtest to ensure put atomicityProtocol
crates/core/src/node/network_bridge/p2p_protoc.rs: Updated to handle new transaction fields✅ PUT operations with
subscribe: truenow complete atomically✅ Sub-operation failures propagate to parent and client
✅ Transaction hierarchies are properly tracked
✅ No regression in existing functionality
Future Work
As outlined in #1975, future enhancements include: