Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(satp-hermes): add crash recovery & rollback protocol #3491

Open
wants to merge 1 commit into
base: satp-dev
Choose a base branch
from

Conversation

Yogesh01000100
Copy link
Contributor

@Yogesh01000100 Yogesh01000100 commented Aug 20, 2024

Description

This PR addresses issue #3114 by implementing core components for crash recovery and rollback protocols. The changes enhance fault tolerance and ensure consistent recovery during failures.


Key Changes

1. CrashManager

Introduced a CrashManager class responsible for managing crash detection, recovery, and rollback processes.
Key functionalities include:

  • Session Management: Tracks and maintains SATP sessions.
  • Recovery Initiation: Detects crashes and triggers recovery logic.
  • Rollback Execution: Handles rollback processes for failed recovery attempts.
  • Cron Job Integration: Added scheduled crash detection using node-schedule.
    • Ensures jobs pause during rollback to prevent conflicts.

2. Protocol Services

Updated crash_recovery.proto to define:

  • RecoverMessage, RecoverUpdateMessage, and RecoverSuccessMessage for crash recovery.
  • RollbackMessage and RollbackAckMessage for rollback processes.

3. Recovery & Rollback Strategies

Implemented recovery & rollback strategies for all SATP protocol stages, ensuring the ability to revert to a consistent state upon failure.

  • Added RollbackStrategyFactory to centralize strategy selection.

4. Crash Detection and Handling

Added mechanisms to:

  • Detect incomplete operations (via logs).
  • Compare timestamps against session timeouts to trigger recovery/rollback.

@RafaelAPB
Copy link
Contributor

I will review this PR

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from f9014b0 to 0de9744 Compare August 21, 2024 20:08
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 please rebase with satp-dev (should not have conflicts)

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 0de9744 to 4c0124d Compare August 23, 2024 17:19
@Yogesh01000100 Yogesh01000100 changed the title feat: add crash recovery and knex config for production feat(recovery): add crash recovery implementation Aug 25, 2024
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 please include documentation and tests, and update the description, as discussed.

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from ce9a179 to 24b8eaf Compare September 8, 2024 07:44
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 24b8eaf to 728e7cb Compare September 16, 2024 18:56
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 could you please squash the commits and rebase with latest version of satp-dev, prior to merge?

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from 1a55673 to 21ad772 Compare September 17, 2024 10:11
Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks very good, but there are some changes to be done prior to merging.
Summarizing my comments:

  1. Add other authors to the commit
  2. Incorporate feedback from the logging process (namely un-hardcoding logs and adding more information)
  3. Implement RollbackState (for example, should state how many more steps are to be rolled-back, at any moment; what was rolledback already; estimated time to completion, etc)
  4. Please add tests that support the new feature
  5. Please add comprehensive documentation on this feature. Example: The readme of SATP should have a section on how to run the docker compose with several examples of configurations.

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from 49e1135 to fb703b4 Compare October 16, 2024 19:57
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from fb703b4 to b30ccb5 Compare November 3, 2024 19:16
Copy link
Contributor

@LordKubaya LordKubaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review how sessionData is being used, and take a look at the Stage 3 question.
Please document the new code as well. The rest is being documented in this PR:
https://github.com/hyperledger/cacti/pull/3619


private async checkCrash(session: SATPSession): Promise<CrashStatus> {
const fnTag = `${this.className}#checkCrash()`;
const sessionData = session.hasClientSessionData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this prioritize client session?
In this Implementation the gateway can be a client and a server at the same time. So, when this is the case we may not be deteting some crashes.

public async checkAndResolveCrash(session: SATPSession): Promise<void> {
const fnTag = `${this.className}#checkAndResolveCrash()`;

const sessionData = session.hasClientSessionData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

public async handleRecovery(session: SATPSession): Promise<boolean> {
const fnTag = `${this.className}#handleRecovery()`;

const sessionData = session.hasClientSessionData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

throw new Error(`${fnTag}, session data is not correctly initialized`);
}
const sessionData = session.hasClientSessionData()
? session.getClientSessionData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too


this.log.info(`${fnTag} Asset Id: ${assetId} amount: ${amount}`);

await bridgeManager.burnAsset(assetId, Number(amount));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Stage 3 rollback, the rollback is only feasible if it occurs before the asset is minted on the receiver chain. Once minting happens, if the gateway encounters an issue and fails to assign the minted amount to the recipient, a rollback can no longer be initiated. Is this the reason why a rollback isn’t considered in such cases?

Wouldn't it make sense, then, for the minted amount on the receiver chain to be burned and re-minted on the source chain too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RafaelAPB RafaelAPB force-pushed the satp-dev branch 2 times, most recently from 13e0302 to 2896426 Compare November 13, 2024 15:33
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from f0e50ef to cb24d53 Compare November 15, 2024 13:46
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from cb24d53 to d14f178 Compare November 18, 2024 16:23
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from d14f178 to 4eef528 Compare November 26, 2024 20:35
string senderSignature = 6;
}

message LocalLog {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename to "LogEntry" - this schema should be imported and used by persistLogEntry

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from d6ffbca to 1405923 Compare December 11, 2024 22:12
@Yogesh01000100 Yogesh01000100 changed the title feat(recovery): add crash recovery implementation feat(satp-hermes): add crash recovery & rollback protocol Dec 11, 2024
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from 98a3846 to d73a5eb Compare December 13, 2024 17:00
@Yogesh01000100 Yogesh01000100 marked this pull request as ready for review December 13, 2024 17:03
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from d73a5eb to e16e84d Compare December 13, 2024 22:11
@RafaelAPB RafaelAPB self-requested a review December 14, 2024 13:31
Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave some comments:

Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave some comments:

As discussed 3 months ago: @Yogesh01000100 please include documentation and tests, and update the description, as discussed.
Add other authors to the commit

Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include CrashStatus and LocalLog types in the open api spec and import them where needed

this.crashManager = new CrashManager(crashOptions);
}

if (this.config.enableMigration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this about enalbing the database connection? if so, please change the variable names and

@@ -97,6 +104,9 @@ export class SATPGateway implements IPluginWebService, ICactusPlugin {
public localRepository?: ILocalLogRepository;
public remoteRepository?: IRemoteLogRepository;
private readonly shutdownHooks: ShutdownHook[];
private readonly crashManager?: CrashManager;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please set up this instance in the constructor, as it is not being used

import { SATP_VERSION } from "../../../main/typescript/core/constants";
import { SATPSession } from "../../../main/typescript/core/satp-session";
import { getSatpLogKey } from "../../../main/typescript/gateway-utils";
import { TokenType } from "../../../main/typescript/core/stage-services/satp-bridge/types/asset";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some tests:
crash recovery for stage 1 (feel free to provoke a crash at stage 1 ) (phase 1.1)
crash recovery for stage 2.1, 2.3A, 2.5 and 2.6
crash recovery for stage 3.2A, 3.3, 3.6A, 3.7
rollback for the stages and steps above
Please create test utility functions to help reduce code duplication. You have a good example here: packages/cactus-plugin-satp-hermes/src/test/typescript/test-utils.ts. Please add to this file.
reference https://github.com/ietf-satp/draft-ietf-satp-core/blob/main/figures/message%20flow%20diagram/gateway-message-flow-asset-transfer-v20PNG.png

@Yogesh01000100
Copy link
Contributor Author

Please include CrashStatus and LocalLog types in the open api spec and import them where needed

about this open API spec part as it has a tpl.json and a .json, which one to update I'm a bit unsure

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from b094409 to 222d088 Compare December 16, 2024 09:15
Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider this carefully. If sessionData remains as it is, we must handle it with care and clearly differentiate between the client and server sides of the gateway. I designed the sessionData this way to ensure that a gateway can act as both a client and server to itself.

@yogesh please address this concern

Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include CrashStatus and LocalLog types in the open api spec and import them where needed

about this open API spec part as it has a tpl.json and a .json, which one to update I'm a bit unsure
Please see the package.json to see which one is used for generation and which purpose

Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RafaelAPB

Yogesh, can you confirm this has been addressed?

1. Implemented recovery & rollback using RPC-based message handlers.
2. Added rollback strategies for all SATP stages.
3. Integrated database log management for recovery and rollback.
4. Added cron jobs for scheduled crash detection and recovery initiation.

Co-authored-by: Rafael Belchior <[email protected]>
Co-authored-by: Carlos Amaro <[email protected]>
Signed-off-by: Yogesh01000100 <[email protected]>

chore(satp-hermes): improve DB management

Signed-off-by: Rafael Belchior <[email protected]>

chore(satp-hermes): crash recovery architecture

Signed-off-by: Rafael Belchior <[email protected]>

fix(recovery): enhance crash recovery and rollback implementation

Signed-off-by: Yogesh01000100 <[email protected]>

refactor(recovery): consolidate logic and improve SATP message handling

Signed-off-by: Yogesh01000100 <[email protected]>

feat(recovery): add rollback implementations

Signed-off-by: Yogesh01000100 <[email protected]>

fix: correct return types and inits

Signed-off-by: Yogesh01000100 <[email protected]>

fix: add unit tests and resolve rollbackstate

Signed-off-by: Yogesh01000100 <[email protected]>

feat: add function processing logs from g2

Signed-off-by: Yogesh01000100 <[email protected]>

feat: add cron schedule for periodic crash checks

Signed-off-by: Yogesh01000100 <[email protected]>

fix: resolve rollback condition and add tests

Signed-off-by: Yogesh01000100 <[email protected]>

feat: add orchestrator communication layer using connect-RPC

Signed-off-by: Yogesh01000100 <[email protected]>

feat: add rollback protocol rpc

Signed-off-by: Yogesh01000100 <[email protected]>

fix: handle server log synchronization

Signed-off-by: Yogesh01000100 <[email protected]>

fix: resolve gol errors, add unit tests

Signed-off-by: Yogesh01000100 <[email protected]>

fix: handle server-side rollback

Signed-off-by: Yogesh01000100 <[email protected]>

fix: resolve networkId in rollback strategies

Signed-off-by: Yogesh01000100 <[email protected]>
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 222d088 to 503658c Compare December 17, 2024 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants