-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(satp-hermes): add crash recovery & rollback protocol #3491
base: satp-dev
Are you sure you want to change the base?
feat(satp-hermes): add crash recovery & rollback protocol #3491
Conversation
I will review this PR |
f9014b0
to
0de9744
Compare
@Yogesh01000100 please rebase with satp-dev (should not have conflicts) |
0de9744
to
4c0124d
Compare
@Yogesh01000100 please include documentation and tests, and update the description, as discussed. |
ce9a179
to
24b8eaf
Compare
24b8eaf
to
728e7cb
Compare
@Yogesh01000100 could you please squash the commits and rebase with latest version of satp-dev, prior to merge? |
1a55673
to
21ad772
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks very good, but there are some changes to be done prior to merging.
Summarizing my comments:
- Add other authors to the commit
- Incorporate feedback from the logging process (namely un-hardcoding logs and adding more information)
- Implement RollbackState (for example, should state how many more steps are to be rolled-back, at any moment; what was rolledback already; estimated time to completion, etc)
- Please add tests that support the new feature
- Please add comprehensive documentation on this feature. Example: The readme of SATP should have a section on how to run the docker compose with several examples of configurations.
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-manager.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-recovery-handler.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-recovery-handler.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage0-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage1-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage2-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage3-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/generated/proto/cacti/satp/v02/common/health_pb.ts
Outdated
Show resolved
Hide resolved
49e1135
to
fb703b4
Compare
fb703b4
to
b30ccb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review how sessionData is being used, and take a look at the Stage 3 question.
Please document the new code as well. The rest is being documented in this PR:
https://github.com/hyperledger/cacti/pull/3619
|
||
private async checkCrash(session: SATPSession): Promise<CrashStatus> { | ||
const fnTag = `${this.className}#checkCrash()`; | ||
const sessionData = session.hasClientSessionData() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this prioritize client session?
In this Implementation the gateway can be a client and a server at the same time. So, when this is the case we may not be deteting some crashes.
public async checkAndResolveCrash(session: SATPSession): Promise<void> { | ||
const fnTag = `${this.className}#checkAndResolveCrash()`; | ||
|
||
const sessionData = session.hasClientSessionData() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
public async handleRecovery(session: SATPSession): Promise<boolean> { | ||
const fnTag = `${this.className}#handleRecovery()`; | ||
|
||
const sessionData = session.hasClientSessionData() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
throw new Error(`${fnTag}, session data is not correctly initialized`); | ||
} | ||
const sessionData = session.hasClientSessionData() | ||
? session.getClientSessionData() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too
|
||
this.log.info(`${fnTag} Asset Id: ${assetId} amount: ${amount}`); | ||
|
||
await bridgeManager.burnAsset(assetId, Number(amount)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Stage 3 rollback, the rollback is only feasible if it occurs before the asset is minted on the receiver chain. Once minting happens, if the gateway encounters an issue and fails to assign the minted amount to the recipient, a rollback can no longer be initiated. Is this the reason why a rollback isn’t considered in such cases?
Wouldn't it make sense, then, for the minted amount on the receiver chain to be burned and re-minted on the source chain too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packages/cactus-plugin-satp-hermes/src/test/typescript/unit/recovery/logging.test.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/test/typescript/unit/recovery/services.test.ts
Outdated
Show resolved
Hide resolved
13e0302
to
2896426
Compare
f0e50ef
to
cb24d53
Compare
cb24d53
to
d14f178
Compare
d14f178
to
4eef528
Compare
packages/cactus-plugin-satp-hermes/src/main/typescript/gol/gateway-orchestrator.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-manager.ts
Outdated
Show resolved
Hide resolved
string senderSignature = 6; | ||
} | ||
|
||
message LocalLog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename to "LogEntry" - this schema should be imported and used by persistLogEntry
packages/cactus-plugin-satp-hermes/src/main/typescript/blo/dispatcher.ts
Outdated
Show resolved
Hide resolved
d6ffbca
to
1405923
Compare
98a3846
to
d73a5eb
Compare
d73a5eb
to
e16e84d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave some comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave some comments:
As discussed 3 months ago: @Yogesh01000100 please include documentation and tests, and update the description, as discussed.
Add other authors to the commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include CrashStatus and LocalLog types in the open api spec and import them where needed
this.crashManager = new CrashManager(crashOptions); | ||
} | ||
|
||
if (this.config.enableMigration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this about enalbing the database connection? if so, please change the variable names and
@@ -97,6 +104,9 @@ export class SATPGateway implements IPluginWebService, ICactusPlugin { | |||
public localRepository?: ILocalLogRepository; | |||
public remoteRepository?: IRemoteLogRepository; | |||
private readonly shutdownHooks: ShutdownHook[]; | |||
private readonly crashManager?: CrashManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please set up this instance in the constructor, as it is not being used
import { SATP_VERSION } from "../../../main/typescript/core/constants"; | ||
import { SATPSession } from "../../../main/typescript/core/satp-session"; | ||
import { getSatpLogKey } from "../../../main/typescript/gateway-utils"; | ||
import { TokenType } from "../../../main/typescript/core/stage-services/satp-bridge/types/asset"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some tests:
crash recovery for stage 1 (feel free to provoke a crash at stage 1 ) (phase 1.1)
crash recovery for stage 2.1, 2.3A, 2.5 and 2.6
crash recovery for stage 3.2A, 3.3, 3.6A, 3.7
rollback for the stages and steps above
Please create test utility functions to help reduce code duplication. You have a good example here: packages/cactus-plugin-satp-hermes/src/test/typescript/test-utils.ts. Please add to this file.
reference https://github.com/ietf-satp/draft-ietf-satp-core/blob/main/figures/message%20flow%20diagram/gateway-message-flow-asset-transfer-v20PNG.png
about this open API spec part as it has a tpl.json and a .json, which one to update I'm a bit unsure |
b094409
to
222d088
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider this carefully. If sessionData remains as it is, we must handle it with care and clearly differentiate between the client and server sides of the gateway. I designed the sessionData this way to ensure that a gateway can act as both a client and server to itself.
@yogesh please address this concern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include CrashStatus and LocalLog types in the open api spec and import them where needed
about this open API spec part as it has a tpl.json and a .json, which one to update I'm a bit unsure
Please see the package.json to see which one is used for generation and which purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yogesh, can you confirm this has been addressed?
1. Implemented recovery & rollback using RPC-based message handlers. 2. Added rollback strategies for all SATP stages. 3. Integrated database log management for recovery and rollback. 4. Added cron jobs for scheduled crash detection and recovery initiation. Co-authored-by: Rafael Belchior <[email protected]> Co-authored-by: Carlos Amaro <[email protected]> Signed-off-by: Yogesh01000100 <[email protected]> chore(satp-hermes): improve DB management Signed-off-by: Rafael Belchior <[email protected]> chore(satp-hermes): crash recovery architecture Signed-off-by: Rafael Belchior <[email protected]> fix(recovery): enhance crash recovery and rollback implementation Signed-off-by: Yogesh01000100 <[email protected]> refactor(recovery): consolidate logic and improve SATP message handling Signed-off-by: Yogesh01000100 <[email protected]> feat(recovery): add rollback implementations Signed-off-by: Yogesh01000100 <[email protected]> fix: correct return types and inits Signed-off-by: Yogesh01000100 <[email protected]> fix: add unit tests and resolve rollbackstate Signed-off-by: Yogesh01000100 <[email protected]> feat: add function processing logs from g2 Signed-off-by: Yogesh01000100 <[email protected]> feat: add cron schedule for periodic crash checks Signed-off-by: Yogesh01000100 <[email protected]> fix: resolve rollback condition and add tests Signed-off-by: Yogesh01000100 <[email protected]> feat: add orchestrator communication layer using connect-RPC Signed-off-by: Yogesh01000100 <[email protected]> feat: add rollback protocol rpc Signed-off-by: Yogesh01000100 <[email protected]> fix: handle server log synchronization Signed-off-by: Yogesh01000100 <[email protected]> fix: resolve gol errors, add unit tests Signed-off-by: Yogesh01000100 <[email protected]> fix: handle server-side rollback Signed-off-by: Yogesh01000100 <[email protected]> fix: resolve networkId in rollback strategies Signed-off-by: Yogesh01000100 <[email protected]>
222d088
to
503658c
Compare
Description
This PR addresses issue #3114 by implementing core components for crash recovery and rollback protocols. The changes enhance fault tolerance and ensure consistent recovery during failures.
Key Changes
1. CrashManager
Introduced a CrashManager class responsible for managing crash detection, recovery, and rollback processes.
Key functionalities include:
node-schedule
.2. Protocol Services
Updated crash_recovery.proto to define:
RecoverMessage
,RecoverUpdateMessage
, andRecoverSuccessMessage
for crash recovery.RollbackMessage
andRollbackAckMessage
for rollback processes.3. Recovery & Rollback Strategies
Implemented recovery & rollback strategies for all SATP protocol stages, ensuring the ability to revert to a consistent state upon failure.
4. Crash Detection and Handling
Added mechanisms to: