Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support Cluster Snapshot Backup: deletion control (part4) #54980

Merged
merged 9 commits into from
Jan 17, 2025

Conversation

srlch
Copy link
Contributor

@srlch srlch commented Jan 13, 2025

What I'm doing:

This is part 4 for Support Cluster Snapshot Backup
In this pr, we support deletion control to keep the files in BE/CN side needed by the
current cluster snapshot. The basic idea is following:

  1. Introduce the deletion valid time stamp, which is the previous created time for automated
    cluster snapshot, called ValidDeletionTimeMs
  2. Use ValidDeletionTimeMs for vacuum.
  3. Use ValidDeletionTimeMs to filter for partition/table in recycleBin
  4. Use ValidDeletionTimeMs to filter alter job

Fixes #53867
#53867

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

@srlch srlch requested a review from a team as a code owner January 13, 2025 02:42
@mergify mergify bot assigned srlch Jan 13, 2025
}
return valid;
}

public TClusterSnapshotJobsResponse getAllJobsInfo() {
TClusterSnapshotJobsResponse response = new TClusterSnapshotJobsResponse();
for (Map.Entry<Long, ClusterSnapshotJob> entry : historyAutomatedSnapshotJobs.entrySet()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
Incorrect conversion of time units leading to potential logic errors.

You can modify the code like this:

valid = (alterJob.getFinishedTimeMs() < getValidDeletionTimeSecByAutomatedSnapshot() * 1000L);

Explanation: The multiplication of getValidDeletionTimeSecByAutomatedSnapshot() by 1000 should explicitly use a long literal (1000L) to ensure the calculation handles large numbers properly, matching the type of getFinishedTimeMs() which likely returns milliseconds as a long.

@@ -635,7 +651,7 @@ private synchronized void disableRecoverPartitionWithSameName(long dbId, long ta
continue;
}
partitionInfo.setRecoverable(false);
idToRecycleTime.replace(partitionInfo.getPartition().getId(), 0L);
idToRecycleTime.replace(partitionInfo.getPartition().getId(), System.currentTimeMillis());
break;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
Changing the recycle time to System.currentTimeMillis() may prevent immediate deletion of non-recoverable tables or partitions, which should be set to zero.

You can modify the code like this:

// Keep original behavior for non-recoverable items by setting their recycle time to 0.
idToRecycleTime.put(table.getId(), !recoverable ? 0 : System.currentTimeMillis());
// Update other relevant parts where non-recoverable state should set time to 0 similarly.

@wanpengfei-git wanpengfei-git requested a review from a team January 13, 2025 03:45
Comment on lines 317 to 321
if (idToRecycleTableInfo.get(id) != null) {
recoverable = idToRecycleTableInfo.get(id).isRecoverable();
} else if (idToPartition.get(id) != null) {
recoverable = idToPartition.get(id).isRecoverable();
}
Copy link
Contributor

@xiangguangyxg xiangguangyxg Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated get()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -52,6 +53,8 @@ public class ClusterSnapshotMgr implements GsonPostProcessable {
@SerializedName(value = "historyAutomatedSnapshotJobs")
private TreeMap<Long, ClusterSnapshotJob> historyAutomatedSnapshotJobs = new TreeMap<>();

private long previousAutomatedSnapshotCreatedTimsMs = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed ? there is a createdTimeMs in snapshot


boolean findFirstSuccess = false;
long previousAutomatedSnapshotCreatedTimsMs = 0;
for (Long key : historyAutomatedSnapshotJobs.descendingKeySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using descendingMap is better ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@srlch srlch force-pushed the cluster_snapshot_part_4 branch from 21bf08a to 4683647 Compare January 13, 2025 10:35
@srlch srlch force-pushed the cluster_snapshot_part_4 branch 3 times, most recently from 7109613 to 3ec6546 Compare January 14, 2025 06:23
Comment on lines 303 to 306
if (!idToRecycleTime.containsKey(id)) {
return true;
}
return idToRecycleTime.get(id) < GlobalStateMgr.getCurrentState()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containsKey() and get() are duplicated looking up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -105,6 +105,10 @@ public boolean isUnFinishedState() {
state == ClusterSnapshotJobState.FINISHED;
}

public boolean isSuccess() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public boolean isSuccess() {
public boolean isFinished() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

return 0;
}

return idToRecycleTime.get(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a id not in idToRecycleTime, it will throw exception ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original logic, the caller will ensure that the key should be existed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could simply handle this exception ? just return 0 is ok ?

@srlch srlch force-pushed the cluster_snapshot_part_4 branch 2 times, most recently from c66d6ee to 3f4cbce Compare January 15, 2025 10:55
srlch added 7 commits January 17, 2025 10:24
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
@srlch srlch force-pushed the cluster_snapshot_part_4 branch from 3f4cbce to 676ade4 Compare January 17, 2025 02:24
srlch added 2 commits January 17, 2025 10:32
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 82 / 86 (95.35%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/catalog/CatalogRecycleBin.java 27 31 87.10% [305, 348, 364, 401]
🔵 com/starrocks/lake/snapshot/ClusterSnapshotMgr.java 30 30 100.00% []
🔵 com/starrocks/lake/snapshot/ClusterSnapshotJob.java 8 8 100.00% []
🔵 com/starrocks/alter/AlterHandler.java 3 3 100.00% []
🔵 com/starrocks/lake/vacuum/AutovacuumDaemon.java 3 3 100.00% []
🔵 com/starrocks/lake/snapshot/ClusterSnapshot.java 8 8 100.00% []
🔵 com/starrocks/lake/StarMgrMetaSyncer.java 3 3 100.00% []

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@wyb wyb merged commit 5cc4e6a into StarRocks:main Jan 17, 2025
49 checks passed
historyAutomatedSnapshotJobs.pollFirstEntry();
}
historyAutomatedSnapshotJobs.put(job.getId(), job);
}

public synchronized long getValidDeletionTimeMsByAutomatedSnapshot() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public synchronized long getValidDeletionTimeMsByAutomatedSnapshot() {
public synchronized long getSafeDeletionTimeMs() {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add another function:
boolean isDeletiionSafeToExecute(long deletionCreatedTimeMs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix in next pr

return previousAutomatedSnapshotCreatedTimsMs;
}

public synchronized boolean checkValidDeletionForTableFromAlterJob(long tableId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public synchronized boolean checkValidDeletionForTableFromAlterJob(long tableId) {
public synchronized boolean isTableSafeToDeleteTablet(long tableId) { ?

return 0;
}

return idToRecycleTime.get(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could simply handle this exception ? just return 0 is ok ?

@srlch
Copy link
Contributor Author

srlch commented Jan 17, 2025

https://github.com/Mergifyio backport branch-3.4

Copy link
Contributor

mergify bot commented Jan 17, 2025

backport branch-3.4

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shared-data mode support backup restore though automated snapshot
5 participants