Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataMove Should Decide BulkLoading After Old DataMove Actor Has Been Cleared #11947

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kakaiu
Copy link
Member

@kakaiu kakaiu commented Feb 13, 2025

This PR fixes a bulkload bug causing DD restarts.

In the bulkLoad datamove mechanism, when a bulkload is triggered, this task is registered on a map. For any following data move on the same range, the data move is converted to a "bulkload" data move, which triggers SS to load data. When the bulkload data move starts. it updates the bulkload metadata. At this time, it checks if the metadata shows correct phase (the metadata should not indicate the move has been completed). If the task has been marked as completed, the data move triggers DD restart.

The problem happens when a previous data move for the bulkload task is cancelled by the new data move for the same task. The issue happens when the previous data move marks the bulkload task as completed, before the new data move starts the same task. As a result, when the new data move tries to start the task, it finds that the bulkload metadata has been marked as completed which is unexpected and triggers DD restarts.

The fix is that the new data move should decide to do bulk loading after the cancellation of the old data move. So, after the cancellation of the old data move, if the bulkload task metadata has been marked as completed, the new data move does not do bulk loading.

100K bulkload tests:
20250213-174727-zhewang-aa06c0d4d7bf86be compressed=True data_size=36849755 duration=8769146 ended=100000 fail=2 fail_fast=10 max_runs=100000 pass=99998 priority=100 remaining=0 runtime=1:11:07 sanity=False started=100000 stopped=20250213-185834 submitted=20250213-174727 timeout=5400 username=zhewang

100K bulkdump tests:
20250213-190617-zhewang-a7491d9a121a5380 compressed=True data_size=36849352 duration=11450458 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=1:22:49 sanity=False started=100000 stopped=20250213-202906 submitted=20250213-190617 timeout=5400 username=zhewang

100K correctness tests:
20250213-213207-zhewang-0e91c2ddace17da4 compressed=True data_size=36812798 fail_fast=10 max_runs=100000 priority=100 sanity=False submitted=20250213-213207 timeout=5400 username=zhewang

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@kakaiu kakaiu requested review from saintstack and jzhou77 February 13, 2025 17:43
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: cbcc3af
  • Duration 0:15:53
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: cbcc3af
  • Duration 0:39:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 0:22:54
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: cbcc3af
  • Duration 0:49:54
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: cbcc3af
  • Duration 0:56:36
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: cbcc3af
  • Duration 1:05:19
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: cbcc3af
  • Duration 1:11:13
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 0:50:00
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: cbcc3af
  • Duration 1:12:33
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 0:54:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 1:02:22
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 1:03:38
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@kakaiu kakaiu closed this Feb 13, 2025
@kakaiu kakaiu reopened this Feb 13, 2025
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 0:23:08
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 0d36d2a
  • Duration 0:39:12
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 0:50:11
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 0:53:30
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 0:56:42
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 0d36d2a
  • Duration 0:57:31
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 0d36d2a
  • Duration 1:05:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Contributor

@saintstack saintstack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me (though I'm the first to admit I'm not too clear on how this all works)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants