Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix helix-lock regression #2698

Merged
merged 7 commits into from
Jan 31, 2024
Merged

Conversation

MarkGaox
Copy link
Contributor

@MarkGaox MarkGaox commented Nov 15, 2023

Issues

  • My PR addresses the following Helix issues and references them in the PR description:
  • This PR mainly addresses the recent behavior regression of helix-lock

(#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)

Description

  • Here are some details about my PR, including screenshots of any UI changes:
  • The incompatibility between old and new helix-lock versions was caused by the last update to helix-lock in Dec 2020, the update is in Implement Helix lock priority and notification #1564 which added priority and notification support to Helix locks.

There are two separate regressions in this PR

  1. When a lock request is made from the new helix-lock version to a lock path currently locked by the old helix-lock version, what happens is:
  • The lock request sees current lock priority is -1 (due to the priority field not present in lock ZNode), which is lower than the priority 0 of the lock being requested. Thus it will try to preempt the current lock owner by writing its own user id, priority and waiting timeout to the requestor fields in the lock ZNode. The lock request pending timeout is set to -1 since the current lock cleanup timeout is -1 (due to the cleanup timeout field not present in lock ZNode). Lock status of the lock request now becomes PENDING.
  • The lock request waits on a CountDownLatch for the pending timeout which is -1, therefore the wait immediately returns.
    The current lock owner won’t clean up itself and release the lock since it’s using the old helix-lock version which doesn’t react to lock requests with higher priority.
  • Since the lock status of the lock request is still PENDING and the lock request by default is not forceful, an exception is thrown saying “Cleanup has not been finished by lock owner”, which breaks clients' workflow.
  1. The non-lock owners is able to unlock() a lock if its priority is larger than the priority recorded in the lock. This should be avoided as well. If the non-lock owners want to acquire a lock, it should only call tryLock().

To resolve the two issue discussed above

  1. Change the default value of priority from -1 to 0. In this way, a lock request made from new helix-lock version with default priority won't be able to acquire the lock holding by the old helix-lock.
  2. Throws exception when update() handles a unlock request but the requestor is not the current owner of the lock.

(Write a concise description including what, why, how)

Tests

mvn test -Dtest=TestZKHelixNonblockingLock,TestZKHelixNonblockingLockWithPriority -pl helix-lock

  • The following tests are written for this issue:
[INFO] Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 54.524 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 14, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-lock ---
[INFO] Loading execution data file /Users/xiaxgao/IdeaProjects/helix_ps/helix-lock/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Distributed Lock' with 13 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  57.140 s
[INFO] Finished at: 2024-01-29T17:59:36-08:00
[INFO] ------------------------------------------------------------------------

(List the names of added unit/integration tests)

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Copy link
Contributor

@desaikomal desaikomal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great analysis @MarkGaox - Thanks for fixing the old regression( since Dec 2020).
I am reviewing the change, but wanted to provide initial comment.

@MarkGaox
Copy link
Contributor Author

This PR is approved by @desaikomal. And it's ready to merge.
Final commit message:
Fix helix-lock regression by changing the default lock priority to 0

// higher priority lock request will try to preempt current lock owner
LockInfo unlockOrLockRequestLockInfo = new LockInfo(_record);
// Any unlock request from non-lock owners is blocked.
if (unlockOrLockRequestLockInfo.getOwner().equals(LockConstants.DEFAULT_USER_ID)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of current codebase is requestor.unlock() will return true when it's priority is larger than the current lock owner. However, what actually happened inside is the requestor didn't release the lock even if unlock returns true, the current code just changed requestor lock's pendingTimeOut and the requestor still has to call tryLock() to acquire the lock, which is not a really efficient. Based on offline discussion with @xyuanlu, the ideal behavior is that unlock() should return false when the lock requestor is not the lock owner. And requestors should call tryLock() to acquire the lock.

@xyuanlu xyuanlu merged commit 43e8db2 into apache:master Jan 31, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants