-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(defrag): handle no space left error #18822
Conversation
Hi @ghouscht. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files
... and 19 files with indirect coverage changes @@ Coverage Diff @@
## main #18822 +/- ##
==========================================
- Coverage 68.76% 68.75% -0.01%
==========================================
Files 420 420
Lines 35523 35525 +2
==========================================
- Hits 24426 24425 -1
- Misses 9665 9678 +13
+ Partials 1432 1422 -10 Continue to review full report in Codecov by Sentry.
|
The e2e test looks good. The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu |
I added a second commit that contains a working implementation of a possible restore operation. I did some manual testing with the failpoint and the e2e test and it seems to work. However this opens up a whole lot of other possible problems. I highlighted some of them with |
/retest |
/ok-to-test |
server/storage/backend/backend.go
Outdated
b.batchTx.unsafeCommit(true) | ||
b.batchTx.tx = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can even move these two lines to right before defragdb
. i.e. if we fail to open the temp bbolt db, we don't need to reopen the existing working bbolt db.
etcd/server/storage/backend/backend.go
Line 519 in bd88963
err = defragdb(b.db, tmpdb, defragLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, yes 🙂, will move it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now in case the defragdb
fails and returns an error we could potentially endup in the same state as described in the issue (b.batchTx.tx == nil
and nil ptr panic). Should this error be handled differnlty (e.g. restarting a transaction) or is this something were etcd should stop (e.g. panic)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see #18822 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a second commit for that, please let me know if that is ok.
Signed-off-by: Thomas Gosteli <[email protected]>
I think we still need to handle the error if any during
etcd/server/storage/backend/backend.go Lines 520 to 526 in bd88963
|
Note we need to resolve #18822 (comment) in a separate PR. Could you please raise a new issue to track it? Thanks. |
|
Overall looks good now. Please signoff the second commit. Refer to https://github.com/etcd-io/etcd/pull/18822/checks?check_run_id=32589384806 |
Signed-off-by: Thomas Gosteli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, ghouscht, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ghouscht can you please backport this PR to 3.5 and 3.4? |
PR contains an e2e test, gofailpoint and a fix for the issue described in #18810.
Without the fix the test triggers a nil ptr panic in etcd as described in the linked issue:
I think from here on we can discuss potential solutions for the problem. @ahrtr already suggested two possible options in the linked issue.As mentioned in #18822 (comment) the PR now restores the environment and lets etcd continue to run.
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.