Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(defrag): handle no space left error #18822

Merged
merged 2 commits into from
Nov 6, 2024
Merged

Conversation

ghouscht
Copy link
Contributor

@ghouscht ghouscht commented Nov 1, 2024

PR contains an e2e test, gofailpoint and a fix for the issue described in #18810.

Without the fix the test triggers a nil ptr panic in etcd as described in the linked issue:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: execute job failed
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x787ad8]

goroutine 136 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x80?, 0x2?, {0xf?, 0x0?, 0x0?})
	go.uber.org/[email protected]/zapcore/entry.go:196 +0x78
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0x40005d81a0, {0x400058a780, 0x2, 0x2})
	go.uber.org/[email protected]/zapcore/entry.go:262 +0x1c4
go.uber.org/zap.(*Logger).Panic(0xcd8746?, {0xce7440?, 0xb54f60?}, {0x400058a780, 0x2, 0x2})
	go.uber.org/[email protected]/logger.go:285 +0x54
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob.func1()
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:202 +0x24c
panic({0xb54f60?, 0x1681ec0?})
	runtime/panic.go:785 +0x124
go.etcd.io/bbolt.(*Tx).Bucket(...)
	go.etcd.io/[email protected]/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.unsafeForEach(0x0, {0xe9b988?, 0x1695e60?}, 0x4000580460)
	go.etcd.io/etcd/server/v3/storage/backend/batch_tx.go:235 +0x38
go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).UnsafeForEach(...)
	go.etcd.io/etcd/server/v3/storage/backend/batch_tx.go:231
go.etcd.io/etcd/server/v3/storage/backend.unsafeVerifyTxConsistency({0xea55e0, 0x40000bafc0}, {0xe9b988, 0x1695e60})
	go.etcd.io/etcd/server/v3/storage/backend/verify.go:97 +0xa4
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll.VerifyBackendConsistency.func1()
	go.etcd.io/etcd/server/v3/storage/backend/verify.go:90 +0x244
go.etcd.io/etcd/client/pkg/v3/verify.Verify(0x40000efdf8)
	go.etcd.io/etcd/client/pkg/[email protected]/verify/verify.go:71 +0x3c
go.etcd.io/etcd/server/v3/storage/backend.VerifyBackendConsistency(...)
	go.etcd.io/etcd/server/v3/storage/backend/verify.go:75
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0x40003a3c08, 0x4000172180, 0x40005da000)
	go.etcd.io/etcd/server/v3/etcdserver/server.go:972 +0xcc
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func6({0x4000196850?, 0x4000494a80?})
	go.etcd.io/etcd/server/v3/etcdserver/server.go:847 +0x28
go.etcd.io/etcd/pkg/v3/schedule.job.Do(...)
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:41
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob(0x40000eff70?, {0xe943f8?, 0x40005b6330?}, 0x0?)
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:206 +0x78
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0x400039a0e0)
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:187 +0x15c
created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler in goroutine 155
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:101 +0x178

I think from here on we can discuss potential solutions for the problem. @ahrtr already suggested two possible options in the linked issue.

As mentioned in #18822 (comment) the PR now restores the environment and lets etcd continue to run.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@k8s-ci-robot
Copy link

Hi @ghouscht. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov-commenter
Copy link

codecov-commenter commented Nov 1, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.75%. Comparing base (3de0018) to head (8369f07).
Report is 14 commits behind head on main.

Current head 8369f07 differs from pull request most recent head 04c042c

Please upload reports for the commit 04c042c to get more accurate results.

Files with missing lines Patch % Lines
server/storage/backend/backend.go 50.00% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
Files with missing lines Coverage Δ
server/storage/backend/backend.go 82.82% <50.00%> (-0.52%) ⬇️

... and 19 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18822      +/-   ##
==========================================
- Coverage   68.76%   68.75%   -0.01%     
==========================================
  Files         420      420              
  Lines       35523    35525       +2     
==========================================
- Hits        24426    24425       -1     
- Misses       9665     9678      +13     
+ Partials     1432     1422      -10     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3de0018...04c042c. Read the comment docs.

@ahrtr
Copy link
Member

ahrtr commented Nov 3, 2024

The e2e test looks good.

The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu

@k8s-ci-robot k8s-ci-robot added size/L and removed size/M labels Nov 4, 2024
@ghouscht
Copy link
Contributor Author

ghouscht commented Nov 4, 2024

The e2e test looks good.

The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu

I added a second commit that contains a working implementation of a possible restore operation. I did some manual testing with the failpoint and the e2e test and it seems to work. However this opens up a whole lot of other possible problems. I highlighted some of them with TODO in the code - feedback appreciated 🙂

@k8s-ci-robot k8s-ci-robot added size/M and removed size/L labels Nov 5, 2024
@ghouscht ghouscht changed the title chore: e2e test defrag no space fix(defrag): handle no space left error Nov 5, 2024
@serathius
Copy link
Member

/retest

@serathius
Copy link
Member

/ok-to-test

server/storage/backend/backend.go Outdated Show resolved Hide resolved
Comment on lines 490 to 491
b.batchTx.unsafeCommit(true)
b.batchTx.tx = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can even move these two lines to right before defragdb. i.e. if we fail to open the temp bbolt db, we don't need to reopen the existing working bbolt db.

err = defragdb(b.db, tmpdb, defragLimit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, yes 🙂, will move it.

Copy link
Contributor Author

@ghouscht ghouscht Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now in case the defragdb fails and returns an error we could potentially endup in the same state as described in the issue (b.batchTx.tx == nil and nil ptr panic). Should this error be handled differnlty (e.g. restarting a transaction) or is this something were etcd should stop (e.g. panic)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #18822 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a second commit for that, please let me know if that is ok.

@ahrtr
Copy link
Member

ahrtr commented Nov 6, 2024

I think we still need to handle the error if any during defragdb, something like below, also add one more failpoint and a new [sub]test case.

$ git diff -l10
diff --git a/server/storage/backend/backend.go b/server/storage/backend/backend.go
index 95f5cf96f..5a9361ae8 100644
--- a/server/storage/backend/backend.go
+++ b/server/storage/backend/backend.go
@@ -522,6 +522,11 @@ func (b *backend) defrag() error {
                if rmErr := os.RemoveAll(tmpdb.Path()); rmErr != nil {
                        b.lg.Error("failed to remove db.tmp after defragmentation completed", zap.Error(rmErr))
                }
+
+               // restore the bbolt transactions if defragmentation fails
+               b.batchTx.tx = b.unsafeBegin(true)
+               b.readTx.tx = b.unsafeBegin(false)
+
                return err
        }

if err != nil {
tmpdb.Close()
if rmErr := os.RemoveAll(tmpdb.Path()); rmErr != nil {
b.lg.Error("failed to remove db.tmp after defragmentation completed", zap.Error(rmErr))
}
return err
}

@ahrtr
Copy link
Member

ahrtr commented Nov 6, 2024

Note we need to resolve #18822 (comment) in a separate PR. Could you please raise a new issue to track it? Thanks.

@ghouscht
Copy link
Contributor Author

ghouscht commented Nov 6, 2024

Note we need to resolve #18822 (comment) in a separate PR. Could you please raise a new issue to track it? Thanks.

#18841

@ahrtr
Copy link
Member

ahrtr commented Nov 6, 2024

Overall looks good now. Please signoff the second commit. Refer to https://github.com/etcd-io/etcd/pull/18822/checks?check_run_id=32589384806

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you!

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, ghouscht, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr ahrtr merged commit 9250982 into etcd-io:main Nov 6, 2024
32 checks passed
@ahrtr
Copy link
Member

ahrtr commented Nov 6, 2024

@ghouscht can you please backport this PR to 3.5 and 3.4?

@ahrtr
Copy link
Member

ahrtr commented Nov 6, 2024

@ghouscht Please also update the changelog for 3.5 and 3.4. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants