fix(defrag): handle no space left error #18822

ghouscht · 2024-11-01T12:28:37Z

PR contains an e2e test, gofailpoint and a fix for the issue described in #18810.

Without the fix the test triggers a nil ptr panic in etcd as described in the linked issue:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: execute job failed
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x787ad8]

goroutine 136 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x80?, 0x2?, {0xf?, 0x0?, 0x0?})
	go.uber.org/[email protected]/zapcore/entry.go:196 +0x78
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0x40005d81a0, {0x400058a780, 0x2, 0x2})
	go.uber.org/[email protected]/zapcore/entry.go:262 +0x1c4
go.uber.org/zap.(*Logger).Panic(0xcd8746?, {0xce7440?, 0xb54f60?}, {0x400058a780, 0x2, 0x2})
	go.uber.org/[email protected]/logger.go:285 +0x54
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob.func1()
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:202 +0x24c
panic({0xb54f60?, 0x1681ec0?})
	runtime/panic.go:785 +0x124
go.etcd.io/bbolt.(*Tx).Bucket(...)
	go.etcd.io/[email protected]/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.unsafeForEach(0x0, {0xe9b988?, 0x1695e60?}, 0x4000580460)
	go.etcd.io/etcd/server/v3/storage/backend/batch_tx.go:235 +0x38
go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).UnsafeForEach(...)
	go.etcd.io/etcd/server/v3/storage/backend/batch_tx.go:231
go.etcd.io/etcd/server/v3/storage/backend.unsafeVerifyTxConsistency({0xea55e0, 0x40000bafc0}, {0xe9b988, 0x1695e60})
	go.etcd.io/etcd/server/v3/storage/backend/verify.go:97 +0xa4
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll.VerifyBackendConsistency.func1()
	go.etcd.io/etcd/server/v3/storage/backend/verify.go:90 +0x244
go.etcd.io/etcd/client/pkg/v3/verify.Verify(0x40000efdf8)
	go.etcd.io/etcd/client/pkg/[email protected]/verify/verify.go:71 +0x3c
go.etcd.io/etcd/server/v3/storage/backend.VerifyBackendConsistency(...)
	go.etcd.io/etcd/server/v3/storage/backend/verify.go:75
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0x40003a3c08, 0x4000172180, 0x40005da000)
	go.etcd.io/etcd/server/v3/etcdserver/server.go:972 +0xcc
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func6({0x4000196850?, 0x4000494a80?})
	go.etcd.io/etcd/server/v3/etcdserver/server.go:847 +0x28
go.etcd.io/etcd/pkg/v3/schedule.job.Do(...)
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:41
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob(0x40000eff70?, {0xe943f8?, 0x40005b6330?}, 0x0?)
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:206 +0x78
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0x400039a0e0)
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:187 +0x15c
created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler in goroutine 155
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:101 +0x178

~~I think from here on we can discuss potential solutions for the problem. @ahrtr already suggested two possible options in the linked issue.~~

As mentioned in #18822 (comment) the PR now restores the environment and lets etcd continue to run.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

k8s-ci-robot · 2024-11-01T12:28:47Z

Hi @ghouscht. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov-commenter · 2024-11-01T12:46:23Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.75%. Comparing base (3de0018) to head (8369f07).
Report is 14 commits behind head on main.

❗ Current head 8369f07 differs from pull request most recent head 04c042c

Please upload reports for the commit 04c042c to get more accurate results.

Files with missing lines	Patch %	Lines
server/storage/backend/backend.go	50.00%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

Files with missing lines	Coverage Δ
server/storage/backend/backend.go	`82.82% <50.00%> (-0.52%)`	⬇️

... and 19 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18822      +/-   ##
==========================================
- Coverage   68.76%   68.75%   -0.01%     
==========================================
  Files         420      420              
  Lines       35523    35525       +2     
==========================================
- Hits        24426    24425       -1     
- Misses       9665     9678      +13     
+ Partials     1432     1422      -10

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3de0018...04c042c. Read the comment docs.

ahrtr · 2024-11-03T20:09:52Z

The e2e test looks good.

The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu

ghouscht · 2024-11-04T10:10:35Z

The e2e test looks good.

The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu

I added a second commit that contains a working implementation of a possible restore operation. I did some manual testing with the failpoint and the e2e test and it seems to work. However this opens up a whole lot of other possible problems. I highlighted some of them with TODO in the code - feedback appreciated 🙂

server/storage/backend/backend.go

serathius · 2024-11-06T06:54:57Z

/retest

serathius · 2024-11-06T06:55:37Z

/ok-to-test

server/storage/backend/backend.go

ahrtr · 2024-11-06T09:50:53Z

server/storage/backend/backend.go

+	b.batchTx.unsafeCommit(true)
+	b.batchTx.tx = nil


We can even move these two lines to right before defragdb. i.e. if we fail to open the temp bbolt db, we don't need to reopen the existing working bbolt db.

etcd/server/storage/backend/backend.go

Line 519 in bd88963

err = defragdb(b.db, tmpdb, defragLimit)

Of course, yes 🙂, will move it.

Now in case the defragdb fails and returns an error we could potentially endup in the same state as described in the issue (b.batchTx.tx == nil and nil ptr panic). Should this error be handled differnlty (e.g. restarting a transaction) or is this something were etcd should stop (e.g. panic)?

Please see #18822 (comment)

I added a second commit for that, please let me know if that is ok.

Signed-off-by: Thomas Gosteli <[email protected]>

ahrtr · 2024-11-06T10:13:49Z

I think we still need to handle the error if any during defragdb, something like below, also add one more failpoint and a new [sub]test case.

$ git diff -l10
diff --git a/server/storage/backend/backend.go b/server/storage/backend/backend.go
index 95f5cf96f..5a9361ae8 100644
--- a/server/storage/backend/backend.go
+++ b/server/storage/backend/backend.go
@@ -522,6 +522,11 @@ func (b *backend) defrag() error {
                if rmErr := os.RemoveAll(tmpdb.Path()); rmErr != nil {
                        b.lg.Error("failed to remove db.tmp after defragmentation completed", zap.Error(rmErr))
                }
+
+               // restore the bbolt transactions if defragmentation fails
+               b.batchTx.tx = b.unsafeBegin(true)
+               b.readTx.tx = b.unsafeBegin(false)
+
                return err
        }

etcd/server/storage/backend/backend.go

Lines 520 to 526 in bd88963

    
           if err != nil { 
        
           	tmpdb.Close() 
        
           	if rmErr := os.RemoveAll(tmpdb.Path()); rmErr != nil { 
        
           		b.lg.Error("failed to remove db.tmp after defragmentation completed", zap.Error(rmErr)) 
        
           	} 
        
           	return err 
        
           }

ahrtr · 2024-11-06T10:15:40Z

Note we need to resolve #18822 (comment) in a separate PR. Could you please raise a new issue to track it? Thanks.

ghouscht · 2024-11-06T10:26:55Z

Note we need to resolve #18822 (comment) in a separate PR. Could you please raise a new issue to track it? Thanks.

#18841

server/storage/backend/backend.go

ahrtr · 2024-11-06T10:58:07Z

Overall looks good now. Please signoff the second commit. Refer to https://github.com/etcd-io/etcd/pull/18822/checks?check_run_id=32589384806

Signed-off-by: Thomas Gosteli <[email protected]>

ahrtr

LGTM

Thank you!

k8s-ci-robot · 2024-11-06T13:00:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, ghouscht, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahrtr,serathius]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahrtr · 2024-11-06T13:00:57Z

@ghouscht can you please backport this PR to 3.5 and 3.4?

ahrtr · 2024-11-06T14:42:33Z

@ghouscht Please also update the changelog for 3.5 and 3.4. Thanks

k8s-ci-robot added do-not-merge/work-in-progress area/testing needs-ok-to-test labels Nov 1, 2024

k8s-ci-robot added the size/M label Nov 1, 2024

k8s-ci-robot added size/L and removed size/M labels Nov 4, 2024

ghouscht commented Nov 4, 2024

View reviewed changes

server/storage/backend/backend.go Outdated Show resolved Hide resolved

ghouscht commented Nov 4, 2024

View reviewed changes

server/storage/backend/backend.go Outdated Show resolved Hide resolved

k8s-ci-robot added size/M and removed size/L labels Nov 5, 2024

ghouscht commented Nov 5, 2024

View reviewed changes

server/storage/backend/backend.go Outdated Show resolved Hide resolved

ghouscht changed the title ~~chore: e2e test defrag no space~~ fix(defrag): handle no space left error Nov 5, 2024

ghouscht force-pushed the issue-18810 branch from 1fb2064 to 28a2e22 Compare November 5, 2024 08:49

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Nov 6, 2024

serathius approved these changes Nov 6, 2024

View reviewed changes

k8s-ci-robot added the approved label Nov 6, 2024

ghouscht force-pushed the issue-18810 branch from 28a2e22 to 991990f Compare November 6, 2024 08:05

ghouscht marked this pull request as ready for review November 6, 2024 08:06

k8s-ci-robot removed the do-not-merge/work-in-progress label Nov 6, 2024

ghouscht force-pushed the issue-18810 branch from 991990f to f270a35 Compare November 6, 2024 08:09

ahrtr reviewed Nov 6, 2024

View reviewed changes

fix(defrag): handle no space left error

35cab80

Signed-off-by: Thomas Gosteli <[email protected]>

ghouscht force-pushed the issue-18810 branch from f270a35 to 35cab80 Compare November 6, 2024 10:01

ghouscht mentioned this pull request Nov 6, 2024

make sure temp file(s) are closed during defrag #18841

Closed

4 tasks

ahrtr added backport/v3.4 backport/v3.5 labels Nov 6, 2024

ahrtr reviewed Nov 6, 2024

View reviewed changes

server/storage/backend/backend.go Outdated Show resolved Hide resolved

ghouscht force-pushed the issue-18810 branch from 7513fe4 to cc64aa0 Compare November 6, 2024 11:43

fix(defrag): handle defragdb failure

04c042c

Signed-off-by: Thomas Gosteli <[email protected]>

ghouscht force-pushed the issue-18810 branch from cc64aa0 to 04c042c Compare November 6, 2024 11:44

ahrtr approved these changes Nov 6, 2024

View reviewed changes

ahrtr merged commit 9250982 into etcd-io:main Nov 6, 2024
32 checks passed

ghouscht mentioned this pull request Nov 6, 2024

[3.5] fix(defrag): handle errors during defrag #18842

Merged

ghouscht deleted the issue-18810 branch November 6, 2024 13:19

ghouscht mentioned this pull request Nov 6, 2024

[3.4] fix(defrag): handle errors during defrag #18843

Merged

ghouscht mentioned this pull request Nov 7, 2024

fix(defrag): close temp file in case of error #18851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(defrag): handle no space left error #18822

fix(defrag): handle no space left error #18822

ghouscht commented Nov 1, 2024 •

edited

Loading

k8s-ci-robot commented Nov 1, 2024

codecov-commenter commented Nov 1, 2024 •

edited

Loading

ahrtr commented Nov 3, 2024

ghouscht commented Nov 4, 2024

serathius commented Nov 6, 2024

serathius commented Nov 6, 2024

ahrtr Nov 6, 2024

ghouscht Nov 6, 2024

ghouscht Nov 6, 2024 •

edited

Loading

ahrtr Nov 6, 2024

ghouscht Nov 6, 2024

ahrtr commented Nov 6, 2024

ahrtr commented Nov 6, 2024

ghouscht commented Nov 6, 2024

ahrtr commented Nov 6, 2024

ahrtr left a comment

k8s-ci-robot commented Nov 6, 2024

ahrtr commented Nov 6, 2024

ahrtr commented Nov 6, 2024

fix(defrag): handle no space left error #18822

fix(defrag): handle no space left error #18822

Conversation

ghouscht commented Nov 1, 2024 • edited Loading

k8s-ci-robot commented Nov 1, 2024

codecov-commenter commented Nov 1, 2024 • edited Loading

Codecov Report

ahrtr commented Nov 3, 2024

ghouscht commented Nov 4, 2024

serathius commented Nov 6, 2024

serathius commented Nov 6, 2024

ahrtr Nov 6, 2024

Choose a reason for hiding this comment

ghouscht Nov 6, 2024

Choose a reason for hiding this comment

ghouscht Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

ahrtr Nov 6, 2024

Choose a reason for hiding this comment

ghouscht Nov 6, 2024

Choose a reason for hiding this comment

ahrtr commented Nov 6, 2024

ahrtr commented Nov 6, 2024

ghouscht commented Nov 6, 2024

ahrtr commented Nov 6, 2024

ahrtr left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 6, 2024

ahrtr commented Nov 6, 2024

ahrtr commented Nov 6, 2024

ghouscht commented Nov 1, 2024 •

edited

Loading

codecov-commenter commented Nov 1, 2024 •

edited

Loading

ghouscht Nov 6, 2024 •

edited

Loading