Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify tracking of deleted objects #64

Closed
wants to merge 3 commits into from
Closed

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Nov 4, 2024

This reworks how deleted objects are stored by removing the separate graveyard indexes and instead storing the deleted objects in the normal primary and revision indexes. This simplifies the code fair bit, especially for Changes(), and allows for more efficient cleanup of deleted objects as part of Commit() rather than in a separate graveyard worker.

benchmarks: Add benchmark for deletion

    Add a benchmark for Delete() to test impact of the graveyard index removal.

Reimplement the deleted object tracking

    The use of separate graveyard primary and revision indexes and a background
    GC job to remove observed objects from them was unnecessarily complicated.

    We can implement this in a much simpler way by keeping the
    "soft-deleted" objects in the normal primary & revision indexes, but just
    marking them as deleted.

    The garbage collection of observed dead objects can be performed as part of
    the Commit(), which has the benefit that we can operate on already cloned
    mutated nodes, allowing in-place modification of the trees. This way we also
    don't need any background jobs in StateDB. This does have the semantic
    difference that observed deleted objects are only dropped from indexes on a
    subsequent WriteTxn, but I don't we'd have an issue with the delayed
    deletions in practice. If needed we can add back a background job to
    essentially do a 'WriteTxn(allTables...).Commit()' periodically.

    There's not a huge difference in benchmarks in terms of time since we're 
    essentially saving a lookup of the graveyard tree during inserts, but this 
    does allow for more compact indexing due to fewer and more optimal radix 
    tree nodes due to the merged indexes. The code for iterating over changes is
    also a bit more efficient and simpler as it's now just an iteration over a
    single index.

@joamaki joamaki requested a review from bimmlerd November 4, 2024 16:25
@joamaki joamaki requested a review from a team as a code owner November 4, 2024 16:25
DeleteTrackerCountVar: newMap("delete_tracker_count"),
RevisionVar: newMap("revision"),
LockContentionVar: newMap("lock_contention"),
GraveyardLowWatermarkVar: newMap("graveyard_low_watermark"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping the graveyard naming makes sense here as deleted_object_count would be rather confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be objects_pending_deletion_count, but yeah, graveyard here is more concise.

Copy link

github-actions bot commented Nov 4, 2024

$ make test
go: downloading golang.org/x/sys v0.17.0
go: downloading golang.org/x/tools v0.17.0
go: downloading golang.org/x/exp v0.0.0-20240119083558-1b970713d09a
go: downloading github.com/spf13/cast v1.6.0
go: downloading github.com/fsnotify/fsnotify v1.7.0
go: downloading github.com/sagikazarmark/slog-shim v0.1.0
go: downloading github.com/spf13/afero v1.11.0
go: downloading github.com/subosito/gotenv v1.6.0
go: downloading github.com/hashicorp/hcl v1.0.0
go: downloading gopkg.in/ini.v1 v1.67.0
go: downloading github.com/magiconair/properties v1.8.7
go: downloading github.com/pelletier/go-toml/v2 v2.1.0
go: downloading golang.org/x/text v0.14.0
	github.com/cilium/statedb/reconciler/benchmark		coverage: 0.0% of statements
	github.com/cilium/statedb/reconciler/example		coverage: 0.0% of statements
ok  	github.com/cilium/statedb	27.437s	coverage: 81.8% of statements
ok  	github.com/cilium/statedb/index	0.005s	coverage: 28.7% of statements
ok  	github.com/cilium/statedb/internal	0.011s	coverage: 46.7% of statements
ok  	github.com/cilium/statedb/part	2.861s	coverage: 82.9% of statements
ok  	github.com/cilium/statedb/reconciler	0.320s	coverage: 88.5% of statements
-----
$ make bench
go test ./... -bench . -benchmem -test.run xxx
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb
cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkDB_WriteTxn_1-4                    	  431449	      2715 ns/op	    368352 objects/sec	    2864 B/op	      34 allocs/op
BenchmarkDB_WriteTxn_10-4                   	 1214542	       983.3 ns/op	   1016970 objects/sec	     748 B/op	      10 allocs/op
BenchmarkDB_WriteTxn_100-4                  	 1570803	       764.7 ns/op	   1307753 objects/sec	     599 B/op	       7 allocs/op
BenchmarkDB_WriteTxn_1000-4                 	 1501476	       850.4 ns/op	   1175927 objects/sec	     550 B/op	       7 allocs/op
BenchmarkDB_WriteTxn_100_SecondaryIndex-4   	  502753	      2370 ns/op	    421932 objects/sec	    1508 B/op	      37 allocs/op
BenchmarkDB_Delete-4                        	    1234	    949720 ns/op	   1052942 insert+delete/sec	  560316 B/op	   11252 allocs/op
BenchmarkDB_Delete_With_Changes-4           	     710	   1699283 ns/op	    588484 insert+delete/sec	 1099926 B/op	   14615 allocs/op
BenchmarkDB_Modify-4                        	    1291	    984861 ns/op	   1015373 objects/sec	  770960 B/op	    8461 allocs/op
BenchmarkDB_GetInsert-4                     	    1186	   1025977 ns/op	    974682 objects/sec	  760195 B/op	    8466 allocs/op
BenchmarkDB_RandomInsert-4                  	    2503	    471622 ns/op	   2120343 objects/sec	  401346 B/op	    7094 allocs/op
BenchmarkDB_RandomReplace-4                 	     318	   3765911 ns/op	    265540 objects/sec	 2348257 B/op	   48567 allocs/op
BenchmarkDB_SequentialInsert-4              	    1557	    801356 ns/op	   1247884 objects/sec	  552083 B/op	    7291 allocs/op
BenchmarkDB_Changes_Baseline-4              	    1242	    960727 ns/op	   1040879 objects/sec	  560322 B/op	   11251 allocs/op
BenchmarkDB_Changes-4                       	     700	   1720150 ns/op	    581345 objects/sec	 1101948 B/op	   14620 allocs/op
BenchmarkDB_RandomLookup-4                  	   21844	     56126 ns/op	  17817245 objects/sec	     160 B/op	       1 allocs/op
BenchmarkDB_SequentialLookup-4              	   27080	     44005 ns/op	  22724869 objects/sec	       0 B/op	       0 allocs/op
BenchmarkDB_Prefix_SecondaryIndex-4         	    7833	    148107 ns/op	   6751875 objects/sec	   93898 B/op	    1044 allocs/op
BenchmarkDB_FullIteration_All-4             	     776	   1581976 ns/op	  63212226 objects/sec	     480 B/op	      12 allocs/op
BenchmarkDB_FullIteration_Get-4             	     214	   5614484 ns/op	  17811111 objects/sec	       0 B/op	       0 allocs/op
BenchmarkDB_PropagationDelay-4              	  504532	      2350 ns/op	        20.00 50th_µs	        23.00 90th_µs	        94.00 99th_µs	    1603 B/op	      25 allocs/op
PASS
ok  	github.com/cilium/statedb	31.445s
PASS
ok  	github.com/cilium/statedb/index	0.004s
PASS
ok  	github.com/cilium/statedb/internal	0.003s
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb/part
cpu: AMD EPYC 7763 64-Core Processor                
Benchmark_Insert_RootOnlyWatch-4    	    8581	    132410 ns/op	   7552284 objects/sec	  104162 B/op	    2041 allocs/op
Benchmark_Insert-4                  	    6043	    181193 ns/op	   5518966 objects/sec	  219063 B/op	    3064 allocs/op
Benchmark_Modify-4                  	    8450	    143081 ns/op	   6989063 objects/sec	  212424 B/op	    1205 allocs/op
Benchmark_GetInsert-4               	    6681	    175261 ns/op	   5705791 objects/sec	  212551 B/op	    1204 allocs/op
Benchmark_Replace-4                 	27105801	        44.80 ns/op	  22321032 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Replace_RootOnlyWatch-4   	27151072	        44.56 ns/op	  22439192 objects/sec	       0 B/op	       0 allocs/op
Benchmark_txn_1-4                   	 3038158	       391.4 ns/op	   2554717 objects/sec	     448 B/op	       7 allocs/op
Benchmark_txn_10-4                  	 7592256	       157.7 ns/op	   6341349 objects/sec	     154 B/op	       2 allocs/op
Benchmark_txn_100-4                 	 8409177	       142.8 ns/op	   7004554 objects/sec	     224 B/op	       2 allocs/op
Benchmark_txn_1000-4                	 7396270	       162.2 ns/op	   6164156 objects/sec	     216 B/op	       2 allocs/op
Benchmark_txn_delete_1-4            	 3161295	       380.3 ns/op	   2629720 objects/sec	     856 B/op	       6 allocs/op
Benchmark_txn_delete_10-4           	 8344952	       142.5 ns/op	   7017432 objects/sec	     132 B/op	       1 allocs/op
Benchmark_txn_delete_100-4          	10394166	       115.4 ns/op	   8666159 objects/sec	      60 B/op	       1 allocs/op
Benchmark_txn_delete_1000-4         	10986669	       108.2 ns/op	   9243087 objects/sec	      26 B/op	       1 allocs/op
Benchmark_Get-4                     	   39673	     30624 ns/op	  32654625 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Iterate-4                 	  174397	      6965 ns/op	 143575614 objects/sec	      80 B/op	       3 allocs/op
Benchmark_Hashmap_Insert-4          	   15781	     75273 ns/op	  13285026 objects/sec	   86544 B/op	      64 allocs/op
Benchmark_Hashmap_Get_Uint64-4      	  153133	      7814 ns/op	 127977092 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Hashmap_Get_Bytes-4       	  149127	      8099 ns/op	 123465232 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Uint64Map_Random-4        	    1368	    860169 ns/op	   1162563 items/sec	 2702478 B/op	    9027 allocs/op
Benchmark_Uint64Map_Sequential-4    	    1495	    797045 ns/op	   1254636 items/sec	 2492407 B/op	    9749 allocs/op
PASS
ok  	github.com/cilium/statedb/part	28.309s
PASS
ok  	github.com/cilium/statedb/reconciler	0.004s
?   	github.com/cilium/statedb/reconciler/benchmark	[no test files]
?   	github.com/cilium/statedb/reconciler/example	[no test files]
go run ./reconciler/benchmark -quiet
1000000 objects reconciled in 2.91 seconds (batch size 1000)
Throughput 343150.56 objects per second
Allocated 6011283 objects, 424769kB bytes, 542008kB bytes still in use

txn.go Show resolved Hide resolved
Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of nits, but haven't spotted bugs so far. it seems pretty easy to miss a if obj.deleted somewhere, but I'd hope tests catch it

)

// object is the format in which data is stored in the tables.
type object struct {
revision uint64
data any
deleted bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fieldalignment would probably enjoy an order of data first so that GC doesn't have to scan further (as there can't be more pointers)

In general, can we avoid this bool in the struct somehow?
Alternatively, do you envision it makes sense to make this a bitmask for future flags on all objs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played a bit with having data point to type struct deleted { data any }, but that just caused a further allocation so was no good. If we mandated that T is always a pointer then we could store data unsafe.Pointer here and use the bytes saved from the not having the type information around, but that's probably a bit annoying to work with.

But yeah, probably worth checking the field alignments, especially with the object embedded into a part node. Will take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also halve the revision space and use the bit 😇
but I think you mentioned somewhere that the size didn't increase? that's surprising 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we could do that, but need to make sure the "deleted bit" is the least significant one which is a bit annoying (shift left by one and set). But yeah since the size didn't increase (most likely because there were padding in the node structs) not worth optimizing.

@@ -440,28 +435,44 @@ type indexEntry struct {
unique bool
}

type revisionRange struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on whether this is a half-open interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah but it's not half-open, the end is included. commenting :D

types.go Outdated
// deleted objects.
deletedRange revisionRange

numDeletedObjects int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be uint, but probably makes arithmetic annoying because of casts

types.go Outdated
// deletedRange holds the revisions of the oldest and
// newest deleted objects. Used to short-cut GCing of
// deleted objects.
deletedRange revisionRange
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deletedRange is somehow hard to understand, but I'm struggling to come up with a better name. gcRange?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like gcRange

iterator.go Show resolved Hide resolved
iterator.go Outdated
updateIter := &iterator[Obj]{indexTxn.LowerBound(index.Uint64(it.revision + 1))}
deleteIter := it.dt.deleted(itxn, it.deleteRevision+1)
it.iter = NewDualIterator(deleteIter, updateIter)
it.iter = indexTxn.LowerBound(index.Uint64(it.revision + 1))

// It is enough to watch the revision index and not the graveyard since
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no more graveyard, so comment can go
generally probably grep for graveyard; there's some more comment references to it which are stale

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good point. bunch of outdated docs in type RWTable...

iterator.go Outdated
if obj.deleted {
if rev <= it.startRevision {
// Ignore objects that were marked deleted before this
// changge iterator was created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changge typo

It's somehow asymmetric to me that the iterator has a startRevision, but only cares about it for deleted objects. It seems correct, but irks me for some reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let me rename that!

DeleteTrackerCountVar: newMap("delete_tracker_count"),
RevisionVar: newMap("revision"),
LockContentionVar: newMap("lock_contention"),
GraveyardLowWatermarkVar: newMap("graveyard_low_watermark"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be objects_pending_deletion_count, but yeah, graveyard here is more concise.

@@ -388,7 +391,7 @@ func (t *genTable[Obj]) ListWatch(txn ReadTxn, q Query[Obj]) (iter.Seq2[Obj, Rev
// Doing a Get() is more efficient than constructing an iterator.
value, watch, ok := indexTxn.Get(q.key)
seq := func(yield func(Obj, Revision) bool) {
if ok {
if ok && !value.deleted {
yield(value.data.(Obj), value.revision)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in the CompareAndSwap impl; should the compare not first check that the compared to obj is not deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. We can't look into value unless ok is true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhh, I think I was trying to comment on CompareAndDelete below but couldn't because GitHub UI

but the inner delete handles the deleted objects so I think this is just fine

txn.go Outdated
if !hadOld {
panic("BUG: Object to be deleted not found from primary index")
}
table.numDeletedObjects--
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could call it numObjsPendingDeletion instead, which would look more intuitive here, but it's fine either way

Test the memory usage overhead per object to make sure changes we make don't
drastically increase our memory consumption.

Signed-off-by: Jussi Maki <[email protected]>
Add a benchmark for Delete() to test impact of the graveyard
index removal.

Signed-off-by: Jussi Maki <[email protected]>
@joamaki joamaki force-pushed the pr/joamaki/no-graveyard branch from c127d1b to ac41f4c Compare November 6, 2024 11:36
The use of separate graveyard primary and revision indexes
and a background GC job to remove observed objects from them
was unnecessarily complicated.

We can implement this in a much simpler way by keeping the
"soft-deleted" objects in the normal primary & revision indexes,
but just marking them as deleted.

The garbage collection of observed dead objects can be performed
as part of the Commit(), which has the benefit that we can operate
on already cloned mutated nodes, allowing in-place modification of
the trees. This way we also don't need any background jobs in StateDB.
This does have the semantic difference that observed deleted objects are
only dropped from indexes on a subsequent WriteTxn, but I don't we'd have
an issue with the delayed deletions in practice. If needed we can add back
a background job to essentially do a 'WriteTxn(allTables...).Commit()'
periodically.

There's not a huge difference in benchmarks in terms of time since we're
essentially saving a lookup of the graveyard tree during inserts, but this
does allow for more compact indexing due to fewer and more optimal radix
tree nodes due to the merged indexes. The code for iterating over changes
is also a bit more efficient and simpler as it's now just an iteration over
a single index.

Signed-off-by: Jussi Maki <[email protected]>
@joamaki joamaki force-pushed the pr/joamaki/no-graveyard branch from 624c07d to 5ee5d23 Compare November 7, 2024 15:12
@joamaki
Copy link
Contributor Author

joamaki commented Nov 8, 2024

While I would've guessed that this approach would've performed better and used less memory the statistics from the load-balancer benchmark show otherwise:

With this change:

Memory statistics from N=10 iterations:
Min: Allocated 815937kB in total, 3162627 objects / 208585kB still reachable (per service:  63 objs, 16710B alloc,  4271B in-use)
Avg: Allocated 825660kB in total, 3183026 objects / 216537kB still reachable (per service:  63 objs, 16909B alloc,  4434B in-use)
Max: Allocated 900403kB in total, 3332232 objects / 277196kB still reachable (per service:  66 objs, 18440B alloc,  5676B in-use)

Before this:

Memory statistics from N=10 iterations:
Min: Allocated 818803kB in total, 2294293 objects / 127964kB still reachable (per service:  45 objs, 16769B alloc,  2620B in-use)
Avg: Allocated 833869kB in total, 2586240 objects / 163081kB still reachable (per service:  51 objs, 17077B alloc,  3339B in-use)
Max: Allocated 902984kB in total, 3329752 objects / 276950kB still reachable (per service:  66 objs, 18493B alloc,  5671B in-use)

Because of that I won't move forward this at least in this form and will close this PR for now.

@joamaki joamaki closed this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants