Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync2: implement multi-peer synchronization #6358

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from
Open

Conversation

ivan4th
Copy link
Contributor

@ivan4th ivan4th commented Sep 30, 2024

Motivation

syncv2 must ensure that the network is in sync by performing sync against multiple peers from time to time, also when starting a fresh/stale node.
When a lot of data needs to be transferred during sync, it would be better to spread the load across multiple peers to avoid costly ax/1-like requests.

Description

#6404 needs to be merged before this one.

This adds multi-peer synchronization support.
When the local set differs too much from the remote sets, making pairwise sync degrade to transferring the whole set, "torrent-style" "split sync" is attempted which splits the set into subranges and syncs each sub-range against a separate peer. Otherwise, the full sync is done, syncing the whole set against each of the synchronization peers.
Full sync is also done after each split sync run.
The local set can be considered synchronized after the specified number of full syncs has happened.

The approach is loosely based on SREP: Out-Of-Band Sync of Transaction Pools for Large-Scale
Blockchains
paper by Novak Boškov, Sevval Simsek, Ari Trachtenberg, and David Starobinski.

Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 83.90663% with 131 lines in your changes missing coverage. Please review.

Project coverage is 79.8%. Comparing base (d32ffaa) to head (cf46587).
Report is 3 commits behind head on develop.

Files with missing lines Patch % Lines
sync2/multipeer/split_sync.go 70.7% 29 Missing and 9 partials ⚠️
sync2/multipeer/multipeer.go 89.0% 22 Missing and 6 partials ⚠️
sync2/p2p.go 67.0% 21 Missing and 5 partials ⚠️
sync2/multipeer/setsyncbase.go 78.7% 15 Missing and 9 partials ⚠️
sync2/multipeer/sync_queue.go 86.8% 5 Missing and 3 partials ⚠️
sync2/rangesync/rangesync.go 88.7% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           develop   #6358     +/-   ##
=========================================
+ Coverage     79.7%   79.8%   +0.1%     
=========================================
  Files          328     335      +7     
  Lines        42977   43656    +679     
=========================================
+ Hits         34256   34860    +604     
- Misses        6782    6831     +49     
- Partials      1939    1965     +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ivan4th ivan4th requested a review from jellonek as a code owner October 9, 2024 17:31
@spacemesh-bors spacemesh-bors bot changed the base branch from syncv2/pairwise to develop October 17, 2024 05:40
@ivan4th ivan4th force-pushed the sync2/multipeer branch 3 times, most recently from b8ea626 to 82b5a72 Compare October 21, 2024 03:54
Given that after recent item sync is done (if it's needed at all), the
range set reconciliation algorithm no longer depends on newly received
item being added to the set, we can save memory by not adding the
received items during reconciliation.

During real sync, the received items will be sent to the respective
handlers and after the corresponding data are fetched and validated,
they will be added to the database, without the need to add them to
cloned OrderedSets which are used to sync against particular peers.
@ivan4th ivan4th changed the base branch from develop to sync2/rangesync-recent October 23, 2024 10:29
@ivan4th
Copy link
Contributor Author

ivan4th commented Oct 23, 2024

Given that no review comments were added yet, I've rebased the PR on top of #6404, squashing commits one more time

@ivan4th ivan4th mentioned this pull request Oct 23, 2024
This adds multi-peer synchronization support.
When the local set differs too much from the remote sets,
"torrent-style" "split sync" is attempted which splits the set into
subranges and syncs each sub-range against a separate peer.
Otherwise, the full sync is done, syncing the whole set against
each of the synchronization peers.
Full sync is also done after each split sync run.
The local set can be considered synchronized after the specified
number of full syncs has happened.

The approach is loosely based on [SREP: Out-Of-Band Sync of
Transaction Pools for Large-Scale
Blockchains](https://people.bu.edu/staro/2023-ICBC-Novak.pdf) paper by
Novak Boškov, Sevval Simsek, Ari Trachtenberg, and David Starobinski.
"github.com/spacemeshos/go-spacemesh/sync2/rangesync"
)

func getDelimiters(numPeers, keyLen, maxDepth int) (h []rangesync.KeyBytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand the context of the word delimiters here. It isn't really clear what this function does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment describing the purpose of this function

sync2/multipeer/dumbset.go Outdated Show resolved Hide resolved
// It extends rangesync.OrderedSet with methods which are needed for multi-peer
// reconciliation.
type OrderedSet interface {
rangesync.OrderedSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to have these multiple interfaces that share the same name in a different package? why can't we just have one OrderedSet interface? also the inheritance syntax is confusing in this context. if this is strictly needed and we can't do without it, consider adding to the interface naming something that suggests that it has to do with syncing (iiuc from the comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably do it indeed, this kind of splitting for this interface was motivated by the need to split the whole syncv2 thing into multiple PRs, but at this point it's already not needed that much

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged multipeer.OrderedSet interface into rangesync.OrderedSet

Has(rangesync.KeyBytes) (bool, error)
// Release releases the resources associated with the set.
// Calling Release on a set that is already released is a no-op.
Release() error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using io.Closer here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand this seems to make sense, but OrderedSet is not really an I/O primitive, so I'm somewhat in doubt here.

}

// Syncer is a synchronization interface for a single peer.
type Syncer interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PeerSyncer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will rename

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

return sr
}

func newSyncQueue(numPeers, keyLen, maxDepth int) syncQueue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why not to return a pointer here? also all the methods have pointer semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncQueue is just a slice, not a struct, so the pointer is only needed for the methods that modify it, as there's not much copying involved

for sl.syncs.Len() != 0 {
el := sl.syncs.Back()
if t.After(el.Value.(time.Time)) {
sl.syncs.Remove(el)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this not cause a memory leak? i.e. if a a series of double-linked list items get cut off from the rest, it means they might just continue living in memory because they reference each other (won't show up in the gc mark-and-sweep runs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go's double-linked implementation unlinks the list element properly: https://github.com/golang/go/blob/go1.23.2/src/container/list/list.go#L108-L115

sync2/p2p.go Outdated
"github.com/spacemeshos/go-spacemesh/sync2/rangesync"
)

type Dispatcher = rangesync.Dispatcher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the type alias needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was for sync2 package to serve as a facade that hides all the implementation details of sync itself beneath it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, I'm still not sure I understand the type-aliasing narrative though. Not sure how the two are related. Unless you're decorating the original type with more functionality I don't see why this should be necessary. It just adds more indirection to an already quite large package. If you need to leak stuff out of the package, maybe better to do it through interfaces instead of type aliasing (that's just my opinion though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it was a remnant from an older iteration where Dispatcher did not use the constructor, etc.
Given that rangesync.OrderedSet etc. is needed anyway, I removed this type alias.

sync2/p2p.go Outdated
s.reconciler = multipeer.NewMultiPeerReconciler(
s.syncBase, peers, keyLen, maxDepth,
multipeer.WithLogger(logger),
multipeer.WithSyncPeerCount(cfg.SyncPeerCount),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a good candidate for a Config type. many function calls that can be avoided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to config type (also did that for rangesync package)

return
}
s.running.Store(true)
s.start.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why Once is needed? can it be that it will be called from multiple places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was for the Start method to be idempotent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the reasoning - Start can be called any amount of times - but only the first time it actually does something? Start - Stop - Start causes running to be true but the component is actually in the stopped state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that there's no harm in invoking Start multiple times on a P2PHashSync, but after you Stop() it, you throw it away (it is non-restartable)

@spacemesh-bors spacemesh-bors bot changed the base branch from sync2/rangesync-recent to develop October 28, 2024 12:15
Comment on lines +125 to +130
s.eg.Go(func() error {
defer s.running.Store(false)
var ctx context.Context
ctx, s.cancel = context.WithCancel(context.Background())
return s.reconciler.Run(ctx)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start just serves as a wrapper around the Run method here. Would it make sense to instead of having the Start, Stop and s.running methods/fields to just have a Run method that passes along the context to s.reconciler? This would also get rid of the need for s.cancel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2PHashSync has active sync enable/disable flag as its config option. That's part of P2PHashSync logic.
When cfg.EnableActiveSync is false, Start / Stop are noops. In this case, P2PHashSync only serves requests received from a p2p Server via a Dispatcher.

Comment on lines 221 to 239
var ctx context.Context
for i := 0; i < numSyncs; i++ {
pl := mt.expectProbe(6, rangesync.ProbeResult{
FP: "foo",
Count: 100,
Sim: 0.99, // high enough for full sync
})
mt.expectFullSync(pl, 6, 0)
mt.syncBase.EXPECT().Wait()
if i == 0 {
//nolint:fatcontext
ctx = mt.start()
} else {
// first full sync happens immediately
mt.clock.Advance(time.Minute)
}
mt.clock.BlockUntilContext(ctx, 1)
mt.satisfy()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first loop behaves differently would it make sense to indicate it as such more clearly? This should also get rid of the linter warning.

Suggested change
var ctx context.Context
for i := 0; i < numSyncs; i++ {
pl := mt.expectProbe(6, rangesync.ProbeResult{
FP: "foo",
Count: 100,
Sim: 0.99, // high enough for full sync
})
mt.expectFullSync(pl, 6, 0)
mt.syncBase.EXPECT().Wait()
if i == 0 {
//nolint:fatcontext
ctx = mt.start()
} else {
// first full sync happens immediately
mt.clock.Advance(time.Minute)
}
mt.clock.BlockUntilContext(ctx, 1)
mt.satisfy()
}
pl := mt.expectProbe(6, rangesync.ProbeResult{
FP: "foo",
Count: 100,
Sim: 0.99, // high enough for full sync
})
mt.expectFullSync(pl, 6, 0)
mt.syncBase.EXPECT().Wait()
mt.clock.Advance(time.Minute) // first sync happens immediatly
mt.clock.BlockUntilContext(context.Background(), 1)
mt.satisfy()
for i := 1; i < numSyncs; i++ {
pl := mt.expectProbe(6, rangesync.ProbeResult{
FP: "foo",
Count: 100,
Sim: 0.99, // high enough for full sync
})
mt.expectFullSync(pl, 6, 0)
mt.syncBase.EXPECT().Wait()
ctx := mt.start()
mt.clock.BlockUntilContext(ctx, 1)
mt.satisfy()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the other way around, we need to do mt.start() in the initial iteration and advance the clocks in the following ones. But otherwise that's probably the right idea, so I updated the code, except that I wrapped expectations in the nested func expect to highlight the fact that the expectations are the same right after startup and when it's time to do more syncs

sync2/p2p.go Outdated
Comment on lines 81 to 84
peer, found := server.ContextPeerID(ctx)
if !found {
panic("BUG: no peer ID found in the handler")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to pass the peer explicitly instead of putting it into the context and then panicing when it isn't there? Afaik this is the only place in our codebase at the moment where we put values in the context at all. A missing value should not lead to a panic or this feels like a misuse of context.WithValue to me 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That entailed quite a few changes in unrelated code as most of the existing p2p.Server use cases don't care about the peer ID, but I still updated the p2p.Server and got rid of that context key

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants