Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waku Sync 2.0 #102

Open
SionoiS opened this issue Sep 12, 2024 · 14 comments
Open

Waku Sync 2.0 #102

SionoiS opened this issue Sep 12, 2024 · 14 comments
Assignees

Comments

@SionoiS
Copy link

SionoiS commented Sep 12, 2024

Waku Sync 2.0

Waku Sync 1.0 wraps the Negentropy (RBSR based) protocol. I believe that designing a new RBSR protocol specifically for Waku would increase it's usefulness and efficiency.

I will describe version 2.0 below. Keep in mind that I will not explain range based set reconciliation in this post.

Why range based set reconciliation?

The flexibility of RBSR in terms of storage technology, stateless syncing sessions and overall simplicity makes RBSR a good fit for Waku. The ability to generate membership proofs could also be added as a separate system if needed.

Prolly trees have the capability of membership proofs (they are merkle trees after all) but would require the implementation of a garbage collector and the tree state cannot be changed while a sync is ongoing. The rigid structure that give Prolly trees their usefulness is also a detriment. Physical storage cannot be adapted to data structure but the opposite is true.

Overview

The ultimate goal is that syncing pair of nodes have the same set of messages. To achieve this we will use 2 protocols concurrently. One of those protocols will be used to transfer messages the other to inform the first as to which messages to transfer.

Reconciliation

This protocol is responsible for set reconciliation and is inspired by Negentropy but with modification because knowing the nature of the data stored (Waku messages) allow us to make informed decisions about fingerprinting function to use, attack vectors and storage data structure. We will also use specific sync ranges to optimize for Waku use cases (light push, filter, relay).

Transfer

This is the simplest of the 2 systems, it's main purpose is to send messages to the other node in the sync pair. It's only input will be message hashes and will rely on storage for message retrieval. To lower latency, no buffer or batching will be done so that messages are integrated into the set of the recipient node as fast as possible. This way if the recipient node is also syncing with other nodes at the same time this new message may also be synced and transferred.

Reconciliation Protocol

Operates on sets of elements that can be totally ordered. In our case timestamp and hash is used as the total order.

Range Choice

Ranges can be chosen based on the specific use case to reduce the number of round-trips in best case scenario. For example, if the context of the sync require fast propagation (Relay), smaller ranges in recent history would be best (RLN epoch). Inversely, for syncing archives choosing large ranges in the oldest part of the set would be preferable.

Various nodes can use any strategy the protocol stays the same.

Fingerprinting Function

The fingerprinting of a range would be the XOR operation applied to every hash in that range. Waku messages are hashed and then RLN verified when received. Reusing this hash and the timestamp included makes the most sense.

Attack Vectors

The sync process can be perturbed by carefully crafting Waku messages that combined with others would lead to range fingerprints identical to another so as to prevent syncing of that range by other nodes. While in theory possible it would be impractical. The first problem is that messages are encrypted, it is not possible to associate users to messages. Second, the very specific case where the sync process skips a range due to an attack has low impact. When syncing at a later moment, ranges would differs and the targeted message would still be synced. Sending messages also has a cost via RLN which act as a deterrent. Finally if an attacker has control of enough nodes they could use far more effective attacks than temporarily disturbing the sync process.

Bidirectionality

Negentropy assumes a client-server model, we don't, only the client know all the differences in their case. We want both nodes to know the differences for each ranges. When receiving a item set we will answer with our own item set and indicate that the sync for that range is done.

Storage

The goal of any storage implementation is to make the range fingerprinting as efficient as possible for that specific use case.

Queues

Light clients should sync very small range as to keep computation and state stored reasonably low. In-memory queues should be enough to hold the state and fingerprints can be computed by simple iteration.

Trees

Tree structures will be needed when linearly iterating to compute fingerprints becomes too inefficient. Instead of O(n) complexity they can compute fingerprints in O(log n) of the data set size. For queries that include content topics K-D trees could also be used. The problem is that inserting and removing elements of the tree also has a cost. We know that insertion will be frequent and deletions can also be in cases where the whole set is not keep but only the latest subset. Balancing these properties present an interesting engineering challenge.

Transfer Protocol

Messages will be sent on a best effort basis without retry, responses or error handling. The assumption is that any message that fail to reach it's destination will be synchronized and sent later. On the receiving end, every messages will be hashed and RLN proofs verified then added to the local storage. Nodes will only accept incoming connections from nodes they are synchronizing with.

Waku message proto buffers will be reused as wire-format.

Connection & Scoring

Who to sync with may not be part of the protocol but is an important detail nonetheless. In theory, choosing random peers may be best but in practice we may want to keep track of latency, online status and amount of messages synced so that the local node can curate a high quality list of peers. Another strategy we could implement is optimistic tit-for-tats (a la BitTorrent), since we can easily count how many missed messages we sent and received and from whom.

@SionoiS SionoiS self-assigned this Sep 12, 2024
@SionoiS
Copy link
Author

SionoiS commented Sep 17, 2024

Technical Stuff

Protocol Parameters

  • T -> Item set threshold. If a range length is <= than T, all items are sent. Higher T sends more items which means higher chance of duplicates but reduce the amount of round trips overall.
  • B -> Partitioning count. When recursively splitting a range, it is split into B sub ranges. Higher B reduce round trips at the cost of computing more fingerprints.

When the recursion is seen as a tree.

The parameter t influences the height of the tree. For t = 1,
the protocol recurses as far as possible. For t = b, the last level
of recursion is cut off, for t = b2 the last two levels, and so
on. Overall, the height of the tree is reduced by ⌊logb (t)⌋.

N.B. both syncing nodes can have different parameters.

Code Architecture

  • If ranges can be processed independently we can design a parallel pipeline.
    IO <-> Sync engine <-> range processing <-> Fingerprinting <-> storage
  • We may need frame size limits to efficiently impl. or optimize this pipeline.

Optimization

Possible optimizations are part of the research paper. I noted those that fit our use case here for completeness.

Range encoding

We will delta encode range bounds like Negentropy does and truncate hashes the same way, it reduces bandwidth with minimal downside.

Range "Path" reuse

When computing multiple ranges, if they are sorted, the upper bound may match the lower bound of the next range. In that case, a tree storage can reuse each "path" when computing fingerprints.

Exotic Partitioning

Partitioning doe not have to split equal ranges. We could design a partitioning scheme that increase the size of ranges for older elements since we could assume that it is unlikely that messages would still be missing after all this time.

Subset Checks

When receiving a range fingerprint, in case of mismatch, a subset of item might match. With XOR we can in O(n3) find if we have such a subset, avoiding further round-trips at the cost of more computation.

@SionoiS
Copy link
Author

SionoiS commented Sep 24, 2024

Roadmap

Lets project ourselves in the future shall we?

Waku Sync 2.0

The first step would be to replace 1.0 with the new version. The core protocol should be mostly final at this stage but only impl. basic storage mechanisms and optimizations.

A big part of this work will be engineering of the code base as to design the proper abstractions and also finalize protocol design.

  • storage interface
  • fingerprinting function
  • range processing
  • requests handling
  • encoding/decoding

Light Protocols Replacement

Light push and filter can optionally be replace by sync. The first thing is to try different storage and range choice strategies for node with limited compute and memory.

We may decide to do a Js impl. at this stage if Nim experiments are conclusive.

Another possibility is to only augment current light protocols with sync for reliability.

Store Replacement

Store could be replaced by sync with proper storage implementation. At this stage ranges would become queries that could include content and/or pusbsub topics or other parameters. This step would require protocol changes and a good amount of engineering efforts to impl. K-Dimensional trees for storage.

Relay Replacement

Sync can in theory replace GossibSub but replacing a battle-tested and mature protocol is a big task. We would have to study attack vectors, run experiments to measure latency and we may even have to design and impl. a scoring system similar to GossipSub to gain enough performance.

@SionoiS SionoiS changed the title Waku Sync v2 Waku Sync 2.0 Sep 25, 2024
@jm-clius
Copy link
Contributor

Thanks for this detailed post! I also like how well the Roadmap lays out what I think would be the steps needed to get as much value from Sync as soon as possible. Some comments/questions:

 stateless syncing sessions

Could this be a disadvantage when we have multiple diversified sync sessions? In other words, "stateless" seems to me to imply that we dynamically create and destroy state for each sync session - for each sync peer we fingerprint a unique sync range matching each sync request; for each light peer we apply a content topic filter and create + fingerprint a sync range over the filtered messages. We've spoken about this before, but I still wonder if the resulting overhead may be the major unknown to test in a real environment. Perhaps one of the first deliverables could be to build a basic PoC that proves that the overhead due to dynamic "state" management is reasonable.

The ability to generate membership proofs could also be added

You mean proving that a specific message hash is kept by the store node? Do we have rough idea how this might work, say if we need a consensus mechanism for messages (not that we necessarily should)? For Prolly trees, proving that you know the root and a path to the root should be enough. Here, however, it seems that ranges are dynamic by definition and it will be difficult to build a structure suitable for consensus?

Tree structures will be needed when linearly iterating to compute fingerprints becomes too inefficient.

I would perhaps need more background to understand this point about trees. Do you mean that dynamically computing the fingerprint over every sync range might become too expensive, so we may have to impose a more rigid hierarchical structure in the storage backend, where subranges can be "statically" fingerprinted? In such a model, only the affected leaves' and branches' fingerprints will need updating after an insert. However, if this understanding is correct, this seems very similar to the Prolly Tree structure?

can curate a high quality list of peers

Agree that we'll need at least a basic peer management mechanism here, perhaps descoring misbehaving sync peers according to a simple heuristic. We'll need to limit the number of peers we initiate sync with and that we allow to sync with us to maintain scalability (think of GRAFTs in gossipsub being based on mutual consensus). It would be nice to have Sync behaviour eventually feed into gossipsub scoring, so that also on a Relay level we encourage stronger connectivity to peers that prove themselves to be more useful in terms of message delivery.

One more general question:

Do you think there are any other major risks/unknowns that could be showstoppers here? To restate one of my concerns above, I'm wondering if it's possible that the overhead for state management could become impractical. This may force us to consider other data structures. If such unknowns exist, could we derisk them by building a PoC that targets these questions first?

@SionoiS
Copy link
Author

SionoiS commented Sep 25, 2024

Could this be a disadvantage when we have multiple diversified sync sessions? In other words, "stateless" seems to me to imply that we dynamically create and destroy state for each sync session - for each sync peer we fingerprint a unique sync range matching each sync request; for each light peer we apply a content topic filter and create + fingerprint a sync range over the filtered messages. We've spoken about this before, but I still wonder if the resulting overhead may be the major unknown to test in a real environment. Perhaps one of the first deliverables could be to build a basic PoC that proves that the overhead due to dynamic "state" management is reasonable.

Only the storage has a state, requests only read from storage they do not write but storage must also be updated with new messages. I'll build a POC to explore this area.

You mean proving that a specific message hash is kept by the store node? Do we have rough idea how this might work, say if we need a consensus mechanism for messages (not that we necessarily should)? For Prolly trees, proving that you know the root and a path to the root should be enough. Here, however, it seems that ranges are dynamic by definition and it will be difficult to build a structure suitable for consensus?

A merkle tree could be kept for membership proofs. Either part of storage of separate like an addon. We could even sync Prolly trees if needed.

I would perhaps need more background to understand this point about trees. Do you mean that dynamically computing the fingerprint over every sync range might become too expensive, so we may have to impose a more rigid hierarchical structure in the storage backend, where subranges can be "statically" fingerprinted? In such a model, only the affected leaves' and branches' fingerprints will need updating after an insert. However, if this understanding is correct, this seems very similar to the Prolly Tree structure?

Similar yes they are both trees but Prollies are probabilistic, leaves have no limit on number of elements. Our tree would simply have fixed size leaves which is easier to manage.

Agree that we'll need at least a basic peer management mechanism here, perhaps descoring misbehaving sync peers according to a simple heuristic. We'll need to limit the number of peers we initiate sync with and that we allow to sync with us to maintain scalability (think of GRAFTs in gossipsub being based on mutual consensus). It would be nice to have Sync behaviour eventually feed into gossipsub scoring, so that also on a Relay level we encourage stronger connectivity to peers that prove themselves to be more useful in terms of message delivery.

💯

Do you think there are any other major risks/unknowns that could be showstoppers here? To restate one of my concerns above, I'm wondering if it's possible that the overhead for state management could become impractical. This may force us to consider other data structures. If such unknowns exist, could we derisk them by building a PoC that targets these questions first?

Of course there's a limit on how many range processing a node can do per seconds. It's hard to say what that number is without building something. How far can we optimize a highly concurrent tree structure is the engineering question we need to answer. As for other risks, I can't think of anything else, only unknown unknowns remains.

@SionoiS
Copy link
Author

SionoiS commented Sep 26, 2024

With the simplest impl. on my computer and a 100k cache I can;

  • fingerprint all elements in ~2-3ms
  • random insert in 1.5ms I tried the avg of 1K random inserts and it's more efficient. so 🤷
  • trim the storage in negligible time

10M cache

  • 230ms fingerprint
  • avg 12ms 1k random inserts
  • 21ms trim lower half the cache

@chaitanyaprem
Copy link

Thanks for the detailed write-up!

Few questions/points that come to me:

  1. Can this new Sync protocol be used to do an initial sync? i.e a new store node that joins an existing network will have to sync all the previous persisted data based on network config in order to actually provide store service. If so, would like to know how that may work?
  2. Apart from pruning delay(of 10 minutes for 10K entries), what other inefficiencies have been noticed wrt Sync-1.0 which is based on negentropy? It would help get an idea and compare the new Sync-2.0 with existing Sync if we have this list.
  3. In case a running store node goes offline for few hours and comes back online, can we use this new Sync protocol to sync with existing nodes? This is similar to point-1 above, but wondering if verifying RLN proof of each message in this case might not be optimal and we may have to think of an alternative way to assert messages being synced are valid.

no buffer or batching will be done so that messages are integrated into the set of the recipient node as fast as possible

shouldn't this be based on the use-case of the user of sync protocol? what if in some scenarios batching may make this more efficient because latency may not be an issue for the user. this may be more of an implementation detail, but leaving this as a configurable option for users may benefit more use-cases.

Will give another read diving into Technical details as well and come back with more questions.

@SionoiS
Copy link
Author

SionoiS commented Sep 30, 2024

1. Can this new Sync protocol be used to do an initial sync? i.e a new store node that joins an existing network will have to sync all the previous persisted data based on network config in order to actually provide store service. If so, would like to know how that may work?

It can be used this way but are nodes going to accept to send that many messages? I would not want my node to do that except if I syncing with my own nodes. What I'm thinking is this use case might require special trust assumptions.

2. Apart from pruning delay(of 10 minutes for 10K entries), what other inefficiencies have been noticed wrt Sync-1.0 which is based on negentropy? It would help get an idea and compare the new Sync-2.0 with existing Sync if we have this list.

Good idea, I will measure and make a table.

3. In case a running store node goes offline for few hours and comes back online, can we use this new Sync protocol to sync with existing nodes? This is similar to point-1 above, but wondering if verifying RLN proof of each message in this case might not be optimal and we may have to think of an alternative way to assert messages being synced are valid.

I'm not sure what other ways we have. I'm open to suggestion.

shouldn't this be based on the use-case of the user of sync protocol? what if in some scenarios batching may make this more efficient because latency may not be an issue for the user. this may be more of an implementation detail, but leaving this as a configurable option for users may benefit more use-cases.

We could add more options to the transfer protocol such as batching and ACKs.

@chaitanyaprem
Copy link

chaitanyaprem commented Sep 30, 2024

  1. In case a running store node goes offline for few hours and comes back online, can we use this new Sync protocol to sync with existing nodes? This is similar to point-1 above, but wondering if verifying RLN proof of each message in this case might not be optimal and we may have to think of an alternative way to assert messages being synced are valid.

I'm not sure what other ways we have. I'm open to suggestion.

i remember initiating a discussion with @rymnc about this topic when we were working on store-sync few months ago. It invovles generating some sort of recursive aggregate proofs. maybe this can be an approach.
ref discussion https://discord.com/channels/1110799176264056863/1219876846209077313

@SionoiS
Copy link
Author

SionoiS commented Sep 30, 2024

i remember initiating a discussion with @rymnc about this topic when we were working on store-sync few months ago. It invovles generating some sort of recursive proofs. maybe this can be an approach. ref discussion https://discord.com/channels/1110799176264056863/1219876846209077313

This assumes a flow where we wait for the whole set diff to be known, not a problem just a bit different than what I had in mind where diff found are sent right away.

For DOS protection, would it not be better if the requester do the most work? Proof generation is expensive and verification not (usually?), but for batches the proof generator is the sender.

We can simply disconnect from nodes that sends invalid proofs as we know they are malicious.

I feel like in the case where batching would be better, latency is not a concern (then why batch?) but it is a concern when trying to speed up consistency.

@chaitanyaprem
Copy link

chaitanyaprem commented Oct 1, 2024

This assumes a flow where we wait for the whole set diff to be known, not a problem just a bit different than what I had in mind where diff found are sent right away.

what do you mean by whole set diff? same logic can be applied to be done in batches as well right once we identify a batch that is being synced proof gets generated for it. maybe i am missing something.

For DOS protection, would it not be better if the requester do the most work? Proof generation is expensive and verification not (usually?), but for batches the proof generator is the sender.

valid point, but here requester is getting protected from being DOS'ed by the sender with which they are syncing. so, this makes sense that proof geenration is costly so that spam is not sent to the requester.
DOS protection from requester's can be a simple per protocol rate-limit itself i guess.

We can simply disconnect from nodes that sends invalid proofs as we know they are malicious.

true, but this would be costly in nature to verify all old proofs isn't it? requester will have to store many merkle roots back in time for this to happen.

I feel like in the case where batching would be better, latency is not a concern (then why batch?) but it is a concern when trying to speed up consistency.

yep , agreed

@SionoiS
Copy link
Author

SionoiS commented Oct 1, 2024

what do you mean by whole set diff? same logic can be applied to be done in batches as well right once we identify a batch that is being synced proof gets generated for it. maybe i am missing something.

I don't understand. Do you mean generating proofs for messages in a range?

In my mind, batches would contain either a sub-set or the whole sets differences. Messages and the proof would be sent to the other node to verify.

valid point, but here requester is getting protected from being DOS'ed by the sender with which they are syncing. so, this makes sense that proof geenration is costly so that spam is not sent to the requester.
DOS protection from requester's can be a simple per protocol rate-limit itself i guess.

I would have to think more deeply about who is more vulnerable in this scenario. On one side, long sync like this are costly to do, but on the other hand you can also get spammed by fake messages.

My instinct would be to always have the initiator of any computation pay the highest cost.

true, but this would be costly in nature to verify all old proofs isn't it? requester will have to store many merkle roots back in time for this to happen.

Yes that is a cost but with 32 bytes proofs we can store a lot of them before it becomes a problems.

@SionoiS
Copy link
Author

SionoiS commented Oct 1, 2024

  1. Apart from pruning delay(of 10 minutes for 10K entries), what other inefficiencies have been noticed wrt Sync-1.0 which is based on negentropy? It would help get an idea and compare the new Sync-2.0 with existing Sync if we have this list.

Good idea, I will measure and make a table.

I can't really measure the Negentropy internals it seams. I'll try to compare them once I have a working POC of 2.0

@chaitanyaprem
Copy link

chaitanyaprem commented Oct 3, 2024

what do you mean by whole set diff? same logic can be applied to be done in batches as well right once we identify a batch that is being synced proof gets generated for it. maybe i am missing something.

I don't understand. Do you mean generating proofs for messages in a range?

In my mind, batches would contain either a sub-set or the whole sets differences. Messages and the proof would be sent to the other node to verify.

sorry, looks like my sentence is quite jumbled.
what i meant was let's say nodeA is syncing with nodeB and finds a set diff of n msgs which are then synced over in m batches. In this case the proof can be generated for each batch that is being synced and can be verified rather than for the whole set diff.

My instinct would be to always have the initiator of any computation pay the highest cost.

agreed, but maybe that would be anyways taken care of by incentivization rather than building into the protocol?

Yes that is a cost but with 32 bytes proofs we can store a lot of them before it becomes a problems.

true, it maynot be a large storage and anyways this would be part of some persistent store.

@rymnc
Copy link

rymnc commented Oct 3, 2024

hi, just my 2c on this ~ if both the peers know each other, then you may not need to use RLN for rate limiting at all, perhaps PoW might be more beneficial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants