Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync store - How old messages can be requestd for Sync #64

Open
ABresting opened this issue Nov 20, 2023 · 7 comments
Open

Sync store - How old messages can be requestd for Sync #64

ABresting opened this issue Nov 20, 2023 · 7 comments
Assignees

Comments

@ABresting
Copy link

In the Waku Store Sync protocol, a node that is out-of-activity/offline for some time can come back online and ask for messages it missed during the offline period, aka the Sync mechanism. More on the foundational aspects of Sync store here #62

The following Message sync policies should be answered:

  1. In general, a Waku Store node keeps the messages of last 30 days and deletes anything that falls beyond this period. But for a node that wants to Sync with its peers, how much older messages can it Sync?
  2. Should there be a time threshold for a message to be eligible to be a part of the Sync request?
  3. Should we only have one type of Sync mechanism based on a single time threshold?
@SionoiS
Copy link

SionoiS commented Nov 20, 2023

I have different perspective here:

  1. In general, a Waku Store node keeps the messages of last 30 days and deletes anything that falls beyond this period. But for a node that wants to Sync with its peers, how much older messages can it Sync?

This "general" case will not be common at all and should not be designed for. The fleets we have right now are not a good starting point when thinking about these concepts.

2. Should there be a time threshold for a message to be eligible to be a part of the Sync request?

That should not be up to us. We only design the tools not how to use them.

3. Should we only have one type of Sync mechanism based on a single time threshold?

Query to find the correct hashes then ask for messages from that hash list.

@ABresting
Copy link
Author

I have different perspective here:

  1. In general, a Waku Store node keeps the messages of last 30 days and deletes anything that falls beyond this period. But for a node that wants to Sync with its peers, how much older messages can it Sync?

This "general" case will not be common at all and should not be designed for. The fleets we have right now are not a good starting point when thinking about these concepts.

That makes sense, thanks.

  1. Should there be a time threshold for a message to be eligible to be a part of the Sync request?

That should not be up to us. We only design the tools not how to use them.

Yeah totally, we need to make it configurable so that clients/apps can choose depends on their needs. But then we need to think about different solutions based on different use cases, a Sync method might work for smaller data but not for bulk data.

  1. Should we only have one type of Sync mechanism based on a single time threshold?

Query to find the correct hashes then ask for messages from that hash list.

Let me rephrase the question, based on how much older messages we aim to Sync, the implementation of such use case may differ, should we start thinking from that flexibility PoV?

@ABresting ABresting self-assigned this Nov 20, 2023
@SionoiS
Copy link

SionoiS commented Nov 20, 2023

Let me rephrase the question, based on how much older messages we aim to Sync, the implementation of such use case may differ, should we start thinking from that flexibility PoV?

When you say older message what do you mean?

If you mean timestamp wise, I don't think it matters. The protocol should not limit the range of queries. Now, implementations should limit requests and/or respond with multiple chunk of data but that's a detail.

If you mean older version of messages then I think we could support any version if we treat messages as data blobs. Only the indexes would be different. As long as we can hash a message deterministically, the number of indexes that would point to a message based on version could change.

a Sync method might work for smaller data but not for bulk data.

Why would it not work?

@ABresting
Copy link
Author

ABresting commented Nov 20, 2023

Let me rephrase the question, based on how much older messages we aim to Sync, the implementation of such use case may differ, should we start thinking from that flexibility PoV?

When you say older message what do you mean?

If you mean timestamp wise, I don't think it matters. The protocol should not limit the range of queries. Now, implementations should limit requests and/or respond with multiple chunk of data but that's a detail.

yeah this one. got it!

If you mean older version of messages then I think we could support any version if we treat messages as data blobs. Only the indexes would be different. As long as we can hash a message deterministically, the number of indexes that would point to a message based on version could change.

a Sync method might work for smaller data but not for bulk data.

Why would it not work?

I mean it would work but may not optimally, so need to consider tradeoffs, if we build a Prolly tree on top of all messageHashes in DB then for sure it might become huge. for eg. If 90% of Sync requests are coming for last 30 mins of data then why making Prolly tree for all the data?

@SionoiS
Copy link

SionoiS commented Nov 20, 2023

if we build a Prolly tree on top of all messageHashes in DB then for sure it might become huge.

Yes the trees would be as big as the number of messages in the DB.

If 90% of Sync requests are coming for last 30 mins of data then why making Prolly tree for all the data?

Otherwise how would you search for a specific message?

Prolly trees are very efficient for random read and write.

@ABresting
Copy link
Author

If 90% of Sync requests are coming for last 30 mins of data then why making Prolly tree for all the data?

Otherwise how would you search for a specific message?

How about having two trees, one with let's say past 1 hour of activity and other with remaining, this way we can faster serve hot data which is more prone to be missed. We can also define priorities based on that since hot data is what real-time messaging use case will be interested in for instance.

@SionoiS
Copy link

SionoiS commented Nov 21, 2023

How about having two trees, one with let's say past 1 hour of activity and other with remaining, this way we can faster serve hot data which is more prone to be missed. We can also define priorities based on that since hot data is what real-time messaging use case will be interested in for instance.

Prolly tree are ordered no need to split them. You would just iterate in reverse to search the latest messages.

The time index would be a tree with timestamp as keys and hashes as values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants