Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lightPush): improve peer usage and improve readability #2155

Merged
merged 12 commits into from
Oct 4, 2024

Conversation

weboko
Copy link
Collaborator

@weboko weboko commented Oct 2, 2024

Problem

This PR addresses a couple of problems.

LightPush usage of peers

LightPush currently uses renewal mechanics from BaseProtocol when we consume peers from a selected pool.
But as status-desktop approach proved - we don't need any hustle with management of peers / connections for LightPush and it is sufficient to use any connected peer at the moment.

Another problem that exists right now(partially was solved in #2145) - a lot of errors marked as No stream available. After investigating it extensively I figured that what actually happens is when we try to push to a set of peers from connectedPeers pool supported in BaseProtocolSDK and connectedPeers is out of sync with libp2p connections then creation of a stream would fail and get returned.

const sendPromises = this.connectedPeers.map((peer) =>

waitForRemotePeer improvements

This is something I figured while working on it so included in the same PR as it was difficult to decouple.
In waitForRemotePeer we were spawning multiple promises for MetadataService in case some connections were present. This is something that can be done and was moved into waitForMetadata.

Some debt and refactoring

In neighboring areas I reshaped code a bit to decrease level of deepness and better readability.
Also, LightPushSDK was renamed to LightPush as we discussed while ago.

Solution

Implement very straightforward mechanism for LightPush to get peers to use.
I also keep usage of this.reliabilityMonitor.attemptRetriesOrRenew in case a peer that was used failed and needs to be renewed or dropped.

Notes

@weboko weboko requested a review from a team as a code owner October 2, 2024 20:24
@weboko
Copy link
Collaborator Author

weboko commented Oct 2, 2024

@waku-org/js-waku I didn't include unit tests because it would increase PR too much as it requires integrating test runners into sdk package. Will follow up for the changes in this PR.

Copy link

github-actions bot commented Oct 2, 2024

size-limit report 📦

Path Size Loading time (3g) Running time (snapdragon) Total time
Waku node 83.62 KB (+0.01% 🔺) 1.7 s (+0.01% 🔺) 317 ms (+117.15% 🔺) 2 s
Waku Simple Light Node 135.49 KB (+0.17% 🔺) 2.8 s (+0.17% 🔺) 231 ms (+8.74% 🔺) 3 s
ECIES encryption 22.94 KB (0%) 459 ms (0%) 167 ms (+99.36% 🔺) 626 ms
Symmetric encryption 22.39 KB (0%) 448 ms (0%) 244 ms (+323.71% 🔺) 692 ms
DNS discovery 72.28 KB (0%) 1.5 s (0%) 258 ms (+30.21% 🔺) 1.8 s
Peer Exchange discovery 73.88 KB (+0.05% 🔺) 1.5 s (+0.05% 🔺) 197 ms (+34.05% 🔺) 1.7 s
Local Peer Cache Discovery 67.63 KB (0%) 1.4 s (0%) 235 ms (+48.33% 🔺) 1.6 s
Privacy preserving protocols 74.82 KB (+0.05% 🔺) 1.5 s (+0.05% 🔺) 254 ms (+70.58% 🔺) 1.8 s
Waku Filter 78.67 KB (-0.03% 🔽) 1.6 s (-0.03% 🔽) 266 ms (+64.61% 🔺) 1.9 s
Waku LightPush 76.89 KB (+0.01% 🔺) 1.6 s (+0.01% 🔺) 170 ms (+17.58% 🔺) 1.8 s
History retrieval protocols 75.97 KB (-0.09% 🔽) 1.6 s (-0.09% 🔽) 263 ms (+56.15% 🔺) 1.8 s
Deterministic Message Hashing 7.38 KB (0%) 148 ms (0%) 50 ms (+93.25% 🔺) 198 ms

packages/interfaces/src/light_push.ts Show resolved Hide resolved
Comment on lines +108 to +110
void this.reliabilityMonitor.attemptRetriesOrRenew(
connectedPeer.id,
() => this.protocol.send(encoder, message, connectedPeer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we test this?
So seems like we are not relying on PeerManager anymore to retrieve peers to be used for the protocols: moved from hasPeers() which relies on PeerManager -> getConnectedPeers() which gets all available connections, I'm curious how renewing peers would affect management. Wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is something that was there so I assume we should have tests there

for connections - yes, we move away but not from hasPeers but form BaseProtocolSDK.connectedPeers that proved to be out of sync quite often

the reason for it is:

  • to simplify process for LightPush
  • alight with status-go usage of LightPush that proved to be reliable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe hasPeers would work with tests. Considering if it would be required to double check with getConnectedPeers, especially as we do renewals and what not

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseProtocolSDK.connectedPeers() was being inconsistent because of race conditions in shared peer management, which isn't the case with #2137

packages/sdk/src/protocols/light_push/light_push.ts Outdated Show resolved Hide resolved
Comment on lines +97 to +98
codec: string,
libp2p: Libp2p
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that we decouple protocol specifics and operate over connections and codecs, this way we can change BaseProtocol and not change this code

Copy link
Collaborator

@danisharora099 danisharora099 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also makes me think: if we indeed are going in this direction where we don't do peer management for LightPush, which leaves us with only doing it for Filter as Store also doesn't rely on peer management.

Couple of questions:

  • Should ReliabilityMonitor be re-thought? Do we still need to do any renewals if LightPush fails for a node? (renewal was simply removing the peer, instead of disconnecting with improvements in feat(filter): enhancing protocol peer management with mutex locks  #2137). This is the case because often, especially with LightPush, RLN limits are hit which signify that there isn't a need for a node to disconnect, but simply be not actively used for a while and can be rotated back in. Perhaps round-robin here could also make sense between all available LP nodes.
  • How does this affect peer management in general, if only one protocol is using it?

@weboko
Copy link
Collaborator Author

weboko commented Oct 3, 2024

Should ReliabilityMonitor be re-thought

I also think so, but wasn't doing any changes not to increase PR that much

Do we still need to do any renewals if LightPush fails for a node?

As you mentions, there are still things that can make LightPush fail and we should renew it. It aligns with the spec. One important thing is that if we renew because of LightPush we should be considerate if a peer that is going to be renewed is used by Filter.

How does this affect peer management in general, if only one protocol is using it?

We still use it in LightPush as if a node if renewed - connection to it must be dropped - then new attempt in LightPush won't use it. There is a room for improvement as for injection of peers, but I deliberately didn't add it as it would make things difficult with #2137

@weboko weboko merged commit 1d68526 into master Oct 4, 2024
10 of 11 checks passed
@weboko weboko deleted the weboko/light-push branch October 4, 2024 08:51
@weboko weboko mentioned this pull request Oct 4, 2024
@danisharora099
Copy link
Collaborator

danisharora099 commented Oct 4, 2024

ne important thing is that if we renew because of LightPush we should be considerate if a peer that is going to be renewed is used by Filter.

Do you have any ideas as to how we can tackle that?

As you mentions, there are still things that can make LightPush fail and we should renew it. It aligns with the spec.

What would renewal mean in this case? Do we disconnect from the peer? Do we just not use the peer for the time being (considering RLN limits are hit, and they will reset?)
I guess it would be the former since there is not persistent peer management for LightPush anymore, thus we can't control which peers are used. I would ideally prefer the latter

@@ -49,11 +49,6 @@ class LightPushSDK extends BaseProtocolSDK implements ILightPushSDK {
message: IMessage,
_options?: ProtocolUseOptions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options are not being accounted for in send anymore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly! as I mentioned here it will be follow up

#2137 (comment)

@danisharora099
Copy link
Collaborator

danisharora099 commented Oct 4, 2024

@weboko
I think this breaks a few things when seen with #2137:

  • LightPushSDK.send() doesn't use/account for options anymore
    • This makes the peer renewal E2E test fail as it uses forceUseAllPeers, a property on options
  • PeerManager was abstracted away into an entity to be used by LightPush and Filter
    • With this change where LightPush doesn't use any peer management, PeerManager is only used by Filter, which I'm not sure is a right design decision
  • HealthManager updates were moved to be updated from within PeerManager, but now with LightPush not using PeerManager, we'll have to change it to be done from within the Filter class, or one of its derivatives

I'm not certain how some of these tests passed on this PR, while are now failing on #2137. I'm moving my PR to draft until we are able to recognise some of these design decisions and how to approach them. Let me know your thoughts.

@weboko
Copy link
Collaborator Author

weboko commented Oct 4, 2024

LightPushSDK.send() doesn't use/account for options anymore

agree with that as mentioned here #2137 (comment)

PeerManager was abstracted away into an entity to be used by LightPush and Filter

PeerManager is not present yet hence comparing to what was present direct access to connections proved to be working better. Since you are creating PeerManager in your PR and it doesn't have same problems (state of populated peers out of sync) - we can add it back to LightPush in follow up PR


From design PoV I think PeerManager is needed entity and a good direction.

And I want to add to it that LightPush doesn't have the same need as Filter to keep some of the peers. We can literally use any open light push connection and if fails try another one. I don't think it should be more complex than this (and as proven in Status - it shouldn't).
Where these connection come from - I don't think matters that much - PeerManager or libp2p directly (of course clarifying that if a connection is straight out bad in LightPush- it shouldn't be used or terminated right away - as we agreed in the spec)

@danisharora099
Copy link
Collaborator

agree with that as mentioned here #2137 (comment)

do you want to follow up to address this?

I don't think it should be more complex than this (and as proven in Status - it shouldn't).

Ok cool, I agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: make waitForRemotePeer part of the waku interface
3 participants