Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Future operation cancelled flooding logs #2780

Closed
AlbertoSoutullo opened this issue Jun 5, 2024 · 4 comments
Closed

bug: Future operation cancelled flooding logs #2780

AlbertoSoutullo opened this issue Jun 5, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@AlbertoSoutullo
Copy link

Problem

From DST, we are testing version v0.28, and we noticed a "regression" in terms of how our simulations get stable.

As a summary with respect to explaining the simulation: We use 45 nWaku nodes as bootstrap nodes, and then 50 normal nWaku nodes. Each normal nWaku node will have 3 --discv5-bootstrap-node at random. They will also use --relay-peer-exchange.
Before we inject traffic in the simulation, we make sure that every node is "healthy". We obtain this information from libp2p_gossipsub_healthy_peers_topics . This returns the number of topics with peers in the range d_low < nºpeers < d_high (in our case, just 1).

In version v0.27, everything is ok. The aforementioned scenario reach healthy state in less than 1 minute in a constant manner.
In version v0.28, this behavior is very inconsistent. It can vary from a couple of minutes to more than 10 minutes (cancelled at that point).
The only difference between these 2 cases is the nWaku version, everything else is exactly the same.

What we found so far is that the following line is very present in the nodes that can't reach a healthy state:
Dialing canceled topics="libp2p dialer" tid=7 file=dialer.nim:67 err="Future operation cancelled!"

Impact

It slows down considerably the setup time of the simulations for DST. Also, as in the previous version this was not happening, we assume there is an unexpected behavior / possible bug that might be good to look into.

To reproduce

  1. Instantiate 40~ nWaku nodes that will act as bootstrap.
  2. Use 3 ENRs from these nodes in another deployment of 50~ nWaku Nodes.
  3. Look at DEBUG logs to see the line previously mentioned.
  4. Look how libp2p_gossipsub_healthy_peers_topics metric varies.

Expected behavior

No connection getting cancelled if there is no interruption.

Screenshots/logs

This log is an example of v0.28:
nodes-11_waku.log

nwaku version/commit hash

v0.28.0-rc.0

@AlbertoSoutullo AlbertoSoutullo added the bug Something isn't working label Jun 5, 2024
@Ivansete-status Ivansete-status self-assigned this Jun 6, 2024
@AlbertoSoutullo
Copy link
Author

AlbertoSoutullo commented Jun 7, 2024

Hello @Ivansete-status , I come back with more news about this.

I just tested this again. Maybe this information helps:

  • This issue happens on v0.26 and v0.28.0-rc.0. It happens so often that we literally cannot even run simulations.
  • Version v0.27 runs perfectly fine. No problem whatsoever. If this log line appears, it is almost unnoticeable.
  • On all v0.26, v0.27 and v0.28 simulations, we used same parameters:
    • --relay-peer-exchange=true
    • --relay=true
    • --max-connections=500
    • --rest=true
    • --rest-admin=true
    • --rest-private=true
    • --rest-address=0.0.0.0
    • --discv5-discovery=true
    • --discv5-enr-auto-update=True
    • --log-level=DEBUG
    • --metrics-server=True
    • --metrics-server-address=0.0.0.0
    • --discv5-bootstrap-node=$ENR1
    • --discv5-bootstrap-node=$ENR2
    • --discv5-bootstrap-node=$ENR3
    • --nat=extip:${IP}
    • --pubsub-topic="/waku/2/default-waku/proto"
  • For bootstrap and midstrap nodes, the only change is --max-connections=2000. We increased this size to allow peers to have more connections, but it doesn't help. v0.27 works with --max-connections=50 on nodes, and --max-connections=200 in bootstrap and midstrap.

If you can provide more information about why this log line it would be helpful.
Also, if you have ideas on how to further test this, let us now and we will run them.

@Ivansete-status
Copy link
Collaborator

Ivansete-status commented Jun 7, 2024

...
If you can provide more information about why this log line it would be helpful. Also, if you have ideas on how to further test this, let us now and we will run them.

Thanks for the update! We need to pinpoint the actual commit that caused that issue.
IMO the most relevant part is the the different "stable time" in all cases.

@AlbertoSoutullo
Copy link
Author

After some debugging thanks to @Ivansete-status we found out that:

  • Using cluster-id with the default value, with the also default topic "/waku/2/default-waku/proto" was causing conectivity issues.
  • Ivan traced the issue until this specific commit: 6e6cb29
  • Changing the cluster-id to 2 (not 1 to avoid RLN) and using "/waku/2/rs/2/0" as a pubsub topic fixed the issue.
  • Still not clear what is the root of the problem.

@gabrielmer
Copy link
Contributor

gabrielmer commented Jun 20, 2024

After verifying with @AlbertoSoutullo, it seems that merging #2823 and #2824 fixed the issue 🥳

Closing it :))

The fixes increased the time to get a stable mesh, which makes sense as we used to establish connections in a very aggressive way.

We need to calibrate how many max-connections to use and maybe which value to give to ConnectivityLoopInterval. By the simulations run now, it seems that the most important thing is the max-connections value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants