bug: Future operation cancelled flooding logs #2780

AlbertoSoutullo · 2024-06-05T14:45:59Z

Problem

From DST, we are testing version v0.28, and we noticed a "regression" in terms of how our simulations get stable.

As a summary with respect to explaining the simulation: We use 45 nWaku nodes as bootstrap nodes, and then 50 normal nWaku nodes. Each normal nWaku node will have 3 --discv5-bootstrap-node at random. They will also use --relay-peer-exchange.
Before we inject traffic in the simulation, we make sure that every node is "healthy". We obtain this information from libp2p_gossipsub_healthy_peers_topics . This returns the number of topics with peers in the range d_low < nºpeers < d_high (in our case, just 1).

In version v0.27, everything is ok. The aforementioned scenario reach healthy state in less than 1 minute in a constant manner.
In version v0.28, this behavior is very inconsistent. It can vary from a couple of minutes to more than 10 minutes (cancelled at that point).
The only difference between these 2 cases is the nWaku version, everything else is exactly the same.

What we found so far is that the following line is very present in the nodes that can't reach a healthy state:
Dialing canceled topics="libp2p dialer" tid=7 file=dialer.nim:67 err="Future operation cancelled!"

Impact

It slows down considerably the setup time of the simulations for DST. Also, as in the previous version this was not happening, we assume there is an unexpected behavior / possible bug that might be good to look into.

To reproduce

Instantiate 40~ nWaku nodes that will act as bootstrap.
Use 3 ENRs from these nodes in another deployment of 50~ nWaku Nodes.
Look at DEBUG logs to see the line previously mentioned.
Look how libp2p_gossipsub_healthy_peers_topics metric varies.

Expected behavior

No connection getting cancelled if there is no interruption.

Screenshots/logs

This log is an example of v0.28:
nodes-11_waku.log

nwaku version/commit hash

v0.28.0-rc.0

The text was updated successfully, but these errors were encountered:

AlbertoSoutullo · 2024-06-07T13:38:07Z

Hello @Ivansete-status , I come back with more news about this.

I just tested this again. Maybe this information helps:

This issue happens on v0.26 and v0.28.0-rc.0. It happens so often that we literally cannot even run simulations.
Version v0.27 runs perfectly fine. No problem whatsoever. If this log line appears, it is almost unnoticeable.
On all v0.26, v0.27 and v0.28 simulations, we used same parameters:
- --relay-peer-exchange=true
- --relay=true
- --max-connections=500
- --rest=true
- --rest-admin=true
- --rest-private=true
- --rest-address=0.0.0.0
- --discv5-discovery=true
- --discv5-enr-auto-update=True
- --log-level=DEBUG
- --metrics-server=True
- --metrics-server-address=0.0.0.0
- --discv5-bootstrap-node=$ENR1
- --discv5-bootstrap-node=$ENR2
- --discv5-bootstrap-node=$ENR3
- --nat=extip:${IP}
- --pubsub-topic="/waku/2/default-waku/proto"
For bootstrap and midstrap nodes, the only change is --max-connections=2000. We increased this size to allow peers to have more connections, but it doesn't help. v0.27 works with --max-connections=50 on nodes, and --max-connections=200 in bootstrap and midstrap.

If you can provide more information about why this log line it would be helpful.
Also, if you have ideas on how to further test this, let us now and we will run them.

Ivansete-status · 2024-06-07T17:14:27Z

...
If you can provide more information about why this log line it would be helpful. Also, if you have ideas on how to further test this, let us now and we will run them.

Thanks for the update! We need to pinpoint the actual commit that caused that issue.
IMO the most relevant part is the the different "stable time" in all cases.

AlbertoSoutullo · 2024-06-10T18:49:11Z

After some debugging thanks to @Ivansete-status we found out that:

Using cluster-id with the default value, with the also default topic "/waku/2/default-waku/proto" was causing conectivity issues.
Ivan traced the issue until this specific commit: 6e6cb29
Changing the cluster-id to 2 (not 1 to avoid RLN) and using "/waku/2/rs/2/0" as a pubsub topic fixed the issue.
Still not clear what is the root of the problem.

gabrielmer · 2024-06-20T15:02:32Z

After verifying with @AlbertoSoutullo, it seems that merging #2823 and #2824 fixed the issue 🥳

Closing it :))

The fixes increased the time to get a stable mesh, which makes sense as we used to establish connections in a very aggressive way.

We need to calibrate how many max-connections to use and maybe which value to give to ConnectivityLoopInterval. By the simulations run now, it seems that the most important thing is the max-connections value

AlbertoSoutullo added the bug Something isn't working label Jun 5, 2024

Ivansete-status self-assigned this Jun 6, 2024

gabrielmer assigned gabrielmer and unassigned Ivansete-status Jun 13, 2024

This was referenced Jun 18, 2024

chore: stop connecting to out peers until target is reached #2823

Merged

fix: adding peer exchange peers to the peerStore #2824

Merged

gabrielmer closed this as completed Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Future operation cancelled flooding logs #2780

bug: Future operation cancelled flooding logs #2780

AlbertoSoutullo commented Jun 5, 2024

AlbertoSoutullo commented Jun 7, 2024 •

edited

Loading

Ivansete-status commented Jun 7, 2024 •

edited

Loading

AlbertoSoutullo commented Jun 10, 2024

gabrielmer commented Jun 20, 2024 •

edited

Loading

bug: Future operation cancelled flooding logs #2780

bug: Future operation cancelled flooding logs #2780

Comments

AlbertoSoutullo commented Jun 5, 2024

Problem

Impact

To reproduce

Expected behavior

Screenshots/logs

nwaku version/commit hash

AlbertoSoutullo commented Jun 7, 2024 • edited Loading

Ivansete-status commented Jun 7, 2024 • edited Loading

AlbertoSoutullo commented Jun 10, 2024

gabrielmer commented Jun 20, 2024 • edited Loading

AlbertoSoutullo commented Jun 7, 2024 •

edited

Loading

Ivansete-status commented Jun 7, 2024 •

edited

Loading

gabrielmer commented Jun 20, 2024 •

edited

Loading