feat: Added proper per shard bandwidth metric calculation #2851

NagyZoltanPeter · 2024-06-26T05:17:30Z

Description

Calculation of relay network traffic per served shards for better understanding network usage for operators.

As part of this, DST log analytics support logs are also added as the necessary solution must converge.
Therefore solution implemented in #2832 is applied here.
This helps testing with lite-protocol-tester and waku-simulator.

Changes

Changed rate limit metrics for dashboard
Updated monitoring dashboard for bw and rate metrics
proper per shard bandwidth metric calculation
logging of in/out messages

related changes in monitoring for nwaku-compose

waku-org/nwaku-compose#99

Monitoring panel added for relay/shard metrics:

Issue

#1945

…g of in/out messages Changed rate limit metrics for dashboard Updated monitoring dashboard for bw and rate metrics

github-actions · 2024-06-26T05:22:05Z

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:2851

Built from 2fb161f

Ivansete-status

Great PR! Thanks for it! You are solving a very complex issue wrt how to properly track the "sent" messages 🥳

I'm not approving yet because we need to comment a little what is the best approach to track the "in" traffic.

If we merge as it is now we will get metrics of all the gross "in" traffic. In other words, we would track all the "in" traffic and even for the topics "my peer" are not interested in or messages that don't bypass the msg validations. I think this metric sounds interesting, to see how loaded "my peer" is, but I think that is something we can add in a separate PR.

On the other hand, I suggest having the "in" traffic metrics logic within the following else statement so that we can print the valid messages "my peer" is interested in:

nwaku/waku/waku_relay/protocol.nim

Lines 273 to 274 in 99149ea

    
           else: 
        
             return handler(pubsubTopic, decMsg.get())

As an additional context, the current approach is tracking the "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L498

Whereas the suggested approach will track "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L445

waku/waku_relay/protocol.nim

apps/liteprotocoltester/monitoring/configuration/dashboards/nwaku-monitoring.json

NagyZoltanPeter · 2024-06-26T13:46:48Z

Great PR! Thanks for it! You are solving a very complex issue wrt how to properly track the "sent" messages 🥳

I'm not approving yet because we need to comment a little what is the best approach to track the "in" traffic.

If we merge as it is now we will get metrics of all the gross "in" traffic. In other words, we would track all the "in" traffic and even for the topics "my peer" are not interested in or messages that don't bypass the msg validations. I think this metric sounds interesting, to see how loaded "my peer" is, but I think that is something we can add in a separate PR.

On the other hand, I suggest having the "in" traffic metrics logic within the following else statement so that we can print the valid messages "my peer" is interested in:

nwaku/waku/waku_relay/protocol.nim

Lines 273 to 274 in 99149ea

else:

return handler(pubsubTopic, decMsg.get())

As an additional context, the current approach is tracking the "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L498

Whereas the suggested approach will track "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L445

@Ivansete-status.
Yes, you see it very well, but all these was intentional. I played it with a lot before this solution.
The problem is, if we calculate only the filtered messages as IN we will lie to ourself about bandwidth stress on the node. So we may not find specific shard that spams us.
This measurement has a follow up task, aka dynamic subscribe/unsubscribe to shards based on bandwidth requirements later on.
But agree as a measurement we may differentiate between relay/shard/in and relay/shard/valid-in (which is valid in the terms we try to propagate it).

BTW, normally we should get messages on topics via relay we are subscribed to... if not that is the case we are spammed somehow.

What we already discussed with @gabrielmer that we need to prepare a libp2p PR that enables better fit hooks during the relay/publish process for getting known to possible errors come across.
Currently, we can know very little what went wrong.

NagyZoltanPeter · 2024-06-26T13:46:55Z

Great PR! Thanks for it! You are solving a very complex issue wrt how to properly track the "sent" messages 🥳

I'm not approving yet because we need to comment a little what is the best approach to track the "in" traffic.

If we merge as it is now we will get metrics of all the gross "in" traffic. In other words, we would track all the "in" traffic and even for the topics "my peer" are not interested in or messages that don't bypass the msg validations. I think this metric sounds interesting, to see how loaded "my peer" is, but I think that is something we can add in a separate PR.

On the other hand, I suggest having the "in" traffic metrics logic within the following else statement so that we can print the valid messages "my peer" is interested in:

nwaku/waku/waku_relay/protocol.nim

Lines 273 to 274 in 99149ea

else:

return handler(pubsubTopic, decMsg.get())

As an additional context, the current approach is tracking the "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L498

Whereas the suggested approach will track "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L445

@Ivansete-status.
Yes, you see it very well, but all these was intentional. I played it with a lot before this solution.
The problem is, if we calculate only the filtered messages as IN we will lie to ourself about bandwidth stress on the node. So we may not find specific shard that spams us.
This measurement has a follow up task, aka dynamic subscribe/unsubscribe to shards based on bandwidth requirements later on.
But agree as a measurement we may differentiate between relay/shard/in and relay/shard/valid-in (which is valid in the terms we try to propagate it).

BTW, normally we should get messages on topics via relay we are subscribed to... if not that is the case we are spammed somehow.

What we already discussed with @gabrielmer that we need to prepare a libp2p PR that enables better fit hooks during the relay/publish process for getting known to possible errors come across.
Currently, we can know very little what went wrong.

gabrielmer

Love it! Thanks so much! 😍

if you can please attach a screenshot of how the Grafana dashboard looks like after the changes, as it's hard to review that.

Really like this solution :))

Ivansete-status · 2024-06-27T10:34:08Z

...

What we already discussed with @gabrielmer that we need to prepare a libp2p PR that enables better fit hooks during the relay/publish process for getting known to possible errors come across. Currently, we can know very little what went wrong.

Thanks for the explanation @NagyZoltanPeter! Fair enough, what you mention makes a lot of sense. Nevertheless, it is still relevant to also add statistics and metrics for the "net in" traffic so that we can have a ratio between the gross and the net traffic that we have. Maybe something to consider in upcoming PRs but that would be very relevant IMHO ;P

NagyZoltanPeter · 2024-06-27T20:45:35Z

...

What we already discussed with @gabrielmer that we need to prepare a libp2p PR that enables better fit hooks during the relay/publish process for getting known to possible errors come across. Currently, we can know very little what went wrong.

Thanks for the explanation @NagyZoltanPeter! Fair enough, what you mention makes a lot of sense. Nevertheless, it is still relevant to also add statistics and metrics for the "net in" traffic so that we can have a ratio between the gross and the net traffic that we have. Maybe something to consider in upcoming PRs but that would be very relevant IMHO ;P

Sure, will extend in a separate PR. Thanks for it.

* Added proper per shard bandwidth metric calculation and proper logging of in/out messages Changed rate limit metrics for dashboard Updated monitoring dashboard for bw and rate metrics

Added proper per shard bandwidth metric calculation and proper loggin…

18f7a0a

…g of in/out messages Changed rate limit metrics for dashboard Updated monitoring dashboard for bw and rate metrics

NagyZoltanPeter requested review from darshankabariya, gabrielmer and Ivansete-status June 26, 2024 05:17

NagyZoltanPeter self-assigned this Jun 26, 2024

NagyZoltanPeter mentioned this pull request Jun 26, 2024

feat: nwaku dashboard with relay per shard, non-relay request rates and traffic panels waku-org/nwaku-compose#99

Merged

4 tasks

Ivansete-status reviewed Jun 26, 2024

View reviewed changes

gabrielmer mentioned this pull request Jun 26, 2024

feat: adding onValidated observer vacp2p/nim-libp2p#1128

Merged

gabrielmer approved these changes Jun 26, 2024

View reviewed changes

Ivansete-status approved these changes Jun 27, 2024

View reviewed changes

Applied review suggestions

25c5b5d

NagyZoltanPeter merged commit 8f14c04 into master Jun 28, 2024
9 of 10 checks passed

NagyZoltanPeter deleted the feat-bandwidth-metrics branch June 28, 2024 00:48

NagyZoltanPeter mentioned this pull request Jul 19, 2024

chore: Distinction between gross/net trafic in bandwidth per shard metric, a… #2920

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added proper per shard bandwidth metric calculation #2851

feat: Added proper per shard bandwidth metric calculation #2851

NagyZoltanPeter commented Jun 26, 2024 •

edited

Loading

github-actions bot commented Jun 26, 2024 •

edited

Loading

Ivansete-status left a comment

NagyZoltanPeter commented Jun 26, 2024

NagyZoltanPeter commented Jun 26, 2024

gabrielmer left a comment

Ivansete-status commented Jun 27, 2024

NagyZoltanPeter commented Jun 27, 2024

feat: Added proper per shard bandwidth metric calculation #2851

feat: Added proper per shard bandwidth metric calculation #2851

Conversation

NagyZoltanPeter commented Jun 26, 2024 • edited Loading

Description

Changes

related changes in monitoring for nwaku-compose

Monitoring panel added for relay/shard metrics:

Issue

github-actions bot commented Jun 26, 2024 • edited Loading

Ivansete-status left a comment

Choose a reason for hiding this comment

NagyZoltanPeter commented Jun 26, 2024

NagyZoltanPeter commented Jun 26, 2024

gabrielmer left a comment

Choose a reason for hiding this comment

Ivansete-status commented Jun 27, 2024

NagyZoltanPeter commented Jun 27, 2024

NagyZoltanPeter commented Jun 26, 2024 •

edited

Loading

github-actions bot commented Jun 26, 2024 •

edited

Loading