Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added proper per shard bandwidth metric calculation #2851

Merged
merged 2 commits into from
Jun 28, 2024

Conversation

NagyZoltanPeter
Copy link
Contributor

@NagyZoltanPeter NagyZoltanPeter commented Jun 26, 2024

Description

Calculation of relay network traffic per served shards for better understanding network usage for operators.

As part of this, DST log analytics support logs are also added as the necessary solution must converge.
Therefore solution implemented in #2832 is applied here.
This helps testing with lite-protocol-tester and waku-simulator.

Changes

  • Changed rate limit metrics for dashboard
  • Updated monitoring dashboard for bw and rate metrics
  • proper per shard bandwidth metric calculation
  • logging of in/out messages

related changes in monitoring for nwaku-compose

waku-org/nwaku-compose#99

Monitoring panel added for relay/shard metrics:

image

Issue

#1945

…g of in/out messages

Changed rate limit metrics for dashboard
Updated monitoring dashboard for bw and rate metrics
Copy link

github-actions bot commented Jun 26, 2024

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:2851

Built from 2fb161f

Copy link
Collaborator

@Ivansete-status Ivansete-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Thanks for it! You are solving a very complex issue wrt how to properly track the "sent" messages 🥳

I'm not approving yet because we need to comment a little what is the best approach to track the "in" traffic.

If we merge as it is now we will get metrics of all the gross "in" traffic. In other words, we would track all the "in" traffic and even for the topics "my peer" are not interested in or messages that don't bypass the msg validations. I think this metric sounds interesting, to see how loaded "my peer" is, but I think that is something we can add in a separate PR.

On the other hand, I suggest having the "in" traffic metrics logic within the following else statement so that we can print the valid messages "my peer" is interested in:

else:
return handler(pubsubTopic, decMsg.get())


As an additional context, the current approach is tracking the "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L498

Whereas the suggested approach will track "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L445

waku/waku_relay/protocol.nim Outdated Show resolved Hide resolved
waku/waku_relay/protocol.nim Outdated Show resolved Hide resolved
waku/waku_relay/protocol.nim Outdated Show resolved Hide resolved
waku/waku_relay/protocol.nim Outdated Show resolved Hide resolved
@NagyZoltanPeter
Copy link
Contributor Author

Great PR! Thanks for it! You are solving a very complex issue wrt how to properly track the "sent" messages 🥳

I'm not approving yet because we need to comment a little what is the best approach to track the "in" traffic.

If we merge as it is now we will get metrics of all the gross "in" traffic. In other words, we would track all the "in" traffic and even for the topics "my peer" are not interested in or messages that don't bypass the msg validations. I think this metric sounds interesting, to see how loaded "my peer" is, but I think that is something we can add in a separate PR.

On the other hand, I suggest having the "in" traffic metrics logic within the following else statement so that we can print the valid messages "my peer" is interested in:

else:
return handler(pubsubTopic, decMsg.get())

As an additional context, the current approach is tracking the "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L498

Whereas the suggested approach will track "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L445

@Ivansete-status.
Yes, you see it very well, but all these was intentional. I played it with a lot before this solution.
The problem is, if we calculate only the filtered messages as IN we will lie to ourself about bandwidth stress on the node. So we may not find specific shard that spams us.
This measurement has a follow up task, aka dynamic subscribe/unsubscribe to shards based on bandwidth requirements later on.
But agree as a measurement we may differentiate between relay/shard/in and relay/shard/valid-in (which is valid in the terms we try to propagate it).

BTW, normally we should get messages on topics via relay we are subscribed to... if not that is the case we are spammed somehow.

What we already discussed with @gabrielmer that we need to prepare a libp2p PR that enables better fit hooks during the relay/publish process for getting known to possible errors come across.
Currently, we can know very little what went wrong.

1 similar comment
@NagyZoltanPeter
Copy link
Contributor Author

Great PR! Thanks for it! You are solving a very complex issue wrt how to properly track the "sent" messages 🥳

I'm not approving yet because we need to comment a little what is the best approach to track the "in" traffic.

If we merge as it is now we will get metrics of all the gross "in" traffic. In other words, we would track all the "in" traffic and even for the topics "my peer" are not interested in or messages that don't bypass the msg validations. I think this metric sounds interesting, to see how loaded "my peer" is, but I think that is something we can add in a separate PR.

On the other hand, I suggest having the "in" traffic metrics logic within the following else statement so that we can print the valid messages "my peer" is interested in:

else:
return handler(pubsubTopic, decMsg.get())

As an additional context, the current approach is tracking the "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L498

Whereas the suggested approach will track "in" traffic from: https://github.com/vacp2p/nim-libp2p/blob/d0af3fbe8559f69195657a360c3dd4ac4552c811/libp2p/protocols/pubsub/gossipsub.nim#L445

@Ivansete-status.
Yes, you see it very well, but all these was intentional. I played it with a lot before this solution.
The problem is, if we calculate only the filtered messages as IN we will lie to ourself about bandwidth stress on the node. So we may not find specific shard that spams us.
This measurement has a follow up task, aka dynamic subscribe/unsubscribe to shards based on bandwidth requirements later on.
But agree as a measurement we may differentiate between relay/shard/in and relay/shard/valid-in (which is valid in the terms we try to propagate it).

BTW, normally we should get messages on topics via relay we are subscribed to... if not that is the case we are spammed somehow.

What we already discussed with @gabrielmer that we need to prepare a libp2p PR that enables better fit hooks during the relay/publish process for getting known to possible errors come across.
Currently, we can know very little what went wrong.

Copy link
Contributor

@gabrielmer gabrielmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it! Thanks so much! 😍

if you can please attach a screenshot of how the Grafana dashboard looks like after the changes, as it's hard to review that.

Really like this solution :))

@Ivansete-status
Copy link
Collaborator

...

What we already discussed with @gabrielmer that we need to prepare a libp2p PR that enables better fit hooks during the relay/publish process for getting known to possible errors come across. Currently, we can know very little what went wrong.

Thanks for the explanation @NagyZoltanPeter! Fair enough, what you mention makes a lot of sense. Nevertheless, it is still relevant to also add statistics and metrics for the "net in" traffic so that we can have a ratio between the gross and the net traffic that we have. Maybe something to consider in upcoming PRs but that would be very relevant IMHO ;P

@NagyZoltanPeter
Copy link
Contributor Author

...

What we already discussed with @gabrielmer that we need to prepare a libp2p PR that enables better fit hooks during the relay/publish process for getting known to possible errors come across. Currently, we can know very little what went wrong.

Thanks for the explanation @NagyZoltanPeter! Fair enough, what you mention makes a lot of sense. Nevertheless, it is still relevant to also add statistics and metrics for the "net in" traffic so that we can have a ratio between the gross and the net traffic that we have. Maybe something to consider in upcoming PRs but that would be very relevant IMHO ;P

Sure, will extend in a separate PR. Thanks for it.

@NagyZoltanPeter NagyZoltanPeter merged commit 8f14c04 into master Jun 28, 2024
9 of 10 checks passed
@NagyZoltanPeter NagyZoltanPeter deleted the feat-bandwidth-metrics branch June 28, 2024 00:48
gabrielmer pushed a commit that referenced this pull request Jul 9, 2024
* Added proper per shard bandwidth metric calculation and proper logging of in/out messages
Changed rate limit metrics for dashboard
Updated monitoring dashboard for bw and rate metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants