Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable in-packet parallel processing for gossip #4838

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alexpyattaev
Copy link

Problem

Verifying signatures for individual CRDS values within a packet is not very useful as we do not have unlimited CPU cores, and there is already parallel processing on a per-packet level.

Summary of Changes

  • Disable parallel processing within a gossip packet

@alexpyattaev
Copy link
Author

@behzadnouri do you think it is a good idea? It was discussed on discord in context of "why would we use threads if someone sends a packet full of bogus CRDS values", i.e. it would be cheaper to verify them serially and exit on first failure.

@behzadnouri
Copy link

@behzadnouri do you think it is a good idea? It was discussed on discord in context of "why would we use threads if someone sends a packet full of bogus CRDS values", i.e. it would be cheaper to verify them serially and exit on first failure.

When doing #4259 I actually tested removing inner par_iter on a staked testnet node, and there was a slight regression. So I left it as is.

@alexpyattaev
Copy link
Author

What kind of regression did you observe? i.e. which metrics degraded?

@behzadnouri
Copy link

behzadnouri commented Feb 7, 2025

What kind of regression did you observe? i.e. which metrics degraded?

https://github.com/anza-xyz/agave/blob/f36a62dce/gossip/src/cluster_info.rs#L2288

"why would we use threads if someone sends a packet full of bogus CRDS values", i.e. it would be cheaper to verify them serially and exit on first failure.

I think the counter argument is that, if we verify CrdsValue's serially the spammer can easily send a Vec<CrdsValue> which all sigverify except for the very last one. So with serial verification will still sigverify more CrdsValues, not fewer.

@alexpyattaev
Copy link
Author

That is true. In the end, however, we are getting > 20K packets per second, there is obviously enough of them to make it irrelevant to parallelize within a packet. And parallelizing unnecessarily is pure overhead.

@behzadnouri
Copy link

we are getting > 20K packets per second, there is obviously enough of them to make it irrelevant to parallelize within a packet.

But the run_socket_consume function runs a lot more often than just once every second.
So it does not necessarily have a lot of packets to process each time it runs.

And parallelizing unnecessarily is pure overhead.

That was my thinking too, and like I said I tried removing the inner par_iter in #4259 too for that exact same reason. But the measurements showed a slight regression.
So I intuitively agree with you, but I guess this is one of those cases that test results do not match intuition.

@alexpyattaev
Copy link
Author

This is the distribution of the number of packets that are available for processing (MNB data, a few huge spikes filtered out).
test2

So vast majority of the time we are dealing with less than 100 packets in that queue.

test3

Beyond that, we are looking at maybe 35 packets median value, 40 packets on average, and 100000 max. This is enough to spread across the 8 threads that gossip currently uses for sigverify, with several packets per thread. Under low load, maybe we could win a few microseconds of latency, but under high load making code simpler and working in larger units should be benefitial. I'll have a working gossip traffic generator done soon enough so I'll be able to test what happens under high load.

@behzadnouri
Copy link

behzadnouri commented Feb 8, 2025

This is the distribution of the number of packets that are available for processing (MNB data, a few huge spikes filtered out).
So vast majority of the time we are dealing with less than 100 packets in that queue.

The distribution depends a lot on the node's stake.
There also seems to be a current issue with unstaked nodes not getting enough push messages. So they receive tons of pull responses every so often when they send out a pull request. So that causes a lot more spiky traffic.

Under low load, maybe we could win a few microseconds of latency, but under high load making code simpler and working in larger units should be benefitial.

I repeated the test again on a staked node on testnet.
Like before, if we remove inner par_iter, there is 5%+ regression in verify_gossip_packets_time.
The left side is master, the right is this code.

verify_gossip_packets_time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants