-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PoC] TPU receiver prometheus metrics #76
base: prometheus
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -109,6 +109,7 @@ use { | |
time::{Duration, Instant}, | ||
}, | ||
}; | ||
use solana_prometheus::collector::PrometheusCollector; | ||
|
||
const MAX_COMPLETED_DATA_SETS_IN_CHANNEL: usize = 100_000; | ||
const WAIT_FOR_SUPERMAJORITY_THRESHOLD_PERCENT: u64 = 80; | ||
|
@@ -379,6 +380,7 @@ impl Validator { | |
socket_addr_space: SocketAddrSpace, | ||
use_quic: bool, | ||
tpu_connection_pool_size: usize, | ||
prometheus_collector: Option<PrometheusCollector>, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although it makes sense to put new arguments at the end of the function parameters, to avoid conflicts, it's better to put new arguments in the middle of the other arguments, that makes rebasing easier if something changes in the arguments. |
||
) -> Self { | ||
let id = identity_keypair.pubkey(); | ||
assert_eq!(id, node.info.id); | ||
|
@@ -740,6 +742,7 @@ impl Validator { | |
connection_cache.clone(), | ||
max_complete_transaction_status_slot, | ||
config.vote_accounts_to_monitor.clone(), | ||
prometheus_collector.clone(), | ||
)), | ||
if !config.rpc_config.full_api { | ||
None | ||
|
@@ -996,6 +999,7 @@ impl Validator { | |
&cost_model, | ||
&connection_cache, | ||
&identity_keypair, | ||
prometheus_collector, | ||
config.enable_quic_servers, | ||
); | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
use crate::utils::{write_metric, Metric, MetricFamily}; | ||
use solana_streamer::streamer::StreamerReceiveStatsTotal; | ||
use std::io; | ||
use std::sync::{Arc, Mutex}; | ||
|
||
pub type PrometheusCollector = Arc<Mutex<MetricsCollector>>; | ||
|
||
pub struct MetricsCollector { | ||
tpu_receiver_stats: Option<StreamerReceiveStatsTotal>, | ||
} | ||
|
||
impl MetricsCollector { | ||
pub fn new() -> Self { | ||
Self { | ||
tpu_receiver_stats: None, | ||
} | ||
} | ||
|
||
pub fn save_tpu_receiver_stats(&mut self, stats: StreamerReceiveStatsTotal) { | ||
self.tpu_receiver_stats = Some(stats) | ||
} | ||
|
||
pub fn write_metrics<W: io::Write>(&self, out: &mut W) -> io::Result<()> { | ||
if self.tpu_receiver_stats.is_none() { | ||
return Ok(()); | ||
} | ||
|
||
let tpu_metrics = self.tpu_receiver_stats.as_ref().unwrap(); | ||
|
||
write_metric( | ||
out, | ||
&MetricFamily { | ||
name: "solana_validator_tpu_packets_count_total", | ||
help: "Packets received by Transaction Processing Unit", | ||
type_: "counter", | ||
metrics: vec![Metric::new(tpu_metrics.packets_count_total as u64)], | ||
}, | ||
)?; | ||
|
||
Ok(()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this will keep one copy of the metrics in with the Prometheus config, and periodically “push” to it, and then the endpoint serves whatever the latest value is that it had.
This creates a polling delay, the value that we serve on
/metrics
can be outdated by the cycle interval of this loop. It looks like that internal is 1 second, so it’s not too bad, but it would be nicer if we can get the actual current values. If we can get theStreamerReceiveStats
into the Prometheus endpoint, it could do the atomic reads there, but it means we would need to make them totals. We can handle the resetting here at this point then.Something like this:
Not sure if this is feasible in practice though, it looks like there is something going on with accumulation over multiple threads? Also, having the separate fields as atomic ints can lead to surprising results. For example, if you are trying to compute an error rate, and you have counters
num_requests
andnum_errors
, where it should be the case thatnum_errors <= num_requests
, then if we load those counters withOrdering::Relaxed
, that inequality could be violated. It requires some thought both at the place where the counters are read and the place where they are incremented to get it right, I’m not sure Solana does this, since they prioritize speed over correctness, and theOrdering::Relaxed
does give the compiler and CPU more freedom to optimize than some of the other orderings ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see, so that would basically result in changing how the logic of current Solana metrics are implemented. I would be a bit hesitant to change this Solana metrics logic because to make it consistent we would likely need to do it in all places this pattern is followed (which seems to be a lot).
Yes, that is what is happening.
Yeah there are definitely possible inconsistencies, but I guess that is really common in metrics. Using the example of
num_requests
andnum_errors
in some other application,num_requests
is likely to be incremented at the beginning of handling the request whilenum_errors
for obvious reasons can only be incremented later, so it won't be atomic operation.P.S. I now want to see this whole application for teleporting goats 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That by itself is not an issue, since you still couldn’t get a >100% error rate. If you increment the error counter before the request counter, then you could get a >100% error rate, which is really counter-intuitive. (Or if you read the request counter before the error counter.)
You might enjoy these :-)
https://crbug.com/31482
https://codereview.chromium.org/543045