Skip to content

Add per-client queue length metrics #2754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

gefjon
Copy link
Contributor

@gefjon gefjon commented May 19, 2025

Description of Changes

This PR adds two new metrics, spacetime_client_connection_incoming_queue_length and spacetime_client_connection_outgoing_queue_length, which track the lengths of the per-client incoming and outgoing message queues. We expect to use these metrics in testing, incl. bot testing, but not to merge them, as we expect the cardinality of per-client metrics to be too high. I will open a separate PR with per-database versions of these metrics which we can merge.

It's a metric that tracks the size of the incoming per-client `message_queue`.
We've been concerned about not having visibility into this queue's length,
as currently we only have the length of the per-database reducer queue,
but each client can only have a single reducer in that queue,
and may have additional messages waiting in its per-client queue.

The new metric, `spacetime_client_connection_incoming_queue_length`,
is an `IntGaugeVec` with the labels:
`db: Identity, client_identity: Identity, connection_id: ConnectionId`.
My theory is that in our viewer we can inspect the average and the sum per database,
and it also may be interesting to be able to look at individual clients,
as e.g. the BitCraft mob monitor may be a notable outlier.
This is not the same pattern as most of our other metrics, though,
which tend to only offer per-database granularity.
It's possible that this new metric should also have the last two labels removed,
and be labeled only on `db: Identity`.
@gefjon gefjon requested review from jsdt and joshua-spacetime May 19, 2025 15:01
Copy link
Collaborator

@joshua-spacetime joshua-spacetime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are indeed the labels that we want for right now.

Comment on lines +234 to +237
scopeguard::defer!(
if let Err(e) = WORKER_METRICS
.client_connection_incoming_queue_length
.remove_label_values(&addr, &client_identity, &connection_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In lieu of a test, I'd just add a comment here about resetting the gauge when the connection ends since that's something that has plagued us before.


#[name = spacetime_client_connection_incoming_queue_length]
#[help = "The number of client -> server WebSocket messages waiting in a client connection's incoming queue"]
#[labels(db: Identity, client_identity: Identity, connection_id: ConnectionId)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid adding any metrics with per-client or per-connection labels. I'm not sure how big the impact would be on our metrics collection/storage/querying performance, but the cardinality of a stat like this can blow up very quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed out-of-band, I'll alter this to be a per-database metric, which I suppose will be the sum of the length of all clients' queues. I expect to open a separate PR for that, since we want to use this PR in testing.

@bfops bfops added the release-any To be landed in any release window label May 19, 2025
@gefjon gefjon marked this pull request as draft May 20, 2025 17:52
Like the previous metric added in this branch, it's per-client,
so we'll use it for testing, but likely not merge it into master.
I'll follow up in a separate PR with a version that's per-database instead.
@gefjon gefjon changed the title Add client_connection_incoming_queue_length metric Add per-client queue length metrics May 20, 2025
gefjon added a commit that referenced this pull request May 21, 2025
This commit alters the message queue length metrics
introduced by #2754
to be per-database, rather than per-client.
This should limit the cardinality of these metrics,
and better lines up with the labels of our other metrics.

Because the metrics are now per-database rather than per-client,
it's no longer correct to just drop the label when the client disconnects.
Instead, care must be taken to decrement the metric
by the number of messages which were waiting in the queue
at the time of the disconnection.
I've added comments to call attention to this complexity.
gefjon and others added 3 commits May 22, 2025 11:36
Included resolving merge conflicts in client_connection.rs,
moving the sendtx queue length metric into the `ClientConnectionMetrics`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-any To be landed in any release window
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants