Per-database incoming and outgoing queue length metrics #2773

gefjon · 2025-05-22T15:52:17Z

Description of Changes

Add two new metrics, spacetime_total_incoming_queue_length and spacetime_total_outgoing_queue_length. These are similar to the metrics added in #2754 , except that these are per-database, not per-client. This will reduce cardinality at the expense of less granular reporting.

Because the metrics are per-database rather than per-client, there's some tricky code related to cleaning up when a client disconnects.

API and ABI breaking changes

N/a

Expected complexity level and risk

1 - new metrics with low cardinality and infrequent with_label_values calls.

Testing

None - I am not sure how to test metrics.

It's a metric that tracks the size of the incoming per-client `message_queue`. We've been concerned about not having visibility into this queue's length, as currently we only have the length of the per-database reducer queue, but each client can only have a single reducer in that queue, and may have additional messages waiting in its per-client queue. The new metric, `spacetime_client_connection_incoming_queue_length`, is an `IntGaugeVec` with the labels: `db: Identity, client_identity: Identity, connection_id: ConnectionId`. My theory is that in our viewer we can inspect the average and the sum per database, and it also may be interesting to be able to look at individual clients, as e.g. the BitCraft mob monitor may be a notable outlier. This is not the same pattern as most of our other metrics, though, which tend to only offer per-database granularity. It's possible that this new metric should also have the last two labels removed, and be labeled only on `db: Identity`.

Like the previous metric added in this branch, it's per-client, so we'll use it for testing, but likely not merge it into master. I'll follow up in a separate PR with a version that's per-database instead.

This commit alters the message queue length metrics introduced by #2754 to be per-database, rather than per-client. This should limit the cardinality of these metrics, and better lines up with the labels of our other metrics. Because the metrics are now per-database rather than per-client, it's no longer correct to just drop the label when the client disconnects. Instead, care must be taken to decrement the metric by the number of messages which were waiting in the queue at the time of the disconnection. I've added comments to call attention to this complexity.

…ngth Incl. resolving conflicts in client_connection.rs, moving the new metric into the `ClientConnectionMetrics` struct.

gefjon · 2025-05-22T15:52:54Z

Do I need to remove_label_values when the database is deleted somehow? Where would I do that?

joshua-spacetime · 2025-05-27T17:17:29Z

Do I need to remove_label_values when the database is deleted somehow? Where would I do that?

I'm guessing in delete_database although I'm not sure if anything extra is needed for cloud. Perhaps @kim would know.

kim · 2025-05-27T18:07:28Z

When a database is deleted, the ws_client_actor loop should exit (eventually), so I would think that nothing more needs to be done?

joshua-spacetime

@gefjon we should be removing these label values when a database is deleted, but currently we don't do that for any of our metrics, so I think it's fine to just leave this a TODO for now. Ultimately the logic will have to be duplicated in delete_database for both standalone and cloud.

Can you create a ticket to track?

gefjon · 2025-05-28T16:17:43Z

@gefjon we should be removing these label values when a database is deleted, but currently we don't do that for any of our metrics, so I think it's fine to just leave this a TODO for now. Ultimately the logic will have to be duplicated in delete_database for both standalone and cloud.

Can you create a ticket to track?

Opened #2807 .

This reverts commit ac18790.

gefjon added 4 commits May 19, 2025 10:55

Add client_connection_outgoing_queue_length metric

4ec6c25

Like the previous metric added in this branch, it's per-client, so we'll use it for testing, but likely not merge it into master. I'll follow up in a separate PR with a version that's per-database instead.

Merge branch 'master' into phoebe/metric-per-database-client-queue-le…

fa266a2

…ngth Incl. resolving conflicts in client_connection.rs, moving the new metric into the `ClientConnectionMetrics` struct.

gefjon requested review from joshua-spacetime and jsdt May 22, 2025 15:52

joshua-spacetime linked an issue May 23, 2025 that may be closed by this pull request

Track the size of a client's send_message queue #2594

Closed

bfops added the release-any To be landed in any release window label May 27, 2025

joshua-spacetime approved these changes May 28, 2025

View reviewed changes

gefjon mentioned this pull request May 28, 2025

delete_database: clean up per-database metrics #2807

Open

gefjon added this pull request to the merge queue May 28, 2025

Merged via the queue into master with commit ac18790 May 28, 2025
22 checks passed

joshua-spacetime added a commit that referenced this pull request May 28, 2025

Revert "Per-database incoming and outgoing queue length metrics (#2773)"

e261ec2

This reverts commit ac18790.

joshua-spacetime added a commit that referenced this pull request May 28, 2025

Revert "Per-database incoming and outgoing queue length metrics (#2773)"

f79f21c

This reverts commit ac18790.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Per-database incoming and outgoing queue length metrics #2773

Per-database incoming and outgoing queue length metrics #2773

Uh oh!

gefjon commented May 22, 2025

Uh oh!

gefjon commented May 22, 2025

Uh oh!

joshua-spacetime commented May 27, 2025

Uh oh!

kim commented May 27, 2025

Uh oh!

joshua-spacetime left a comment

Uh oh!

gefjon commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Per-database incoming and outgoing queue length metrics #2773

Per-database incoming and outgoing queue length metrics #2773

Uh oh!

Conversation

gefjon commented May 22, 2025

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

gefjon commented May 22, 2025

Uh oh!

joshua-spacetime commented May 27, 2025

Uh oh!

kim commented May 27, 2025

Uh oh!

joshua-spacetime left a comment

Choose a reason for hiding this comment

Uh oh!

gefjon commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!