Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled table rcgen has higher-than-expected numbers #5207

Closed
jgallagher opened this issue Mar 6, 2024 · 5 comments
Closed

sled table rcgen has higher-than-expected numbers #5207

jgallagher opened this issue Mar 6, 2024 · 5 comments
Assignees

Comments

@jgallagher
Copy link
Contributor

While testing #5124, we noticed some surprisingly high rcgen counts. madrid had just been wiped and gone through RSS about 90 minutes prior to running this query:

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen,reservoir_size from sled;
                   id                  |         time_created          |         time_modified         | rcgen | reservoir_size
---------------------------------------+-------------------------------+-------------------------------+-------+-----------------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 2024-03-06 16:28:54.028114+00 |    39 |   869286281216
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 2024-03-06 16:28:53.387421+00 |    41 |   869286281216
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 2024-03-06 16:28:53.663936+00 |    42 |   869286281216
(3 rows)

When we added a new sled, its initial entry in the table also had a surprisingly high rcgen:

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,now()-time_created,time_modified,rcgen,reservoir_size from sled;
                   id                  |         time_created          |    ?column?     |         time_modified         | rcgen | reservoir_size
---------------------------------------+-------------------------------+-----------------+-------------------------------+-------+-----------------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 01:46:22.489399 | 2024-03-06 16:28:54.028114+00 |    39 |   869286281216
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 01:46:23.130092 | 2024-03-06 16:28:53.387421+00 |    41 |   869286281216
  cad2d871-967f-4820-82b6-afac96b406b2 | 2024-03-06 18:13:47.457219+00 | 00:01:29.060294 | 2024-03-06 18:13:47.457219+00 |    21 |              0
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 01:46:22.853577 | 2024-03-06 16:28:53.663936+00 |    42 |   869286281216
(4 rows)

When the reservoir size was updated (the whole point of #5124), we saw the new sled's rcgen get bumped by 1 as expected:

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,now()-time_created,time_modified,rcgen,reservoir_size from sled;
                   id                  |         time_created          |    ?column?     |         time_modified         | rcgen | reservoir_size
---------------------------------------+-------------------------------+-----------------+-------------------------------+-------+-----------------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 01:56:13.631615 | 2024-03-06 16:28:54.028114+00 |    39 |   869286281216
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 01:56:14.272308 | 2024-03-06 16:28:53.387421+00 |    41 |   869286281216
  cad2d871-967f-4820-82b6-afac96b406b2 | 2024-03-06 18:13:47.457219+00 | 00:11:20.20251  | 2024-03-06 18:15:50.843231+00 |    22 |   869286281216
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 01:56:13.995793 | 2024-03-06 16:28:53.663936+00 |    42 |   869286281216
(4 rows)

For comparison and to rule out #5124, we checked dogfood, and see very high rcgen counts there:

root@[fd00:1122:3344:105::3]:32221/omicron> select id,time_created,time_modified,rcgen,reservoir_size from sled;
                   id                  |         time_created          |         time_modified         | rcgen | reservoir_size
---------------------------------------+-------------------------------+-------------------------------+-------+-----------------
  0c7011f7-a4bf-4daf-90cc-1c2410103300 | 2023-08-30 18:57:23.548684+00 | 2024-03-05 16:21:56.628329+00 |   856 |   869286281216
  2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa | 2023-08-30 18:57:23.584978+00 | 2024-03-05 16:22:17.043409+00 |   855 |   869286281216
  5f6720b8-8a31-45f8-8c94-8e699218f28b | 2023-08-30 18:57:22.55883+00  | 2024-03-05 16:23:51.653518+00 |   829 |   869286281216
  71def415-55ad-46b4-ba88-3ca55d7fb287 | 2023-08-30 18:57:23.870158+00 | 2024-03-05 16:24:24.33035+00  |   997 |   869286281216
  7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c | 2023-08-30 18:57:23.910298+00 | 2024-03-05 16:24:05.101916+00 |   853 |   769642201088
  87c2c4fc-b0c7-4fef-a305-78f0ed265bbc | 2023-08-30 18:57:22.686369+00 | 2024-03-05 16:23:43.293687+00 |   854 |   869286281216
  a2adea92-b56e-44fc-8a0d-7d63b5fd3b93 | 2023-08-30 18:57:22.979457+00 | 2024-03-05 16:22:57.394274+00 |   878 |   869286281216
  b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88 | 2023-08-30 18:57:24.302065+00 | 2024-03-05 16:23:40.617043+00 |   853 |   869286281216
  db183874-65b5-4263-a1c1-ddb2737ae0e9 | 2023-08-30 18:57:24.216094+00 | 2024-03-05 16:22:18.427762+00 |   854 |   869286281216
  dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd | 2023-08-30 18:57:23.641123+00 | 2024-03-05 16:21:54.187908+00 |   830 |   659692716032
  f15774c1-b8e5-434f-a493-ec43f96cba06 | 2023-08-30 18:57:22.203013+00 | 2024-03-05 16:21:54.760708+00 |   830 |   869286281216
(11 rows)

There might not (?) be an issue here? But @andrewjstone is going to investigate.

@jgallagher
Copy link
Contributor Author

With a hat tip to @davepacheco for teaching me about AS OF SYSTEM TIME, we can watch the rcgen counts go up shortly after the sled table was populated. Here are the first 5 seconds as sleds show up:

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen from sled as of system time '2024-03-06 16:28:53';
  id | time_created | time_modified | rcgen
-----+--------------+---------------+--------
(0 rows)


Time: 2ms total (execution 2ms / network 0ms)

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen from sled as of system time '2024-03-06 16:28:54';
                   id                  |         time_created          |         time_modified         | rcgen
---------------------------------------+-------------------------------+-------------------------------+--------
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 2024-03-06 16:28:53.387421+00 |     3
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 2024-03-06 16:28:53.663936+00 |     1
(2 rows)


Time: 3ms total (execution 2ms / network 0ms)

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen from sled as of system time '2024-03-06 16:28:55';
                   id                  |         time_created          |         time_modified         | rcgen
---------------------------------------+-------------------------------+-------------------------------+--------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 2024-03-06 16:28:54.028114+00 |     1
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 2024-03-06 16:28:53.387421+00 |     3
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 2024-03-06 16:28:53.663936+00 |     1
(3 rows)


Time: 2ms total (execution 2ms / network 0ms)

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen from sled as of system time '2024-03-06 16:28:56';
                   id                  |         time_created          |         time_modified         | rcgen
---------------------------------------+-------------------------------+-------------------------------+--------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 2024-03-06 16:28:54.028114+00 |     1
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 2024-03-06 16:28:53.387421+00 |     3
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 2024-03-06 16:28:53.663936+00 |     3
(3 rows)


Time: 2ms total (execution 2ms / network 0ms)

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen from sled as of system time '2024-03-06 16:28:57';
                   id                  |         time_created          |         time_modified         | rcgen
---------------------------------------+-------------------------------+-------------------------------+--------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 2024-03-06 16:28:54.028114+00 |     1
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 2024-03-06 16:28:53.387421+00 |     3
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 2024-03-06 16:28:53.663936+00 |     4
(3 rows)


Time: 3ms total (execution 2ms / network 0ms)

It looks like they continued to increase until about 40 seconds after they were created:

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen from sled as of system time '2024-03-06 16:29:10';
                   id                  |         time_created          |         time_modified         | rcgen
---------------------------------------+-------------------------------+-------------------------------+--------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 2024-03-06 16:28:54.028114+00 |    15
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 2024-03-06 16:28:53.387421+00 |    22
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 2024-03-06 16:28:53.663936+00 |    23
(3 rows)


Time: 2ms total (execution 2ms / network 0ms)

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen from sled as of system time '2024-03-06 16:29:20';
                   id                  |         time_created          |         time_modified         | rcgen
---------------------------------------+-------------------------------+-------------------------------+--------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 2024-03-06 16:28:54.028114+00 |    21
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 2024-03-06 16:28:53.387421+00 |    25
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 2024-03-06 16:28:53.663936+00 |    25
(3 rows)


Time: 3ms total (execution 2ms / network 0ms)

root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen from sled as of system time '2024-03-06 16:29:30';
                   id                  |         time_created          |         time_modified         | rcgen
---------------------------------------+-------------------------------+-------------------------------+--------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 2024-03-06 16:28:54.028114+00 |    22
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 2024-03-06 16:28:53.387421+00 |    25
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 2024-03-06 16:28:53.663936+00 |    25
(3 rows)


root@[fd00:1122:3344:103::3]:32221/omicron> select id,time_created,time_modified,rcgen from sled as of system time '2024-03-06 16:29:35';
                   id                  |         time_created          |         time_modified         | rcgen
---------------------------------------+-------------------------------+-------------------------------+--------
  446cc3c8-188c-484b-bf20-512faa2ac5ee | 2024-03-06 16:28:54.028114+00 | 2024-03-06 16:28:54.028114+00 |    39
  bf3e8067-56b1-4d94-be13-d600b4ff8635 | 2024-03-06 16:28:53.387421+00 | 2024-03-06 16:28:53.387421+00 |    41
  fad6e957-e63b-4962-8a08-5a0b300b0b55 | 2024-03-06 16:28:53.663936+00 | 2024-03-06 16:28:53.663936+00 |    42
(3 rows)

@andrewjstone
Copy link
Contributor

andrewjstone commented Mar 6, 2024

As @jgallagher and I discovered, the unexpected bumps in rcgen are due to the to the DatastoreCollectionConfig extension trait used for every child resource of a collection to prevent the "dueling administrators" problem from RFD 192. The justifications in there make sense, and the implementation seems correct with regard to that description. In other words, this doesn't appear to be a "bug" in the strict sense of the word.

On the other hand, I found the implicit behavior here confusing, despite having reviewed RFD 192 (2+ years ago). I think the extension trait makes sense for getting this right, and not having to re-implement the wheel for every child collection. I don't actually believe we should change any of this after thinking about it for a while.

I do wonder though if we are using rcgen in some places where we shouldn't be. For instance, my change in #5124 bumps the generation number on every upsert (in a safe OCC manner - since it now checks rcgen and re-reads on any failures). But this is for modifications to the actual sled by sled-agent and not children. Therefore maybe it should be a separate sled-agent-gen column. What do you think @davepacheco?

@andrewjstone
Copy link
Contributor

andrewjstone commented Mar 6, 2024

One other thing about this, is that @sunshowers has talked about changing the time_deleted, to time_decommissioned field. Since this is not a type-checked thing, we have to remember to update the DatastoreCollectionConfig for sled if and when that happens. It's the behind the scenes magic where rust-analyzer cannot help which had me confused. Again, this is not to say we should necessarily change any of this. It's just less than ideal.

@andrewjstone
Copy link
Contributor

andrewjstone commented Mar 6, 2024

One other thing about this, is that @sunshowers has talked about changing the time_deleted, to time_decommissioned field. Since this is not a type-checked thing, we have to remember to update the DatastoreCollectionConfig for sled if and when that happens. It's the behind the scenes magic where rust-analyzer cannot help which had me confused. Again, this is not to say we should necessarily change any of this. It's just less than ideal.

@jgallagher pointed out that as long as time_deleted is renamed and we don't add a new field for time_decommissioned, then this is type checked, because dsl::time_deleted wouldn't compile. We just have to make sure it's a rename :)

@andrewjstone
Copy link
Contributor

I'm going to close this as not a bug, as we got it sorted. I'm also going to change #5124 to use a separate generation number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants