[reconfigurator] Retrieve keeper lgif information #6549

karencfv · 2024-09-10T06:41:51Z

Overview

This commit introduces a new clickhouse-admin API endpoint: /keeper/lgif.

This endpoint uses the ClickHouse CLI internally to retrieve and parse the logically grouped information file from the ClickHouse keepers.

Purpose

Reconfigurator will need this information to reliably manage and operate a ClickHouse replicated cluster. Additional endpoints to retrieve other information from ClickHouse servers or keepers will be added in follow up PRs.

Testing

In addition to the unit tests, I have manually tested with the following results:

$ cargo run --bin=clickhouse-admin -- run -c ./smf/clickhouse-admin/config.toml -a [::1]:8888 -l [::1]:20001 -b /Users/karcar/src/omicron/out/clickhouse/clickhouse
   Compiling omicron-clickhouse-admin v0.1.0 (/Users/karcar/src/omicron/clickhouse-admin)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 2.46s
     Running `target/debug/clickhouse-admin run -c ./smf/clickhouse-admin/config.toml -a '[::1]:8888' -l '[::1]:20001' -b /Users/karcar/src/omicron/out/clickhouse/clickhouse`
note: configured to log to "/dev/stdout"
{"msg":"listening","v":0,"name":"clickhouse-admin","level":30,"time":"2024-09-12T02:37:19.383597Z","hostname":"ixchel","pid":3115,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/git/checkouts/dropshot-a4a923d29dccc492/06c8dab/dropshot/src/server.rs:205"}
{"msg":"accepted connection","v":0,"name":"clickhouse-admin","level":30,"time":"2024-09-12T02:37:23.843325Z","hostname":"ixchel","pid":3115,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/git/checkouts/dropshot-a4a923d29dccc492/06c8dab/dropshot/src/server.rs:775","remote_addr":"[::1]:54455"}
{"msg":"request completed","v":0,"name":"clickhouse-admin","level":30,"time":"2024-09-12T02:37:24.302588Z","hostname":"ixchel","pid":3115,"uri":"/keeper/lgif","method":"GET","req_id":"64b232d0-d6ac-4cae-8f0a-f14cf6d1dfba","remote_addr":"[::1]:54455","local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/git/checkouts/dropshot-a4a923d29dccc492/06c8dab/dropshot/src/server.rs:914","latency_us":458301,"response_code":"200"}

$ curl http://[::1]:8888/keeper/lgif
{"first_log_idx":1,"first_log_term":1,"last_log_idx":11717,"last_log_term":20,"last_committed_log_idx":11717,"leader_committed_log_idx":11717,"target_committed_log_idx":11717,"last_snapshot_idx":9465}

Related: #5999

andrewjstone

Hey Karen, I took a quick look but I have to run. More feedback coming later.

clickhouse-admin/src/clickhouse_cli.rs

clickhouse-admin/types/src/lib.rs

andrewjstone

Ugh, Sorry @karencfv. I totally did this review on Friday and forgot to send it! My goal was totally to get this out for you before the weekend.

andrewjstone · 2024-09-13T16:10:59Z

clickhouse-admin/api/src/lib.rs

@@ -50,4 +52,15 @@ pub trait ClickhouseAdminApi {
        rqctx: RequestContext<Self::Context>,
        body: TypedBody<KeeperConfigurableSettings>,
    ) -> Result<HttpResponseCreated<KeeperConfig>, HttpError>;
+
+    /// Retrieve a logically grouped information file from a keeper node.


I think this command may be good for debugging/introspection aside from the inventory we need for reconfiguration and we should keep it.

I just want to point out that for inventory we'll likely not use this API command directly, but rather use the underlying keeper request along with another request to /keeper/config to get the current committed configuration. That will eliminate an extra network round trip.

Unfortunately, keeper doesn't seem to provide a way to get the configuration and its log index together in one call. If it did that we could use that for inventory. I think that would be a good patch to make to keeper and the keeper cli, but until then we'll need to make two separate calls. And because of the inherent race condition of two calls, we'll likely have to call at least one endpoint twice to see that the config hasn't changed while polling. We would want to do that local to the sled-agent without a round trip from nexus.

Ah, yeah definitely! I realise there would be another endpoint to collect all the information we need for inventory itself. My reasoning was that the lgif command is useful for many things, and in the spirit of keeping PRs as small and easily digestible as possible, I was going to make a separate PR with that endpoint. On hindsight I should have written that on the PRs description. Sorry about that!

andrewjstone · 2024-09-13T16:26:34Z

clickhouse-admin/src/clickhouse_cli.rs

+            let args: Vec<&OsStr> = command.as_std().get_args().collect();
+            let args_parsed: Vec<String> = args
+                .iter()
+                .map(|&os_str| os_str.to_str().unwrap().to_owned())


Sorry if I wasn't clear. I don't think we should panic here if the string isn't UTF8. Instead we should return an error, especially since this function already returns a Result.

Sorry again. For some reason your comments didn't show up on github last I looked. I realize this is unwrapping now because it's arguments we pass. Nonetheless I'd rather be on the safe side and return an error if possible.

So, to_str doesn't really return an error, it's an Option, is the intention to return an error on None?. The only purpose of args_parsed is to provide information for the ClickhouseCliError::Run error. That's why I thought to_str_lossy made sense, that way we can see in the error exactly what arguments are being passed. So, in a way, I was already returning an error?

Maybe I'm misunderstanding, but it feels a bit strange to return an error because of a malformed error? It seems simpler to collect all the arguments, however they were passed and return a single error with that information like it was in the beginning with to_str_lossy. WDYT?

TBH, we don't even need these args_parsed at all. I just thought it'd be useful to have them as part of the error message when returning an error, so an user could see exactly what arguments had been passed when running the command. The more I think about it, the more I think it's best to have to_str_lossy to see exactly what's being passed even in somehow a non-unicode character made it's way in somehow.

Unless this information isn't useful as part of the error message? In that case I could remove them altogether 🤔

Ok, looks like I totally misinterpreted what was happening here. Sorry about that @karencfv! I didn't realize this was just for the error message. You were right using to_str_lossy in the first place. It's useful information to have, even if there's some weird non-utf8 char that gets excluded, which is doubtful anyway. That's better than panicking on unwrap. Sorry for the hassle.

😄 nw! I've changed the names of the variables, so it's clearer what those are for

clickhouse-admin/src/clickhouse_cli.rs

andrewjstone · 2024-09-13T17:41:04Z

clickhouse-admin/types/src/lib.rs

+        }
+    }
+
+    pub fn parse(log: &Logger, data: &[u8]) -> Result<Self> {


This implementation is very robust :)

However, I'm not sure how necessary it is to use a macro and all this work to keep track. Since all fields are options, you can initialize the default and just parse each line and match on the key strings, ignoring ones that don't match. You can set a boolean for "any fields changed", or just test all them at the end.

It may be worthwhile instead to actually simplify this down to even an in order expected parsing and then just run the command against the actual keeper to see what it returns. That way we always know if the keeper changes on upgrade that we won't break our deployments. This will catch things like key name changes which we'll want to fix.
You should be able to spin up a keeper to do this type of testing via clickward as we do in the oximeter db integration tests

FWIW, I don't expect much to change about the output, as it's pretty simple in clickhouse, but we'd rather catch actual changes than treat missing fields as None in production. If you know exactly what names to expect you can also just make every field a u64 and get rid of the options altogether.

Hm, ok, let me play around with this a bit and simplify it :D

Thank you :)

I changed the code to return errors instead of trying to fill in as much of the response as it could.

I tried to get rid of the macro but hear me out 😄 . The moment I pass the raw output through .lines() the order of the items gets completely changed, which means order expected parsing could be error prone. On the other hand, the macro gives us assurance that a field of the struct will only be populated if the struct filed name is exactly the same as the key from the command output. I can't really think of a safer way to do this, but open to suggestions!

The moment I pass the raw output through .lines() the order of the items gets completely changed

This is not how lines works. It's a lazy iterator like all others and returns one line at a time in the current order. Are you sure the change in order isn't because you are inserting into a HashMap to collect the lines? There's also a sort below that will change the order. To keep the same order you probably want to collect into a Vec instead.

karencfv

Thanks for the through review! I think I've addressed your concerns, let me know what you think!

karencfv · 2024-09-17T06:16:55Z

clickhouse-admin/types/src/lib.rs

+        }
+    }
+
+    pub fn parse(log: &Logger, data: &[u8]) -> Result<Self> {


I changed the code to return errors instead of trying to fill in as much of the response as it could.

I tried to get rid of the macro but hear me out 😄 . The moment I pass the raw output through .lines() the order of the items gets completely changed, which means order expected parsing could be error prone. On the other hand, the macro gives us assurance that a field of the struct will only be populated if the struct filed name is exactly the same as the key from the command output. I can't really think of a safer way to do this, but open to suggestions!

karencfv

Thanks for taking the time to chat with me about this PR @andrewjstone :) I've added the integration test as agreed.

karencfv

Thanks for taking the time to catch up @andrewjstone ! I've made the changes we talked about :) Please let me know if there's anything missing

andrewjstone

Thanks for all the cleanup @karencfv! Looks great.

The failing test looks unrelated, but we should probably open an issue for it. I'm going to take a closer look and will do that.

andrewjstone · 2024-09-19T03:28:37Z

The test failure appears to be a known flake: #4949

karencfv · 2024-09-19T04:35:04Z

The test failure appears to be a known flake: #4949

Ah! good to know, thanks for looking into that

karencfv added 13 commits September 10, 2024 16:19

Basic lgif endpoint

e3a40bf

Return Lgif

78520f5

generate opeapi spec

40ae8f2

improve the parsing a bit

9959fd7

better...

bc1705a

Improvement...

bb3a272

clean up

3c8ada0

fmt

83c7869

openapi spec

481aad8

Add some tests

0866bbe

Add some docs

6c442d0

Add some logging

c8a1f89

Clean up and more tests

380139f

karencfv marked this pull request as ready for review September 12, 2024 02:48

karencfv changed the title ~~[reconfigurator] WIP: Retrieve keeper information~~ [reconfigurator] Retrieve keeper information Sep 12, 2024

karencfv changed the title ~~[reconfigurator] Retrieve keeper information~~ [reconfigurator] Retrieve keeper lgif information Sep 12, 2024

karencfv requested a review from andrewjstone September 12, 2024 02:49

andrewjstone reviewed Sep 12, 2024

View reviewed changes

karencfv added 2 commits September 13, 2024 12:19

address comments

3b5bb45

clean up

7a6619b

andrewjstone reviewed Sep 16, 2024

View reviewed changes

karencfv added 2 commits September 17, 2024 18:00

address comments

8d1104d

generate openapi spec

6371954

karencfv commented Sep 17, 2024

View reviewed changes

karencfv added 2 commits September 18, 2024 13:07

add integration test

55f59bd

actually add the new integration test

5cbbae2

karencfv commented Sep 18, 2024

View reviewed changes

Goodbye macro

bd3465d

karencfv requested a review from andrewjstone September 19, 2024 01:36

karencfv commented Sep 19, 2024

View reviewed changes

andrewjstone approved these changes Sep 19, 2024

View reviewed changes

karencfv merged commit c7f5a11 into oxidecomputer:main Sep 19, 2024
18 checks passed

karencfv deleted the keeper-cluster-membership branch September 19, 2024 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reconfigurator] Retrieve keeper lgif information #6549

[reconfigurator] Retrieve keeper lgif information #6549

karencfv commented Sep 10, 2024 •

edited

Loading

andrewjstone left a comment

andrewjstone left a comment

andrewjstone Sep 13, 2024

karencfv Sep 17, 2024

andrewjstone Sep 13, 2024

andrewjstone Sep 13, 2024

karencfv Sep 17, 2024

karencfv Sep 17, 2024

andrewjstone Sep 17, 2024 •

edited

Loading

karencfv Sep 17, 2024

andrewjstone Sep 13, 2024

karencfv Sep 17, 2024

andrewjstone Sep 17, 2024

karencfv Sep 17, 2024

andrewjstone Sep 17, 2024

karencfv left a comment

karencfv Sep 17, 2024

karencfv left a comment

karencfv left a comment

andrewjstone left a comment

andrewjstone commented Sep 19, 2024

karencfv commented Sep 19, 2024

[reconfigurator] Retrieve keeper lgif information #6549

[reconfigurator] Retrieve keeper lgif information #6549

Conversation

karencfv commented Sep 10, 2024 • edited Loading

Overview

Purpose

Testing

andrewjstone left a comment

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv left a comment

Choose a reason for hiding this comment

karencfv left a comment

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

andrewjstone commented Sep 19, 2024

karencfv commented Sep 19, 2024

karencfv commented Sep 10, 2024 •

edited

Loading

andrewjstone Sep 17, 2024 •

edited

Loading