Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reconfigurator] Retrieve keeper lgif information #6549

Merged
merged 20 commits into from
Sep 19, 2024

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Sep 10, 2024

Overview

This commit introduces a new clickhouse-admin API endpoint: /keeper/lgif.

This endpoint uses the ClickHouse CLI internally to retrieve and parse the logically grouped information file from the ClickHouse keepers.

Purpose

Reconfigurator will need this information to reliably manage and operate a ClickHouse replicated cluster. Additional endpoints to retrieve other information from ClickHouse servers or keepers will be added in follow up PRs.

Testing

In addition to the unit tests, I have manually tested with the following results:

$ cargo run --bin=clickhouse-admin -- run -c ./smf/clickhouse-admin/config.toml -a [::1]:8888 -l [::1]:20001 -b /Users/karcar/src/omicron/out/clickhouse/clickhouse
   Compiling omicron-clickhouse-admin v0.1.0 (/Users/karcar/src/omicron/clickhouse-admin)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 2.46s
     Running `target/debug/clickhouse-admin run -c ./smf/clickhouse-admin/config.toml -a '[::1]:8888' -l '[::1]:20001' -b /Users/karcar/src/omicron/out/clickhouse/clickhouse`
note: configured to log to "/dev/stdout"
{"msg":"listening","v":0,"name":"clickhouse-admin","level":30,"time":"2024-09-12T02:37:19.383597Z","hostname":"ixchel","pid":3115,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/git/checkouts/dropshot-a4a923d29dccc492/06c8dab/dropshot/src/server.rs:205"}
{"msg":"accepted connection","v":0,"name":"clickhouse-admin","level":30,"time":"2024-09-12T02:37:23.843325Z","hostname":"ixchel","pid":3115,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/git/checkouts/dropshot-a4a923d29dccc492/06c8dab/dropshot/src/server.rs:775","remote_addr":"[::1]:54455"}
{"msg":"request completed","v":0,"name":"clickhouse-admin","level":30,"time":"2024-09-12T02:37:24.302588Z","hostname":"ixchel","pid":3115,"uri":"/keeper/lgif","method":"GET","req_id":"64b232d0-d6ac-4cae-8f0a-f14cf6d1dfba","remote_addr":"[::1]:54455","local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/git/checkouts/dropshot-a4a923d29dccc492/06c8dab/dropshot/src/server.rs:914","latency_us":458301,"response_code":"200"}
$ curl http://[::1]:8888/keeper/lgif
{"first_log_idx":1,"first_log_term":1,"last_log_idx":11717,"last_log_term":20,"last_committed_log_idx":11717,"leader_committed_log_idx":11717,"target_committed_log_idx":11717,"last_snapshot_idx":9465}

Related: #5999

@karencfv karencfv marked this pull request as ready for review September 12, 2024 02:48
@karencfv karencfv changed the title [reconfigurator] WIP: Retrieve keeper information [reconfigurator] Retrieve keeper information Sep 12, 2024
@karencfv karencfv changed the title [reconfigurator] Retrieve keeper information [reconfigurator] Retrieve keeper lgif information Sep 12, 2024
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Karen, I took a quick look but I have to run. More feedback coming later.

clickhouse-admin/src/clickhouse_cli.rs Outdated Show resolved Hide resolved
clickhouse-admin/src/clickhouse_cli.rs Show resolved Hide resolved
clickhouse-admin/types/src/lib.rs Outdated Show resolved Hide resolved
clickhouse-admin/types/src/lib.rs Outdated Show resolved Hide resolved
clickhouse-admin/types/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, Sorry @karencfv. I totally did this review on Friday and forgot to send it! My goal was totally to get this out for you before the weekend.

@@ -50,4 +52,15 @@ pub trait ClickhouseAdminApi {
rqctx: RequestContext<Self::Context>,
body: TypedBody<KeeperConfigurableSettings>,
) -> Result<HttpResponseCreated<KeeperConfig>, HttpError>;

/// Retrieve a logically grouped information file from a keeper node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this command may be good for debugging/introspection aside from the inventory we need for reconfiguration and we should keep it.

I just want to point out that for inventory we'll likely not use this API command directly, but rather use the underlying keeper request along with another request to /keeper/config to get the current committed configuration. That will eliminate an extra network round trip.

Unfortunately, keeper doesn't seem to provide a way to get the configuration and its log index together in one call. If it did that we could use that for inventory. I think that would be a good patch to make to keeper and the keeper cli, but until then we'll need to make two separate calls. And because of the inherent race condition of two calls, we'll likely have to call at least one endpoint twice to see that the config hasn't changed while polling. We would want to do that local to the sled-agent without a round trip from nexus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah definitely! I realise there would be another endpoint to collect all the information we need for inventory itself. My reasoning was that the lgif command is useful for many things, and in the spirit of keeping PRs as small and easily digestible as possible, I was going to make a separate PR with that endpoint. On hindsight I should have written that on the PRs description. Sorry about that!

let args: Vec<&OsStr> = command.as_std().get_args().collect();
let args_parsed: Vec<String> = args
.iter()
.map(|&os_str| os_str.to_str().unwrap().to_owned())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I wasn't clear. I don't think we should panic here if the string isn't UTF8. Instead we should return an error, especially since this function already returns a Result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again. For some reason your comments didn't show up on github last I looked. I realize this is unwrapping now because it's arguments we pass. Nonetheless I'd rather be on the safe side and return an error if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to_str doesn't really return an error, it's an Option, is the intention to return an error on None?. The only purpose of args_parsed is to provide information for the ClickhouseCliError::Run error. That's why I thought to_str_lossy made sense, that way we can see in the error exactly what arguments are being passed. So, in a way, I was already returning an error?

Maybe I'm misunderstanding, but it feels a bit strange to return an error because of a malformed error? It seems simpler to collect all the arguments, however they were passed and return a single error with that information like it was in the beginning with to_str_lossy. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, we don't even need these args_parsed at all. I just thought it'd be useful to have them as part of the error message when returning an error, so an user could see exactly what arguments had been passed when running the command. The more I think about it, the more I think it's best to have to_str_lossy to see exactly what's being passed even in somehow a non-unicode character made it's way in somehow.

Unless this information isn't useful as part of the error message? In that case I could remove them altogether 🤔

Copy link
Contributor

@andrewjstone andrewjstone Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks like I totally misinterpreted what was happening here. Sorry about that @karencfv! I didn't realize this was just for the error message. You were right using to_str_lossy in the first place. It's useful information to have, even if there's some weird non-utf8 char that gets excluded, which is doubtful anyway. That's better than panicking on unwrap. Sorry for the hassle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 nw! I've changed the names of the variables, so it's clearer what those are for

clickhouse-admin/src/clickhouse_cli.rs Outdated Show resolved Hide resolved
}
}

pub fn parse(log: &Logger, data: &[u8]) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is very robust :)

However, I'm not sure how necessary it is to use a macro and all this work to keep track. Since all fields are options, you can initialize the default and just parse each line and match on the key strings, ignoring ones that don't match. You can set a boolean for "any fields changed", or just test all them at the end.

It may be worthwhile instead to actually simplify this down to even an in order expected parsing and then just run the command against the actual keeper to see what it returns. That way we always know if the keeper changes on upgrade that we won't break our deployments. This will catch things like key name changes which we'll want to fix.
You should be able to spin up a keeper to do this type of testing via clickward as we do in the oximeter db integration tests

FWIW, I don't expect much to change about the output, as it's pretty simple in clickhouse, but we'd rather catch actual changes than treat missing fields as None in production. If you know exactly what names to expect you can also just make every field a u64 and get rid of the options altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, ok, let me play around with this a bit and simplify it :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code to return errors instead of trying to fill in as much of the response as it could.

I tried to get rid of the macro but hear me out 😄 . The moment I pass the raw output through .lines() the order of the items gets completely changed, which means order expected parsing could be error prone. On the other hand, the macro gives us assurance that a field of the struct will only be populated if the struct filed name is exactly the same as the key from the command output. I can't really think of a safer way to do this, but open to suggestions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The moment I pass the raw output through .lines() the order of the items gets completely changed

This is not how lines works. It's a lazy iterator like all others and returns one line at a time in the current order. Are you sure the change in order isn't because you are inserting into a HashMap to collect the lines? There's also a sort below that will change the order. To keep the same order you probably want to collect into a Vec instead.

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the through review! I think I've addressed your concerns, let me know what you think!

}
}

pub fn parse(log: &Logger, data: &[u8]) -> Result<Self> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code to return errors instead of trying to fill in as much of the response as it could.

I tried to get rid of the macro but hear me out 😄 . The moment I pass the raw output through .lines() the order of the items gets completely changed, which means order expected parsing could be error prone. On the other hand, the macro gives us assurance that a field of the struct will only be populated if the struct filed name is exactly the same as the key from the command output. I can't really think of a safer way to do this, but open to suggestions!

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to chat with me about this PR @andrewjstone :) I've added the integration test as agreed.

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to catch up @andrewjstone ! I've made the changes we talked about :) Please let me know if there's anything missing

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the cleanup @karencfv! Looks great.

The failing test looks unrelated, but we should probably open an issue for it. I'm going to take a closer look and will do that.

@andrewjstone
Copy link
Contributor

The test failure appears to be a known flake: #4949

@karencfv
Copy link
Contributor Author

The test failure appears to be a known flake: #4949

Ah! good to know, thanks for looking into that

@karencfv karencfv merged commit c7f5a11 into oxidecomputer:main Sep 19, 2024
18 checks passed
@karencfv karencfv deleted the keeper-cluster-membership branch September 19, 2024 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants