-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reconfigurator] Retrieve keeper lgif information #6549
[reconfigurator] Retrieve keeper lgif information #6549
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Karen, I took a quick look but I have to run. More feedback coming later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, Sorry @karencfv. I totally did this review on Friday and forgot to send it! My goal was totally to get this out for you before the weekend.
@@ -50,4 +52,15 @@ pub trait ClickhouseAdminApi { | |||
rqctx: RequestContext<Self::Context>, | |||
body: TypedBody<KeeperConfigurableSettings>, | |||
) -> Result<HttpResponseCreated<KeeperConfig>, HttpError>; | |||
|
|||
/// Retrieve a logically grouped information file from a keeper node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this command may be good for debugging/introspection aside from the inventory we need for reconfiguration and we should keep it.
I just want to point out that for inventory we'll likely not use this API command directly, but rather use the underlying keeper request along with another request to /keeper/config
to get the current committed configuration. That will eliminate an extra network round trip.
Unfortunately, keeper doesn't seem to provide a way to get the configuration and its log index together in one call. If it did that we could use that for inventory. I think that would be a good patch to make to keeper and the keeper cli, but until then we'll need to make two separate calls. And because of the inherent race condition of two calls, we'll likely have to call at least one endpoint twice to see that the config hasn't changed while polling. We would want to do that local to the sled-agent without a round trip from nexus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah definitely! I realise there would be another endpoint to collect all the information we need for inventory itself. My reasoning was that the lgif command is useful for many things, and in the spirit of keeping PRs as small and easily digestible as possible, I was going to make a separate PR with that endpoint. On hindsight I should have written that on the PRs description. Sorry about that!
let args: Vec<&OsStr> = command.as_std().get_args().collect(); | ||
let args_parsed: Vec<String> = args | ||
.iter() | ||
.map(|&os_str| os_str.to_str().unwrap().to_owned()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I wasn't clear. I don't think we should panic here if the string isn't UTF8. Instead we should return an error, especially since this function already returns a Result
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again. For some reason your comments didn't show up on github last I looked. I realize this is unwrapping now because it's arguments we pass. Nonetheless I'd rather be on the safe side and return an error if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to_str
doesn't really return an error, it's an Option, is the intention to return an error on None?. The only purpose of args_parsed
is to provide information for the ClickhouseCliError::Run
error. That's why I thought to_str_lossy
made sense, that way we can see in the error exactly what arguments are being passed. So, in a way, I was already returning an error?
Maybe I'm misunderstanding, but it feels a bit strange to return an error because of a malformed error? It seems simpler to collect all the arguments, however they were passed and return a single error with that information like it was in the beginning with to_str_lossy
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, we don't even need these args_parsed
at all. I just thought it'd be useful to have them as part of the error message when returning an error, so an user could see exactly what arguments had been passed when running the command. The more I think about it, the more I think it's best to have to_str_lossy
to see exactly what's being passed even in somehow a non-unicode character made it's way in somehow.
Unless this information isn't useful as part of the error message? In that case I could remove them altogether 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks like I totally misinterpreted what was happening here. Sorry about that @karencfv! I didn't realize this was just for the error message. You were right using to_str_lossy
in the first place. It's useful information to have, even if there's some weird non-utf8 char that gets excluded, which is doubtful anyway. That's better than panicking on unwrap. Sorry for the hassle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄 nw! I've changed the names of the variables, so it's clearer what those are for
} | ||
} | ||
|
||
pub fn parse(log: &Logger, data: &[u8]) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is very robust :)
However, I'm not sure how necessary it is to use a macro and all this work to keep track. Since all fields are options, you can initialize the default and just parse each line and match on the key strings, ignoring ones that don't match. You can set a boolean for "any fields changed", or just test all them at the end.
It may be worthwhile instead to actually simplify this down to even an in order expected parsing and then just run the command against the actual keeper to see what it returns. That way we always know if the keeper changes on upgrade that we won't break our deployments. This will catch things like key name changes which we'll want to fix.
You should be able to spin up a keeper to do this type of testing via clickward
as we do in the oximeter db integration tests
FWIW, I don't expect much to change about the output, as it's pretty simple in clickhouse, but we'd rather catch actual changes than treat missing fields as None
in production. If you know exactly what names to expect you can also just make every field a u64
and get rid of the options altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, ok, let me play around with this a bit and simplify it :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code to return errors instead of trying to fill in as much of the response as it could.
I tried to get rid of the macro but hear me out 😄 . The moment I pass the raw output through .lines()
the order of the items gets completely changed, which means order expected parsing could be error prone. On the other hand, the macro gives us assurance that a field of the struct will only be populated if the struct filed name is exactly the same as the key from the command output. I can't really think of a safer way to do this, but open to suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The moment I pass the raw output through .lines() the order of the items gets completely changed
This is not how lines
works. It's a lazy iterator like all others and returns one line at a time in the current order. Are you sure the change in order isn't because you are inserting into a HashMap
to collect the lines? There's also a sort below that will change the order. To keep the same order you probably want to collect into a Vec
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the through review! I think I've addressed your concerns, let me know what you think!
} | ||
} | ||
|
||
pub fn parse(log: &Logger, data: &[u8]) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code to return errors instead of trying to fill in as much of the response as it could.
I tried to get rid of the macro but hear me out 😄 . The moment I pass the raw output through .lines()
the order of the items gets completely changed, which means order expected parsing could be error prone. On the other hand, the macro gives us assurance that a field of the struct will only be populated if the struct filed name is exactly the same as the key from the command output. I can't really think of a safer way to do this, but open to suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to chat with me about this PR @andrewjstone :) I've added the integration test as agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to catch up @andrewjstone ! I've made the changes we talked about :) Please let me know if there's anything missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the cleanup @karencfv! Looks great.
The failing test looks unrelated, but we should probably open an issue for it. I'm going to take a closer look and will do that.
The test failure appears to be a known flake: #4949 |
Ah! good to know, thanks for looking into that |
Overview
This commit introduces a new
clickhouse-admin
API endpoint:/keeper/lgif
.This endpoint uses the ClickHouse CLI internally to retrieve and parse the logically grouped information file from the ClickHouse keepers.
Purpose
Reconfigurator will need this information to reliably manage and operate a ClickHouse replicated cluster. Additional endpoints to retrieve other information from ClickHouse servers or keepers will be added in follow up PRs.
Testing
In addition to the unit tests, I have manually tested with the following results:
Related: #5999