-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coordinator crashes after upgrading to 29.0.0 #15942
Comments
@jreyeshdez - There is an explanation in the ticket you linked, and it has a workaround too. I am going to close this bug as there is nothing we can do from Druid side. If there is a new client library available with the fix, we can upgrade to that version. |
thanks @abhishekagarwal87, however the workaround provided in the link did not work in my case tho, as described above, I followed instructions in the comments and the amount of errors increased across all nodes. There were also other users impacted that commented in the slack thread we started https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1708510329253319 Is there any documentation in druid's release page describing what steps needs to be done for users upgrading to 29.0.0 that encounter that error? I think the guidance should come from Druid, not going to somewhere else PR/Issue's comment to fix it. |
The issue is caused because of a change in date format in the leader election config maps that are stored in kubernetes. The steps to fix are as follows:
you should see 2 config maps of the form:
verify the date from your error message appears in the annotations:
remove these by:
These will be recreated by the services whcih will then successfully start. |
Likely introduced by #15449 |
Thanks @m17kea for the explanation. I actually attempted your solution awhile ago and unfortunately did not work either. Worth saying I did it through k9s but somehow the config maps were not recreated with updated time. Link of the conversation https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1712231718374279?thread_ts=1708510329.253319&cid=C0309C9L90D In case you cant see above slack link, this is what the message says:
and
I deploy druid via Helm so not sure if there is a different procedure. |
The code that creates the config map is in a downstream kubernetes java library that has been updated in #15449. We use the druid-operator to deploy our clusters but this issue is not related to how it is deployed, it's controlled exclusively by the k8s extension. If the config maps get recreated with the old format again then you must have a version 28 version of one or more of the processes running somewhere. If you have 3 replicas perhaps the helm rolling update does them one at a time leaving old code running. The other thing you could try is manually editing the date in the config map to the new format which would move the error to the older versions whilst the new ones rollout. If you want a hand, reach out to me |
In that EKS cluster, there is only 1 Druid cluster running, it is isolated in a AWS dev account and EKS. However there are replicas for broker (3), coordinator (3) and router (3). Only 1 historical. Also worth noting it is running in mm-less and zk-less, not sure if that over complicates it. I think it might be related to the rolling update where if the newer upcoming pod fails to start then it won't terminate the other ones with old config. So not sure what best approach I can take here, specially when I have to do it to the production cluster. |
Description
Upgraded Druid cluster from 28.0.1 to 29.0.0 configured with mm-less and zk-less. Coordinator fails to start up.
Went ahead an deleted coordinator endpoint, it got re-created but issue became worse since after that all nodes started failing.
Historical log very similar to above.
The text was updated successfully, but these errors were encountered: