Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator crashes after upgrading to 29.0.0 #15942

Closed
jreyeshdez opened this issue Feb 22, 2024 · 7 comments · May be fixed by #17529
Closed

Coordinator crashes after upgrading to 29.0.0 #15942

jreyeshdez opened this issue Feb 22, 2024 · 7 comments · May be fixed by #17529

Comments

@jreyeshdez
Copy link

Description

Upgraded Druid cluster from 28.0.1 to 29.0.0 configured with mm-less and zk-less. Coordinator fails to start up.

  • Cluster size: Dev cluster: 3xBrokers, 3xCoordinators, 3xHistorical
  • Configurations in use: mm-less and zk-less set up.
  • Steps to reproduce the problem: upgraded from 28.0.1 to 29.0.0
  • The error message or stack traces encountered: In coordinator:
2024-02-21T10:10:36,125 INFO [main] org.apache.druid.client.HttpServerInventoryView - Started executor[HttpServerInventoryView].
2024-02-21T10:10:36,126 INFO [main] org.apache.druid.client.CoordinatorServerView - CoordinatorServerView waiting for initialization.
2024-02-21T10:10:36,521 ERROR [leader-elector-scheduled-worker-1] io.kubernetes.client.extended.leaderelection.LeaderElector - Unexpected error on acquiring or renewing the lease
java.util.concurrent.ExecutionException: java.time.format.DateTimeParseException: Text '20240220T125204.849Z' could not be parsed at index 0
    at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
    at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:?]
    at io.kubernetes.client.extended.leaderelection.LeaderElector.lambda$acquire$1(LeaderElector.java:187) ~[?:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.time.format.DateTimeParseException: Text '20240220T125204.849Z' could not be parsed at index 0
    at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2052) ~[?:?]
    at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1880) ~[?:?]
    at io.kubernetes.client.openapi.JSON$DateTypeAdapter.read(JSON.java:405) ~[?:?]
    at io.kubernetes.client.openapi.JSON$DateTypeAdapter.read(JSON.java:363) ~[?:?]
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.readIntoField(ReflectiveTypeAdapterFactory.java:212) ~[gson-2.10.jar:?]
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$FieldReflectionAdapter.readField(ReflectiveTypeAdapterFactory.java:431) ~[gson-2.10.jar:?]
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:391) ~[gson-2.10.jar:?]
    at com.google.gson.Gson.fromJson(Gson.java:1214) ~[gson-2.10.jar:?]
    at com.google.gson.Gson.fromJson(Gson.java:1124) ~[gson-2.10.jar:?]
    at com.google.gson.Gson.fromJson(Gson.java:1034) ~[gson-2.10.jar:?]
    at com.google.gson.Gson.fromJson(Gson.java:1001) ~[gson-2.10.jar:?]
    at io.kubernetes.client.openapi.JSON.deserialize(JSON.java:168) ~[?:?]
    at io.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.get(ConfigMapLock.java:78) ~[?:?]
    at io.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:291) ~[?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    ... 3 more

Went ahead an deleted coordinator endpoint, it got re-created but issue became worse since after that all nodes started failing.

  • Broker log:
2024-02-21T10:46:44,297 INFO [main] org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider - Created NodeRoleWatcher for nodeRole [COORDINATOR].
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at org.apache.druid.server.security.PreResponseAuthorizationCheckFilter.doFilter(PreResponseAuthorizationCheckFilter.java:84)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at org.apache.druid.server.initialization.jetty.StandardResponseHeaderFilterHolder$StandardResponseHeaderFilter.doFilter(StandardResponseHeaderFilterHolder.java:164)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at org.apache.druid.server.security.AllowHttpMethodsResourceFilter.doFilter(AllowHttpMethodsResourceFilter.java:78)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at org.apache.druid.server.security.AllowOptionsResourceFilter.doFilter(AllowOptionsResourceFilter.java:74)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at org.apache.druid.security.basic.authentication.BasicHTTPAuthenticator$BasicHTTPAuthenticationFilter.doFilter(BasicHTTPAuthenticator.java:220)
    at org.apache.druid.server.security.AuthenticationWrappingFilter.doFilter(AuthenticationWrappingFilter.java:59)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at org.apache.druid.server.security.SecuritySanityCheckFilter.doFilter(SecuritySanityCheckFilter.java:77)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    at org.eclipse.jetty.server.Server.handle(Server.java:516)
    at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
    at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.RuntimeException: java.time.format.DateTimeParseException: Text '20240221T104140.624Z' could not be parsed at index 0
    at org.apache.druid.k8s.discovery.K8sDruidLeaderSelector.getCurrentLeader(K8sDruidLeaderSelector.java:112)
    at org.apache.druid.server.coordinator.DruidCoordinator.getCurrentLeader(DruidCoordinator.java:315)
    at org.apache.druid.server.http.CoordinatorRedirectInfo.getRedirectURL(CoordinatorRedirectInfo.java:57)
    ... 53 more
Caused by: java.time.format.DateTimeParseException: Text '20240221T104140.624Z' could not be parsed at index 0
    at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2052)
    at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1880)
    at io.kubernetes.client.openapi.JSON$DateTypeAdapter.read(JSON.java:405)
    at io.kubernetes.client.openapi.JSON$DateTypeAdapter.read(JSON.java:363)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.readIntoField(ReflectiveTypeAdapterFactory.java:212)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$FieldReflectionAdapter.readField(ReflectiveTypeAdapterFactory.java:431)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:391)
    at com.google.gson.Gson.fromJson(Gson.java:1214)
    at com.google.gson.Gson.fromJson(Gson.java:1124)
    at com.google.gson.Gson.fromJson(Gson.java:1034)
    at com.google.gson.Gson.fromJson(Gson.java:1001)
    at io.kubernetes.client.openapi.JSON.deserialize(JSON.java:168)
    at io.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.get(ConfigMapLock.java:78)
    at org.apache.druid.k8s.discovery.DefaultK8sLeaderElectorFactory$1.getCurrentLeader(DefaultK8sLeaderElectorFactory.java:67)
    at org.apache.druid.k8s.discovery.LeaderElectorAsyncWrapper.getCurrentLeader(LeaderElectorAsyncWrapper.java:117)
    at org.apache.druid.k8s.discovery.K8sDruidLeaderSelector.getCurrentLeader(K8sDruidLeaderSelector.java:109)
    ... 55 more
</pre>

</body>
</html>
]
    at org.apache.druid.query.lookup.LookupReferencesManager.tryGetLookupListFromCoordinator(LookupReferencesManager.java:459) ~[druid-server-29.0.0.jar:29.0.0]
    at org.apache.druid.query.lookup.LookupReferencesManager.lambda$getLookupListFromCoordinator$4(LookupReferencesManager.java:428) ~[druid-server-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:129) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.query.lookup.LookupReferencesManager.getLookupListFromCoordinator(LookupReferencesManager.java:418) ~[druid-server-29.0.0.jar:29.0.0]
    at org.apache.druid.query.lookup.LookupReferencesManager.getLookupsList(LookupReferencesManager.java:395) ~[druid-server-29.0.0.jar:29.0.0]
    at org.apache.druid.query.lookup.LookupReferencesManager.loadAllLookupsAndInitStateRef(LookupReferencesManager.java:378) ~[druid-server-29.0.0.jar:29.0.0]
    at org.apache.druid.query.lookup.LookupReferencesManager.start(LookupReferencesManager.java:170) ~[druid-server-29.0.0.jar:29.0.0]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
    at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.guice.LifecycleModule$2.start(LifecycleModule.java:152) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:136) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:94) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.ServerRunnable.run(ServerRunnable.java:63) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.Main.main(Main.java:112) ~[druid-services-29.0.0.jar:29.0.0]
  • Router logs:
2024-02-21T10:48:30,378 WARN [main] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 9) in 1,123ms.
com.fasterxml.jackson.core.JsonParseException: Input does not start with Smile format header (first byte = 0x3c) and parser has REQUIRE_HEADER enabled: can not parse
 at [Source: (byte[])"<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 500 java.lang.RuntimeException: java.lang.RuntimeException: java.time.format.DateTimeParseException: Text &apos;20240221T104140.624Z&apos; could not be parsed at index 0</title>
</head>
<body><h2>HTTP ERROR 500 java.lang.RuntimeException: java.lang.RuntimeException: java.time.format.DateTimeParseException: Text &apos;20240221T104140.624Z&apos; could not be parsed at index 0</h2>
<table>
<tr><th>UR"[truncated 7950 bytes]; line: -1, column: 0]
    at com.fasterxml.jackson.dataformat.smile.SmileParserBootstrapper.constructParser(SmileParserBootstrapper.java:133) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.dataformat.smile.SmileFactory._createParser(SmileFactory.java:458) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.dataformat.smile.SmileFactory.createParser(SmileFactory.java:364) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.dataformat.smile.SmileFactory.createParser(SmileFactory.java:351) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.dataformat.smile.SmileFactory.createParser(SmileFactory.java:29) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3626) ~[jackson-databind-2.12.7.1.jar:2.12.7.1]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.tryFetchUserMapFromCoordinator(CoordinatorPollingBasicAuthenticatorCacheManager.java:262) ~[?:?]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.lambda$fetchUserMapFromCoordinator$1(CoordinatorPollingBasicAuthenticatorCacheManager.java:192) ~[?:?]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:129) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.fetchUserMapFromCoordinator(CoordinatorPollingBasicAuthenticatorCacheManager.java:190) ~[?:?]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.initUserMaps(CoordinatorPollingBasicAuthenticatorCacheManager.java:289) ~[?:?]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.start(CoordinatorPollingBasicAuthenticatorCacheManager.java:108) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
    at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.guice.LifecycleModule$2.start(LifecycleModule.java:152) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:136) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:94) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.ServerRunnable.run(ServerRunnable.java:63) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.Main.main(Main.java:112) ~[druid-services-29.0.0.jar:29.0.0]
2024-02-21T10:48:31,605 WARN [main] org.apache.druid.java.util.common.RetryUtils - Retrying (2 of 9) in 2,098ms.
com.fasterxml.jackson.core.JsonParseException: Input does not start with Smile format header (first byte = 0x3c) and parser has REQUIRE_HEADER enabled: can not parse
 at [Source: (byte[])"<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 500 java.lang.RuntimeException: java.lang.RuntimeException: java.time.format.DateTimeParseException: Text &apos;20240221T104140.624Z&apos; could not be parsed at index 0</title>
</head>
<body><h2>HTTP ERROR 500 java.lang.RuntimeException: java.lang.RuntimeException: java.time.format.DateTimeParseException: Text &apos;20240221T104140.624Z&apos; could not be parsed at index 0</h2>
<table>
<tr><th>UR"[truncated 7950 bytes]; line: -1, column: 0]
    at com.fasterxml.jackson.dataformat.smile.SmileParserBootstrapper.constructParser(SmileParserBootstrapper.java:133) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.dataformat.smile.SmileFactory._createParser(SmileFactory.java:458) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.dataformat.smile.SmileFactory.createParser(SmileFactory.java:364) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.dataformat.smile.SmileFactory.createParser(SmileFactory.java:351) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.dataformat.smile.SmileFactory.createParser(SmileFactory.java:29) ~[jackson-dataformat-smile-2.12.7.jar:2.12.7]
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3626) ~[jackson-databind-2.12.7.1.jar:2.12.7.1]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.tryFetchUserMapFromCoordinator(CoordinatorPollingBasicAuthenticatorCacheManager.java:262) ~[?:?]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.lambda$fetchUserMapFromCoordinator$1(CoordinatorPollingBasicAuthenticatorCacheManager.java:192) ~[?:?]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:129) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.fetchUserMapFromCoordinator(CoordinatorPollingBasicAuthenticatorCacheManager.java:190) ~[?:?]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.initUserMaps(CoordinatorPollingBasicAuthenticatorCacheManager.java:289) ~[?:?]
    at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.start(CoordinatorPollingBasicAuthenticatorCacheManager.java:108) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
    at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.guice.LifecycleModule$2.start(LifecycleModule.java:152) ~[druid-processing-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:136) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:94) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.ServerRunnable.run(ServerRunnable.java:63) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.Main.main(Main.java:112) ~[druid-services-29.0.0.jar:29.0.0]
  • Coordinator logs:
2024-02-21T10:50:05,825 DEBUG [qtp778350106-150] org.apache.druid.jetty.RequestLog - 10.2.22.225 POST //10.2.14.73:8081/druid/coordinator/v1/metadata/dataSourceInformation HTTP/1.1 500
2024-02-21T10:50:06,031 WARN [qtp778350106-157] org.eclipse.jetty.server.HttpChannel - /druid/coordinator/v1/metadata/dataSourceInformation
java.lang.RuntimeException: java.lang.RuntimeException: java.time.format.DateTimeParseException: Text '20240221T104140.624Z' could not be parsed at index 0
    at org.apache.druid.server.http.CoordinatorRedirectInfo.getRedirectURL(CoordinatorRedirectInfo.java:71) ~[druid-server-29.0.0.jar:29.0.0]
    at org.apache.druid.cli.CoordinatorOverlordRedirectInfo.getRedirectURL(CoordinatorOverlordRedirectInfo.java:58) ~[druid-services-29.0.0.jar:29.0.0]
    at org.apache.druid.server.http.RedirectFilter.doFilter(RedirectFilter.java:75) ~[druid-server-29.0.0.jar:29.0.0]
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.apache.druid.server.security.PreResponseAuthorizationCheckFilter.doFilter(PreResponseAuthorizationCheckFilter.java:84) ~[druid-server-29.0.0.jar:29.0.0]
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.apache.druid.server.initialization.jetty.StandardResponseHeaderFilterHolder$StandardResponseHeaderFilter.doFilter(StandardResponseHeaderFilterHolder.java:164) ~[druid-server-29.0.0.jar:29.0.0]
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.apache.druid.server.security.AllowHttpMethodsResourceFilter.doFilter(AllowHttpMethodsResourceFilter.java:78) ~[druid-server-29.0.0.jar:29.0.0]
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.apache.druid.server.security.AllowOptionsResourceFilter.doFilter(AllowOptionsResourceFilter.java:74) ~[druid-server-29.0.0.jar:29.0.0]
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.apache.druid.security.basic.authentication.BasicHTTPAuthenticator$BasicHTTPAuthenticationFilter.doFilter(BasicHTTPAuthenticator.java:220) ~[?:?]
    at org.apache.druid.server.security.AuthenticationWrappingFilter.doFilter(AuthenticationWrappingFilter.java:59) ~[druid-server-29.0.0.jar:29.0.0]
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.apache.druid.server.security.SecuritySanityCheckFilter.doFilter(SecuritySanityCheckFilter.java:77) ~[druid-server-29.0.0.jar:29.0.0]
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505) ~[jetty-servlet-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[jetty-server-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[jetty-io-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[jetty-io-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[jetty-io-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[jetty-util-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[jetty-util-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[jetty-util-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[jetty-util-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[jetty-util-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[jetty-util-9.4.53.v20231009.jar:9.4.53.v20231009]
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[jetty-util-9.4.53.v20231009.jar:9.4.53.v20231009]
    at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.lang.RuntimeException: java.time.format.DateTimeParseException: Text '20240221T104140.624Z' could not be parsed at index 0
    at org.apache.druid.k8s.discovery.K8sDruidLeaderSelector.getCurrentLeader(K8sDruidLeaderSelector.java:112) ~[?:?]
    at org.apache.druid.server.coordinator.DruidCoordinator.getCurrentLeader(DruidCoordinator.java:315) ~[druid-server-29.0.0.jar:29.0.0]
    at org.apache.druid.server.http.CoordinatorRedirectInfo.getRedirectURL(CoordinatorRedirectInfo.java:57) ~[druid-server-29.0.0.jar:29.0.0]
    ... 53 more
Caused by: java.time.format.DateTimeParseException: Text '20240221T104140.624Z' could not be parsed at index 0
    at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2052) ~[?:?]
    at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1880) ~[?:?]
    at io.kubernetes.client.openapi.JSON$DateTypeAdapter.read(JSON.java:405) ~[?:?]
    at io.kubernetes.client.openapi.JSON$DateTypeAdapter.read(JSON.java:363) ~[?:?]
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.readIntoField(ReflectiveTypeAdapterFactory.java:212) ~[gson-2.10.jar:?]
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$FieldReflectionAdapter.readField(ReflectiveTypeAdapterFactory.java:431) ~[gson-2.10.jar:?]
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:391) ~[gson-2.10.jar:?]
    at com.google.gson.Gson.fromJson(Gson.java:1214) ~[gson-2.10.jar:?]
    at com.google.gson.Gson.fromJson(Gson.java:1124) ~[gson-2.10.jar:?]
    at com.google.gson.Gson.fromJson(Gson.java:1034) ~[gson-2.10.jar:?]
    at com.google.gson.Gson.fromJson(Gson.java:1001) ~[gson-2.10.jar:?]
    at io.kubernetes.client.openapi.JSON.deserialize(JSON.java:168) ~[?:?]
    at io.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.get(ConfigMapLock.java:78) ~[?:?]
    at org.apache.druid.k8s.discovery.DefaultK8sLeaderElectorFactory$1.getCurrentLeader(DefaultK8sLeaderElectorFactory.java:67) ~[?:?]
    at org.apache.druid.k8s.discovery.LeaderElectorAsyncWrapper.getCurrentLeader(LeaderElectorAsyncWrapper.java:117) ~[?:?]
    at org.apache.druid.k8s.discovery.K8sDruidLeaderSelector.getCurrentLeader(K8sDruidLeaderSelector.java:109) ~[?:?]
    at org.apache.druid.server.coordinator.DruidCoordinator.getCurrentLeader(DruidCoordinator.java:315) ~[druid-server-29.0.0.jar:29.0.0]
    at org.apache.druid.server.http.CoordinatorRedirectInfo.getRedirectURL(CoordinatorRedirectInfo.java:57) ~[druid-server-29.0.0.jar:29.0.0]
    ... 53 more

Historical log very similar to above.

@abhishekagarwal87
Copy link
Contributor

@jreyeshdez - There is an explanation in the ticket you linked, and it has a workaround too. I am going to close this bug as there is nothing we can do from Druid side. If there is a new client library available with the fix, we can upgrade to that version.

@abhishekagarwal87 abhishekagarwal87 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2024
@jreyeshdez
Copy link
Author

@jreyeshdez - There is an explanation in the ticket you linked, and it has a workaround too. I am going to close this bug as there is nothing we can do from Druid side. If there is a new client library available with the fix, we can upgrade to that version.

thanks @abhishekagarwal87, however the workaround provided in the link did not work in my case tho, as described above, I followed instructions in the comments and the amount of errors increased across all nodes. There were also other users impacted that commented in the slack thread we started https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1708510329253319

Is there any documentation in druid's release page describing what steps needs to be done for users upgrading to 29.0.0 that encounter that error? I think the guidance should come from Druid, not going to somewhere else PR/Issue's comment to fix it.

@m17kea
Copy link

m17kea commented Dec 2, 2024

The issue is caused because of a change in date format in the leader election config maps that are stored in kubernetes. The steps to fix are as follows:

kubectl get cm -n <druid namespace>

you should see 2 config maps of the form:

<cluster-identifier>-leaderelection-coordinator
<cluster-identifier>-leaderelection-overlord

verify the date from your error message appears in the annotations:

kubectl describe cm <cluster-identifier>-leaderelection-coordinator -n <druid namespace>
kubectl describe cm <cluster-identifier>-leaderelection-overlord -n <druid namespace>

remove these by:

kubectl delete cm <cluster-identifier>-leaderelection-coordinator -n <druid namespace>
kubectl delete cm <cluster-identifier>-leaderelection-overlord -n <druid namespace>

These will be recreated by the services whcih will then successfully start.

@m17kea
Copy link

m17kea commented Dec 2, 2024

Likely introduced by #15449

@jreyeshdez
Copy link
Author

The issue is caused because of a change in date format in the leader election config maps that are stored in kubernetes. The steps to fix are as follows:

kubectl get cm -n <druid namespace>

you should see 2 config maps of the form:

<cluster-identifier>-leaderelection-coordinator
<cluster-identifier>-leaderelection-overlord

verify the date from your error message appears in the annotations:

kubectl describe cm <cluster-identifier>-leaderelection-coordinator -n <druid namespace>
kubectl describe cm <cluster-identifier>-leaderelection-overlord -n <druid namespace>

remove these by:

kubectl delete cm <cluster-identifier>-leaderelection-coordinator -n <druid namespace>
kubectl delete cm <cluster-identifier>-leaderelection-overlord -n <druid namespace>

These will be recreated by the services whcih will then successfully start.

Thanks @m17kea for the explanation.

I actually attempted your solution awhile ago and unfortunately did not work either. Worth saying I did it through k9s but somehow the config maps were not recreated with updated time.

Link of the conversation https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1712231718374279?thread_ts=1708510329.253319&cid=C0309C9L90D

In case you cant see above slack link, this is what the message says:

just checked and see druid-deployment-dev-leaderelection-coordinator and druid-deployment-dev-leaderelection-overlord  both have "acquireTime":"20240228T111729.078Z" and "acquireTime":"20240228T111857.334Z"

and

well unfortunately it does not work, I upgraded and applied the changes, it failed with same error but I deleted the configmaps but they got re-created back with same acquireTime format even in 29.0.1

I deploy druid via Helm so not sure if there is a different procedure.

@m17kea
Copy link

m17kea commented Dec 3, 2024

The code that creates the config map is in a downstream kubernetes java library that has been updated in #15449. We use the druid-operator to deploy our clusters but this issue is not related to how it is deployed, it's controlled exclusively by the k8s extension. If the config maps get recreated with the old format again then you must have a version 28 version of one or more of the processes running somewhere. If you have 3 replicas perhaps the helm rolling update does them one at a time leaving old code running. The other thing you could try is manually editing the date in the config map to the new format which would move the error to the older versions whilst the new ones rollout. If you want a hand, reach out to me

@jreyeshdez
Copy link
Author

jreyeshdez commented Dec 3, 2024

The code that creates the config map is in a downstream kubernetes java library that has been updated in #15449. We use the druid-operator to deploy our clusters but this issue is not related to how it is deployed, it's controlled exclusively by the k8s extension. If the config maps get recreated with the old format again then you must have a version 28 version of one or more of the processes running somewhere. If you have 3 replicas perhaps the helm rolling update does them one at a time leaving old code running. The other thing you could try is manually editing the date in the config map to the new format which would move the error to the older versions whilst the new ones rollout. If you want a hand, reach out to me

In that EKS cluster, there is only 1 Druid cluster running, it is isolated in a AWS dev account and EKS. However there are replicas for broker (3), coordinator (3) and router (3). Only 1 historical. Also worth noting it is running in mm-less and zk-less, not sure if that over complicates it.

I think it might be related to the rolling update where if the newer upcoming pod fails to start then it won't terminate the other ones with old config. So not sure what best approach I can take here, specially when I have to do it to the production cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants