Cassandra pod stuck instead of CrashLoopBackOff when Medusa config loading fails #805

c3-clement · 2024-09-18T11:27:24Z

See k8ssandra-operator issue: k8ssandra/k8ssandra-operator#1406

What happened?

I deployed a K8ssandraCluster with 96 replicas and medusa enabled, and one of the pods did not reach the Readiness probe

k get sts cs-95f5cdf50d-cs-95f5cdf50d-default-sts -n platform
NAME                                      READY   AGE
cs-95f5cdf50d-cs-95f5cdf50d-default-sts   95/96

I identified the faulty pod: It was not reaching readiness probe because of the medusa container.
The medusa gRPC server did not start because load_config() failed (see logs below).
Since the gRPC server was not started, the readiness probe was not reached.

The medusa container was "blocked" and did not attempt to restart the gRPC server.
I restarted the pod manually by deleting it, and the medusa gRPC server started successfully.

Did you expect to see something different?

I expect the pod to restart and to be in CrashLoopBackOff phase if a uncaught exception is raised by the medusa python process, instead of blocking indefinitely.

I believe this behavior was introduced by the following change : #731

How to reproduce it (as minimally and precisely as possible):
Start the medusa container with an invalid configuration

Environment

K8ssandra Operator version:
1.18
Medusa version:
0.21
Kubernetes version information:
1.29
Kubernetes cluster kind:
GKE

Medusa logs

MEDUSA_MODE = GRPC
sleeping for 0 sec
Starting Medusa gRPC service
WARNING:root:The CQL_USERNAME environment variable is deprecated and has been replaced by the MEDUSA_CQL_USERNAME variable
WARNING:root:The CQL_PASSWORD environment variable is deprecated and has been replaced by the MEDUSA_CQL_PASSWORD variable
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 424, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 419, in main
    server = Server(config_file_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 53, in __init__
    self.medusa_config = self.create_config()
                         ^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 88, in create_config
    return load_config(args, config_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/config.py", line 315, in load_config
    config = parse_config(args, config_file)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/config.py", line 280, in parse_config
    config.set('storage', 'fqdn', hostname_resolver.resolve_fqdn())
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/network/hostname_resolver.py", line 48, in resolve_fqdn
    hostname = self.compute_k8s_hostname(ip_address_to_resolve)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/network/hostname_resolver.py", line 56, in compute_k8s_hostname
    fqdns = dns.resolver.resolve(reverse_name, 'PTR')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/dns/resolver.py", line 1565, in resolve
    return get_default_resolver().resolve(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/dns/resolver.py", line 1307, in resolve
    (request, answer) = resolution.next_request()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/dns/resolver.py", line 749, in next_request
    raise NXDOMAIN(qnames=self.qnames_to_try, responses=self.nxdomain_responses)
dns.resolver.NXDOMAIN: The DNS query name does not exist: 92.49.20.172.in-addr.arpa.

┆Issue is synchronized with this Jira Story by Unito
┆Reviewer: Alexander Dejanovski
┆Fix Versions: 2024-10
┆Issue Number: MED-97

The text was updated successfully, but these errors were encountered:

c3-clement · 2024-09-18T11:28:35Z

Hello @adejanovski @rzvoncek

FYI, I'm on my way to submit a PR to address this issue.

c3-clement · 2024-09-23T16:21:59Z

Thanks @adejanovski .
When can we expect a release?

rzvoncek · 2024-10-03T11:29:24Z

@c3-clement Medusa 0.22.3 is out with this patch in it. The k8ssandra-operator will come out, it seems, next week.

c3-clement mentioned this issue Sep 18, 2024

Propagate Medusa process exit code in k8s docker-entrypoint #806

Merged

4 tasks

adejanovski closed this as completed in #806 Sep 23, 2024

c3-clement mentioned this issue Sep 23, 2024

Cassandra pod stuck instead of CrashLoopBackOff when Medusa config loading fails k8ssandra/k8ssandra-operator#1406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cassandra pod stuck instead of CrashLoopBackOff when Medusa config loading fails #805

Cassandra pod stuck instead of CrashLoopBackOff when Medusa config loading fails #805

c3-clement commented Sep 18, 2024 •

edited by sync-by-unito bot

Loading

c3-clement commented Sep 18, 2024

c3-clement commented Sep 23, 2024

rzvoncek commented Oct 3, 2024

Cassandra pod stuck instead of CrashLoopBackOff when Medusa config loading fails #805

Cassandra pod stuck instead of CrashLoopBackOff when Medusa config loading fails #805

Comments

c3-clement commented Sep 18, 2024 • edited by sync-by-unito bot Loading

c3-clement commented Sep 18, 2024

c3-clement commented Sep 23, 2024

rzvoncek commented Oct 3, 2024

c3-clement commented Sep 18, 2024 •

edited by sync-by-unito bot

Loading