Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra pod stuck instead of CrashLoopBackOff when Medusa config loading fails #805

Closed
c3-clement opened this issue Sep 18, 2024 · 3 comments · Fixed by #806
Closed

Cassandra pod stuck instead of CrashLoopBackOff when Medusa config loading fails #805

c3-clement opened this issue Sep 18, 2024 · 3 comments · Fixed by #806

Comments

@c3-clement
Copy link
Contributor

c3-clement commented Sep 18, 2024

Project board link

See k8ssandra-operator issue: k8ssandra/k8ssandra-operator#1406

What happened?

I deployed a K8ssandraCluster with 96 replicas and medusa enabled, and one of the pods did not reach the Readiness probe

k get sts cs-95f5cdf50d-cs-95f5cdf50d-default-sts -n platform
NAME                                      READY   AGE
cs-95f5cdf50d-cs-95f5cdf50d-default-sts   95/96

I identified the faulty pod: It was not reaching readiness probe because of the medusa container.
The medusa gRPC server did not start because load_config() failed (see logs below).
Since the gRPC server was not started, the readiness probe was not reached.

The medusa container was "blocked" and did not attempt to restart the gRPC server.
I restarted the pod manually by deleting it, and the medusa gRPC server started successfully.

Did you expect to see something different?

I expect the pod to restart and to be in CrashLoopBackOff phase if a uncaught exception is raised by the medusa python process, instead of blocking indefinitely.

I believe this behavior was introduced by the following change : #731

How to reproduce it (as minimally and precisely as possible):
Start the medusa container with an invalid configuration

Environment

  • K8ssandra Operator version:
    1.18
  • Medusa version:
    0.21
  • Kubernetes version information:
    1.29
  • Kubernetes cluster kind:
    GKE

Medusa logs

MEDUSA_MODE = GRPC
sleeping for 0 sec
Starting Medusa gRPC service
WARNING:root:The CQL_USERNAME environment variable is deprecated and has been replaced by the MEDUSA_CQL_USERNAME variable
WARNING:root:The CQL_PASSWORD environment variable is deprecated and has been replaced by the MEDUSA_CQL_PASSWORD variable
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 424, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 419, in main
    server = Server(config_file_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 53, in __init__
    self.medusa_config = self.create_config()
                         ^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 88, in create_config
    return load_config(args, config_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/config.py", line 315, in load_config
    config = parse_config(args, config_file)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/config.py", line 280, in parse_config
    config.set('storage', 'fqdn', hostname_resolver.resolve_fqdn())
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/network/hostname_resolver.py", line 48, in resolve_fqdn
    hostname = self.compute_k8s_hostname(ip_address_to_resolve)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/network/hostname_resolver.py", line 56, in compute_k8s_hostname
    fqdns = dns.resolver.resolve(reverse_name, 'PTR')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/dns/resolver.py", line 1565, in resolve
    return get_default_resolver().resolve(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/dns/resolver.py", line 1307, in resolve
    (request, answer) = resolution.next_request()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/dns/resolver.py", line 749, in next_request
    raise NXDOMAIN(qnames=self.qnames_to_try, responses=self.nxdomain_responses)
dns.resolver.NXDOMAIN: The DNS query name does not exist: 92.49.20.172.in-addr.arpa.

┆Issue is synchronized with this Jira Story by Unito
┆Reviewer: Alexander Dejanovski
┆Fix Versions: 2024-10
┆Issue Number: MED-97

@c3-clement
Copy link
Contributor Author

Hello @adejanovski @rzvoncek

FYI, I'm on my way to submit a PR to address this issue.

@c3-clement
Copy link
Contributor Author

Thanks @adejanovski .
When can we expect a release?

@rzvoncek
Copy link
Contributor

rzvoncek commented Oct 3, 2024

@c3-clement Medusa 0.22.3 is out with this patch in it. The k8ssandra-operator will come out, it seems, next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants