Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SPS-204: don't depend on orchestrator for topology discovery #339

Closed
wants to merge 23 commits into from

Conversation

pooknull
Copy link
Contributor

@pooknull pooknull commented Mar 29, 2023

K8SPS-204 Powered by Pull Request Badge

https://jira.percona.com/browse/K8SPS-204

DESCRIPTION

Problem:
We don't want to use an orchestrator for topology discovery, as it will be one point of failure.

Solution:
We should improve the pkg/mysql/topology package and use it instead of sending requests to the orchestrator.

Changes:

  1. removed build/haproxy_check_primary.sh and build/haproxy_check_replicas.sh. Added cmd/haproxy-check/main.go binary which will use the new topology package
  2. cmd/bootstrap will use the new topology package
  3. added TOPOLOGY_EXPERIMENTAL env variable to the operator. Setting it to true will force the operator to discover the topology without the orchestrator
  4. added self-healing-chaos and gr-self-healing-chaos tests

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are the manifests (crd/bundle) regenerated if needed?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PS version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/XL 500-999 lines label Mar 29, 2023
@pooknull pooknull marked this pull request as ready for review April 4, 2023 07:44
@pooknull pooknull force-pushed the dev/K8SPS-204-improve-topology branch from b4a5330 to cc640a9 Compare July 10, 2023 13:27
@pooknull pooknull force-pushed the dev/K8SPS-204-improve-topology branch from e2320fe to a6e616a Compare July 19, 2023 19:24
@pull-request-size pull-request-size bot added size/XXL 1000+ lines and removed size/XL 500-999 lines labels Jul 19, 2023
e2e-tests/functions Outdated Show resolved Hide resolved
@egegunes
Copy link
Contributor

  • The idea of TOPOLOGY_EXPERIMENTALis, if it's not enabled we should preserve the old behavior. As I see, we use new experimental topology manager in bootstrap, in haproxy checks even if it's not enabled.

  • We shouldn't delete orc-handler. The idea is not depending on orchestrator for discovery, it's still the orchestrator that manages async topology and it should be the orchestrator that labels the primary pod.

  • I see new tests are failing, you probably need to drop finalizers as the last step like we do in other tests.

@egegunes
Copy link
Contributor

egegunes commented Aug 4, 2023

  • please fix conflicts
  • orc-handler should come back (right now ps-init-entrypoint.sh fails because we still try to install it even tough we removed from Dockerfile)
  • bootstrap fails on async replication

first pod goes into CrashLoopBackOff with the following error:

2023/08/04 07:32:54 Peers: [3237353861636563.cluster2-mysql-unready.ps]
2023/08/04 07:32:54 FQDN: cluster2-mysql-0.cluster2-mysql.ps
2023/08/04 07:32:54 Primary: cluster2-mysql-0.cluster2-mysql.ps Replicas: []
2023/08/04 07:32:54 lookup cluster2-mysql-0 [10.28.0.8]
2023/08/04 07:32:54 PodIP: 10.28.0.8
2023/08/04 07:32:54 bootstrap finished in 0.039863 seconds
2023/08/04 07:32:54 bootstrap failed: get primary IP: lookup cluster2-mysql-0.cluster2-mysql.ps: lookup cluster2-mysql-0.cluster2-mysql.ps on 10.24.112.10:53: no such host
2023/08/04 07:33:14 Peers: [3237353861636563.cluster2-mysql-unready.ps]
2023/08/04 07:33:14 FQDN: cluster2-mysql-0.cluster2-mysql.ps
2023/08/04 07:33:14 Primary: cluster2-mysql-0.cluster2-mysql.ps Replicas: []
2023/08/04 07:33:14 lookup cluster2-mysql-0 [10.28.0.8]
2023/08/04 07:33:14 PodIP: 10.28.0.8
2023/08/04 07:33:14 bootstrap finished in 0.071393 seconds
2023/08/04 07:33:14 bootstrap failed: get primary IP: lookup cluster2-mysql-0.cluster2-mysql.ps: lookup cluster2-mysql-0.cluster2-mysql.ps on 10.24.112.10:53: no such host

P.S.: During bootstrap pod is not ready, therefore cluster2-mysql-0.cluster2-mysql.ps is not resolvable.

@JNKPercona
Copy link
Collaborator

Test name Status
async-ignore-annotations passed
auto-config passed
config passed
config-router passed
demand-backup passed
gr-bootstrap passed
gr-demand-backup passed
gr-haproxy failure
gr-ignore-annotations passed
gr-init-deploy passed
gr-one-pod passed
gr-scaling passed
gr-tls-cert-manager passed
gr-self-healing-chaos failure
haproxy passed
init-deploy passed
limits passed
monitoring passed
one-pod passed
scaling passed
service-per-pod passed
sidecars passed
tls-cert-manager passed
users passed
version-service passed
self-healing-chaos passed
We run 26 out of 26

commit: 3a8483b
image: perconalab/percona-server-mysql-operator:PR-339-3a8483b

@tplavcic tplavcic mentioned this pull request Aug 17, 2023
11 tasks
@hors hors closed this Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL 1000+ lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants