Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-3684] Implement DA139 #663

Open
wants to merge 42 commits into
base: dpe-3684-reinitialise-raft
Choose a base branch
from

Conversation

dragomirp
Copy link
Contributor

@dragomirp dragomirp commented Oct 27, 2024

Implement DA139:

  • Change the promote-to-primary to promote units and reinitialise RAFT
  • Add status messages for stuck RAFT

Copy link

codecov bot commented Oct 27, 2024

Codecov Report

Attention: Patch coverage is 56.36364% with 24 lines in your changes missing coverage. Please review.

Project coverage is 71.78%. Comparing base (7d49b47) to head (9c00278).

Files with missing lines Patch % Lines
src/cluster.py 38.09% 13 Missing ⚠️
src/charm.py 66.66% 6 Missing and 5 partials ⚠️
Additional details and impacted files
@@                      Coverage Diff                       @@
##           dpe-3684-reinitialise-raft     #663      +/-   ##
==============================================================
- Coverage                       72.15%   71.78%   -0.38%     
==============================================================
  Files                              15       15              
  Lines                            3426     3466      +40     
  Branches                          528      536       +8     
==============================================================
+ Hits                             2472     2488      +16     
- Misses                            827      846      +19     
- Partials                          127      132       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dragomirp dragomirp changed the title [DPE-3684] Three units scenarios [DPE-3684] Implement DA139 Dec 24, 2024
Comment on lines -109 to -111
self.framework.observe(
self.charm.on.promote_to_primary_action, self._on_promote_to_primary
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the main charm code, since it's no longer used only for async promotion.

Comment on lines -865 to -874
try:
health_status = self.get_patroni_health()
except Exception:
logger.warning("Remove raft member: Unable to get health status")
health_status = {}
if health_status.get("role") in ("leader", "master") or health_status.get(
"sync_standby"
):
logger.info(f"{self.charm.unit.name} is raft candidate")
data_flags["raft_candidate"] = "True"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for the action to start reinit

@@ -746,15 +747,18 @@ def stop_patroni(self) -> bool:
logger.exception(error_message, exc_info=e)
return False

def switchover(self) -> None:
def switchover(self, candidate: str | None = None) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass a candidate when promoting a specific unit.

for unit in units:
logger.info(f"Stopping unit {unit}")
await stop_machine(ops_test, await get_machine_from_unit(ops_test, unit))
await sleep(15)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleep for the Juju leadership to drift.

Comment on lines +109 to +114
# Check if Patroni self healed
assert (
left_unit.workload_status == "active"
and left_unit.workload_status_message == "Primary"
)
logger.warning(f"Patroni self-healed without raft reinitialisation for roles {roles}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes when removing the primary and async replica, Patroni manages to survive, so adding an exception for this case. Should I nail it down further?

@dragomirp dragomirp marked this pull request as ready for review January 24, 2025 01:43
@dragomirp dragomirp requested review from a team, taurus-forever, marceloneppel and lucasgameiroborges and removed request for a team January 24, 2025 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant