Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix units with stuck reconciliation #367

Merged
merged 7 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/reference/cos.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ The "GitHub Self-Hosted Runner Metrics" metrics dashboard presents the following
- Runner idle duration
- Charm reconciliation duration
- Job queue duration - how long a job waits in the queue before a runner picks it up
- Max job queue duration by application: Similar to "Job queue duration" panel, but shows maximum durations by charm application.
- Average reconciliation interval: Shows the average time between reconciliation events, broken down by charm application.
- Jobs: Displays certain metrics about the jobs executed by the runners. These metrics can be displayed per repository by specifying a
regular expression on the `Repository` variable. The following metrics are displayed:
- Proportion charts: Share of jobs by completion status, job conclusion, application, repo policy check failure http codes and github events over time.
Expand Down
11 changes: 6 additions & 5 deletions src-docs/charm.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,13 @@ Charm for creating and managing GitHub self-hosted runner instances.
- **RECONCILE_INTERVAL_CONFIG_NAME**
- **TEST_MODE_CONFIG_NAME**
- **TOKEN_CONFIG_NAME**
- **RECONCILIATION_INTERVAL_TIMEOUT_FACTOR**
- **RECONCILE_RUNNERS_EVENT**
- **REACTIVE_MQ_DB_NAME**

---

<a href="../src/charm.py#L113"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L117"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `catch_charm_errors`

Expand All @@ -46,7 +47,7 @@ Catch common errors in charm.

---

<a href="../src/charm.py#L154"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L158"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `catch_action_errors`

Expand All @@ -72,7 +73,7 @@ Catch common errors in actions.

---

<a href="../src/charm.py#L106"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L110"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>class</kbd> `ReconcileRunnersEvent`
Event representing a periodic check to ensure runners are ok.
Expand All @@ -83,7 +84,7 @@ Event representing a periodic check to ensure runners are ok.

---

<a href="../src/charm.py#L192"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L196"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>class</kbd> `GithubRunnerCharm`
Charm for managing GitHub self-hosted runners.
Expand All @@ -100,7 +101,7 @@ Charm for managing GitHub self-hosted runners.
- <b>`ram_pool_path`</b>: The path to memdisk storage.
- <b>`kernel_module_path`</b>: The path to kernel modules.

<a href="../src/charm.py#L215"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L219"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>method</kbd> `__init__`

Expand Down
8 changes: 2 additions & 6 deletions src-docs/event_timer.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ Construct the timer manager.

---

<a href="../src/event_timer.py#L151"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/event_timer.py#L146"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>method</kbd> `disable_event_timer`

Expand Down Expand Up @@ -138,11 +138,7 @@ Disable the systemd timer for the given event.
### <kbd>method</kbd> `ensure_event_timer`

```python
ensure_event_timer(
event_name: str,
interval: int,
timeout: Optional[int] = None
) → None
ensure_event_timer(event_name: str, interval: int, timeout: int) → None
```

Ensure that a systemd service and timer are registered to dispatch the given event.
Expand Down
7 changes: 6 additions & 1 deletion src/charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@
from runner_manager import LXDRunnerManager, LXDRunnerManagerConfig
from runner_manager_type import LXDFlushMode

# We assume a stuck reconcile event when it takes longer
# than 10 times a normal interval. Currently, we are only aware of
# https://bugs.launchpad.net/juju/+bug/2055184 causing a stuck reconcile event.
RECONCILIATION_INTERVAL_TIMEOUT_FACTOR = 10
RECONCILE_RUNNERS_EVENT = "reconcile-runners"

# This is currently hardcoded and may be moved to a config option in the future.
Expand Down Expand Up @@ -555,7 +559,8 @@ def _set_reconcile_timer(self) -> None:
self._event_timer.ensure_event_timer(
event_name="reconcile-runners",
interval=int(self.config[RECONCILE_INTERVAL_CONFIG_NAME]),
timeout=int(self.config[RECONCILE_INTERVAL_CONFIG_NAME]) - 1,
timeout=RECONCILIATION_INTERVAL_TIMEOUT_FACTOR
* int(self.config[RECONCILE_INTERVAL_CONFIG_NAME]),
)

def _ensure_reconcile_timer_is_active(self) -> None:
Expand Down
11 changes: 3 additions & 8 deletions src/event_timer.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import logging
import subprocess # nosec B404
from pathlib import Path
from typing import Optional, TypedDict
from typing import TypedDict

import jinja2

Expand Down Expand Up @@ -107,9 +107,7 @@ def is_active(self, event_name: str) -> bool:

return ret_code == 0

def ensure_event_timer(
self, event_name: str, interval: int, timeout: Optional[int] = None
) -> None:
def ensure_event_timer(self, event_name: str, interval: int, timeout: int) -> None:
"""Ensure that a systemd service and timer are registered to dispatch the given event.

The interval is how frequently, in minutes, the event should be dispatched.
Expand All @@ -125,10 +123,7 @@ def ensure_event_timer(
Raises:
TimerEnableError: Timer cannot be started. Events will be not emitted.
"""
if timeout is not None:
timeout_in_secs = timeout * 60
else:
timeout_in_secs = interval * 30
timeout_in_secs = timeout * 60

context: EventConfig = {
"event": event_name,
Expand Down
Loading
Loading