Skip to content

Commit

Permalink
Fix units with stuck reconciliation (#367)
Browse files Browse the repository at this point in the history
* add timeout command

* Check-in panel with average actual reconciliation interval

* introduce constant

* update dashboard

* rename panel using term "Application"

* update docs

* remove Optional
  • Loading branch information
cbartz committed Sep 11, 2024
1 parent 0292abb commit 0e891cc
Show file tree
Hide file tree
Showing 7 changed files with 181 additions and 61 deletions.
2 changes: 2 additions & 0 deletions docs/reference/cos.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ The "GitHub Self-Hosted Runner Metrics" metrics dashboard presents the following
- Runner idle duration
- Charm reconciliation duration
- Job queue duration - how long a job waits in the queue before a runner picks it up
- Max job queue duration by application: Similar to "Job queue duration" panel, but shows maximum durations by charm application.
- Average reconciliation interval: Shows the average time between reconciliation events, broken down by charm application.
- Jobs: Displays certain metrics about the jobs executed by the runners. These metrics can be displayed per repository by specifying a
regular expression on the `Repository` variable. The following metrics are displayed:
- Proportion charts: Share of jobs by completion status, job conclusion, application, repo policy check failure http codes and github events over time.
Expand Down
11 changes: 6 additions & 5 deletions src-docs/charm.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,13 @@ Charm for creating and managing GitHub self-hosted runner instances.
- **RECONCILE_INTERVAL_CONFIG_NAME**
- **TEST_MODE_CONFIG_NAME**
- **TOKEN_CONFIG_NAME**
- **RECONCILIATION_INTERVAL_TIMEOUT_FACTOR**
- **RECONCILE_RUNNERS_EVENT**
- **REACTIVE_MQ_DB_NAME**

---

<a href="../src/charm.py#L113"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L117"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `catch_charm_errors`

Expand All @@ -46,7 +47,7 @@ Catch common errors in charm.

---

<a href="../src/charm.py#L154"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L158"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `catch_action_errors`

Expand All @@ -72,7 +73,7 @@ Catch common errors in actions.

---

<a href="../src/charm.py#L106"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L110"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>class</kbd> `ReconcileRunnersEvent`
Event representing a periodic check to ensure runners are ok.
Expand All @@ -83,7 +84,7 @@ Event representing a periodic check to ensure runners are ok.

---

<a href="../src/charm.py#L192"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L196"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>class</kbd> `GithubRunnerCharm`
Charm for managing GitHub self-hosted runners.
Expand All @@ -100,7 +101,7 @@ Charm for managing GitHub self-hosted runners.
- <b>`ram_pool_path`</b>: The path to memdisk storage.
- <b>`kernel_module_path`</b>: The path to kernel modules.

<a href="../src/charm.py#L215"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L219"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>method</kbd> `__init__`

Expand Down
8 changes: 2 additions & 6 deletions src-docs/event_timer.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ Construct the timer manager.

---

<a href="../src/event_timer.py#L151"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/event_timer.py#L146"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>method</kbd> `disable_event_timer`

Expand Down Expand Up @@ -138,11 +138,7 @@ Disable the systemd timer for the given event.
### <kbd>method</kbd> `ensure_event_timer`

```python
ensure_event_timer(
event_name: str,
interval: int,
timeout: Optional[int] = None
) → None
ensure_event_timer(event_name: str, interval: int, timeout: int) → None
```

Ensure that a systemd service and timer are registered to dispatch the given event.
Expand Down
7 changes: 6 additions & 1 deletion src/charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@
from runner_manager import LXDRunnerManager, LXDRunnerManagerConfig
from runner_manager_type import LXDFlushMode

# We assume a stuck reconcile event when it takes longer
# than 10 times a normal interval. Currently, we are only aware of
# https://bugs.launchpad.net/juju/+bug/2055184 causing a stuck reconcile event.
RECONCILIATION_INTERVAL_TIMEOUT_FACTOR = 10
RECONCILE_RUNNERS_EVENT = "reconcile-runners"

# This is currently hardcoded and may be moved to a config option in the future.
Expand Down Expand Up @@ -555,7 +559,8 @@ def _set_reconcile_timer(self) -> None:
self._event_timer.ensure_event_timer(
event_name="reconcile-runners",
interval=int(self.config[RECONCILE_INTERVAL_CONFIG_NAME]),
timeout=int(self.config[RECONCILE_INTERVAL_CONFIG_NAME]) - 1,
timeout=RECONCILIATION_INTERVAL_TIMEOUT_FACTOR
* int(self.config[RECONCILE_INTERVAL_CONFIG_NAME]),
)

def _ensure_reconcile_timer_is_active(self) -> None:
Expand Down
11 changes: 3 additions & 8 deletions src/event_timer.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import logging
import subprocess # nosec B404
from pathlib import Path
from typing import Optional, TypedDict
from typing import TypedDict

import jinja2

Expand Down Expand Up @@ -107,9 +107,7 @@ def is_active(self, event_name: str) -> bool:

return ret_code == 0

def ensure_event_timer(
self, event_name: str, interval: int, timeout: Optional[int] = None
) -> None:
def ensure_event_timer(self, event_name: str, interval: int, timeout: int) -> None:
"""Ensure that a systemd service and timer are registered to dispatch the given event.
The interval is how frequently, in minutes, the event should be dispatched.
Expand All @@ -125,10 +123,7 @@ def ensure_event_timer(
Raises:
TimerEnableError: Timer cannot be started. Events will be not emitted.
"""
if timeout is not None:
timeout_in_secs = timeout * 60
else:
timeout_in_secs = interval * 30
timeout_in_secs = timeout * 60

context: EventConfig = {
"event": event_name,
Expand Down
Loading

0 comments on commit 0e891cc

Please sign in to comment.