Skip to content

Commit

Permalink
feat: Adjustments for COS integration for reactive mode (#390)
Browse files Browse the repository at this point in the history
* pin requirements.txt

* issue active_runners in lxd runner_manager

* create and use dedicated user

* only run test_reactive - REVERT ME

* some adaptions

* check metrics

* create home dir

* create home dir

* WIP checkin

* pin dev branch

* use tmate

* fix test

* add move_to_quarantine to shared_fs

* outcomment other tests

* pass in RUNNER_MANAGER_USER

* fix setup RUNNER_MANAGER_USER

* add a reconcile before metric extraction

* re-add the other reactive integration tests

* fix shared_fs

* run all tests

* fix test runner_manager_openstack

* fix test runner_manager_openstack

* reconcile for runner_installed_metric

* fix keypath

* outcomment integration tests

* call ops.main

* fix keypath

* outcomment patching

* add sleep

* remove charm metrics tests as they do pass

* remove test_charm_fork_repo

* make test_charm_runner less brittle

* add sleep to test_reactive to see if it fixes things

* catch FileNotFoundError in test_runner_manager_openstack.py

* remove test_charm_runner

* remove wrong use of usefixtures

* replace sleep by wait_for

* test if sleep fixes issues

* outcomment test reactive

* replace sleep by wait_for

* check for None

* add assert

* set log-level to DEBUG

* change wait_for condition

* remove check for metrics

* remove outcommented code

* restore log level

* refactor test_reactive

* run all tests

* fix changed health check

* increase timeout for reactive

* only run test_reactive

* update dashboard with active runners

* Revert "only run test_reactive"

This reverts commit 5a51184.

* update changelog

* make test_runner_manager_openstack.py less brittle

* pin branch

* remove tmate-debug

* re-add e2e test

* lint

* remove TODO

* fix case where pre_job metric stats are empty

* move StorageManager instantiation into functions

* move staticmethod to module level

* pin correct commit
  • Loading branch information
cbartz authored Oct 16, 2024
1 parent d960bd5 commit 5a31f10
Show file tree
Hide file tree
Showing 18 changed files with 304 additions and 70 deletions.
5 changes: 5 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# Changelog
### 2024-10-11

- Added support for COS integration with reactive runners.
- The charm now creates a dedicated user which is used for running the reactive process and
storing metrics and ssh keys (also for non-reactive mode).

### 2024-10-07

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/cos.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The "GitHub Self-Hosted Runner Metrics" metrics dashboard presents the following
- General: Displays general metrics about the charm and runners, such as:
- Lifecycle counters: Tracks the frequency of Runner initialisation, start, stop, and crash events.
- Available runners: A horizontal bar graph showing the number of runners available during the last reconciliation event. Note: This data is updated after each reconciliation event and is not real-time.
- Idle runners after reconciliation: A time series graph showing the number of runners marked as idle during the last reconciliation event over time. Note: This data is updated after each reconciliation event and is not real-time.
- Runners after reconciliation: A time series graph showing the number of runners marked as active/idle during the last reconciliation event over time. Note: This data is updated after each reconciliation event and is not real-time.
- Duration observations: Each data point aggregates the last hour and shows the 50th, 90th, 95th percentile and maximum durations for:
- Runner installation
- Runner idle duration
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ cosl ==0.0.15
# juju 3.1.2.0 depends on pyyaml<=6.0 and >=5.1.2
PyYAML ==6.0.*
pyOpenSSL==24.2.1
github_runner_manager @ git+https://github.com/canonical/github-runner-manager.git@33bbaff42d7cc0f250006fdd08d24659cef364c9
github_runner_manager @ git+https://github.com/canonical/github-runner-manager.git@601dbb70a983b2902566503e462a8567f80758e7
9 changes: 6 additions & 3 deletions src-docs/charm.py.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,13 @@ Charm for creating and managing GitHub self-hosted runner instances.
- **RECONCILE_RUNNERS_EVENT**
- **REACTIVE_MQ_DB_NAME**
- **GITHUB_SELF_HOSTED_ARCH_LABELS**
- **ROOT_USER**
- **RUNNER_MANAGER_USER**
- **RUNNER_MANAGER_GROUP**

---

<a href="../src/charm.py#L123"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L127"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `catch_charm_errors`

Expand All @@ -48,7 +51,7 @@ Catch common errors in charm.

---

<a href="../src/charm.py#L164"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L168"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `catch_action_errors`

Expand Down Expand Up @@ -89,7 +92,7 @@ Charm for managing GitHub self-hosted runners.
- <b>`ram_pool_path`</b>: The path to memdisk storage.
- <b>`kernel_module_path`</b>: The path to kernel modules.

<a href="../src/charm.py#L225"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/charm.py#L229"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `__init__`

Expand Down
10 changes: 5 additions & 5 deletions src-docs/runner_manager.py.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Construct RunnerManager object for creating and managing runners.

---

<a href="../.tox/src-docs/lib/python3.10/site-packages/github_runner_manager/utilities.py#L797"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../.tox/src-docs/lib/python3.10/site-packages/github_runner_manager/utilities.py#L805"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `build_runner_image`

Expand Down Expand Up @@ -87,7 +87,7 @@ Check if runner binary exists.

---

<a href="../src/runner_manager.py#L608"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L616"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `flush`

Expand Down Expand Up @@ -164,7 +164,7 @@ The runner binary URL changes when a new version is available.

---

<a href="../src/runner_manager.py#L789"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L797"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `has_runner_image`

Expand All @@ -181,7 +181,7 @@ Check if the runner image exists.

---

<a href="../src/runner_manager.py#L526"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L534"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `reconcile`

Expand All @@ -205,7 +205,7 @@ Bring runners in line with target.

---

<a href="../src/runner_manager.py#L812"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L820"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `schedule_build_runner_image`

Expand Down
27 changes: 23 additions & 4 deletions src-docs/shared_fs.py.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Classes and functions to operate on the shared filesystem between the charm and

---

<a href="../src/shared_fs.py#L34"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/shared_fs.py#L40"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `create`

Expand Down Expand Up @@ -45,7 +45,7 @@ The method is not idempotent and will raise an exception if the shared filesyste

---

<a href="../src/shared_fs.py#L72"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/shared_fs.py#L80"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `list_all`

Expand All @@ -63,7 +63,7 @@ List all the metric storages.

---

<a href="../src/shared_fs.py#L88"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/shared_fs.py#L98"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `get`

Expand Down Expand Up @@ -95,7 +95,7 @@ Mounts the filesystem if it is not currently mounted.

---

<a href="../src/shared_fs.py#L128"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/shared_fs.py#L140"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `delete`

Expand All @@ -118,3 +118,22 @@ Delete the shared filesystem for the runner.
- <b>`DeleteMetricsStorageError`</b>: If the shared filesystem could not be deleted.


---

<a href="../src/shared_fs.py#L184"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

## <kbd>function</kbd> `move_to_quarantine`

```python
move_to_quarantine(runner_name: str) → None
```

Archive the mshared filesystem for the runner and delete it.



**Args:**

- <b>`runner_name`</b>: The name of the runner.


45 changes: 43 additions & 2 deletions src/charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
)
from github_runner_manager.reactive.types_ import QueueConfig as ReactiveQueueConfig
from github_runner_manager.reactive.types_ import RunnerConfig as ReactiveRunnerConfig
from github_runner_manager.types_ import SystemUserConfig
from github_runner_manager.types_.github import GitHubPath, GitHubRunnerStatus, parse_github_path
from ops.charm import (
ActionEvent,
Expand All @@ -61,7 +62,6 @@
UpgradeCharmEvent,
)
from ops.framework import StoredState
from ops.main import main
from ops.model import ActiveStatus, BlockedStatus, MaintenanceStatus, WaitingStatus

import logrotate
Expand Down Expand Up @@ -110,6 +110,10 @@

GITHUB_SELF_HOSTED_ARCH_LABELS = {"x64", "arm64"}

ROOT_USER = "root"
RUNNER_MANAGER_USER = "runner-manager"
RUNNER_MANAGER_GROUP = "runner-manager"

logger = logging.getLogger(__name__)


Expand Down Expand Up @@ -454,6 +458,12 @@ def _common_install_code(self, state: CharmState) -> bool: # noqa: C901
logger.error("Failed to install charm dependencies")
raise

try:
_setup_runner_manager_user()
except SubprocessError:
logger.error("Failed to setup runner manager user")
raise

try:
logrotate.setup()
except LogrotateSetupError:
Expand Down Expand Up @@ -1295,6 +1305,7 @@ def _get_runner_scaler(
cloud_runner_manager=openstack_runner_manager_config,
github_token=token,
supported_labels=supported_labels,
system_user=SystemUserConfig(user=RUNNER_MANAGER_USER, group=RUNNER_MANAGER_GROUP),
)
return RunnerScaler(
runner_manager=runner_manager, reactive_runner_config=reactive_runner_config
Expand Down Expand Up @@ -1363,9 +1374,39 @@ def _create_openstack_runner_manager_config(
server_config=server_config,
runner_config=runner_config,
service_config=service_config,
system_user_config=SystemUserConfig(
user=RUNNER_MANAGER_USER, group=RUNNER_MANAGER_GROUP
),
)
return openstack_runner_manager_config


def _setup_runner_manager_user() -> None:
"""Create the user and required directories for the runner manager."""
# check if runner_manager user is already existing
_, retcode = execute_command(["/usr/bin/id", RUNNER_MANAGER_USER], check_exit=False)
if retcode != 0:
logger.info("Creating user %s", RUNNER_MANAGER_USER)
execute_command(
[
"/usr/sbin/useradd",
"--system",
"--create-home",
"--user-group",
RUNNER_MANAGER_USER,
],
)
execute_command(["/usr/bin/mkdir", "-p", f"/home/{RUNNER_MANAGER_USER}/.ssh"])
execute_command(
[
"/usr/bin/chown",
"-R",
f"{RUNNER_MANAGER_USER}:{RUNNER_MANAGER_USER}",
f"/home/{RUNNER_MANAGER_USER}/.ssh",
]
)
execute_command(["/usr/bin/chmod", "700", f"/home/{RUNNER_MANAGER_USER}/.ssh"])


if __name__ == "__main__":
main(GithubRunnerCharm)
ops.main(GithubRunnerCharm)
27 changes: 19 additions & 8 deletions src/grafana_dashboards/metrics.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 567,
"id": 313,
"links": [],
"liveNow": false,
"panels": [
Expand Down Expand Up @@ -635,7 +635,7 @@
"type": "loki",
"uid": "${lokids}"
},
"description": "This panel totals the number of idle runners reported by the charm units after a reconciliation event. Note that there may be active runners in the time between reconciliations which will not be shown in this panel.",
"description": "This panel totals the number of active/idle runners reported by charm units after reconciliation. Note that this is not a real-time metric and should only be used to identify trends or to investigate problems. A given point in time shows the aggregation of the last reported values from the reconciliation. This means, for example, that if the graph shows 10 idle runners at a particular time, it doesn't mean that there are 10 idle runners at that time, just that this is the most recently reported value. Also note that runners who are reported as idle at the time of reconciliation may become active immediately afterwards. And active runners may become offline/inactive.",
"fieldConfig": {
"defaults": {
"color": {
Expand Down Expand Up @@ -702,7 +702,7 @@
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": false
"showLegend": true
},
"tooltip": {
"mode": "single",
Expand All @@ -721,9 +721,21 @@
"legendFormat": "Idle",
"queryType": "range",
"refId": "D"
},
{
"datasource": {
"type": "loki",
"uid": "${lokids}"
},
"editorMode": "code",
"expr": "sum by(filename)(last_over_time({filename=\"/var/log/github-runner-metrics.log\", juju_application=~\"$juju_application\", juju_model=~\"$juju_model\", juju_model_uuid=~\"$juju_model_uuid\", juju_unit=~\"$juju_unit\"} | json event=\"event\",active_runners=\"active_runners\",flavor=\"flavor\" | event=\"reconciliation\" | flavor=~\"$flavor\" | unwrap active_runners[60m]))",
"hide": false,
"legendFormat": "Active",
"queryType": "range",
"refId": "A"
}
],
"title": "Idle Runners after Reconciliation",
"title": "Runners after Reconciliation",
"transformations": [],
"type": "timeseries"
},
Expand Down Expand Up @@ -958,7 +970,7 @@
"uid": "${lokids}"
},
"editorMode": "code",
"expr": "1 / sum by (flavor) (rate({filename=\"/var/log/github-runner-metrics.log\", juju_application=~\"$juju_application\", juju_model=~\"$juju_model\", juju_model_uuid=~\"$juju_model_uuid\", juju_unit=~\"$juju_unit\"} | json event=\"event\", flavor=\"flavor\" | event=\"reconciliation\" [$__range]))",
"expr": "(1 / sum by(flavor)(rate({filename=\"/var/log/github-runner-metrics.log\", juju_application=~\"$juju_application\", juju_model=~\"$juju_model\", juju_model_uuid=~\"$juju_model_uuid\", juju_unit=~\"$juju_unit\"} | json event=\"event\",flavor=\"flavor\" | event=\"reconciliation\"[$__range])))",
"key": "Q-9302bc4d-cce0-4674-bad5-353257fdd2f4-0",
"legendFormat": "",
"queryType": "instant",
Expand Down Expand Up @@ -1028,8 +1040,7 @@
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
"color": "green"
},
{
"color": "red",
Expand Down Expand Up @@ -2045,6 +2056,6 @@
"timepicker": {},
"timezone": "",
"title": "GitHub Self-Hosted Runner Metrics",
"version": 15,
"version": 16,
"weekStart": ""
}
22 changes: 15 additions & 7 deletions src/runner_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -331,14 +331,21 @@ def _issue_runner_metrics(self) -> IssuedMetricEventsStats:
for extracted_metrics in runner_metrics.extract(
metrics_storage_manager=shared_fs, runners=set(runner_states.healthy)
):
try:
job_metrics = github_metrics.job(
github_client=self._clients.github,
pre_job_metrics=extracted_metrics.pre_job,
runner_name=extracted_metrics.runner_name,
if extracted_metrics.pre_job:
try:
job_metrics = github_metrics.job(
github_client=self._clients.github,
pre_job_metrics=extracted_metrics.pre_job,
runner_name=extracted_metrics.runner_name,
)
except GithubMetricsError:
logger.exception("Failed to calculate job metrics")
job_metrics = None
else:
logger.debug(
"No pre-job metrics found for %s, will not calculate job metrics.",
extracted_metrics.runner_name,
)
except GithubMetricsError:
logger.exception("Failed to calculate job metrics")
job_metrics = None

issued_events = runner_metrics.issue_events(
Expand Down Expand Up @@ -390,6 +397,7 @@ def _issue_reconciliation_metric(
crashed_runners=metric_stats.get(metric_events.RunnerStart, 0)
- metric_stats.get(metric_events.RunnerStop, 0),
idle_runners=idle_online_count + idle_offline_count,
active_runners=active_count,
duration=reconciliation_end_ts - reconciliation_start_ts,
)
)
Expand Down
Loading

0 comments on commit 5a31f10

Please sign in to comment.