Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ramdisk as LXD storage for the runners #69

Closed
wants to merge 37 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
22c1af8
Add using juju storage as storage for runner
yhaliaw Jun 5, 2023
d193ccb
Add attach juju storage to deploy script
yhaliaw Jun 7, 2023
ecba393
Fix linting issues
yhaliaw Jun 7, 2023
386de13
Fix issue with missing LXD storage pool path
yhaliaw Jun 7, 2023
470401e
Add stub for LxdStoragePoolManager class
yhaliaw Jun 7, 2023
258667e
Fix get the path of storage pool
yhaliaw Jun 8, 2023
265d6dd
Fix typo
yhaliaw Jun 8, 2023
05032a2
Fix unit test for Runner and RunnerManager
yhaliaw Jun 8, 2023
d198b49
Fix charm unit test
yhaliaw Jun 8, 2023
65474ae
Change config for debug
yhaliaw Jun 8, 2023
4716423
Testing ramdisk as LXD storage
yhaliaw Jun 15, 2023
9f7c7b8
Add existence checks for ramdisk
yhaliaw Jun 16, 2023
ee4a550
Add unit conversion for byte units
yhaliaw Jun 16, 2023
7e66bd0
Fix typo
yhaliaw Jun 16, 2023
a8959f6
Remove ramdisk config
yhaliaw Jun 19, 2023
7236c6f
Fix download block size
yhaliaw Jun 19, 2023
a165546
Remove juju storage
yhaliaw Jun 19, 2023
794302b
Set ram disk to 1TiB
yhaliaw Jun 19, 2023
8a5ce65
Switch back to tmpfs with dir driver
yhaliaw Jun 20, 2023
9acbf5b
Set apparmor to complain mode
yhaliaw Jun 20, 2023
5ce9574
Add installation of apparmor-utils
yhaliaw Jun 20, 2023
614de9f
Fix missing string to int
yhaliaw Jun 20, 2023
109acbc
Create missing dir for ram disk
yhaliaw Jun 20, 2023
d91d107
Fix creation of tmpfs
yhaliaw Jun 20, 2023
a9c823d
Fix unit of tmpfs size option
yhaliaw Jun 20, 2023
766e708
Fix setting complain mode for apparmor profile
yhaliaw Jun 20, 2023
888ada5
Attempt to fix LXD storage pool with dir driver
yhaliaw Jun 20, 2023
a68bdc4
Fix unit test
yhaliaw Jun 20, 2023
60d4947
Change ram disk path to under LXD snap path
yhaliaw Jun 20, 2023
81e5c0f
Rename ram lxd storage pool
yhaliaw Jun 20, 2023
bee04c5
Change path for ram pool
yhaliaw Jun 20, 2023
8faec4d
Fix path of ram pool
yhaliaw Jun 20, 2023
892428a
Refactor enabling complain mode for apparmor
yhaliaw Jun 20, 2023
ad4e3cc
Revert to using brd with LVM for LXD storage
yhaliaw Jun 20, 2023
e44b212
Fix brd size.
yhaliaw Jun 20, 2023
8126595
Remove debug code
yhaliaw Jun 20, 2023
aac67ac
Remove extra file
yhaliaw Jun 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,14 @@ options:
type: string
default: 7GiB
description: >
Amount of memory to allocate per virtual machine runner. Positive integers with GiB suffix.
Amount of memory to allocate per virtual machine runner. Positive integers with KiB, MiB, GiB,
TiB, PiB, EiB suffix.
vm-disk:
type: string
default: 10GiB
description: >
Amount of disk space to allocate to root disk for virtual machine runner. Positive integers
with GiB suffix.
with KiB, MiB, GiB, TiB, PiB, EiB suffix.
reconcile-interval:
type: int
default: 10
Expand Down
19 changes: 9 additions & 10 deletions metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,13 @@ issues: https://github.com/canonical/github-runner-operator/issues
source: https://github.com/canonical/github-runner-operator
summary: Creates a cluster of self-hosted github runners.
description: |
A [Juju](https://juju.is/) [charm](https://juju.is/docs/olm/charmed-operators)
deploying self-hosted GitHub runners.

Each unit of this charm will start a configurable number of LXD based containers
and virtual machines to host them. Each runner performs only one job, after which
it unregisters from GitHub to ensure that each job runs in a clean environment.
The charm will periodically check the number of idle runners and spawn or destroy them as
necessary to maintain the configured number of runners. Both the reconciliation interval and the
number of runners to maintain are configurable.
A [Juju](https://juju.is/) [charm](https://juju.is/docs/olm/charmed-operators) deploying
self-hosted GitHub runners.

Each unit of this charm will start a configurable number of LXD based virtual machines to host
them. Each runner performs only one job, after which it unregisters from GitHub to ensure that
each job runs in a clean environment. The charm will periodically check the number of idle runners
and spawn or destroy them as necessary to maintain the configured number of runners. Both the
reconciliation interval and the number of runners to maintain are configurable.
series:
- jammy
- jammy
2 changes: 1 addition & 1 deletion script/deploy_runner.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,5 @@ unzip -p github_runner.zip > github-runner.charm
rm github_runner.zip

# Deploy the charm.
juju deploy ./github-runner.charm --series=jammy e2e-runner
juju deploy ./github-runner.charm --series=jammy --constraints="cores=4 mem=32G" e2e-runner
juju config e2e-runner token="$GITHUB_TOKEN" path=canonical/github-runner-operator virtual-machines=1
105 changes: 75 additions & 30 deletions src/charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,12 @@
from ops.main import main
from ops.model import ActiveStatus, BlockedStatus, MaintenanceStatus

from errors import RunnerError, SubprocessError
from errors import MissingConfigurationError, RunnerError, SubprocessError
from event_timer import EventTimer, TimerDisableError, TimerEnableError
from github_type import GitHubRunnerStatus
from runner_manager import RunnerManager, RunnerManagerConfig
from runner_type import GitHubOrg, GitHubRepo, ProxySetting, VirtualMachineResources
from utilities import execute_command, get_env_var, retry
from utilities import execute_command, get_env_var, retry, secure_run_subprocess

if TYPE_CHECKING:
from ops.model import JsonObject # pragma: no cover
Expand All @@ -52,9 +52,7 @@ class UpdateRunnerBinEvent(EventBase):
EventT = TypeVar("EventT")


def catch_unexpected_charm_errors(
func: Callable[[CharmT, EventT], None]
) -> Callable[[CharmT, EventT], None]:
def catch_charm_errors(func: Callable[[CharmT, EventT], None]) -> Callable[[CharmT, EventT], None]:
"""Catch unexpected errors in charm.

This decorator is for unrecoverable errors and sets the charm to
Expand All @@ -72,14 +70,16 @@ def func_with_catch_unexpected_errors(self, event: EventT) -> None:
# Safe guard against unexpected error.
try:
func(self, event)
except Exception as err: # pylint: disable=broad-exception-caught
logger.exception(err)
self.unit.status = BlockedStatus(str(err))
except MissingConfigurationError as err:
logger.exception("Missing required charm configuration")
self.unit.status = BlockedStatus(
f"Missing required charm configuration: {err.configs}"
)

return func_with_catch_unexpected_errors


def catch_unexpected_action_errors(
def catch_action_errors(
func: Callable[[CharmT, ActionEvent], None]
) -> Callable[[CharmT, ActionEvent], None]:
"""Catch unexpected errors in actions.
Expand All @@ -92,15 +92,15 @@ def catch_unexpected_action_errors(
"""

@functools.wraps(func)
def func_with_catch_unexpected_errors(self, event: ActionEvent) -> None:
def func_with_catch_errors(self, event: ActionEvent) -> None:
# Safe guard against unexpected error.
try:
func(self, event)
except Exception as err: # pylint: disable=broad-exception-caught
logger.exception(err)
event.fail(f"Failed to get runner info: {err}")
except MissingConfigurationError as err:
logger.exception("Missing required charm configuration")
event.fail(f"Missing required charm configuration: {err.configs}")

return func_with_catch_unexpected_errors
return func_with_catch_errors


class GithubRunnerCharm(CharmBase):
Expand All @@ -112,6 +112,7 @@ class GithubRunnerCharm(CharmBase):
repo_check_web_service_path = Path("/home/ubuntu/repo_policy_compliance_service")
repo_check_web_service_script = Path("src/repo_policy_compliance_service.py")
repo_check_systemd_service = Path("/etc/systemd/system/repo-policy-compliance.service")
lvm_vg_name = "ramdisk_pool"

def __init__(self, *args, **kargs) -> None:
"""Construct the charm.
Expand Down Expand Up @@ -155,9 +156,24 @@ def __init__(self, *args, **kargs) -> None:
self.framework.observe(self.on.flush_runners_action, self._on_flush_runners_action)
self.framework.observe(self.on.update_runner_bin_action, self._on_update_runner_bin)

def _ensure_ramdisk_volume_group(self) -> None:
"""Ensure ramdisk LVM volume group exists."""
# Check if ramdisk at /dev/ram0 exists.
result = secure_run_subprocess(["test", "-e", "/dev/ram0"])
if result.returncode != 0:
# The block ram disk is set to 1 TiB size, as a way to not limit it.
# Block ram disk does not pre-allocate the memory.
# Each LXD instance memory usage is restricted through the LXD profile.
execute_command(["modprobe", "brd", "rd_size=1073741824", "rd_nr=1"])

# Check if volume group exits.
result = secure_run_subprocess(["vgdisplay", self.lvm_vg_name])
if result.returncode != 0:
execute_command(["vgcreate", self.lvm_vg_name, "/dev/ram0"])

def _get_runner_manager(
self, token: Optional[str] = None, path: Optional[str] = None
) -> Optional[RunnerManager]:
) -> RunnerManager:
"""Get a RunnerManager instance, or None if missing config.

Args:
Expand All @@ -166,15 +182,22 @@ def _get_runner_manager(
name.

Returns:
A instance of RunnerManager if the token and path configuration can be found.
A instance of RunnerManager.
"""
if token is None:
token = self.config["token"]
if path is None:
path = self.config["path"]

if not token or not path:
return None
missing_configs = []
if not token:
missing_configs.append("token")
if not path:
missing_configs.append("path")
if missing_configs:
raise MissingConfigurationError(missing_configs)

self._ensure_ramdisk_volume_group()

if self.service_token is None:
self.service_token = self._get_service_token()
Expand All @@ -194,11 +217,11 @@ def _get_runner_manager(
return RunnerManager(
app_name,
unit,
RunnerManagerConfig(path, token, "jammy", self.service_token),
RunnerManagerConfig(path, token, "jammy", self.service_token, self.lvm_vg_name),
proxies=self.proxies,
)

@catch_unexpected_charm_errors
@catch_charm_errors
def _on_install(self, _event: InstallEvent) -> None:
"""Handle the installation of charm.

Expand Down Expand Up @@ -244,7 +267,7 @@ def _on_install(self, _event: InstallEvent) -> None:
else:
self.unit.status = BlockedStatus("Missing token or org/repo path config")

@catch_unexpected_charm_errors
@catch_charm_errors
def _on_upgrade_charm(self, _event: UpgradeCharmEvent) -> None:
"""Handle the update of charm.

Expand All @@ -263,7 +286,7 @@ def _on_upgrade_charm(self, _event: UpgradeCharmEvent) -> None:
runner_manager.flush()
self._reconcile_runners(runner_manager)

@catch_unexpected_charm_errors
@catch_charm_errors
def _on_config_changed(self, _event: ConfigChangedEvent) -> None:
"""Handle the configuration change.

Expand Down Expand Up @@ -298,7 +321,7 @@ def _on_config_changed(self, _event: ConfigChangedEvent) -> None:
else:
self.unit.status = BlockedStatus("Missing token or org/repo path config")

@catch_unexpected_charm_errors
@catch_charm_errors
def _on_update_runner_bin(self, _event: UpdateRunnerBinEvent) -> None:
"""Handle checking update of runner binary event.

Expand Down Expand Up @@ -337,7 +360,7 @@ def _on_update_runner_bin(self, _event: UpdateRunnerBinEvent) -> None:

self.unit.status = ActiveStatus()

@catch_unexpected_charm_errors
@catch_charm_errors
def _on_reconcile_runners(self, _event: ReconcileRunnersEvent) -> None:
"""Handle the reconciliation of runners.

Expand All @@ -363,7 +386,7 @@ def _on_reconcile_runners(self, _event: ReconcileRunnersEvent) -> None:

self.unit.status = ActiveStatus()

@catch_unexpected_action_errors
@catch_action_errors
def _on_check_runners_action(self, event: ActionEvent) -> None:
"""Handle the action of checking of runner state.

Expand Down Expand Up @@ -403,7 +426,7 @@ def _on_check_runners_action(self, event: ActionEvent) -> None:
}
)

@catch_unexpected_action_errors
@catch_action_errors
def _on_reconcile_runners_action(self, event: ActionEvent) -> None:
"""Handle the action of reconcile of runner state.

Expand All @@ -420,7 +443,7 @@ def _on_reconcile_runners_action(self, event: ActionEvent) -> None:
self._on_check_runners_action(event)
event.set_results(delta)

@catch_unexpected_action_errors
@catch_action_errors
def _on_flush_runners_action(self, event: ActionEvent) -> None:
"""Handle the action of flushing all runner and reconciling afterwards.

Expand All @@ -438,7 +461,7 @@ def _on_flush_runners_action(self, event: ActionEvent) -> None:
self._on_check_runners_action(event)
event.set_results(delta)

@catch_unexpected_charm_errors
@catch_charm_errors
def _on_stop(self, _: StopEvent) -> None:
"""Handle the stopping of the charm.

Expand Down Expand Up @@ -483,7 +506,7 @@ def _reconcile_runners(self, runner_manager: RunnerManager) -> Dict[str, "JsonOb
return {"delta": {"virtual-machines": delta_virtual_machines}}
# Safe guard against transient unexpected error.
except Exception as err: # pylint: disable=broad-exception-caught
logger.exception("Failed to update runner binary")
logger.exception("Failed to reconcile runners.")
# Failure to reconcile runners is a transient error.
# The charm automatically reconciles runners on a schedule.
self.unit.status = MaintenanceStatus(f"Failed to reconcile runners: {err}")
Expand Down Expand Up @@ -511,7 +534,6 @@ def _install_deps(self) -> None:
[
"/usr/bin/pip",
"install",
"flask",
"git+https://github.com/canonical/repo-policy-compliance@main",
],
env=env,
Expand All @@ -528,6 +550,7 @@ def _install_deps(self) -> None:
"cpu-checker",
"libvirt-clients",
"libvirt-daemon-driver-qemu",
"apparmor-utils",
],
)
execute_command(["/usr/bin/snap", "install", "lxd", "--channel=latest/stable"])
Expand All @@ -538,6 +561,28 @@ def _install_deps(self) -> None:
execute_command(["/snap/bin/lxc", "network", "set", "lxdbr0", "ipv6.address", "none"])
logger.info("Finished installing charm dependencies.")

logger.info("Setting apparmor to complain mode.")
self._apparmor_complain_mode(
[
"/etc/apparmor.d/lsb_release",
"/etc/apparmor.d/nvidia_modprobe",
"/etc/apparmor.d/sbin.dhclient",
"/etc/apparmor.d/usr.bin.man",
"/etc/apparmor.d/usr.bin.tcpdump",
"/etc/apparmor.d/usr.lib.snapd.snap-confine.real",
"/etc/apparmor.d/usr.sbin.rsyslogd",
]
)

def _apparmor_complain_mode(self, profiles: list[str]) -> None:
"""Enable complain mode for the apparmor profile.

Args:
profiles: Profiles to enable complain mode.
"""
for profile in profiles:
execute_command(["aa-complain", profile])

@retry(tries=10, delay=15, max_delay=60, backoff=1.5)
def _start_services(self) -> None:
"""Start services."""
Expand Down
18 changes: 18 additions & 0 deletions src/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,24 @@ class RunnerBinaryError(RunnerError):
"""Error of getting runner binary."""


class MissingConfigurationError(Exception):
"""Error for missing juju configuration.

Attrs:
configs: The missing configurations.
"""

def __init__(self, configs: list[str]):
"""Construct the MissingConfigurationError.

Args:
configs: The missing configurations.
"""
super().__init__(f"Missing required charm configuration: {configs}")

self.configs = configs


class LxdError(Exception):
"""Error for executing LXD actions."""

Expand Down
Loading