Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove race condition between restart repo-policy-compliance and request of the service #166

Merged
merged 48 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
41fa8e7
WIP
yhaliaw Dec 11, 2023
3297896
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Dec 11, 2023
73826e2
Add waiting of runner that has taken a job
yhaliaw Dec 11, 2023
a5973db
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Dec 12, 2023
7aa40a6
Fix import
yhaliaw Dec 12, 2023
a386fec
Draft for integration test
yhaliaw Dec 12, 2023
bd9f19f
Fix import
yhaliaw Dec 12, 2023
e92b08f
Fix import order
yhaliaw Dec 12, 2023
a13b03e
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 15, 2024
d784374
[no ci] Fix linting
yhaliaw Jan 15, 2024
84d5ce0
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 16, 2024
dca0cb6
Flush non-busy runner
yhaliaw Jan 16, 2024
77638fe
Fix lint and unit test
yhaliaw Jan 22, 2024
377b47f
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 22, 2024
91c1f43
Fix merge issues
yhaliaw Jan 22, 2024
a7f096f
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 22, 2024
04bde3b
Fix the integration test
yhaliaw Jan 23, 2024
200185c
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 23, 2024
5e593fa
Fix code comment
yhaliaw Jan 23, 2024
3b9126a
Fix integration test
yhaliaw Jan 23, 2024
6220bba
Change repo used for integration test
yhaliaw Jan 24, 2024
9945a5b
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 24, 2024
e66c17b
Fix waiting for busy runner
yhaliaw Jan 24, 2024
7efee67
Use enum as RunnerManager.flush argument
yhaliaw Jan 24, 2024
d8bbf61
Fix integration test
yhaliaw Jan 24, 2024
510fa28
Increase the timeout
yhaliaw Jan 26, 2024
9b6f3f7
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 26, 2024
4e4457b
Fix lint
yhaliaw Jan 26, 2024
fe52afa
Assert workflow was successfully dispatched
yhaliaw Jan 26, 2024
2db55e2
Fix workflow input type
yhaliaw Jan 26, 2024
8f16e7e
Fix github_client usage
yhaliaw Jan 26, 2024
87fe221
Fix integration test
yhaliaw Jan 26, 2024
b9adcc0
Fix integration test
yhaliaw Jan 26, 2024
987f5c7
Fix timezone format
yhaliaw Jan 26, 2024
c9d08ac
Update enum docstring
yhaliaw Jan 27, 2024
4ab75c9
Fix code comments
yhaliaw Jan 29, 2024
33fc0a4
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 29, 2024
e48b355
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 29, 2024
6fdc6ce
Refactor test to use helper functions
yhaliaw Jan 30, 2024
7b7bdbe
Merge branch 'main' into bug/update-dep-race-condition
yanksyoon Jan 30, 2024
5955379
Fix exception thrown in tests
yhaliaw Jan 31, 2024
0e029a4
Disable pylint due to fastcore lib issues
yhaliaw Jan 31, 2024
fff0d70
Test fix
yhaliaw Jan 31, 2024
14f6625
Resolve the race condition
yhaliaw Jan 31, 2024
020bc6e
Resolve the race condition between restart service and flushing runner
yhaliaw Jan 31, 2024
9bea9e2
Fix lint issues
yhaliaw Jan 31, 2024
2053928
Fix timeout
yhaliaw Feb 1, 2024
abfc217
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Feb 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/integration_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
provider: lxd
test-tox-env: integration-juju2.9
modules: '["test_charm_fork_repo", "test_charm_no_runner", "test_charm_scheduled_events", "test_charm_one_runner", "test_charm_metrics_success", "test_charm_metrics_failure", "test_self_hosted_runner", "test_charm_with_proxy", "test_charm_with_juju_storage", "test_debug_ssh"]'
integration-tests-juju3:
integration-tests:
name: Integration test with juju 3.1
uses: canonical/operator-workflows/.github/workflows/integration_test.yaml@main
secrets: inherit
Expand Down
29 changes: 29 additions & 0 deletions .github/workflows/workflow_dispatch_wait_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Workflow Dispatch Wait Tests

on:
# Manually dispatched workflow action
workflow_dispatch:
inputs:
runner:
description: 'Self hosted gh runner'
required: true
minutes:
description: 'Number of minutes to wait'
# Number type not supported in workflow dispatch: https://github.com/orgs/community/discussions/67182
# Seems to be by design: https://github.blog/changelog/2021-11-10-github-actions-input-types-for-manual-workflows/
default: '2'


jobs:
workflow-dispatch-tests:
runs-on: [self-hosted, linux, x64, "${{ inputs.runner }}"]
steps:
- name: Echo input variable and message
run: |
echo "Hello, runner: ${{ inputs.runner }}"
- name: Wait
run: |
sleep ${{ inputs.minutes }}m
- name: Always echo a message
if: always()
run: echo "Should not echo if pre-job script failed"
23 changes: 12 additions & 11 deletions src-docs/runner_manager.py.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Runner Manager manages the runners on LXD and GitHub.
## <kbd>class</kbd> `RunnerManager`
Manage a group of runners according to configuration.

<a href="../src/runner_manager.py#L55"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L56"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `__init__`

Expand Down Expand Up @@ -46,7 +46,7 @@ Construct RunnerManager object for creating and managing runners.

---

<a href="../src/runner_manager.py#L660"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L710"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `build_runner_image`

Expand All @@ -66,7 +66,7 @@ Build container image in test mode, else virtual machine image.

---

<a href="../src/runner_manager.py#L116"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L117"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `check_runner_bin`

Expand All @@ -83,12 +83,12 @@ Check if runner binary exists.

---

<a href="../src/runner_manager.py#L526"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L552"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `flush`

```python
flush(flush_busy: bool = True) → int
flush(force_flush_busy: bool = False, wait_repo_check: bool = False) → int
```

Remove existing runners.
Expand All @@ -97,7 +97,8 @@ Remove existing runners.

**Args:**

- <b>`flush_busy`</b>: Whether to flush busy runners as well.
- <b>`force_flush_busy`</b>: Whether to flush busy runners immediately.
- <b>`wait_repo_check`</b>: Whether to flush busy runners after waiting for repo-policy-compliance check to complete.



Expand All @@ -106,7 +107,7 @@ Remove existing runners.

---

<a href="../src/runner_manager.py#L217"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L218"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `get_github_info`

Expand All @@ -123,7 +124,7 @@ Get information on the runners from GitHub.

---

<a href="../src/utilities.py#L124"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/utilities.py#L125"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `get_latest_runner_bin_url`

Expand All @@ -148,7 +149,7 @@ The runner binary URL changes when a new version is available.

---

<a href="../src/runner_manager.py#L449"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L450"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `reconcile`

Expand All @@ -172,7 +173,7 @@ Bring runners in line with target.

---

<a href="../src/runner_manager.py#L670"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L720"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `schedule_build_runner_image`

Expand All @@ -184,7 +185,7 @@ Install cron job for building runner image.

---

<a href="../src/utilities.py#L150"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/utilities.py#L151"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `update_runner_bin`

Expand Down
16 changes: 8 additions & 8 deletions src/charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -429,7 +429,7 @@ def _on_start(self, _event: StartEvent) -> None:

self.unit.status = MaintenanceStatus("Starting runners")
try:
runner_manager.flush(flush_busy=False)
runner_manager.flush(force_flush_busy=False)
self._reconcile_runners(runner_manager)
except RunnerError as err:
logger.exception("Failed to start runners")
Expand Down Expand Up @@ -485,7 +485,7 @@ def _on_upgrade_charm(self, _event: UpgradeCharmEvent) -> None:
if not runner_manager:
return

runner_manager.flush()
runner_manager.flush(wait_repo_check=True)
yhaliaw marked this conversation as resolved.
Show resolved Hide resolved
self._reconcile_runners(runner_manager)

@catch_charm_errors
Expand Down Expand Up @@ -519,7 +519,7 @@ def _on_config_changed(self, _event: ConfigChangedEvent) -> None:
) # Casting for mypy checks.
if prev_runner_manager:
self.unit.status = MaintenanceStatus("Removing runners from old org/repo")
prev_runner_manager.flush(flush_busy=False)
prev_runner_manager.flush(wait_repo_check=True)
self._stored.path = self.config["path"]

runner_manager = self._get_runner_manager()
Expand All @@ -530,7 +530,7 @@ def _on_config_changed(self, _event: ConfigChangedEvent) -> None:
self.unit.status = BlockedStatus("Missing token or org/repo path config")

if self.config["token"] != self._stored.token:
runner_manager.flush(flush_busy=False)
runner_manager.flush(wait_repo_check=True)
self._stored.token = self.config["token"]

def _check_and_update_dependencies(self) -> bool:
Expand Down Expand Up @@ -583,8 +583,8 @@ def _check_and_update_dependencies(self) -> bool:

self.unit.status = MaintenanceStatus("Flushing runners due to updated deps")

runner_manager.flush(wait_repo_check=True)
yhaliaw marked this conversation as resolved.
Show resolved Hide resolved
self._start_services()
runner_manager.flush(flush_busy=False)

self.unit.status = ActiveStatus()
return service_updated or runner_bin_updated
Expand Down Expand Up @@ -673,7 +673,7 @@ def _on_flush_runners_action(self, event: ActionEvent) -> None:
"""
runner_manager = self._get_runner_manager()

runner_manager.flush()
runner_manager.flush(wait_repo_check=True)
delta = self._reconcile_runners(runner_manager)

self._on_check_runners_action(event)
Expand Down Expand Up @@ -704,7 +704,7 @@ def _on_stop(self, _: StopEvent) -> None:
self.unit.status = BlockedStatus(f"Failed to stop charm event timer: {ex}")

runner_manager = self._get_runner_manager()
runner_manager.flush()
runner_manager.flush(force_flush_busy=True)

def _reconcile_runners(self, runner_manager: RunnerManager) -> Dict[str, Any]:
"""Reconcile the current runners state and intended runner state.
Expand Down Expand Up @@ -931,7 +931,7 @@ def _apt_install(self, packages: Sequence[str]) -> None:
def _on_debug_ssh_relation_changed(self, _: ops.RelationChangedEvent) -> None:
"""Handle debug ssh relation changed event."""
runner_manager = self._get_runner_manager()
runner_manager.flush(flush_busy=False)
runner_manager.flush(force_flush_busy=False)
self._reconcile_runners(runner_manager)


Expand Down
76 changes: 63 additions & 13 deletions src/runner_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import tarfile
import time
import urllib.request
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Dict, Iterator, Optional, Type

Expand Down Expand Up @@ -523,33 +524,82 @@ def reconcile(self, quantity: int, resources: VirtualMachineResources) -> int:
)
return delta

def flush(self, flush_busy: bool = True) -> int:
def _runners_in_pre_job(self) -> bool:
"""Check there exist runners in the pre-job script stage.

If a runner has taken a job for 1 minute or more, it is assumed to exit the pre-job script.

Returns:
Whether there are runners that has taken a job and run for less than 1 minute.
"""
now = datetime.now(timezone.utc)
busy_runners = [
runner for runner in self._get_runners() if runner.status.exist and runner.status.busy
]
for runner in busy_runners:
# Check if `_work` directory exists, if it exists the runner has started a job.
exit_code, stdout, _ = runner.instance.execute(
["/usr/bin/stat", "-c", "'%w'", "/home/ubuntu/github-runner/_work"]
jdkandersson marked this conversation as resolved.
Show resolved Hide resolved
)
if exit_code != 0:
return False
date_str, time_str, timezone_str = stdout.read().decode("utf-8").split(" ")
job_start_time = datetime.fromisoformat(f"{date_str}T{time_str[:12]}{timezone_str}")
if job_start_time + timedelta(minutes=1) > now:
return False
return True

def flush(self, force_flush_busy: bool = False, wait_repo_check: bool = False) -> int:
"""Remove existing runners.

Args:
flush_busy: Whether to flush busy runners as well.
force_flush_busy: Whether to flush busy runners immediately.
wait_repo_check: Whether to flush busy runners after waiting for
repo-policy-compliance check to complete.

Returns:
Number of runners removed.
"""
if flush_busy:
runners = [runner for runner in self._get_runners() if runner.status.exist]
else:
runners = [
runner
for runner in self._get_runners()
if runner.status.exist and not runner.status.busy
]
remove_token = self._clients.github.get_runner_remove_token(self.config.path)

logger.info("Removing existing %i local runners", len(runners))
# Removing non-busy runners
runners = [
runner
for runner in self._get_runners()
if runner.status.exist and not runner.status.busy
]

remove_token = self._clients.github.get_runner_remove_token(self.config.path)
logger.info("Removing existing %i non-busy local runners", len(runners))

remove_count = len(runners)
for runner in runners:
runner.remove(remove_token)
logger.info(REMOVED_RUNNER_LOG_STR, runner.config.name)

return len(runners)
if wait_repo_check:
jdkandersson marked this conversation as resolved.
Show resolved Hide resolved
for _ in range(5):
jdkandersson marked this conversation as resolved.
Show resolved Hide resolved
if not self._runners_in_pre_job():
break
time.sleep(30)
else:
logger.warning(
(
"Proceed with flush runner after timeout waiting on runner in setup "
"stage, pre-job script might fail in currently running jobs"
)
)

if force_flush_busy or wait_repo_check:
busy_runners = [runner for runner in self._get_runners() if runner.status.exist]

logger.info("Removing existing %i busy local runners", len(runners))

remove_count += len(busy_runners)
for runner in busy_runners:
runner.remove(remove_token)
logger.info(REMOVED_RUNNER_LOG_STR, runner.config.name)

return remove_count

def _generate_runner_name(self) -> str:
"""Generate a runner name based on charm name.
Expand Down
6 changes: 6 additions & 0 deletions tests/integration/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
from juju.model import Model
from pytest_operator.plugin import OpsTest

from github_client import GithubClient
from tests.integration.helpers import (
deploy_github_runner_charm,
ensure_charm_has_runner,
Expand Down Expand Up @@ -141,6 +142,11 @@ def model(ops_test: OpsTest) -> Model:
return ops_test.model


@pytest.fixture(scope="module")
def runner_manager_github_client(token: str) -> GithubClient:
return GithubClient(token=token)


@pytest_asyncio.fixture(scope="module")
async def app_no_runner(
model: Model,
Expand Down
1 change: 1 addition & 0 deletions tests/integration/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
DISPATCH_TEST_WORKFLOW_FILENAME = "workflow_dispatch_test.yaml"
DISPATCH_CRASH_TEST_WORKFLOW_FILENAME = "workflow_dispatch_crash_test.yaml"
DISPATCH_FAILURE_TEST_WORKFLOW_FILENAME = "workflow_dispatch_failure_test.yaml"
DISPATCH_WAIT_TEST_WORKFLOW_FILENAME = "workflow_dispatch_wait_test.yaml"

JOB_LOG_START_MSG_TEMPLATE = "Job is about to start running on the runner: {runner_name}"

Expand Down
Loading
Loading