Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove race condition between restart repo-policy-compliance and request of the service #166

Merged
merged 48 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
41fa8e7
WIP
yhaliaw Dec 11, 2023
3297896
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Dec 11, 2023
73826e2
Add waiting of runner that has taken a job
yhaliaw Dec 11, 2023
a5973db
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Dec 12, 2023
7aa40a6
Fix import
yhaliaw Dec 12, 2023
a386fec
Draft for integration test
yhaliaw Dec 12, 2023
bd9f19f
Fix import
yhaliaw Dec 12, 2023
e92b08f
Fix import order
yhaliaw Dec 12, 2023
a13b03e
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 15, 2024
d784374
[no ci] Fix linting
yhaliaw Jan 15, 2024
84d5ce0
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 16, 2024
dca0cb6
Flush non-busy runner
yhaliaw Jan 16, 2024
77638fe
Fix lint and unit test
yhaliaw Jan 22, 2024
377b47f
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 22, 2024
91c1f43
Fix merge issues
yhaliaw Jan 22, 2024
a7f096f
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 22, 2024
04bde3b
Fix the integration test
yhaliaw Jan 23, 2024
200185c
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 23, 2024
5e593fa
Fix code comment
yhaliaw Jan 23, 2024
3b9126a
Fix integration test
yhaliaw Jan 23, 2024
6220bba
Change repo used for integration test
yhaliaw Jan 24, 2024
9945a5b
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 24, 2024
e66c17b
Fix waiting for busy runner
yhaliaw Jan 24, 2024
7efee67
Use enum as RunnerManager.flush argument
yhaliaw Jan 24, 2024
d8bbf61
Fix integration test
yhaliaw Jan 24, 2024
510fa28
Increase the timeout
yhaliaw Jan 26, 2024
9b6f3f7
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 26, 2024
4e4457b
Fix lint
yhaliaw Jan 26, 2024
fe52afa
Assert workflow was successfully dispatched
yhaliaw Jan 26, 2024
2db55e2
Fix workflow input type
yhaliaw Jan 26, 2024
8f16e7e
Fix github_client usage
yhaliaw Jan 26, 2024
87fe221
Fix integration test
yhaliaw Jan 26, 2024
b9adcc0
Fix integration test
yhaliaw Jan 26, 2024
987f5c7
Fix timezone format
yhaliaw Jan 26, 2024
c9d08ac
Update enum docstring
yhaliaw Jan 27, 2024
4ab75c9
Fix code comments
yhaliaw Jan 29, 2024
33fc0a4
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 29, 2024
e48b355
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Jan 29, 2024
6fdc6ce
Refactor test to use helper functions
yhaliaw Jan 30, 2024
7b7bdbe
Merge branch 'main' into bug/update-dep-race-condition
yanksyoon Jan 30, 2024
5955379
Fix exception thrown in tests
yhaliaw Jan 31, 2024
0e029a4
Disable pylint due to fastcore lib issues
yhaliaw Jan 31, 2024
fff0d70
Test fix
yhaliaw Jan 31, 2024
14f6625
Resolve the race condition
yhaliaw Jan 31, 2024
020bc6e
Resolve the race condition between restart service and flushing runner
yhaliaw Jan 31, 2024
9bea9e2
Fix lint issues
yhaliaw Jan 31, 2024
2053928
Fix timeout
yhaliaw Feb 1, 2024
abfc217
Merge branch 'main' into bug/update-dep-race-condition
yhaliaw Feb 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/integration_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
provider: lxd
test-tox-env: integration-juju2.9
modules: '["test_charm_fork_repo", "test_charm_no_runner", "test_charm_scheduled_events", "test_charm_one_runner", "test_charm_metrics", "test_self_hosted_runner", "test_charm_with_proxy", "test_charm_with_juju_storage"]'
integration-tests-juju3:
integration-tests:
name: Integration test with juju 3.1
uses: canonical/operator-workflows/.github/workflows/integration_test.yaml@main
secrets: inherit
Expand Down
29 changes: 29 additions & 0 deletions .github/workflows/workflow_dispatch_wait_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Workflow Dispatch Wait Tests

on:
# Manually dispatched workflow action
workflow_dispatch:
inputs:
runner:
description: 'Self hosted gh runner'
required: true
minutes:
description: 'Number of minutes to wait'
# Number type not supported in workflow dispatch: https://github.com/orgs/community/discussions/67182
# Seems to be by design: https://github.blog/changelog/2021-11-10-github-actions-input-types-for-manual-workflows/
default: '2'


jobs:
workflow-dispatch-tests:
runs-on: [self-hosted, linux, x64, "${{ inputs.runner }}"]
steps:
- name: Echo input variable and message
run: |
echo "Hello, runner: ${{ inputs.runner }}"
- name: Wait
run: |
sleep ${{ inputs.minutes }}m
- name: Always echo a message
if: always()
run: echo "Should not echo if pre-job script failed"
23 changes: 12 additions & 11 deletions src-docs/runner_manager.py.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Runner Manager manages the runners on LXD and GitHub.
## <kbd>class</kbd> `RunnerManager`
Manage a group of runners according to configuration.

<a href="../src/runner_manager.py#L55"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L56"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `__init__`

Expand Down Expand Up @@ -46,7 +46,7 @@ Construct RunnerManager object for creating and managing runners.

---

<a href="../src/runner_manager.py#L658"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L698"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `build_runner_image`

Expand All @@ -66,7 +66,7 @@ Build container image in test mode, else virtual machine image.

---

<a href="../src/runner_manager.py#L116"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L117"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `check_runner_bin`

Expand All @@ -83,12 +83,12 @@ Check if runner binary exists.

---

<a href="../src/runner_manager.py#L525"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L550"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `flush`

```python
flush(flush_busy: bool = True) → int
flush(flush_busy: bool = True, wait_repo_check: bool = False) → int
```

Remove existing runners.
Expand All @@ -97,7 +97,8 @@ Remove existing runners.

**Args:**

- <b>`flush_busy`</b>: Whether to flush busy runners as well.
- <b>`flush_busy`</b>: Whether to flush busy runners.
- <b>`wait_repo_check`</b>: Whether to wait for busy runner to complete repo-policy-compliance check before flushing the runners.



Expand All @@ -106,7 +107,7 @@ Remove existing runners.

---

<a href="../src/runner_manager.py#L217"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L218"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `get_github_info`

Expand All @@ -123,7 +124,7 @@ Get information on the runners from GitHub.

---

<a href="../src/utilities.py#L124"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/utilities.py#L125"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `get_latest_runner_bin_url`

Expand All @@ -148,7 +149,7 @@ The runner binary URL changes when a new version is available.

---

<a href="../src/runner_manager.py#L448"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L449"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `reconcile`

Expand All @@ -172,7 +173,7 @@ Bring runners in line with target.

---

<a href="../src/runner_manager.py#L668"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/runner_manager.py#L708"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `schedule_build_runner_image`

Expand All @@ -184,7 +185,7 @@ Install cron job for building runner image.

---

<a href="../src/utilities.py#L150"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>
<a href="../src/utilities.py#L151"><img align="right" style="float:right;" src="https://img.shields.io/badge/-source-cccccc?style=flat-square"></a>

### <kbd>function</kbd> `update_runner_bin`

Expand Down
2 changes: 1 addition & 1 deletion src/charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -578,8 +578,8 @@ def _check_and_update_dependencies(self) -> bool:

self.unit.status = MaintenanceStatus("Flushing runners due to updated deps")

runner_manager.flush(flush_busy=False, wait_repo_check=True)
self._start_services()
runner_manager.flush(flush_busy=False)

self.unit.status = ActiveStatus()
return service_updated or runner_bin_updated
Expand Down
44 changes: 42 additions & 2 deletions src/runner_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import tarfile
import time
import urllib.request
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Dict, Iterator, Optional, Type

Expand Down Expand Up @@ -522,15 +523,54 @@ def reconcile(self, quantity: int, resources: VirtualMachineResources) -> int:
)
return delta

def flush(self, flush_busy: bool = True) -> int:
def _runners_in_pre_job(self) -> bool:
"""Check there exist runners in the pre-job script stage.

If a runner has taken a job for 1 minute or more, it is assumed to exit the pre-job script.

Returns:
Whether there are runners that has taken a job and run for less than 1 minute.
"""
now = datetime.now(timezone.utc)
busy_runners = [
runner for runner in self._get_runners() if runner.status.exist and runner.status.busy
]
for runner in busy_runners:
exit_code, stdout, _ = runner.instance.execute(
["/usr/bin/stat", "-c", "'%w'", "/home/ubuntu/github-runner/_work"]
jdkandersson marked this conversation as resolved.
Show resolved Hide resolved
)
if exit_code != 0:
return False
date_str, time_str, timezone_str = stdout.read().decode("utf-8").split(" ")
job_start_time = datetime.fromisoformat(f"{date_str}T{time_str[:12]}{timezone_str}")
if job_start_time + timedelta(minutes=1) > now:
return False
return True

def flush(self, flush_busy: bool = True, wait_repo_check: bool = False) -> int:
"""Remove existing runners.

Args:
flush_busy: Whether to flush busy runners as well.
flush_busy: Whether to flush busy runners.
wait_repo_check: Whether to wait for busy runner to complete
repo-policy-compliance check before flushing the runners.

Returns:
Number of runners removed.
"""
if wait_repo_check:
jdkandersson marked this conversation as resolved.
Show resolved Hide resolved
for _ in range(5):
jdkandersson marked this conversation as resolved.
Show resolved Hide resolved
if not self._runners_in_pre_job():
break
time.sleep(30)
else:
logger.warning(
(
"Proceed with flush runner after timeout waiting on runner in setup "
"stage, pre-job script might fail in currently running jobs"
)
)

if flush_busy:
runners = [runner for runner in self._get_runners() if runner.status.exist]
else:
Expand Down
6 changes: 6 additions & 0 deletions tests/integration/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from juju.model import Model
from pytest_operator.plugin import OpsTest

from github_client import GithubClient
from tests.integration.helpers import (
deploy_github_runner_charm,
ensure_charm_has_runner,
Expand Down Expand Up @@ -137,6 +138,11 @@ def model(ops_test: OpsTest) -> Model:
return ops_test.model


@pytest.fixture(scope="module")
def runner_manager_github_client(token: str) -> GithubClient:
return GithubClient(token=token)


@pytest_asyncio.fixture(scope="module")
async def app_no_runner(
model: Model,
Expand Down
1 change: 1 addition & 0 deletions tests/integration/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
DISPATCH_TEST_WORKFLOW_FILENAME = "workflow_dispatch_test.yaml"
DISPATCH_CRASH_TEST_WORKFLOW_FILENAME = "workflow_dispatch_crash_test.yaml"
DISPATCH_FAILURE_TEST_WORKFLOW_FILENAME = "workflow_dispatch_failure_test.yaml"
DISPATCH_WAIT_TEST_WORKFLOW_FILENAME = "workflow_dispatch_wait_test.yaml"


async def check_runner_binary_exists(unit: Unit) -> bool:
Expand Down
72 changes: 72 additions & 0 deletions tests/integration/test_charm_one_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,20 @@

"""Integration tests for github-runner charm."""

from datetime import datetime, timezone
from time import sleep

import github
import pytest
import requests
from github.Repository import Repository
from juju.application import Application
from juju.model import Model

from charm import GithubRunnerCharm
from github_client import GithubClient
from tests.integration.helpers import (
DISPATCH_WAIT_TEST_WORKFLOW_FILENAME,
assert_resource_lxd_profile,
get_runner_names,
reconcile,
Expand Down Expand Up @@ -189,6 +197,70 @@ async def test_reconcile_runners_with_lxd_storage_pool_failure(

@pytest.mark.asyncio
@pytest.mark.abort_on_fail
yhaliaw marked this conversation as resolved.
Show resolved Hide resolved
async def test_wait_on_busy_runner_repo_check(
model: Model,
app_runner: Application,
github_repository: Repository,
runner_manager_github_client: GithubClient,
) -> None:
"""
arrange: A working application with no runners.
act:
1. Set dockerhub-mirror config and spawn one runner.
2. Dispatch a workflow.
yhaliaw marked this conversation as resolved.
Show resolved Hide resolved
assert:
1. registry-mirrors is setup in /etc/docker/daemon.json of runner.
2. Message about dockerhub_mirror appears in logs.
yhaliaw marked this conversation as resolved.
Show resolved Hide resolved
"""

unit = app_runner.units[0]

names = await get_runner_names(unit)
assert len(names) == 1

runner_to_be_used = names[0]

main_branch = github_repository.get_branch(github_repository.default_branch)
workflow = github_repository.get_workflow(id_or_file_name=DISPATCH_WAIT_TEST_WORKFLOW_FILENAME)

workflow.create_dispatch(main_branch, {"runner": app_runner.name, "minutes": 30})

# Wait until runner is busy.
for _ in range(30):
all_runners = runner_manager_github_client.get_runner_github_info(
f"{github_repository.owner}/{github_repository.name}"
)
runners = [runner for runner in all_runners if runner.name == runner_to_be_used]
assert len(runners) == 1, "Should not occur GitHub should enforce unique naming"
runner = runners[0]
if runner["busy"]:
start_time = datetime.now(timezone.utc)
break

sleep(10)
else:
assert False, "Timeout while waiting for workflow to complete"

# Unable to find the run id of the workflow that was dispatched.
# Therefore, all runs after this test start should pass the conditions.
for run in workflow.get_runs(created=f">={start_time.isoformat()}"):
if start_time > run.created_at:
continue

try:
logs_url = run.jobs()[0].logs_url()
logs = requests.get(logs_url).content.decode("utf-8")
except github.GithubException.GithubException:
continue

if f"Job is about to start running on the runner: {app_runner.name}-" in logs:
assert run.jobs()[0].conclusion == "success"
assert (
"A private docker registry is setup as a dockerhub mirror for this self-hosted"
" runner."
) in logs


yhaliaw marked this conversation as resolved.
Show resolved Hide resolved
async def test_change_runner_storage(model: Model, app: Application) -> None:
"""
arrange: A working application with one runners using memory as disk.
Expand Down
Loading