Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator is not able to access runner VM #5

Open
chainzero opened this issue Feb 11, 2024 · 0 comments
Open

Coordinator is not able to access runner VM #5

chainzero opened this issue Feb 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@chainzero
Copy link

Describe the bug
When a GitHub job is invoked the status is hung in status of "Job is about to start running on the runner: gha-runner-coordinator_0 (repository)".

The GitHub Action self hosted runners are created/register properly. When a workflow is initiated for a push event the workflow invokes. The gha-runner-coordinator-auto-spawned0 VM is created in GCP properly. And appears the job is about to run on the spawned runner VM but never proceeds past the "Job is about to start ..." message. Appears that - even though the runner VM is spawned - there are issues in the coordinator accessing or controlling the runner VM. Eventually the runner VM is torn down and a new runner VM is created. And this cycles repeats for approx 5 runner VM creations and then eventually the job fails - after approx 10 minutes - with the message 'VM starter exited with non-zero exit code: 1".

The runner/worker logs on the coordinator show te VM creation and then "Attempting to connect ...." messages but nothing of interest otherwise.

Would seem like an issue that possibly the coordinator cannot access/control the spawned runner VM. But this seems odd as the coordinator is able to create the VM and the coordinator is using a GCP service account that should have necessary rights.

The GitHub self-hosted runners that are created and become active when the job is initiated properly.

Github Actions workflow logs show the runner VM creation but connecting messages are not successful and are re-attempted.

16:23:48 | Attempting to spawn a machine... (PID: 11852)
16:[2](https://github.com/chainzero/actions-playground/actions/runs/7863228087/job/21453603853#step:1:2)3:48 | Instance name:	 gha-runner-coordinator-auto-spawned0
16:23:48 | Instance type:	 n2-standard-4
16:2[3](https://github.com/chainzero/actions-playground/actions/runs/7863228087/job/21453603853#step:1:3):48 | Disk type:	 pd-ssd
16:23:48 | LABELS: [{'github_job_full': 'centos', 'github_sha': 'a5f32cd5c2aa5f970515cb8fee60f255bf[4](https://github.com/chainzero/actions-playground/actions/runs/7863228087/job/21453603853#step:1:4)a340c', 'github_run_id': '7863228087'}]
16:23:48 | Preemptible: False
16:23:49 | Attempting to spawn a machine in us-central1-b
16:23:[5](https://github.com/chainzero/actions-playground/actions/runs/7863228087/job/21453603853#step:1:5)7 | Machine spawned in 8.02 seconds.
1[6](https://github.com/chainzero/actions-playground/actions/runs/7863228087/job/21453603853#step:1:6):23:5[8](https://github.com/chainzero/actions-playground/actions/runs/7863228087/job/21453603853#step:1:8) | export gha-runner-coordinator-auto-spawned0=10.0.0.3
16:23:58 | Connecting... [1/15] 
16:24:14 | Connecting... [2/15] 
16:24:16 | Machine ready
16:26:16 |

The spec used for the runner VM is:

{
  "gcp": {
    "type": "n2-standard-4",
    "allowed_machine_types": ["n2-standard-4", "n2-highmem-4"],
    "disk_type": "pd-ssd",
    "subnet": "gha-runner-net",
    "image": "scalenode-3b2a489--x86",
    "sa": "gha-runner-coordinator-sa@e2e-integration-413420.iam.gserviceaccount.com"
  },
  "machine": {
    "disk": 30,
    "preemptible": false
  }
}

To Reproduce
Steps to reproduce the behavior:
1). Should be able to reproduce using runner VM JSON spec shared

2). Initiate the GH Actins workflow and observe the runner VM creation but the workflow steps never progress as the connection to the runner VMs is not successful/

Expected behavior

Coordinator should connect to runner VMs and complete the job steos.

Runner Version and Platform

Using latest version and VM image created during install steps provided in ReadMe.

Please let me know if I could provide any further info that would be helpful.

@chainzero chainzero added the bug Something isn't working label Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant