Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hung or stuck instances not torn down #38

Open
MaxDiOrio opened this issue Sep 27, 2024 · 1 comment
Open

Hung or stuck instances not torn down #38

MaxDiOrio opened this issue Sep 27, 2024 · 1 comment

Comments

@MaxDiOrio
Copy link

When a build fails and the EC2 instance doesn't run the shutdown script it seems that the EC2 instance is never cleaned up. The one below was a timeout waiting for the self-hosted runner to register.

Ec2 spot instance strategy is set to none
Starting instance with none strategy
AWS EC2 instance i-01967a62320981c42 is up and running
Waiting 30s before polling for runner
Polling for runner every 10s
Waiting...
Waiting...
...
Waiting...
Error: The operation was canceled.

And the instance remained up.

@mahdi-torabi
Copy link
Contributor

 `echo "shutdown -P +1" > $CURRENT_PATH/shutdown_script.sh`,
      "chmod +x $CURRENT_PATH/shutdown_script.sh",
      `echo "./config.sh remove --token ${runnerRegistrationToken.token} || true" > $CURRENT_PATH/shutdown_now_script.sh`,
      `echo "shutdown -h now" > $CURRENT_PATH/shutdown_now_script.sh`,
      "chmod +x $CURRENT_PATH/shutdown_now_script.sh",
      "export ACTIONS_RUNNER_HOOK_JOB_COMPLETED=$CURRENT_PATH/shutdown_script.sh",
  • The code above creates a shutdown script and then uses ACTIONS_RUNNER_HOOK_JOB_COMPLETED to make sure it is executed once a job finishes.
  • We also have github_job_start_ttl_seconds which defines how long an instance is allowed to stay idle before a job is executed
  • Finally we have the instance TTL which would execute if the two options above both fail for any reason.

I just tested with a job which had an error intentionally introduced to make it fail. Exactly 1 minute after failure the instance was terminated.

Do you have an example of a workflow which could trigger a different type of failure ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants