Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reboot during provisioning? Request for docs or feature #11190

Open
robotrapta opened this issue Aug 5, 2021 · 12 comments
Open

How to reboot during provisioning? Request for docs or feature #11190

robotrapta opened this issue Aug 5, 2021 · 12 comments
Labels
docs enhancement reconnect sync to jira For issues that need to be imported to Packer internal JIRA backlog
Milestone

Comments

@robotrapta
Copy link

Description

Howdy y'all! I need to restart my machine during provisioning. I'm new to packer coming from systems like ansible and chef. I've read a bunch of docs on this, and am still confused. So I think this is at least a doc bug, and perhaps a full feature request.

I know this issue has been discussed in #1983 and that there was a proposal to build a native provisioner in #4555 - a native feature makes a ton of sense to me.

I know the recommended way to do this is with a shell provisioner. This seems to rely on the retry mechanism, treating the reboot as a kinda-expected error and then getting the system to recover from it appropriately. Maybe this is useful for other reasons, but it feels like an ugly hack. A hack would be okay if it was clearly documented and worked reliably. But it's not clearly documented - there is no example that I can find showing how to do this. My first few attempts to get it to work after reading the docs were unreliable - sometimes it failed, and sometimes it re-ran things unnecessarily. So it would be great to just tell people the standard way to do this if there isn't a built-in way to do it.

I think a good way to do it is this:

  provisioner "shell" {
    expect_disconnect = true
    inline = [
      "sudo reboot now",
    ]
    pause_after  = "10s"
  }

The pause_after I believe is important to minimize the risk of an expected race condition in issuing the next provisioning command? Which seems to me like a pretty strong argument for making this a native feature.

If that's in fact correct, putting that example code in the docs would be awesome. Thanks!

Use Case(s)

I'm trying to install nvidia drivers to use CUDA, which generally requires a reboot.

@azr
Copy link
Contributor

azr commented Aug 9, 2021

Hello @robotrapta ! Thanks for opening, yeah, I agree with you here about the fact that this feel like a hack. The thing is that Packer sorts of expects to log in to an instance once — in the beginning — and there is no internal/native way to 'reconnect' that feel great. So we would like to introduce a new feature soon to be able to 'connect' in the middle of a build. This would allow changing SSH settings or reboot after an installation. But that one will not come straight away, as we have quite a large and growing to-do list. Making a docs page about that would be a good idea, I think. I'll bring that one up to the team.

One thing that comes to mind here is that you could install the drivers at the end of your provisioning steps, and just shutdown/save the machine. Upon next boot, things should be configured. If you have more things to install/configure, then you could for example start another build ?

With that said, and if that does not work out, do you mind sharing your build file ? And your logs ? Maybe we can help you better/differently from there.

@azr azr added the question label Aug 9, 2021
@keviiin38
Copy link

Hello ! I'm trying to achieve something similar:

  • Provision using Ansible to install Nvidia CUDA drivers and other stuff
  • Need a reboot of the instance to reload kernel, etc...
  • Finally running a last shell with Serverspec and checking that everything is ok
{
    "provisioners": [
        {
            "type": "ansible",
            "playbook_file": "playbook.yml"
        },
        {
            "type": "shell",
            "inline": [ "reboot now" ],
            "expect_disconnect": true
        },
        {
            "type": "file",
            "source": "serverspec/",
            "destination": "/tmp",
            "pause_before": "30s"
        },
        {
            "type": "shell",
            "script": "serverspec.sh"
        }
    ]
}

I'm using pause_before in the next provisioner instead of the pause_after, but don't know which one is better.

@robotrapta
Copy link
Author

Hi @azr thanks for the suggestion of installing the drivers at the end - unfortunately that doesn't work for me. There's a bunch of software I need to install which depends on having CUDA installed, and some of those installations will fail if it can't confirm nvidia hardware/drivers present.

@azr azr added the reconnect label Aug 13, 2021
@azr azr assigned azr and unassigned azr Aug 13, 2021
@azr azr added this to the 1.7.8 milestone Oct 6, 2021
@nywilken nywilken modified the milestones: 1.7.8, 1.7.9 Oct 6, 2021
@ccrvlh
Copy link

ccrvlh commented Apr 26, 2022

Hi, similar use case here, and found the results to be sort of inconsistent.
My current template is something like:

  provisioner "shell" {
    pause_before = "10s"
    script = "./scripts/updates.sh"
  }

  provisioner "ansible-local" {
    role_paths      = ["./roles"]
    playbook_file   = "./roles/ubuntu/ubuntu.yml"
    command         = "sudo ansible-playbook -i localhost -e 'ansible_python_interpreter=/usr/bin/python3'"
  }

  provisioner "shell" {
    script = "./scripts/reboot.sh"
    expect_disconnect = true
  }

  provisioner "shell" {
    pause_before = "120s"
    script = "./scripts/finish.sh"
    max_retries = 3
  }

What I've found is that I always can see the message that the reboot.sh prints (Rebooting to apply updates), and sometimes I see the next message (Pausing for 2 minutes before next text or something similar, can't remeber). But sometimes I the build just fails after the reboot message. This seems strange, since I've been following the VMs and reboot usually take around 30-60 seconds, and I do have the retry mechanism for the finish provisioner. Couldn't quite find a reliable way to do that. I ran about 10-20 builds today and is completely hit or miss.

I understand the solution proposed by @azr but just as @robotrapta I also wanted to perform actions after the reboot. There may be a work around, sure, but the reboot is just the natural way to go for us.

The impression I get is that Packer "crashes" (probably not the right wording, pardon me) after the reboot, even with expect_disconnect and doesn't understand the next task to perform (wait to reconnect).

I could try a few more tests with debug logs on maybe.

@nywilken nywilken added sync to jira For issues that need to be imported to Packer internal JIRA backlog and removed track-internal question sync to jira For issues that need to be imported to Packer internal JIRA backlog labels Sep 30, 2022
@github-actions
Copy link

This issue has been synced to JIRA for planning.

JIRA ID: HPR-770

@spuder
Copy link
Contributor

spuder commented Oct 27, 2022

A request on how to reboot came up in the discuss group https://discuss.hashicorp.com/t/how-to-reboot-vm-with-packer/46083/2

You are already using pause_before and pause_after. Try also adding ssh_read_write_timeout = 5m.

e.g.

source "amazon-ebs" "ubuntu-bionic" {
  ami_name      = "ubuntu-bionic-18.04-hvm-ebs-{{timestamp}}"
  instance_type = "t2.micro"
  region        = "us-west-2"
  source_ami_filter {
    filters = {
      name                = "ubuntu/images/*ubuntu-bionic-18.04-amd64-server-*"
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    }
    most_recent = true
    owners      = ["099720109477"]
  }
  ssh_username    = "ubuntu"
  ssh_read_write_timeout = "5m" # Allow reboots
}

@brianjmurrell
Copy link

Or simply because you need to reboot to a newer kernel, installed during the provisioning in order to be able to remove the one that is currently booted.

@ghost
Copy link

ghost commented Feb 16, 2023

any recommended way to reboot during provisioning?

@RaJiska
Copy link

RaJiska commented Nov 21, 2023

The documentation actually makes mention of it.

Firstly with expect_disconnect meant to be used in your provisioner rebooting your machine, and secondly with start_retry_timeout to be used in your subsequent shell provisioner.

@tenthirtyam
Copy link
Contributor

Agree that this is already available with use of pause_before, expect_disconnect and start_retry_timeout in the shell provisioner.

pause_before (duration) - Sleep for duration before execution.

expect_disconnect (boolean) - Defaults to false. When true, allow the server to disconnect from Packer without throwing an error. A disconnect might happen if you restart the SSH server or reboot the host.

start_retry_timeout (string) - The amount of time to attempt to start the remote process. By default this is 5m or 5 minutes. This setting exists in order to deal with times when SSH may restart, such as a system reboot. Set this to a higher value if reboots take a longer amount of time.

Additionally, this is available for Windows with the windows-restart provisoner.

provisioner "windows-restart" {
  pause_before          = "30s"
  restart_check_command = "powershell -command \"& {Write-Output 'restarted.'}\""
  restart_timeout       = "10m"
}

cc @nywilken @lbajolet-hashicorp

@lbajolet-hashicorp
Copy link
Contributor

Thanks for the update here @tenthirtyam,

This is an old issue, we do have documentation on those options, but maybe the workflow isn't intuitive, or the documentation is lacking.

That said, since this hasn't been updated for a while, and most of the updates seem to point to community resources or sharing examples of how the problem was fixed/circumvented.

I'm tempted to close this issue now, but I'd like to hear from others that commented on this issue before: are you still experiencing the problem? Do you have suggestions on how we can improve Packer or the docs that would've helped you solve that issue?

@jaysoffian
Copy link

jaysoffian commented Jun 21, 2024

None of the suggestions work reliably, at least, not in combination with the AWS session-manager-plugin. I've tried adding pause_after to the step that reboots, pause_before to the step following the reboot. I've tried adding an interim shell-local step. I've tried adding retries. I've tried setting ssh_read_write_timeout to something low like 1m, but that timeout doesn't seem to apply until after the connection has started.

What seems to be happening is that the session-manager-plugin itself does not reliably notice the remote end has gone away and we cannot adjust its timeout which is apparently 1 hour. Once it's finally timed out it's too late.

Here's an example of what it looks like when it works:
     1	2024/06/21 14:45:35 ui: 2024-06-21T14:45:35-04:00:     amazon-ebs.builder: + shutdown -r +1s
     2	2024/06/21 14:45:35 ui: 2024-06-21T14:45:35-04:00:     amazon-ebs.builder: shutdown: [pid 1018]
     3	2024/06/21 14:45:35 ui: 2024-06-21T14:45:35-04:00:     amazon-ebs.builder: Shutdown at Fri Jun 21 18:45:36 2024.
     4	2024/06/21 14:45:35 ui: 2024-06-21T14:45:35-04:00:     amazon-ebs.builder: shutdown: can't detach from console
     5	2024/06/21 14:45:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:45:36 [INFO] RPC endpoint: Communicator ended with: 0
     6	2024/06/21 14:45:36 [INFO] 1639 bytes written for 'stdout'
     7	2024/06/21 14:45:36 [INFO] 0 bytes written for 'stderr'
     8	2024/06/21 14:45:36 [INFO] RPC client: Communicator ended with: 0
     9	2024/06/21 14:45:36 [INFO] RPC endpoint: Communicator ended with: 0
    10	2024/06/21 14:45:36 ui: 2024-06-21T14:45:36-04:00:     amazon-ebs.builder: Shutdown at Fri Jun 21 18:45:36 2024.
    11	2024/06/21 14:45:36 packer-provisioner-shell plugin: [INFO] 1639 bytes written for 'stdout'
    12	2024/06/21 14:45:36 packer-provisioner-shell plugin: [INFO] 0 bytes written for 'stderr'
    13	2024/06/21 14:45:36 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0
    14	2024/06/21 14:45:36 ui: 2024-06-21T14:45:36-04:00:     amazon-ebs.builder:
    15	2024/06/21 14:45:36 ui: 2024-06-21T14:45:36-04:00:     amazon-ebs.builder: System shutdown time has arrived
    16	2024/06/21 14:45:36 ui: 2024-06-21T14:45:36-04:00: ==> amazon-ebs.builder: Pausing 1m0s after this provisioner...
    17	2024/06/21 14:46:25 ui: 2024-06-21T14:46:25-04:00:     amazon-ebs.builder: SessionId: user.XXXXXXXX : document process failed unexpectedly: document worker timed out , check [ssm-document-worker]/[ssm-session-worker] log for crash reason
    18	2024/06/21 14:46:25 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:25 ssm: Starting PortForwarding session to instance i-XXXXXXXX
    19	2024/06/21 14:46:25 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:25 ssm: Terminating PortForwarding session "user.XXXXXXXX"
    20	2024/06/21 14:46:26 ui: 2024-06-21T14:46:26-04:00:     amazon-ebs.builder: Starting portForwarding session "user.XXXXXXXX".
    21	2024/06/21 14:46:26 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:26 Executing: /opt/aws/sessionmanagerplugin/bin/session-manager-plugin [{"SessionId":"XXXXXXXX","StreamUrl":"wss://XXXXXXXX"} XXXXXXXX StartSession  {"DocumentName":"AWS-StartPortForwardingSession","Parameters":{"localPortNumber":["8920"],"portNumber":["22"]},"Reason":null,"Target":"i-XXXXXXXX"} wss://XXXXXXXX]
    22	2024/06/21 14:46:26 ui: 2024-06-21T14:46:26-04:00:     amazon-ebs.builder: Starting session with SessionId: user.XXXXXXXX
    23	2024/06/21 14:46:33 ui: 2024-06-21T14:46:33-04:00:     amazon-ebs.builder: Port 8920 opened for sessionId user.XXXXXXXX.
    24	2024/06/21 14:46:33 ui: 2024-06-21T14:46:33-04:00:     amazon-ebs.builder: Waiting for connections...
    25	2024/06/21 14:46:36 [INFO] (telemetry) ending shell
    26	2024/06/21 14:46:36 [INFO] (telemetry) Starting provisioner shell
    27	2024/06/21 14:46:36 ui: 2024-06-21T14:46:36-04:00: ==> amazon-ebs.builder: Provisioning with shell script: ./stage2.setup_ami.sh
    28	2024/06/21 14:46:36 packer-provisioner-shell plugin: Opening ./stage2.setup_ami.sh for reading
    29	2024/06/21 14:46:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:36 [DEBUG] Opening new ssh session
    30	2024/06/21 14:46:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:36 [ERROR] ssh session open error: 'EOF', attempting reconnect
    31	2024/06/21 14:46:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:36 [DEBUG] reconnecting to TCP connection for SSH
    32	2024/06/21 14:46:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:36 [DEBUG] handshaking with SSH
    33	2024/06/21 14:46:36 ui: 2024-06-21T14:46:36-04:00:     amazon-ebs.builder: Connection accepted for session [user.XXXXXXXX]
    34	2024/06/21 14:46:36 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData'
    35	2024/06/21 14:46:36 [INFO] 18451 bytes written for 'uploadData'
    36	2024/06/21 14:46:37 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:37 [DEBUG] handshake complete!
    37	2024/06/21 14:46:37 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:37 [DEBUG] Opening new ssh session
    38	2024/06/21 14:46:37 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:37 [INFO] agent forwarding enabled
    39	2024/06/21 14:46:37 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:37 [DEBUG] Starting remote scp process:  scp -vt /tmp
    40	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Started SCP session, beginning transfers...
    41	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Copying input data into temporary file so we can read the length
    42	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] scp: Uploading script_5175.sh: perms=C0644 size=18451
    43	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] SCP session complete, closing stdin pipe.
    44	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Waiting for SSH session to complete.
    45	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] scp stderr (length 72): Sink: C0644 18451 script_5175.sh
    46	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: scp: debug1: fd 0 clearing O_NONBLOCK
    47	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Opening new ssh session
    48	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] starting remote command: chmod 0755 /tmp/script_5175.sh
    49	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [INFO] RPC endpoint: Communicator ended with: 0
    50	2024/06/21 14:46:38 [INFO] RPC client: Communicator ended with: 0
    51	2024/06/21 14:46:38 [INFO] RPC endpoint: Communicator ended with: 0
    52	2024/06/21 14:46:38 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0
    53	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] Opening new ssh session
    54	2024/06/21 14:46:38 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 14:46:38 [DEBUG] starting remote command: sudo /tmp/script_5175.sh setup2
    55	2024/06/21 14:46:38 ui: 2024-06-21T14:46:38-04:00:     amazon-ebs.builder: + uptime

The important line to notice is the aws session-manager-plugin disconnecting on line 17.

Now here's the same configuration failing:
     1	2024/06/21 15:21:00 ui: 2024-06-21T15:21:00-04:00:     amazon-ebs.builder: + shutdown -r +1s
     2	2024/06/21 15:21:00 ui: 2024-06-21T15:21:00-04:00:     amazon-ebs.builder: shutdown: [pid 1005]
     3	2024/06/21 15:21:00 ui: 2024-06-21T15:21:00-04:00:     amazon-ebs.builder: Shutdown at Fri Jun 21 19:21:01 2024.
     4	2024/06/21 15:21:00 ui: 2024-06-21T15:21:00-04:00:     amazon-ebs.builder: shutdown: can't detach from console
     5	2024/06/21 15:21:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:21:01 [INFO] RPC endpoint: Communicator ended with: 0
     6	2024/06/21 15:21:01 [INFO] 0 bytes written for 'stderr'
     7	2024/06/21 15:21:01 [INFO] 1639 bytes written for 'stdout'
     8	2024/06/21 15:21:01 [INFO] RPC client: Communicator ended with: 0
     9	2024/06/21 15:21:01 [INFO] RPC endpoint: Communicator ended with: 0
    10	2024/06/21 15:21:01 packer-provisioner-shell plugin: [INFO] 1639 bytes written for 'stdout'
    11	2024/06/21 15:21:01 packer-provisioner-shell plugin: [INFO] 0 bytes written for 'stderr'
    12	2024/06/21 15:21:01 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0
    13	2024/06/21 15:21:01 ui: 2024-06-21T15:21:01-04:00:     amazon-ebs.builder: Shutdown at Fri Jun 21 19:21:01 2024.
    14	2024/06/21 15:21:01 ui: 2024-06-21T15:21:01-04:00:     amazon-ebs.builder:
    15	2024/06/21 15:21:01 ui: 2024-06-21T15:21:01-04:00:     amazon-ebs.builder: System shutdown time has arrived
    16	2024/06/21 15:21:01 ui: 2024-06-21T15:21:01-04:00: ==> amazon-ebs.builder: Pausing 1m0s after this provisioner...
    17	2024/06/21 15:22:01 [INFO] (telemetry) ending shell
    18	2024/06/21 15:22:01 [INFO] (telemetry) Starting provisioner shell
    19	2024/06/21 15:22:01 ui: 2024-06-21T15:22:01-04:00: ==> amazon-ebs.builder: Provisioning with shell script: ./stage2.setup_ami.sh
    20	2024/06/21 15:22:01 packer-provisioner-shell plugin: Opening ./stage2.setup_ami.sh for reading
    21	2024/06/21 15:22:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:22:01 [DEBUG] Opening new ssh session
    22	2024/06/21 15:22:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:22:01 [ERROR] ssh session open error: 'EOF', attempting reconnect
    23	2024/06/21 15:22:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:22:01 [DEBUG] reconnecting to TCP connection for SSH
    24	2024/06/21 15:22:01 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData'
    25	2024/06/21 15:22:01 [INFO] 18451 bytes written for 'uploadData'
    26	2024/06/21 15:22:01 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:22:01 [DEBUG] handshaking with SSH
    27	2024/06/21 15:23:01 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake
    28	2024/06/21 15:23:03 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:23:03 [DEBUG] Opening new ssh session
    29	2024/06/21 15:23:03 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:23:03 [ERROR] ssh session open error: 'client not available', attempting reconnect
    30	2024/06/21 15:23:03 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:23:03 [DEBUG] reconnecting to TCP connection for SSH
    31	2024/06/21 15:23:03 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:23:03 [DEBUG] handshaking with SSH
    32	2024/06/21 15:23:03 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData'
    33	2024/06/21 15:23:03 [INFO] 18451 bytes written for 'uploadData'
    34	2024/06/21 15:24:03 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake
    35	2024/06/21 15:24:05 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:24:05 [DEBUG] Opening new ssh session
    36	2024/06/21 15:24:05 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:24:05 [ERROR] ssh session open error: 'client not available', attempting reconnect
    37	2024/06/21 15:24:05 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:24:05 [DEBUG] reconnecting to TCP connection for SSH
    38	2024/06/21 15:24:05 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:24:05 [DEBUG] handshaking with SSH
    39	2024/06/21 15:24:05 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData'
    40	2024/06/21 15:24:05 [INFO] 18451 bytes written for 'uploadData'
    41	2024/06/21 15:25:05 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake
    42	2024/06/21 15:25:07 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:25:07 [DEBUG] Opening new ssh session
    43	2024/06/21 15:25:07 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:25:07 [ERROR] ssh session open error: 'client not available', attempting reconnect
    44	2024/06/21 15:25:07 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:25:07 [DEBUG] reconnecting to TCP connection for SSH
    45	2024/06/21 15:25:07 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData'
    46	2024/06/21 15:25:07 [INFO] 18451 bytes written for 'uploadData'
    47	2024/06/21 15:25:07 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:25:07 [DEBUG] handshaking with SSH
    48	2024/06/21 15:26:07 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake
    49	2024/06/21 15:26:09 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:26:09 [DEBUG] Opening new ssh session
    50	2024/06/21 15:26:09 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:26:09 [ERROR] ssh session open error: 'client not available', attempting reconnect
    51	2024/06/21 15:26:09 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:26:09 [DEBUG] reconnecting to TCP connection for SSH
    52	2024/06/21 15:26:09 packer-provisioner-shell plugin: [INFO] 18451 bytes written for 'uploadData'
    53	2024/06/21 15:26:09 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:26:09 [DEBUG] handshaking with SSH
    54	2024/06/21 15:26:09 [INFO] 18451 bytes written for 'uploadData'
    55	2024/06/21 15:27:09 packer-provisioner-shell plugin: Retryable error: Error uploading script: Timeout during SSH handshake
    56	2024/06/21 15:27:09 [INFO] (telemetry) ending shell
    57	2024/06/21 15:27:09 ui error: 2024-06-21T15:27:09-04:00: ==> amazon-ebs.builder: Error uploading script: Timeout during SSH handshake
    58	2024/06/21 15:27:09 ui: 2024-06-21T15:27:09-04:00: ==> amazon-ebs.builder: Step "StepProvision" failed
    59	2024/06/21 15:27:09 ui: ask: ==> amazon-ebs.builder: [c] Clean up and exit, [a] abort without cleanup, or [r] retry step (build may fail even if retry succeeds)?
    60	2024/06/21 15:31:33 ui: 2024-06-21T15:31:33-04:00: ==> amazon-ebs.builder: Provisioning step had errors: Running the cleanup provisioner, if present...
    61	2024/06/21 15:31:33 ui: 2024-06-21T15:31:33-04:00: ==> amazon-ebs.builder: Terminating the source AWS instance...
    62	2024/06/21 15:31:33 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/06/21 15:31:33 ssm: Terminating PortForwarding session "user.XXXXXXXX"
    63	2024/06/21 15:40:24 ui: 2024-06-21T15:40:24-04:00: ==> amazon-ebs.builder: Cleaning up any extra volumes...
    64	2024/06/21 15:40:25 ui: 2024-06-21T15:40:25-04:00: ==> amazon-ebs.builder: No volumes to clean up, skipping
    65	2024/06/21 15:40:25 ui: 2024-06-21T15:40:25-04:00: ==> amazon-ebs.builder: Deleting temporary security group...
    66	2024/06/21 15:40:26 [INFO] (telemetry) ending amazon-ebs.builder
    67	2024/06/21 15:40:26 ui error: 2024-06-21T15:40:26-04:00: Build 'amazon-ebs.builder' errored after 28 minutes 18 seconds: Error uploading script: Timeout during SSH handshake
    68	2024/06/21 15:40:26 ui:
    69	==> Wait completed after 28 minutes 18 seconds
    70	2024/06/21 15:40:26 machine readable: error-count []string{"1"}
    71	2024/06/21 15:40:26 ui error:
    72	==> Some builds didn't complete successfully and had errors:
    73	2024/06/21 15:40:26 machine readable: amazon-ebs.builder,error []string{"Error uploading script: Timeout during SSH handshake"}
    74	2024/06/21 15:40:26 ui error: --> amazon-ebs.builder: Error uploading script: Timeout during SSH handshake
    75	2024/06/21 15:40:26 ui:
    76	==> Builds finished but no artifacts were created.

I was using the following settings both times:

  ssh_read_write_timeout = "3m"

  provisioner "shell" {
    script            = "./stage2.setup_ami.sh"
    execute_command   = "sudo {{ .Path }} reboot"
    expect_disconnect = true
    skip_clean        = true
    pause_after       = "1m"
  }

  provisioner "shell" {
    script            = "./stage2.setup_ami.sh"
    execute_command   = "sudo {{ .Path }} setup2"
    max_retries       = 5
  }

I think what we need is perhaps a new option force_disconnect instead of expect_disconnect. At least in the case of using the aws session-manager-plugin.

Edit to add: This works reliably for me with the aws session-manager-plugin:

  provisioner "shell" {
    script            = "./stage2.setup_ami.sh"
    execute_command   = "sudo {{ .Path }} reboot"
    expect_disconnect = true
    skip_clean        = true
  }

  # Force kill the session-manager-plugin since it doesn't always notice the
  # remote end going away. Packer will restart it. This seems to be the only
  # reliable way to handle reboots.
  provisioner "shell-local" {
    inline = ["pkill -g 0 session-manager-plugin"]
  }

  provisioner "shell" {
    pause_before = "10s"
    inline       = ["uptime"]
    max_retries  = 10
  }
Output looks like:
2024-06-21T17:18:44-04:00:     amazon-ebs.builder: shutdown: [pid 1007]
2024-06-21T17:18:44-04:00:     amazon-ebs.builder: Shutdown at Fri Jun 21 21:18:45 2024.
2024-06-21T17:18:44-04:00:     amazon-ebs.builder: shutdown: can't detach from console
2024-06-21T17:18:45-04:00:     amazon-ebs.builder: Shutdown at Fri Jun 21 21:18:45 2024.
2024-06-21T17:18:45-04:00:     amazon-ebs.builder:
2024-06-21T17:18:45-04:00:     amazon-ebs.builder: System shutdown time has arrived
2024-06-21T17:18:46-04:00: ==> amazon-ebs.builder: Running local shell script: /tmp/packer-shell1139810214
2024-06-21T17:18:46-04:00: ==> amazon-ebs.builder: Pausing 10s before the next provisioner...
2024-06-21T17:18:46-04:00: ==> amazon-ebs.builder: Bad exit status: -1
2024-06-21T17:18:56-04:00: ==> amazon-ebs.builder: Provisioning with shell script: /tmp/packer-shell2636857501
2024-06-21T17:19:42-04:00:     amazon-ebs.builder: Starting portForwarding session "user.XXXXXXXX".
2024-06-21T17:19:42-04:00:     amazon-ebs.builder: Starting session with SessionId: user.XXXXXXXX
2024-06-21T17:19:46-04:00:     amazon-ebs.builder: Port 8116 opened for sessionId user.XXXXXXXX.
2024-06-21T17:19:46-04:00:     amazon-ebs.builder: Waiting for connections...
2024-06-21T17:19:48-04:00:     amazon-ebs.builder: Connection accepted for session [user.XXXXXXXX]
2024-06-21T17:19:52-04:00:     amazon-ebs.builder: 21:19  up 48 secs, 1 user, load averages: 5.13 1.37 0.51

That's obviously a hack. The amazon plugin specifically needs better ssm handling, but generally, what I think is needed is a way to tell packer that the remote machine is positively going away between steps and it should do whatever it has to to drop and re-establish the connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs enhancement reconnect sync to jira For issues that need to be imported to Packer internal JIRA backlog
Projects
None yet
Development

No branches or pull requests