Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ensure SSH connections are closed after each command execution #859

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jfanals
Copy link

@jfanals jfanals commented Jun 24, 2024

This commit addresses an issue where SSH connections were not being properly closed after each command execution, leading to timeout errors during the kamal deploy process.

Problem

When executing commands using the execute method in SSHKit::Runner::Parallel, SSH connections were left open, causing subsequent commands to fail with timeout errors. This issue was particularly evident when running kamal build deliver after kamal registry login during the normal kamal deploy process

Solution

The execute method in the SSHKit::Runner::Parallel::CompleteAll module has been modified to ensure that all SSH connections are closed after each command execution. This is achieved by adding an ensure block to the thread creation logic, which calls SSHKit::Backend::Netssh.pool.close_connections after each command, regardless of whether an exception occurs.

Changes

  • Added an ensure block to the execute method in SSHKit::Runner::Parallel::CompleteAll to close SSH connections after each command execution.

Impact

This change ensures that SSH connections are properly closed, preventing timeout errors and improving the reliability of the kamal deploy process.

Testing

Tested the changes by running kamal deploy commands. Verified that the timeout errors no longer occur and the deployment process completes successfully.

Closes #857

This commit addresses an issue where SSH connections were not being properly closed after each command execution, leading to timeout errors during the `kamal deploy` process.
@jfanals
Copy link
Author

jfanals commented Jun 25, 2024

For some strange reason the test on Ruby 3.1 failed, it looks like it might have been just a glitch in the test procedures as the code should not affect only one particular version of ruby but all versions.

        ensure
          SSHKit::Backend::Netssh.pool.close_connections

Would it be possible to run the tests again?

@djmb
Copy link
Collaborator

djmb commented Jun 25, 2024

I've kicked that off. I'm not sure though about this change. Creating the SSH connections can be expensive especially with large numbers of servers to deploy to.

We configure keepalives with an interval of 30 seconds on the connections, so that generally should stop them from timing out. Maybe you could try reducing the keepalive interval and see if that makes any difference?

@jfanals
Copy link
Author

jfanals commented Jun 25, 2024

Thanks for the suggestion, I did try to decrease the .ssh/config ServerAliveInterval to 10, unfortunately that did not fix the issue.

@ruyrocha
Copy link

👍 on this one, as it's pretty much similar to delano/rye#63 I faced a while ago, and I think delano/rye#38 has more context on this

cc @djmb @jfanals

@djmb
Copy link
Collaborator

djmb commented Jul 16, 2024

@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from .ssh/config.

Could you try:

ssh:
  keepalive_interval: 10
  log_level: debug

Setting the log_level to debug might also give us some useful feedback.

@plattenschieber
Copy link

@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from .ssh/config.

Could you try:

ssh:
  keepalive_interval: 10
  log_level: debug

Setting the log_level to debug might also give us some useful feedback.

I had a similar problem, which brought me to this issue. Unfortunately setting the keepalive_interval and keepalive parameters is not possible from the configuration file, as they are hardcoded.

❯ bin/kamal config 
  ERROR (Kamal::ConfigurationError): ssh: unknown key: keepalive_interval

The fix for me was to increase the ufw imposed limit of 6 per 30 second.

@jericopulvera
Copy link

jericopulvera commented Oct 14, 2024

@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from .ssh/config.
Could you try:

ssh:
  keepalive_interval: 10
  log_level: debug

Setting the log_level to debug might also give us some useful feedback.

I had a similar problem, which brought me to this issue. Unfortunately setting the keepalive_interval and keepalive parameters is not possible from the configuration file, as they are hardcoded.

❯ bin/kamal config 
  ERROR (Kamal::ConfigurationError): ssh: unknown key: keepalive_interval

The fix for me was to increase the ufw imposed limit of 6 per 30 second.

@plattenschieber how did you apply your fix? Is it through the server?

@plattenschieber
Copy link

Yes, just sshed into the machine and updated the firewall

@jericopulvera
Copy link

Yes, just sshed into the machine and updated the firewall

Mine does not have ufw active, so it might not be the same issue.

@jericopulvera
Copy link

@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from .ssh/config.

Could you try:

ssh:
  keepalive_interval: 10
  log_level: debug

Setting the log_level to debug might also give us some useful feedback.

@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from .ssh/config.

Could you try:

ssh:
  keepalive_interval: 10
  log_level: debug

Setting the log_level to debug might also give us some useful feedback.

I tried setting the log_level to debug and keepalive_interval to 10 by manually updating the ssh.rb at /opt/homebrew/lib/ruby/gems/3.3.0/gems/kamal-2.2.2/lib/kamal/configuration, but it's still not working. It looks like when my NextJS app is building for a long time while it streams the logs of the build, the keepalive ping is not running.

 INFO [b2d3a646] Finished in 339.110 seconds with exit status 0 (successful).
D, [2024-10-28T07:17:02.043992 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:02.046520 #98262] DEBUG -- socket[71c]: queueing packet nr 17 type 90 len 32
D, [2024-10-28T07:17:02.046768 #98262] DEBUG -- socket[71c]: read 84 bytes
D, [2024-10-28T07:17:02.046836 #98262] DEBUG -- socket[71c]: received packet nr 24 type 82 len 16
I, [2024-10-28T07:17:02.046886 #98262]  INFO -- net.ssh.connection.session[7bc]: global request failure
D, [2024-10-28T07:17:02.046914 #98262] DEBUG -- net.ssh.connection.keepalive[7f8]: keepalive response successful. Missed 0 keepalives
D, [2024-10-28T07:17:02.047055 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:12.054207 #98262]  INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 0
I, [2024-10-28T07:17:12.054477 #98262]  INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:12.054771 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:12.055230 #98262] DEBUG -- socket[71c]: queueing packet nr 18 type 80 len 32
D, [2024-10-28T07:17:12.055601 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:22.059028 #98262]  INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 1
I, [2024-10-28T07:17:22.059350 #98262]  INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:22.059645 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:22.060097 #98262] DEBUG -- socket[71c]: queueing packet nr 19 type 80 len 32
D, [2024-10-28T07:17:22.060466 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:32.064610 #98262]  INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 2
I, [2024-10-28T07:17:32.064725 #98262]  INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:32.064872 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:32.065100 #98262] DEBUG -- socket[71c]: queueing packet nr 20 type 80 len 32
D, [2024-10-28T07:17:32.065300 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:42.069792 #98262]  INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 3
I, [2024-10-28T07:17:42.070080 #98262]  INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:42.070405 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:42.070955 #98262] DEBUG -- socket[71c]: queueing packet nr 21 type 80 len 32
D, [2024-10-28T07:17:42.071385 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:52.075686 #98262]  INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 4
I, [2024-10-28T07:17:52.075939 #98262]  INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:52.076247 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:52.076688 #98262] DEBUG -- socket[71c]: queueing packet nr 22 type 80 len 32
D, [2024-10-28T07:17:52.077079 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:18:02.078444 #98262]  INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 5
I, [2024-10-28T07:18:02.078565 #98262]  INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:18:02.078688 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:18:02.078861 #98262] DEBUG -- socket[71c]: queueing packet nr 23 type 80 len 32
D, [2024-10-28T07:18:02.078970 #98262] DEBUG -- socket[71c]: sent 100 bytes
D, [2024-10-28T07:18:08.075602 #98262] DEBUG -- socket[71c]: read 0 bytes
D, [2024-10-28T07:18:08.076116 #98262] DEBUG -- socket[71c]: read 0 bytes
  Finished all in 412.9 seconds
  ERROR (IO::TimeoutError): Exception while executing on host 54.66.xxx.xx: Blocking operation timed out!

@jericopulvera
Copy link

This PR actually fixed my issue. It might be worth considering merging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SSH Connection Timeouts During Full Deployment
5 participants