Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease _DEFAULT_SAFE_OPEN_INTERVAL (to what?) #6599

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

GeigerJ2
Copy link
Contributor

@GeigerJ2 GeigerJ2 commented Nov 4, 2024

A possible quick fix to not spend more of our time (at least for now...) on:

While this is quite a minor change, as it has rather far-reaching consequences, we should bikeshed this at least for a week or so :D
Any thoughts on the duration, @sphuber, @mbercx, @unkcpz, @t-reents? @giovannipizzi proposed 3s, but I'm worried this is a bit short, so I put 5s for now. Will add some benchmarks here later on.

Copy link

codecov bot commented Nov 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.87%. Comparing base (ef60b66) to head (d8d8712).
Report is 127 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6599      +/-   ##
==========================================
+ Coverage   77.51%   77.87%   +0.37%     
==========================================
  Files         560      567       +7     
  Lines       41444    42121     +677     
==========================================
+ Hits        32120    32797     +677     
  Misses       9324     9324              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sphuber
Copy link
Contributor

sphuber commented Nov 4, 2024

Why change this default? If I understand correctly, the main problem is with the SSH plugin, right? The LocalTransport already changes this default as well, since it doesn't really need one. So wouldn't it make more sense to limit the scope and change the default on SshTransport. That being said, the whole point is that this is a safe default, it can be configured by a user to something lower. So why not just keep that? They can simply run verdi computer configure core.ssh -n --safe-interval 5.

@GeigerJ2
Copy link
Contributor Author

GeigerJ2 commented Nov 5, 2024

Yes, good point, it should be changed in the SshTransport instead.

I think most users, especially new ones, will just be running with the default value. Especially if they go through verdi computer setup and then get

Report: Note: before the computer can be used, it has to be configured with the command:
Report:   verdi -p <profile> computer configure core.ssh <computer>

they'll likely just copy-paste this command. Instead, if they run verdi computer configure core.ssh --help they will probably be overwhelmed with the amount of options and overlook it (or not even read through them at all, due to laziness). Then, running short test jobs like the ArithmeticAdd will take longer than they expect, and they might think they did something wrong during their setup.

The question is, is 30s really necessary? And if it's overkill, why not reduce it.

@sphuber
Copy link
Contributor

sphuber commented Nov 5, 2024

The question is, is 30s really necessary? And if it's overkill, why not reduce it.

It is put at that value to prevent users being banned from compute centers for opening too many SSH connections. Admittedly, this is more likely to happen when you are running high-throughput with multiple daemon workers, and I am not sure if that is the main use case anymore for AiiDA users. So we can lower it, with the risk that some users may get warnings or banned. If @giovannipizzi also signed off on changing the default to favoring casual users over protecting heavy high-throughput users by default, then it is fine by me as well.

@giovannipizzi
Copy link
Member

In practice, as also the tests by @khsrali (i think in a different issue?) show, even with a very low value (like 0.1s, that I would not recommend) the number of open connections remains every low. This is because while a connection is not being opened, other tasks pile up and then reuse the same connection (@khsrali can you point to your tests?). For me 5s (or maybe even 3s) is OK, in any case a supercomputer center should not ban you if you connect every 5 seconds every now and then (e.g I could do it myself opening a few ssh shells). If I'm running mid throughput, requests will pile up and reuse the connection. If I'm really running high thoughput, probably I'll need to tweak a few options anyway. So I'd probably just check if 3s is too much once more and then change the default, and update the doc page on how to run high throughput, mentioning what this value is and that it might need to be modified to avoid being banned

@GeigerJ2
Copy link
Contributor Author

GeigerJ2 commented Nov 5, 2024

The benchmarks by @khsrali can be found here: #6544 (comment)

@@ -64,6 +64,9 @@ def convert_to_bool(string):
class SshTransport(Transport):
"""Support connection, command execution and data transfer to remote computers via SSH+SFTP."""

# Reduced from the 30s set in the Transport abstract base class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Reduced from the 30s set in the Transport abstract base class
# Reduced from the 5s set in the Transport abstract base class

@khsrali
Copy link
Contributor

khsrali commented Dec 18, 2024

@GeigerJ2 , do you have a moment to apply the changes?
Thanks a lot!

@GeigerJ2
Copy link
Contributor Author

Hi @khsrali, thanks for the ping here! Before we actually merge this change, I'd like to do some proper benchmarking (e.g., with Thor). That's why this has been on hold for a while now. Maybe we can do it together after the holidays? I remember you already did some benchmarks in the past 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants