Skip to content

Commit

Permalink
Add comments for ~/.ssh/sky-cluster-key (#41)
Browse files Browse the repository at this point in the history
<!-- Describe the changes in this PR -->



<!-- Describe the tests ran -->
<!-- Unit tests (tests/test_*.py) are part of GitHub CI; below are tests
that launch on the cloud. -->

Tested (run the relevant ones):

- [ ] Code formatting: `bash format.sh`
- [ ] Any manual or new tests for this PR (please specify below)
- [ ] All smoke tests: `pytest tests/test_smoke.py` 
- [ ] Relevant individual smoke tests: `pytest
tests/test_smoke.py::test_fill_in_the_name`
- [ ] Backward compatibility tests: `conda deactivate; bash -i
tests/backward_compatibility_tests.sh`
  • Loading branch information
yika-luo authored Dec 3, 2024
1 parent 497c240 commit aae6ae5
Show file tree
Hide file tree
Showing 13 changed files with 13 additions and 14 deletions.
3 changes: 1 addition & 2 deletions sky/templates/aws-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ file_mounts: {
# Increment the following for catching performance bugs easier:
# current num items (num SSH connections): 1
setup_commands:
# Create ~/.ssh/config file in case the file does not exist in the custom image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# We set auto_activate_base to be false for pre-installed conda.
# This also kills the service that is holding the lock on dpkg (problem only exists on aws/azure, not gcp)
# Line "conda config --remove channels": remove the default channel set in the default AWS image as it cannot be accessed.
Expand All @@ -174,7 +174,6 @@ setup_commands:
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
# Line 'mkdir -p ..': disable host key check
# Line 'python3 -c ..': patch the buggy ray files and enable `-o allow_other` option for `goofys`
# Line 'mkdir -p ~/.ssh ...': adding the key in the ssh config to allow interconnection for nodes in the cluster
- mkdir -p ~/.ssh; touch ~/.ssh/config;
{%- for initial_setup_command in initial_setup_commands %}
{{ initial_setup_command }}
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/azure-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ file_mounts: {
# Increment the following for catching performance bugs easier:
# current num items (num SSH connections): 1
setup_commands:
# Create ~/.ssh/config file in case the file does not exist in the image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
# Line 'sudo systemctl stop jupyter ..': stop jupyter service to avoid port conflict on 8888
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/cudo-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ initialization_commands: [ ]
# current num items (num SSH connections): 1
setup_commands:
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Create ~/.ssh/config file in case the file does not exist in the image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Line 'rm ..': there is another installation of pip.
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/fluidstack-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ initialization_commands: []
# current num items (num SSH connections): 1
setup_commands:
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Create ~/.ssh/config file in case the file does not exist in the image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Line 'rm ..': there is another installation of pip.
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/gcp-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -181,8 +181,8 @@ file_mounts: {
# Increment the following for catching performance bugs easier:
# current num items (num SSH connections): 1 (+1 if tpu_vm)
setup_commands:
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Line 'mkdir -p ..': Create ~/.ssh/config file in case the file does not exist in the custom image.
# Line 'which conda ..': some images (TPU VM) do not install conda by
# default. 'source ~/.bashrc' is needed so conda takes effect for the next
# commands.
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/ibm-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ initialization_commands: []
# Increment the following for catching performance bugs easier:
# current num items (num SSH connections): 1
setup_commands:
# Create ~/.ssh/config file in case the file does not exist in the custom image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# We set auto_activate_base to be false for pre-installed conda.
# This also kills the service that is holding the lock on dpkg (problem only exists on aws/azure, not gcp)
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/kubernetes-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -436,7 +436,7 @@ available_node_types:

setup_commands:
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Create ~/.ssh/config file in case the file does not exist in the image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
# Line 'mkdir -p ..': disable host key check
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/lambda-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ initialization_commands: []
# current num items (num SSH connections): 1
setup_commands:
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Create ~/.ssh/config file in case the file does not exist in the image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Line 'rm ..': there is another installation of pip.
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/oci-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ file_mounts: {
# current num items (num SSH connections): 1
setup_commands:
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Create ~/.ssh/config file in case the file does not exist in the image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Line 'rm ..': there is another installation of pip.
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/paperspace-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ initialization_commands: []
# current num items (num SSH connections): 1
setup_commands:
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Create ~/.ssh/config file in case the file does not exist in the image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Line 'rm ..': there is another installation of pip.
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/runpod-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ initialization_commands: []
# current num items (num SSH connections): 1
setup_commands:
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Create ~/.ssh/config file in case the file does not exist in the image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Line 'rm ..': there is another installation of pip.
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/scp-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ initialization_commands: []
# Increment the following for catching performance bugs easier:
# current num items (num SSH connections): 1
setup_commands:
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Create ~/.ssh/config file in case the file does not exist in the custom image.
# We set auto_activate_base to be false for pre-installed conda.
# This also kills the service that is holding the lock on dpkg (problem only exists on aws/azure, not gcp)
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/vsphere-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ initialization_commands: []
# current num items (num SSH connections): 1
setup_commands:
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Create ~/.ssh/config file in case the file does not exist in the image.
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
# Line 'rm ..': there is another installation of pip.
# Line 'sudo bash ..': set the ulimit as suggested by ray docs for performance. https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#system-configuration
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
Expand Down

0 comments on commit aae6ae5

Please sign in to comment.