-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCP] Support private IPs for GCP #2819
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @Michaelvll! Finished a pass except for docs/. Some comments.
sky/skypilot_config.py
Outdated
@@ -119,6 +119,19 @@ def set_nested(keys: Iterable[str], value: Any) -> Dict[str, Any]: | |||
return to_return | |||
|
|||
|
|||
def overwrite_config_file(config: dict) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe? It feels hard to reason about. Any ways to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the controller is launching a task, it should use the task-specific skypilot_config
decided by SKYPILOT_CONFIG
env var, and we have to update the proxy command setting in that config.
Added a docstr to for calling this with cautious.
If needed we can save the new config
in a new config_file and reset the os.environ['SKYPILOT_CONFIG']
to the new config_file and set _dict
to the new config
.
I don't think this will make much difference than directly overwrite the original config file. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What's the reason for replacing the previous way? I.e., directly write out the properly modified config to a local temp file and upload it. It seems like local client does know where the controller is/is going to be launched?
- If there's a strong reason to do it this way, how about renaming this to
unsafe_overwrite_config_file_on_controller
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for replacing the previous way? I.e., directly write out the properly modified config to a local temp file and upload it. It seems like local client does know where the controller is/is going to be launched?
The previous way was done before the controller is actually UP (failover may happen), which means we don't know which cloud the controller will use when we run sky.launch
for the controller. The current way replaces the config on the controller, so it has the latest information about the controller itself.
If there's a strong reason to do it this way, how about renaming this to
unsafe_overwrite_config_file_on_controller
or something like that?
Yes, renaming it sounds good to me.
sky/utils/controller_utils.py
Outdated
config_dict = skypilot_config.set_nested(proxy_command_key, | ||
ssh_proxy_command) | ||
|
||
skypilot_config.overwrite_config_file(config_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment in skypilot_config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, a question about overwriting the config file. Other parts LGTM
sky/skypilot_config.py
Outdated
@@ -119,6 +119,19 @@ def set_nested(keys: Iterable[str], value: Any) -> Dict[str, Any]: | |||
return to_return | |||
|
|||
|
|||
def overwrite_config_file(config: dict) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What's the reason for replacing the previous way? I.e., directly write out the properly modified config to a local temp file and upload it. It seems like local client does know where the controller is/is going to be launched?
- If there's a strong reason to do it this way, how about renaming this to
unsafe_overwrite_config_file_on_controller
or something like that?
Co-authored-by: Zongheng Yang <[email protected]>
According to the request, I refactored the way we reset the config file being uploaded to the controller, so we don't have to overwrite the config file on the remote cluster. @concretevitamin PTAL. : ) Tested:
|
@@ -3215,6 +3218,8 @@ def _sync_file_mounts( | |||
storage_mounts: Optional[Dict[Path, storage_lib.Storage]], | |||
) -> None: | |||
"""Mounts all user files to the remote nodes.""" | |||
controller_utils.replace_skypilot_config_path_in_file_mounts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look quite clean to do a controller_utils call here.
Can we move this invocation to controller_utils.get_controller_resources(), perhaps renaming it back to controller_utils.skypilot_config_setup()?
This way, we can also remove the hard-coded magic constant skypilot:local_skypilot_config_path
. The two controllers j2 files can just take a {{local config path}} like before, and we can pass in a tmp file there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, when we call control_utils.get_controller_resources()
, it is not possible to know where the controller will be launched, i.e. not be able to get the correct skypilot config. We have to call the function after the controller is provisioned, i.e. after the _provision
step is finished and the cloud
is decided for the controller.
Basically, we have to decide the content of the skypilot config to be uploaded after the controller actually goes through failover and provisioned. I think the current way is clean if we move this replace_skypilot_config_path_in_file_mounts
from controller_utils
to backend_utils
and rename it to replace_predefined_placeholder_in_file_mounts
, as it can be used for filling other predefined file mounts values that needs to be decided after the VM is provisioned.
It can also be useful in the future when we want to decide the location of the mounted bucket based on the cloud a normal VM is launched (currently we just default to the first cloud that is enabled).
os.path.abspath(os.path.expanduser(source))): | ||
if (not os.path.exists( | ||
os.path.abspath(os.path.expanduser(source))) and | ||
not source.startswith('skypilot:')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
sky/utils/controller_utils.py
Outdated
|
||
|
||
def setup_proxy_command_on_controller(): | ||
def _get_skypilot_config_for_controller_task( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup_proxy_command_on_controller() seems more accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion, this comment is still relevant? Also, maybe arg rename: cloud -> controller_launched_cloud?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Michaelvll for shipping this!
|
||
|
||
def setup_proxy_command_on_controller(): | ||
def _get_skypilot_config_for_controller_task( | ||
cloud: 'clouds.Cloud') -> Dict[str, Any]: | ||
"""Sets up proxy command on the controller. | ||
|
||
This function should be called on the controller (remote cluster), which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems outdated now.
sky/utils/controller_utils.py
Outdated
|
||
|
||
def setup_proxy_command_on_controller(): | ||
def _get_skypilot_config_for_controller_task( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion, this comment is still relevant? Also, maybe arg rename: cloud -> controller_launched_cloud?
* Add internal IPs and proxy command * Fix schemas * Add config * vpc_name refactor * format * remove remnant * fix API * filter by proxy command * fix region field * Fix tpu pod * Fix internal IP fetching for TPU VM * remove k80 from the resources unordered tests * Address comments * add comment * Add comment * format * Update docs/source/cloud-setup/cloud-permissions/gcp.rst Co-authored-by: Zongheng Yang <[email protected]> * Address comments * improve docs * refactor skypilot_config upload * fix merge issue * fix * Adopt test fixes from skypilot-org#2681 * fix test for quota * longer timeout * Address comments * format --------- Co-authored-by: Zongheng Yang <[email protected]>
Fixes #2818.
Use Internal IP
For security reason, users may only want to use internal IPs for SkyPilot instances.
To do so, you can use SkyPilot's global config file
~/.sky/config.yaml
to specify thegcp.use_internal_ips
andgcp.ssh_proxy_command
fields (to see the detailed syntax, see :ref:config-yaml
):The
gcp.ssh_proxy_command
field is optional. If SkyPilot is run on a machine that can directly access the internal IPs of the instances, it can be omitted. Otherwise, it should be set to a command that can be used to proxy SSH connections to the internal IPs of the instances.Cloud NAT Setup
Instances created with internal IPs only on GCP cannot access public internet by default. To make sure SkyPilot can install the dependencies correctly on the instances,
cloud NAT needs to be setup for the VPC.
Cloud NAT is a regional resource, so it will need to be created in each region that SkyPilot will be used in.
To limit SkyPilot to use some specific regions only, you can specify the
gcp.ssh_proxy_command
to be a dict mapping from region to the SSH proxy command for that region (see :ref:config-yaml
for details):If proxy is not needed, but the regions need to be limited, you can set the
gcp.ssh_proxy_command
to be a dict mapping from region tonull
:With the setup above, we should be able to create clusters with internal IPs only using SkyPilot and the
~/.sky/config.yaml
above.sky launch -c test-private --cloud gcp --cpus 2
Tested (run the relevant ones):
bash format.sh
sky launch -c test-private --cloud gcp --cpus 2
sky launch -c test-private --cloud gcp --gpus L4 --num-nodes 2 echo hi
sky launch -y -c test-tpu-pod examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 --use-spot --zone europe-west4-a
sky spot launch -n test-private --cloud gcp --gpus L4 --num-nodes 2 sleep 1000000
with spot controller on AWS (We have to share the~/.ssh/sky-key
between the local and controller to make this work, as the spot controller needs to be connected with the jump server)sky spot launch -n test-private --cloud gcp --gpus L4 --num-nodes 2 sleep 1000000
with spot controller on GCPsky launch -c test-no-proxy --cloud gcp --cpus 2 echo hi
pytest tests/test_smoke.py --gcp
(except for the docker and serve (no public access) related tests)pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh