Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Support private IPs for GCP #2819

Merged
merged 28 commits into from
Dec 1, 2023
Merged

[GCP] Support private IPs for GCP #2819

merged 28 commits into from
Dec 1, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Nov 24, 2023

Fixes #2818.

Use Internal IP

For security reason, users may only want to use internal IPs for SkyPilot instances.
To do so, you can use SkyPilot's global config file ~/.sky/config.yaml to specify the gcp.use_internal_ips and gcp.ssh_proxy_command fields (to see the detailed syntax, see :ref:config-yaml):

    gcp:
      use_internal_ips: true
      # VPC with NAT setup, see below
      vpc_name: my-vpc-name
      ssh_proxy_command: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]      

The gcp.ssh_proxy_command field is optional. If SkyPilot is run on a machine that can directly access the internal IPs of the instances, it can be omitted. Otherwise, it should be set to a command that can be used to proxy SSH connections to the internal IPs of the instances.

Cloud NAT Setup

Instances created with internal IPs only on GCP cannot access public internet by default. To make sure SkyPilot can install the dependencies correctly on the instances,
cloud NAT needs to be setup for the VPC.

Cloud NAT is a regional resource, so it will need to be created in each region that SkyPilot will be used in.
Screenshot 2023-11-29 at 6 24 27 AM

To limit SkyPilot to use some specific regions only, you can specify the gcp.ssh_proxy_command to be a dict mapping from region to the SSH proxy command for that region (see :ref:config-yaml for details):

    gcp:
      use_internal_ips: true
      vpc_name: my-vpc-name
      ssh_proxy_command:
        us-west1: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]
        us-east1: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]

If proxy is not needed, but the regions need to be limited, you can set the gcp.ssh_proxy_command to be a dict mapping from region to null:

    gcp:
      use_internal_ips: true
      vpc_name: my-vpc-name
      ssh_proxy_command:
        us-west1: null
        us-east1: null

With the setup above, we should be able to create clusters with internal IPs only using SkyPilot and the ~/.sky/config.yaml above.
sky launch -c test-private --cloud gcp --cpus 2

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Creating a cluster in a region with internal IPs only: sky launch -c test-private --cloud gcp --cpus 2
    • sky launch -c test-private --cloud gcp --gpus L4 --num-nodes 2 echo hi
    • sky launch -y -c test-tpu-pod examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 --use-spot --zone europe-west4-a
    • sky spot launch -n test-private --cloud gcp --gpus L4 --num-nodes 2 sleep 1000000 with spot controller on AWS (We have to share the ~/.ssh/sky-key between the local and controller to make this work, as the spot controller needs to be connected with the jump server)
    • sky spot launch -n test-private --cloud gcp --gpus L4 --num-nodes 2 sleep 1000000 with spot controller on GCP
    • sky launch -c test-no-proxy --cloud gcp --cpus 2 echo hi
      gcp:
          vpc_name: skypilot-vpc
          use_internal_ips: true
  • All smoke tests: pytest tests/test_smoke.py --gcp (except for the docker and serve (no public access) related tests)
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh
    • gcp
    • aws

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @Michaelvll! Finished a pass except for docs/. Some comments.

sky/utils/cluster_yaml_utils.py Outdated Show resolved Hide resolved
sky/utils/cluster_yaml_utils.py Outdated Show resolved Hide resolved
sky/utils/cluster_yaml_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
@@ -119,6 +119,19 @@ def set_nested(keys: Iterable[str], value: Any) -> Dict[str, Any]:
return to_return


def overwrite_config_file(config: dict) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe? It feels hard to reason about. Any ways to avoid this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the controller is launching a task, it should use the task-specific skypilot_config decided by SKYPILOT_CONFIG env var, and we have to update the proxy command setting in that config.
Added a docstr to for calling this with cautious.

If needed we can save the new config in a new config_file and reset the os.environ['SKYPILOT_CONFIG'] to the new config_file and set _dict to the new config.

I don't think this will make much difference than directly overwrite the original config file. Wdyt?

Copy link
Member

@concretevitamin concretevitamin Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What's the reason for replacing the previous way? I.e., directly write out the properly modified config to a local temp file and upload it. It seems like local client does know where the controller is/is going to be launched?
  2. If there's a strong reason to do it this way, how about renaming this to unsafe_overwrite_config_file_on_controller or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for replacing the previous way? I.e., directly write out the properly modified config to a local temp file and upload it. It seems like local client does know where the controller is/is going to be launched?

The previous way was done before the controller is actually UP (failover may happen), which means we don't know which cloud the controller will use when we run sky.launch for the controller. The current way replaces the config on the controller, so it has the latest information about the controller itself.

If there's a strong reason to do it this way, how about renaming this to unsafe_overwrite_config_file_on_controller or something like that?

Yes, renaming it sounds good to me.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
config_dict = skypilot_config.set_nested(proxy_command_key,
ssh_proxy_command)

skypilot_config.overwrite_config_file(config_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in skypilot_config.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a question about overwriting the config file. Other parts LGTM

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
@@ -119,6 +119,19 @@ def set_nested(keys: Iterable[str], value: Any) -> Dict[str, Any]:
return to_return


def overwrite_config_file(config: dict) -> None:
Copy link
Member

@concretevitamin concretevitamin Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What's the reason for replacing the previous way? I.e., directly write out the properly modified config to a local temp file and upload it. It seems like local client does know where the controller is/is going to be launched?
  2. If there's a strong reason to do it this way, how about renaming this to unsafe_overwrite_config_file_on_controller or something like that?

docs/source/reference/config.rst Outdated Show resolved Hide resolved
docs/source/cloud-setup/cloud-permissions/gcp.rst Outdated Show resolved Hide resolved
docs/source/cloud-setup/cloud-permissions/gcp.rst Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Nov 29, 2023

According to the request, I refactored the way we reset the config file being uploaded to the controller, so we don't have to overwrite the config file on the remote cluster. @concretevitamin PTAL. : )

Tested:

  • AWS spot controller, and check the config file uploaded to the controller:
    spot:
      controller:
        resources:
          cloud: aws
    
    gcp:
      use_internal_ips: true
      vpc_name: skypilot-vpc
      ssh_proxy_command: 
        us-west1: ssh -W %h:%p -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes [email protected] -i ~/.ssh/sky-key
    
    aws:
      use_internal_ips: true
      vpc_name: skypilot-test-vpc
      ssh_proxy_command:
        us-east-1: ssh -W %h:%p -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -i ~/.ssh/sky-key [email protected]
    • Launch spot job on GCP: sky spot launch --cloud gcp
    • Launch spot job on AWS: sky spot launch --cloud aws
  • GCP spot controller, and check the config file uploaded to the controller:
    spot:
      controller:
        resources:
          cloud: gcp
    
    gcp:
      use_internal_ips: true
      vpc_name: skypilot-vpc
      ssh_proxy_command: 
        us-west1: ssh -W %h:%p -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes [email protected] -i ~/.ssh/sky-key
    
    aws:
      use_internal_ips: true
      vpc_name: skypilot-test-vpc
      ssh_proxy_command:
        us-east-1: ssh -W %h:%p -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -i ~/.ssh/sky-key [email protected]
  • pytest tests/test_smoke.py --gcp (except docker and test_gcp_http_server_with_custom_ports as they don't support ssh_proxy_command)

@@ -3215,6 +3218,8 @@ def _sync_file_mounts(
storage_mounts: Optional[Dict[Path, storage_lib.Storage]],
) -> None:
"""Mounts all user files to the remote nodes."""
controller_utils.replace_skypilot_config_path_in_file_mounts(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look quite clean to do a controller_utils call here.

Can we move this invocation to controller_utils.get_controller_resources(), perhaps renaming it back to controller_utils.skypilot_config_setup()?

This way, we can also remove the hard-coded magic constant skypilot:local_skypilot_config_path. The two controllers j2 files can just take a {{local config path}} like before, and we can pass in a tmp file there.

Copy link
Collaborator Author

@Michaelvll Michaelvll Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, when we call control_utils.get_controller_resources(), it is not possible to know where the controller will be launched, i.e. not be able to get the correct skypilot config. We have to call the function after the controller is provisioned, i.e. after the _provision step is finished and the cloud is decided for the controller.

Basically, we have to decide the content of the skypilot config to be uploaded after the controller actually goes through failover and provisioned. I think the current way is clean if we move this replace_skypilot_config_path_in_file_mounts from controller_utils to backend_utils and rename it to replace_predefined_placeholder_in_file_mounts, as it can be used for filling other predefined file mounts values that needs to be decided after the VM is provisioned.
It can also be useful in the future when we want to decide the location of the mounted bucket based on the cloud a normal VM is launched (currently we just default to the first cloud that is enabled).

sky/utils/controller_utils.py Show resolved Hide resolved
os.path.abspath(os.path.expanduser(source))):
if (not os.path.exists(
os.path.abspath(os.path.expanduser(source))) and
not source.startswith('skypilot:')):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto



def setup_proxy_command_on_controller():
def _get_skypilot_config_for_controller_task(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_proxy_command_on_controller() seems more accurate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, this comment is still relevant? Also, maybe arg rename: cloud -> controller_launched_cloud?

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll for shipping this!



def setup_proxy_command_on_controller():
def _get_skypilot_config_for_controller_task(
cloud: 'clouds.Cloud') -> Dict[str, Any]:
"""Sets up proxy command on the controller.

This function should be called on the controller (remote cluster), which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems outdated now.



def setup_proxy_command_on_controller():
def _get_skypilot_config_for_controller_task(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, this comment is still relevant? Also, maybe arg rename: cloud -> controller_launched_cloud?

@Michaelvll Michaelvll merged commit a1b0bd3 into master Dec 1, 2023
19 checks passed
@Michaelvll Michaelvll deleted the gcp-internal-ips branch December 1, 2023 03:29
remyleone pushed a commit to remyleone/skypilot that referenced this pull request Dec 26, 2023
* Add internal IPs and proxy command

* Fix schemas

* Add config

* vpc_name refactor

* format

* remove remnant

* fix API

* filter by proxy command

* fix region field

* Fix tpu pod

* Fix internal IP fetching for TPU VM

* remove k80 from the resources unordered tests

* Address comments

* add comment

* Add comment

* format

* Update docs/source/cloud-setup/cloud-permissions/gcp.rst

Co-authored-by: Zongheng Yang <[email protected]>

* Address comments

* improve docs

* refactor skypilot_config upload

* fix merge issue

* fix

* Adopt test fixes from skypilot-org#2681

* fix test for quota

* longer timeout

* Address comments

* format

---------

Co-authored-by: Zongheng Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GCP] Support internal IPs for GCP
2 participants