Skip to content

Commit

Permalink
[GCP] Support private IPs for GCP (#2819)
Browse files Browse the repository at this point in the history
* Add internal IPs and proxy command

* Fix schemas

* Add config

* vpc_name refactor

* format

* remove remnant

* fix API

* filter by proxy command

* fix region field

* Fix tpu pod

* Fix internal IP fetching for TPU VM

* remove k80 from the resources unordered tests

* Address comments

* add comment

* Add comment

* format

* Update docs/source/cloud-setup/cloud-permissions/gcp.rst

Co-authored-by: Zongheng Yang <[email protected]>

* Address comments

* improve docs

* refactor skypilot_config upload

* fix merge issue

* fix

* Adopt test fixes from #2681

* fix test for quota

* longer timeout

* Address comments

* format

---------

Co-authored-by: Zongheng Yang <[email protected]>
  • Loading branch information
Michaelvll and concretevitamin authored Dec 1, 2023
1 parent 6612301 commit a1b0bd3
Show file tree
Hide file tree
Showing 22 changed files with 381 additions and 238 deletions.
57 changes: 57 additions & 0 deletions docs/source/cloud-setup/cloud-permissions/gcp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -267,3 +267,60 @@ See details in :ref:`config-yaml`. Example use cases include using a private VP
VPC with fine-grained constraints, typically created via Terraform or manually.

The custom VPC should contain the :ref:`required firewall rules <gcp-minimum-firewall-rules>`.


.. _gcp-use-internal-ips:


Using Internal IPs
-----------------------
For security reason, users may only want to use internal IPs for SkyPilot instances.
To do so, you can use SkyPilot's global config file ``~/.sky/config.yaml`` to specify the ``gcp.use_internal_ips`` and ``gcp.ssh_proxy_command`` fields (to see the detailed syntax, see :ref:`config-yaml`):

.. code-block:: yaml
gcp:
use_internal_ips: true
# VPC with NAT setup, see below
vpc_name: my-vpc-name
ssh_proxy_command: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]
The ``gcp.ssh_proxy_command`` field is optional. If SkyPilot is run on a machine that can directly access the internal IPs of the instances, it can be omitted. Otherwise, it should be set to a command that can be used to proxy SSH connections to the internal IPs of the instances.


Cloud NAT Setup
~~~~~~~~~~~~~~~~

Instances created with internal IPs only on GCP cannot access public internet by default. To make sure SkyPilot can install the dependencies correctly on the instances,
cloud NAT needs to be setup for the VPC (see `GCP's documentation <https://cloud.google.com/nat/docs/overview>`__ for details).


Cloud NAT is a regional resource, so it will need to be created in each region that SkyPilot will be used in.


.. image:: ../../images/screenshots/gcp/cloud-nat.png
:width: 80%
:align: center
:alt: GCP Cloud NAT

To limit SkyPilot to use some specific regions only, you can specify the ``gcp.ssh_proxy_command`` to be a dict mapping from region to the SSH proxy command for that region (see :ref:`config-yaml` for details):

.. code-block:: yaml
gcp:
use_internal_ips: true
vpc_name: my-vpc-name
ssh_proxy_command:
us-west1: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]
us-east1: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]
If proxy is not needed, but the regions need to be limited, you can set the ``gcp.ssh_proxy_command`` to be a dict mapping from region to ``null``:

.. code-block:: yaml
gcp:
use_internal_ips: true
vpc_name: my-vpc-name
ssh_proxy_command:
us-west1: null
us-east1: null
Binary file added docs/source/images/screenshots/gcp/cloud-nat.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions docs/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,31 @@ Available fields and semantics:
# will be added.
vpc_name: skypilot-vpc
# Should instances be assigned private IPs only? (optional)
#
# Set to true to use private IPs to communicate between the local client and
# any SkyPilot nodes. This requires the networking stack be properly set up.
#
# This flag is typically set together with 'vpc_name' above and
# 'ssh_proxy_command' below.
#
# Default: false.
use_internal_ips: true
# SSH proxy command (optional).
#
# Please refer to the aws.ssh_proxy_command section above for more details.
### Format 1 ###
# A string; the same proxy command is used for all regions.
ssh_proxy_command: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no gcpuser@<jump server public ip>
### Format 2 ###
# A dict mapping region names to region-specific proxy commands.
# NOTE: This restricts SkyPilot's search space for this cloud to only use
# the specified regions and not any other regions in this cloud.
ssh_proxy_command:
us-central1: ssh -W %h:%p -p 1234 -o StrictHostKeyChecking=no [email protected]
us-west1: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no gcpuser@<jump server public ip>
# Reserved capacity (optional).
#
# The specific reservation to be considered when provisioning clusters on GCP.
Expand Down
2 changes: 1 addition & 1 deletion examples/job_queue/job.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ setup: |
run: |
timestamp=$(date +%s)
conda env list
for i in {1..120}; do
for i in {1..140}; do
echo "$timestamp $i"
sleep 1
done
20 changes: 9 additions & 11 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
from sky.utils import common_utils
from sky.utils import controller_utils
from sky.utils import env_options
from sky.utils import remote_cluster_yaml_utils
from sky.utils import rich_utils
from sky.utils import subprocess_utils
from sky.utils import timeline
Expand All @@ -64,7 +65,6 @@

# NOTE: keep in sync with the cluster template 'file_mounts'.
SKY_REMOTE_APP_DIR = '~/.sky/sky_app'
SKY_RAY_YAML_REMOTE_PATH = '~/.sky/sky_ray.yml'
# Exclude subnet mask from IP address regex.
IP_ADDR_REGEX = r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(?!/\d{1,2})\b'
SKY_REMOTE_PATH = '~/.sky/wheels'
Expand Down Expand Up @@ -1013,18 +1013,17 @@ def write_cluster_config(
# If the current code is run by controller, propagate the real
# calling user which should've been passed in as the
# SKYPILOT_USER env var (see
# execution.py::_shared_controller_env_vars).
# controller_utils.shared_controller_vars_to_fill().
'user': get_cleaned_username(
os.environ.get(constants.USER_ENV_VAR, '')),

# AWS only:
'aws_vpc_name': skypilot_config.get_nested(('aws', 'vpc_name'),
None),
# Networking configs
'use_internal_ips': skypilot_config.get_nested(
('aws', 'use_internal_ips'), False),
# Not exactly AWS only, but we only test it's supported on AWS
# for now:
(str(cloud).lower(), 'use_internal_ips'), False),
'ssh_proxy_command': ssh_proxy_command,
'vpc_name': skypilot_config.get_nested(
(str(cloud).lower(), 'vpc_name'), None),

# User-supplied instance tags.
'instance_tags': instance_tags,

Expand All @@ -1033,8 +1032,6 @@ def write_cluster_config(
'resource_group': f'{cluster_name}-{region_name}',

# GCP only:
'gcp_vpc_name': skypilot_config.get_nested(('gcp', 'vpc_name'),
None),
'gcp_project_id': gcp_project_id,
'specific_reservations': filtered_specific_reservations,
'num_specific_reserved_workers': num_specific_reserved_workers,
Expand All @@ -1057,7 +1054,8 @@ def write_cluster_config(
'sky_remote_path': SKY_REMOTE_PATH,
'sky_local_path': str(local_wheel_path),
# Add yaml file path to the template variables.
'sky_ray_yaml_remote_path': SKY_RAY_YAML_REMOTE_PATH,
'sky_ray_yaml_remote_path':
remote_cluster_yaml_utils.SKY_CLUSTER_YAML_REMOTE_PATH,
'sky_ray_yaml_local_path':
tmp_yaml_path
if not isinstance(cloud, clouds.Local) else yaml_path,
Expand Down
66 changes: 41 additions & 25 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
import threading
import time
import typing
from typing import Dict, Iterable, List, Optional, Set, Tuple, Union
from typing import Any, Dict, Iterable, List, Optional, Set, Tuple, Union

import colorama
import filelock
Expand All @@ -33,7 +33,6 @@
from sky import resources as resources_lib
from sky import serve as serve_lib
from sky import sky_logging
from sky import skypilot_config
from sky import spot as spot_lib
from sky import status_lib
from sky import task as task_lib
Expand Down Expand Up @@ -1460,7 +1459,7 @@ def _retry_zones(
cloud_user_identity: Optional[List[str]],
prev_cluster_status: Optional[status_lib.ClusterStatus],
prev_handle: Optional['CloudVmRayResourceHandle'],
):
) -> Dict[str, Any]:
"""The provision retry loop."""
style = colorama.Style
fore = colorama.Fore
Expand Down Expand Up @@ -2180,7 +2179,7 @@ def provision_with_retries(
to_provision_config: ToProvisionConfig,
dryrun: bool,
stream_logs: bool,
):
) -> Dict[str, Any]:
"""Provision with retries for all launchable resources."""
cluster_name = to_provision_config.cluster_name
to_provision = to_provision_config.resources
Expand Down Expand Up @@ -2218,7 +2217,7 @@ def provision_with_retries(
prev_cluster_status=prev_cluster_status,
prev_handle=prev_handle)
if dryrun:
return
return config_dict
except (exceptions.InvalidClusterNameError,
exceptions.NotSupportedError,
exceptions.CloudUserIdentityError) as e:
Expand Down Expand Up @@ -2343,8 +2342,9 @@ def __init__(self,
self.cluster_name_on_cloud = cluster_name_on_cloud
self._cluster_yaml = cluster_yaml.replace(os.path.expanduser('~'), '~',
1)
# List of (internal_ip, external_ip) tuples for all the nodes
# in the cluster, sorted by the external ips.
# List of (internal_ip, feasible_ip) tuples for all the nodes in the
# cluster, sorted by the feasible ips. The feasible ips can be either
# internal or external ips, depending on the use_internal_ips flag.
self.stable_internal_external_ips = stable_internal_external_ips
self.stable_ssh_ports = stable_ssh_ports
self.launched_nodes = launched_nodes
Expand Down Expand Up @@ -2374,6 +2374,14 @@ def __repr__(self):
def get_cluster_name(self):
return self.cluster_name

def _use_internal_ips(self):
"""Returns whether to use internal IPs for SSH connections."""
# Directly load the `use_internal_ips` flag from the cluster yaml
# instead of `skypilot_config` as the latter can be changed after the
# cluster is UP.
return common_utils.read_yaml(self.cluster_yaml).get(
'provider', {}).get('use_internal_ips', False)

def _maybe_make_local_handle(self):
"""Adds local handle for the local cloud case.
Expand Down Expand Up @@ -2481,40 +2489,43 @@ def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool:
return (ips is not None and len(ips) == self.num_node_ips and
all(ip is not None for ip in ips))

use_internal_ips = self._use_internal_ips()

# cluster_feasible_ips is the list of IPs of the nodes in the cluster
# which can be used to connect to the cluster. It is a list of external
# IPs if the cluster is assigned public IPs, otherwise it is a list of
# internal IPs.
cluster_feasible_ips: List[str]
if is_provided_ips_valid(external_ips):
logger.debug(f'Using provided external IPs: {external_ips}')
cluster_external_ips = typing.cast(List[str], external_ips)
cluster_feasible_ips = typing.cast(List[str], external_ips)
else:
cluster_external_ips = backend_utils.get_node_ips(
cluster_feasible_ips = backend_utils.get_node_ips(
self.cluster_yaml,
self.launched_nodes,
handle=self,
head_ip_max_attempts=max_attempts,
worker_ip_max_attempts=max_attempts,
get_internal_ips=False)
get_internal_ips=use_internal_ips)

if self.cached_external_ips == cluster_external_ips:
if self.cached_external_ips == cluster_feasible_ips:
logger.debug('Skipping the fetching of internal IPs as the cached '
'external IPs matches the newly fetched ones.')
# Optimization: If the cached external IPs are the same as the
# retrieved external IPs, then we can skip retrieving internal
# retrieved feasible IPs, then we can skip retrieving internal
# IPs since the cached IPs are up-to-date.
return
logger.debug(
'Cached external IPs do not match with the newly fetched ones: '
f'cached ({self.cached_external_ips}), new ({cluster_external_ips})'
f'cached ({self.cached_external_ips}), new ({cluster_feasible_ips})'
)

is_cluster_aws = (self.launched_resources is not None and
isinstance(self.launched_resources.cloud, clouds.AWS))
if is_cluster_aws and skypilot_config.get_nested(
keys=('aws', 'use_internal_ips'), default_value=False):
if use_internal_ips:
# Optimization: if we know use_internal_ips is True (currently
# only exposed for AWS), then our AWS NodeProvider is
# guaranteed to pick subnets that will not assign public IPs,
# thus the first list of IPs returned above are already private
# IPs. So skip the second query.
cluster_internal_ips = list(cluster_external_ips)
# only exposed for AWS and GCP), then our provisioner is guaranteed
# to not assign public IPs, thus the first list of IPs returned
# above are already private IPs. So skip the second query.
cluster_internal_ips = list(cluster_feasible_ips)
elif is_provided_ips_valid(internal_ips):
logger.debug(f'Using provided internal IPs: {internal_ips}')
cluster_internal_ips = typing.cast(List[str], internal_ips)
Expand All @@ -2527,13 +2538,16 @@ def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool:
worker_ip_max_attempts=max_attempts,
get_internal_ips=True)

assert len(cluster_external_ips) == len(cluster_internal_ips), (
assert len(cluster_feasible_ips) == len(cluster_internal_ips), (
f'Cluster {self.cluster_name!r}:'
f'Expected same number of internal IPs {cluster_internal_ips}'
f' and external IPs {cluster_external_ips}.')
f' and external IPs {cluster_feasible_ips}.')

# List of (internal_ip, feasible_ip) tuples for all the nodes in the
# cluster, sorted by the feasible ips. The feasible ips can be either
# internal or external ips, depending on the use_internal_ips flag.
internal_external_ips: List[Tuple[str, str]] = list(
zip(cluster_internal_ips, cluster_external_ips))
zip(cluster_internal_ips, cluster_feasible_ips))

# Ensure head node is the first element, then sort based on the
# external IPs for stableness
Expand Down Expand Up @@ -3204,6 +3218,8 @@ def _sync_file_mounts(
storage_mounts: Optional[Dict[Path, storage_lib.Storage]],
) -> None:
"""Mounts all user files to the remote nodes."""
controller_utils.replace_skypilot_config_path_in_file_mounts(
handle.launched_resources.cloud, all_file_mounts)
self._execute_file_mounts(handle, all_file_mounts)
self._execute_storage_mounts(handle, storage_mounts)
self._set_storage_mounts_metadata(handle.cluster_name, storage_mounts)
Expand Down
23 changes: 6 additions & 17 deletions sky/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
from sky.backends import backend_utils
from sky.skylet import constants
from sky.usage import usage_lib
from sky.utils import common_utils
from sky.utils import controller_utils
from sky.utils import dag_utils
from sky.utils import env_options
Expand Down Expand Up @@ -666,29 +665,19 @@ def spot_launch(
prefix = spot.SPOT_TASK_YAML_PREFIX
remote_user_yaml_path = f'{prefix}/{dag.name}-{dag_uuid}.yaml'
remote_user_config_path = f'{prefix}/{dag.name}-{dag_uuid}.config_yaml'
extra_vars, controller_resources_config = (
controller_utils.skypilot_config_setup(
controller_type='spot',
controller_resources_config=spot.constants.CONTROLLER_RESOURCES,
remote_user_config_path=remote_user_config_path))
try:
controller_resources = sky.Resources.from_yaml_config(
controller_resources_config)
except ValueError as e:
with ux_utils.print_exception_no_traceback():
raise ValueError(
controller_utils.CONTROLLER_RESOURCES_NOT_VALID_MESSAGE.
format(controller_type='spot',
err=common_utils.format_exception(
e, use_bracket=True))) from e
controller_resources = (controller_utils.get_controller_resources(
controller_type='spot',
controller_resources_config=spot.constants.CONTROLLER_RESOURCES))

vars_to_fill = {
'remote_user_yaml_path': remote_user_yaml_path,
'user_yaml_path': f.name,
'spot_controller': controller_name,
# Note: actual spot cluster name will be <task.name>-<spot job ID>
'dag_name': dag.name,
'retry_until_up': retry_until_up,
**extra_vars,
'remote_user_config_path': remote_user_config_path,
**controller_utils.shared_controller_vars_to_fill('spot'),
}

yaml_path = os.path.join(spot.SPOT_CONTROLLER_YAML_PREFIX,
Expand Down
23 changes: 6 additions & 17 deletions sky/serve/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,28 +83,17 @@ def up(
serve_utils.generate_remote_config_yaml_file_name(service_name))
controller_log_file = (
serve_utils.generate_remote_controller_log_file_name(service_name))
extra_vars, controller_resources_config = (
controller_utils.skypilot_config_setup(
controller_type='serve',
controller_resources_config=serve_constants.
CONTROLLER_RESOURCES,
remote_user_config_path=remote_config_yaml_path))
try:
controller_resources = sky.Resources.from_yaml_config(
controller_resources_config)
except ValueError as e:
with ux_utils.print_exception_no_traceback():
raise ValueError(
controller_utils.CONTROLLER_RESOURCES_NOT_VALID_MESSAGE.
format(controller_type='serve',
err=common_utils.format_exception(
e, use_bracket=True))) from e
controller_resources = (controller_utils.get_controller_resources(
controller_type='serve',
controller_resources_config=serve_constants.CONTROLLER_RESOURCES))

vars_to_fill = {
'remote_task_yaml_path': remote_tmp_task_yaml_path,
'local_task_yaml_path': service_file.name,
'service_name': service_name,
'controller_log_file': controller_log_file,
**extra_vars,
'remote_user_config_path': remote_config_yaml_path,
**controller_utils.shared_controller_vars_to_fill('serve'),
}
backend_utils.fill_template(serve_constants.CONTROLLER_TEMPLATE,
vars_to_fill,
Expand Down
Loading

0 comments on commit a1b0bd3

Please sign in to comment.