Skip to content

Commit

Permalink
[Provisioner] Update ports for UP cluster (#2485)
Browse files Browse the repository at this point in the history
* AWS is working

* GCP finished. TODO: find a way to write cluster yaml in backends_utils

* simplify ports when repr

* generate new sg for aws

* fix aws new sg no permission for ssh, format

* remove finished TODO

* write cluster config in backend_utils; deprecate code in config.py

* remove redundant

* minor

* format

* nit

* nit

* Azure finished

* fix

* lint

* change port type to List[str]

* add cli option

* minor

* add doc

* add doc

* add ports doc

* Update docs/source/reference/yaml-spec.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* add port doc

* apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* fix

* upd docs

* apply suggestions from code review

* change api to all ports, fix bug, check ports in resource_utils

* minor

* gracefully handle gcp ports

* Update docs/source/reference/yaml-spec.rst

Co-authored-by: Zhanghao Wu <[email protected]>

* upd doc

* remove ports argument & move ports-specific variable to make_deploy_variables

* restore ToProvisionConfig

* remove new provisioner api check

* add checking for whether need open ports

* nits & ffix multi-node cluster in aws

* add backward compatibility for gcp ports

* merge get_vpc_name and create_or_update_firewall_rule

* fix __setstate__

* get rid of DEFAULT_AWS_SG_NAME and fix a bug

* nits

* nits

* constant for aws default sg

* move ports delta calculation into _update_after_cluster_provisioned

* move sg / firewall name generation into make_deploy_variables

* nit

* fix

* nits

* update doc

* fix a bug

* Apply suggestions from code review

Co-authored-by: Romil Bhardwaj <[email protected]>

* add minimal rule for aws

* add minimal for gcp

* fix

---------

Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
  • Loading branch information
3 people authored Sep 18, 2023
1 parent 6f9ad6b commit e5e400b
Show file tree
Hide file tree
Showing 36 changed files with 891 additions and 295 deletions.
21 changes: 17 additions & 4 deletions docs/source/cloud-setup/cloud-permissions/aws.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,22 +117,35 @@ AWS accounts can be attached with a policy that limits the permissions of the ac
"Resource": "*"
}
5. Click **Next: Tags** and follow the instructions to finish creating the policy. You can give the policy a descriptive name, such as ``minimal-skypilot-policy``.
6. Go back to the previous window and click on the refresh button, and you can now search for the policy you just created.
5. **Optional**: To enable opening ports on AWS cluster, you need to add the following permissions to the policy above as well.

.. code-block:: json
{
"Effect": "Allow",
"Action": [
"ec2:DeleteSecurityGroup",
"ec2:ModifyInstanceAttribute"
],
"Resource": "arn:aws:ec2:*:<account-ID-without-hyphens>:*"
}
6. Click **Next: Tags** and follow the instructions to finish creating the policy. You can give the policy a descriptive name, such as ``minimal-skypilot-policy``.
7. Go back to the previous window and click on the refresh button, and you can now search for the policy you just created.

.. image:: ../../images/screenshots/aws/aws-add-policy.png
:width: 80%
:align: center
:alt: AWS Add Policy

7. **Optional**: If you would like to have your users access S3 buckets: You can additionally attach S3 access, such as the "AmazonS3FullAccess" policy.
8. **Optional**: If you would like to have your users access S3 buckets: You can additionally attach S3 access, such as the "AmazonS3FullAccess" policy.

.. image:: ../../images/screenshots/aws/aws-s3-policy.png
:width: 80%
:align: center
:alt: AWS Add S3 Policy

8. Click on **Next** and follow the instructions to create the user.
9. Click on **Next** and follow the instructions to create the user.

With the steps above you are almost ready to have the users in your organization to use SkyPilot with the minimal permissions.

Expand Down
15 changes: 11 additions & 4 deletions docs/source/cloud-setup/cloud-permissions/gcp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,17 +126,24 @@ User
compute.images.get
compute.images.delete
7. Click **Create** to create the role.
8. Go back to the "IAM" tab and click on **GRANT ACCESS**.
9. Fill in the email address of the user in the “Add principals” section, and select ``minimal-skypilot-role`` in the “Assign roles” section. Click **Save**.
7. **Optional**: To enable opening ports on GCP cluster, you need to have the following permissions for the role as well:

.. code-block:: text
compute.firewalls.list
compute.firewalls.update
8. Click **Create** to create the role.
9. Go back to the "IAM" tab and click on **GRANT ACCESS**.
10. Fill in the email address of the user in the “Add principals” section, and select ``minimal-skypilot-role`` in the “Assign roles” section. Click **Save**.


.. image:: ../../images/screenshots/gcp/create-iam.png
:width: 80%
:align: center
:alt: GCP Grant Access

10. The user should receive an invitation to the project and should be able to setup SkyPilot by following the instructions in :ref:`Installation <installation-gcp>`.
11. The user should receive an invitation to the project and should be able to setup SkyPilot by following the instructions in :ref:`Installation <installation-gcp>`.

.. note::

Expand Down
72 changes: 72 additions & 0 deletions docs/source/examples/ports.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
.. _ports:

Opening Ports
=============

At times, it might be crucial to expose specific ports on your cluster to the public internet. For example:

- **Exposing Development Tools**: If you're working with tools like Jupyter Notebook or ray, you'll need to expose its port to access the interface / dashboard from your browser.
- **Creating Web Services**: Whether you're setting up a web server, database, or another service, they all communicate via specific ports that need to be accessible.
- **Collaborative Tools**: Some tools and platforms may require port openings to enable collaboration with teammates or to integrate with other services.

Opening Ports for SkyPilot cluster
----------------------------------

To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resources` section of your task. For example, here is a YAML configuration to expose a Jupyter Lab server:

.. code-block:: yaml
# jupyter_lab.yaml
resources:
ports: 8888
setup: pip install jupyter
run: jupyter lab --port 8888 --no-browser --ip=0.0.0.0
In this example, the :code:`run` command will start the Jupyter Lab server on port 8888. By specifying :code:`ports: 8888`, SkyPilot will expose port 8888 on the cluster, making the jupyter server publicly accessible. To launch and access the server, run:

.. code-block:: bash
$ sky launch -c jupyter jupyter_lab.yaml
and look in for the logs for some output like:

.. code-block:: bash
Jupyter Server 2.7.0 is running at:
http://127.0.0.1:8888/lab?token=<token>
To get the public IP address of the head node of the cluster, run :code:`sky status --ip jupyter`:

.. code-block:: bash
$ sky status --ip jupyter
35.223.97.21
In the jupyter server URL, replace :code:`127.0.0.1` with the public IP from :code:`sky status --ip jupyter` and open the URL in your browser.

If you want to expose multiple ports, you can specify a list of ports or port ranges in the :code:`resources` section:

.. code-block:: yaml
resources:
ports:
- 8888
- 10020-10040
- 20000-20010
SkyPilot also support opening ports through the CLI:

.. code-block:: bash
$ sky launch -c jupyter --ports 8888 jupyter_lab.yaml
Security and Lifecycle Considerations
-------------------------------------

Before you start opening ports, there are a few things you need to bear in mind:

- **Public Accessibility**: Ports you open are exposed to the public internet. It means anyone who knows your VM's IP address and the opened port can access your service. Ensure you use security measures, like authentication mechanisms, to protect your services.
- **Lifecycle Management**: All opened ports are kept open, even after individual tasks have finished. The only instance when ports are automatically closed is during cluster shutdown. At shutdown, all ports that were opened during the cluster's lifespan are closed. Simultaneously, all corresponding firewall rules and security groups associated with these ports are also cleaned up.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ Documentation
:caption: User Guides

examples/docker-containers
examples/ports
examples/iterative-dev-project
reference/interactive-nodes
reference/faq
Expand Down
17 changes: 14 additions & 3 deletions docs/source/reference/yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,20 @@ Available fields:
# these ports. Applies to all VMs of a cluster created with this field set.
# Currently only TCP protocol is supported.
# Could be an integer or a range.
ports:
- 8080
- 10022-10040
# Ports Lifecycle:
# A cluster's ports will be updated whenever `sky launch` is executed. When launch an
# existing cluster, any new ports specified will be opened for the cluster, and the firewall
# rules for old ports will never be removed until the cluster is terminated.
# The following three ways are valid for specifying ports for a cluster:
# To specify a single port:
# ports: 8081
# To specify a port range:
# ports: 10052-10100
# To specify multiple ports / port ranges:
# ports:
# - 8080
# - 10022-10040
ports: 8081
# Additional accelerator metadata (optional); only used for TPU node
# and TPU VM.
Expand Down
42 changes: 40 additions & 2 deletions sky/adaptors/azure.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
"""Azure cli adaptor"""

# pylint: disable=import-outside-toplevel
from functools import wraps
import functools
import threading

azure = None
_session_creation_lock = threading.RLock()


def import_package(func):

@wraps(func)
@functools.wraps(func)
def wrapper(*args, **kwargs):
global azure
if azure is None:
Expand All @@ -35,3 +37,39 @@ def get_current_account_user() -> str:
"""Get the default account user."""
from azure.common import credentials
return credentials.get_cli_profile().get_current_account_user()


@import_package
def http_error_exception():
"""HttpError exception."""
from azure.core import exceptions
return exceptions.HttpResponseError


@functools.lru_cache()
@import_package
def get_client(name: str, subscription_id: str):
# Sky only supports Azure CLI credential for now.
# Increase the timeout to fix the Azure get-access-token timeout issue.
# Tracked in
# https://github.com/Azure/azure-cli/issues/20404#issuecomment-1249575110
from azure.identity import AzureCliCredential
from azure.mgmt.network import NetworkManagementClient
from azure.mgmt.resource import ResourceManagementClient
with _session_creation_lock:
credential = AzureCliCredential(process_timeout=30)
if name == 'compute':
from azure.mgmt.compute import ComputeManagementClient
return ComputeManagementClient(credential, subscription_id)
elif name == 'network':
return NetworkManagementClient(credential, subscription_id)
elif name == 'resource':
return ResourceManagementClient(credential, subscription_id)
else:
raise ValueError(f'Client not supported: "{name}"')


@import_package
def create_security_rule(**kwargs):
from azure.mgmt.network.models import SecurityRule
return SecurityRule(**kwargs)
29 changes: 11 additions & 18 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,13 @@
# - UserData: The UserData field of the old yaml may be outdated, and we want to
# use the new yaml's UserData field, which contains the authorized key setup as
# well as the disabling of the auto-update with apt-get.
# - firewall_rule: This is a newly added section for gcp in provider section.
# - security_group: In #2485 we introduces the changed of security group, so we
# should take the latest security group name.
_RAY_YAML_KEYS_TO_RESTORE_EXCEPTIONS = [
('provider', 'availability_zone'),
('provider', 'firewall_rule'),
('provider', 'security_group', 'GroupName'),
('available_node_types', 'ray.head.default', 'node_config', 'UserData'),
('available_node_types', 'ray.worker.default', 'node_config', 'UserData'),
]
Expand Down Expand Up @@ -867,7 +872,6 @@ def _restore_block(new_block: Dict[str, Any], old_block: Dict[str, Any]):
def write_cluster_config(
to_provision: 'resources.Resources',
num_nodes: int,
ports: Optional[List[Union[int, str]]],
cluster_config_template: str,
cluster_name: str,
local_wheel_path: pathlib.Path,
Expand All @@ -893,6 +897,10 @@ def write_cluster_config(
# is running a job with less resources than the cluster has.
cloud = to_provision.cloud
assert cloud is not None, to_provision

cluster_name_on_cloud = common_utils.make_cluster_name_on_cloud(
cluster_name, max_length=cloud.max_cluster_name_length())

# This can raise a ResourcesUnavailableError when:
# * The region/zones requested does not appear in the catalog. It can be
# triggered if the user changed the catalog file while there is a cluster
Expand All @@ -905,7 +913,8 @@ def write_cluster_config(
# move the check out of this function, i.e. the caller should be responsible
# for the validation.
# TODO(tian): Move more cloud agnostic vars to resources.py.
resources_vars = to_provision.make_deploy_variables(region, zones)
resources_vars = to_provision.make_deploy_variables(cluster_name_on_cloud,
region, zones)
config_dict = {}

azure_subscription_id = None
Expand Down Expand Up @@ -986,14 +995,6 @@ def write_cluster_config(
f'open(os.path.expanduser("{constants.SKY_REMOTE_RAY_PORT_FILE}"), "w"))\''
)

cluster_name_on_cloud = common_utils.make_cluster_name_on_cloud(
cluster_name, max_length=cloud.max_cluster_name_length())

# Only using new security group names for clusters with ports specified.
default_aws_sg_name = f'sky-sg-{common_utils.user_and_hostname_hash()}'
if ports is not None:
default_aws_sg_name = f'sky-sg-{cluster_name_on_cloud}'

# Use a tmp file path to avoid incomplete YAML file being re-used in the
# future.
tmp_yaml_path = yaml_path + '.tmp'
Expand All @@ -1004,7 +1005,6 @@ def write_cluster_config(
**{
'cluster_name_on_cloud': cluster_name_on_cloud,
'num_nodes': num_nodes,
'ports': ports,
'disk_size': to_provision.disk_size,
# If the current code is run by controller, propagate the real
# calling user which should've been passed in as the
Expand All @@ -1013,13 +1013,6 @@ def write_cluster_config(
'SKYPILOT_USER', '')),

# AWS only:
# Temporary measure, as deleting per-cluster SGs is too slow.
# See https://github.com/skypilot-org/skypilot/pull/742.
# Generate the name of the security group we're looking for.
# (username, last 4 chars of hash of hostname): for uniquefying
# users on shared-account scenarios.
'security_group': skypilot_config.get_nested(
('aws', 'security_group_name'), default_aws_sg_name),
'vpc_name': skypilot_config.get_nested(('aws', 'vpc_name'),
None),
'use_internal_ips': skypilot_config.get_nested(
Expand Down
Loading

0 comments on commit e5e400b

Please sign in to comment.