Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New provisioner for RunPod #2829

Merged
merged 88 commits into from
Jan 13, 2024
Merged
Show file tree
Hide file tree
Changes from 79 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
fda9c83
init
suquark Oct 5, 2023
a972b83
remove ray
suquark Oct 5, 2023
95f470e
update config
suquark Oct 6, 2023
8dd39e7
update
suquark Oct 6, 2023
45ef17a
update
suquark Oct 6, 2023
5d07647
update
suquark Oct 6, 2023
b45c09f
complete bootstrapping
suquark Oct 6, 2023
ee7a924
add start instance
suquark Oct 8, 2023
70ede43
fix
suquark Oct 8, 2023
f8fd06d
fix
suquark Oct 18, 2023
83880d2
fix
suquark Oct 19, 2023
75b23ac
update
suquark Oct 19, 2023
26149b7
wait stopping instances
suquark Oct 19, 2023
03ff947
support normal gcp tpus first
suquark Oct 19, 2023
aea9d1a
fix gcp
suquark Oct 19, 2023
0135eea
support get cluster info
suquark Oct 20, 2023
a5b7537
fix
suquark Oct 20, 2023
2bb7438
update
suquark Oct 20, 2023
e525f06
wait for instance starting
suquark Oct 28, 2023
0ae66fc
rename
suquark Nov 3, 2023
084170b
hide gcp package import
suquark Nov 3, 2023
6940640
fix
suquark Nov 3, 2023
45f00c1
fix
suquark Nov 3, 2023
9b5428f
update constants
suquark Nov 15, 2023
c4bed46
fix comments
suquark Nov 15, 2023
4f7fb16
remove unused methods
suquark Nov 15, 2023
0647233
fix comments
suquark Nov 16, 2023
a4dbcb0
sync 'config' & 'constants' with upstream, Nov 16
suquark Nov 16, 2023
43bf2e3
sync 'instace_utils' with the upstream, Nov 16
suquark Nov 16, 2023
0a867ee
fix typing
suquark Nov 17, 2023
9ee76de
parallelize provisioning
suquark Nov 17, 2023
8e584d4
Fix TPU node
Michaelvll Nov 17, 2023
d713618
Fix TPU NAME env for tpu node
Michaelvll Nov 17, 2023
aebb197
implement bulk provision
suquark Nov 18, 2023
b5b0246
refactor selflink
Michaelvll Nov 20, 2023
b02e70d
format
Michaelvll Nov 20, 2023
01ba45f
reduce the sleep time for autostop
Michaelvll Nov 21, 2023
d90c22b
provisioner version refactoring
Michaelvll Nov 26, 2023
2df5d39
refactor
Michaelvll Nov 27, 2023
06fbad6
Add logging
Michaelvll Nov 27, 2023
aee06dd
avoid saving the provisioner version
Michaelvll Nov 27, 2023
918a3c6
format
Michaelvll Nov 27, 2023
61bec6a
format
Michaelvll Nov 27, 2023
0f07964
Fix scheduling field in config
Michaelvll Nov 27, 2023
279b301
format
Michaelvll Nov 27, 2023
625c510
fix public key content
Michaelvll Nov 27, 2023
ebe92f5
Fix provisioner version for azure
Michaelvll Nov 27, 2023
54a87ef
Use ray port from head node for workers
Michaelvll Nov 27, 2023
b1896ac
format
Michaelvll Nov 27, 2023
15e193d
fix ray_port
Michaelvll Nov 27, 2023
504b7e7
fix smoke tests
Michaelvll Nov 27, 2023
b6ba235
shorter sleep time
Michaelvll Nov 27, 2023
497c438
refactor status refresh version
Michaelvll Nov 27, 2023
061e924
Merge branch 'master' of github.com:skypilot-org/skypilot into new_pr…
Michaelvll Dec 1, 2023
21a3692
Use new provisioner to launch runpod to avoid issue with ray autoscal…
Michaelvll Dec 1, 2023
533afb0
Add wait for the instances to be ready
Michaelvll Dec 1, 2023
389bd21
fix setup
Michaelvll Dec 1, 2023
b797e3b
Retry and give for getting internal IP
Michaelvll Dec 1, 2023
4ef8a59
comment
Michaelvll Dec 1, 2023
184c1de
Remove internal IP
Michaelvll Dec 1, 2023
8162e84
use external IP
Michaelvll Dec 1, 2023
eba6787
fix ssh port
Michaelvll Dec 1, 2023
9bbf3c2
Unsupported feature
Michaelvll Dec 1, 2023
472f1f6
typo
Michaelvll Dec 1, 2023
9f29cbf
Merge branch 'master' of github.com:skypilot-org/skypilot into runpod…
Michaelvll Jan 1, 2024
0b582ae
fix ssh ports
Michaelvll Jan 1, 2024
106eefa
rename var
Michaelvll Jan 1, 2024
8e8501f
format
Michaelvll Jan 2, 2024
1e92cfd
Fix cloud unsupported resources
Michaelvll Jan 2, 2024
fa07c72
Runpod update name mapping (#2945)
landscapepainter Jan 5, 2024
3489df5
Avoid using GpuInfo
Michaelvll Jan 7, 2024
5bbfc3d
Merge branch 'master' of github.com:skypilot-org/skypilot into runpod…
Michaelvll Jan 7, 2024
1859101
fix all_regions
Michaelvll Jan 7, 2024
ed955df
Fix runpod list accelerators
Michaelvll Jan 7, 2024
045bab6
format
Michaelvll Jan 7, 2024
9630832
revert to GpuInfo
Michaelvll Jan 8, 2024
97165a4
Fix get_feasible_launchable_resources
Michaelvll Jan 8, 2024
f527545
Add error
Michaelvll Jan 8, 2024
680beca
Fix optimizer random_dag for feature check
Michaelvll Jan 8, 2024
497039c
Merge branch 'master' of github.com:skypilot-org/skypilot into runpod…
Michaelvll Jan 12, 2024
8193931
address comments
Michaelvll Jan 13, 2024
e5631f3
remove test code
Michaelvll Jan 13, 2024
7399d53
format
Michaelvll Jan 13, 2024
9342d20
Add type hints
Michaelvll Jan 13, 2024
1955380
format
Michaelvll Jan 13, 2024
8b48f32
format
Michaelvll Jan 13, 2024
07498d9
fix keyerror
Michaelvll Jan 13, 2024
b11446e
Address comments
Michaelvll Jan 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions sky/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ def get_git_commit():
Local = clouds.Local
Kubernetes = clouds.Kubernetes
OCI = clouds.OCI
RunPod = clouds.RunPod
optimize = Optimizer.optimize

__all__ = [
Expand All @@ -94,6 +95,7 @@ def get_git_commit():
'Lambda',
'Local',
'OCI',
'RunPod',
'SCP',
'Optimizer',
'OptimizeTarget',
Expand Down
29 changes: 29 additions & 0 deletions sky/adaptors/runpod.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
"""RunPod cloud adaptor."""

import functools

_runpod_sdk = None


def import_package(func):

@functools.wraps(func)
def wrapper(*args, **kwargs):
global _runpod_sdk
if _runpod_sdk is None:
try:
import runpod as _runpod # pylint: disable=import-outside-toplevel
_runpod_sdk = _runpod
except ImportError:
raise ImportError(
'Fail to import dependencies for runpod.'
'Try pip install "skypilot[runpod]"') from None
return func(*args, **kwargs)

return wrapper


@import_package
def runpod():
"""Return the runpod package."""
return _runpod_sdk
15 changes: 15 additions & 0 deletions sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
from sky import skypilot_config
from sky.adaptors import gcp
from sky.adaptors import ibm
from sky.adaptors import runpod
from sky.clouds.utils import lambda_utils
from sky.utils import common_utils
from sky.utils import kubernetes_enums
Expand Down Expand Up @@ -449,3 +450,17 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
config['auth']['ssh_proxy_command'] = ssh_proxy_cmd

return config


# ---------------------------------- RunPod ---------------------------------- #
def setup_runpod_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
"""Sets up SSH authentication for RunPod.
- Generates a new SSH key pair if one does not exist.
- Adds the public SSH key to the user's RunPod account.
"""
_, public_key_path = get_or_generate_keys()
with open(public_key_path, 'r', encoding='UTF-8') as pub_key_file:
public_key = pub_key_file.read().strip()
runpod.runpod().cli.groups.ssh.functions.add_ssh_key(public_key)

return configure_ssh_info(config)
2 changes: 2 additions & 0 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1172,6 +1172,8 @@ def _add_auth_to_cluster_config(cloud: clouds.Cloud, cluster_config_file: str):
config = auth.setup_kubernetes_authentication(config)
elif isinstance(cloud, clouds.IBM):
config = auth.setup_ibm_authentication(config)
elif isinstance(cloud, clouds.RunPod):
config = auth.setup_runpod_authentication(config)
else:
assert isinstance(cloud, clouds.Local), cloud
# Local cluster case, authentication is already filled by the user
Expand Down
10 changes: 10 additions & 0 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ def _get_cluster_config_template(cloud):
clouds.Local: 'local-ray.yml.j2',
clouds.SCP: 'scp-ray.yml.j2',
clouds.OCI: 'oci-ray.yml.j2',
clouds.RunPod: 'runpod-ray.yml.j2',
clouds.Kubernetes: 'kubernetes-ray.yml.j2',
}
return cloud_to_template[type(cloud)]
Expand Down Expand Up @@ -2343,6 +2344,15 @@ def update_ssh_ports(self, max_attempts: int = 1) -> None:
Use this method to use any cloud-specific port fetching logic.
"""
del max_attempts # Unused.
if isinstance(self.launched_resources.cloud, clouds.RunPod):
cluster_info = provision_lib.get_cluster_info(
str(self.launched_resources.cloud).lower(),
region=self.launched_resources.region,
cluster_name_on_cloud=self.cluster_name_on_cloud,
provider_config=None)
self.stable_ssh_ports = cluster_info.get_ssh_ports()
return

head_ssh_port = 22
self.stable_ssh_ports = (
[head_ssh_port] + [22] *
Expand Down
2 changes: 2 additions & 0 deletions sky/clouds/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from sky.clouds.lambda_cloud import Lambda
from sky.clouds.local import Local
from sky.clouds.oci import OCI
from sky.clouds.runpod import RunPod
from sky.clouds.scp import SCP

__all__ = [
Expand All @@ -28,6 +29,7 @@
'Lambda',
'Local',
'SCP',
'RunPod',
'OCI',
'Kubernetes',
'CloudImplementationFeatures',
Expand Down
275 changes: 275 additions & 0 deletions sky/clouds/runpod.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
""" RunPod Cloud. """

import json
import typing
from typing import Dict, Iterator, List, Optional, Tuple

from sky import clouds
from sky.clouds import service_catalog

if typing.TYPE_CHECKING:
from sky import resources as resources_lib

_CREDENTIAL_FILES = [
'config.toml',
]


@clouds.CLOUD_REGISTRY.register
class RunPod(clouds.Cloud):
""" RunPod GPU Cloud

_REPR | The string representation for the RunPod GPU cloud object.
"""
_REPR = 'RunPod'
_CLOUD_UNSUPPORTED_FEATURES = {
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
clouds.CloudImplementationFeatures.STOP: 'Stopping not supported.',
clouds.CloudImplementationFeatures.SPOT_INSTANCE:
('Spot is not supported, as runpod API does not implement spot .'),
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
clouds.CloudImplementationFeatures.MULTI_NODE:
('Multi-node not supported yet, as the interconnection among nodes '
'are non-trivial on RunPod.'),
}
_MAX_CLUSTER_NAME_LEN_LIMIT = 120
_regions: List[clouds.Region] = []

PROVISIONER_VERSION = clouds.ProvisionerVersion.SKYPILOT
STATUS_VERSION = clouds.StatusVersion.SKYPILOT

@classmethod
def _unsupported_features_for_resources(
cls, resources: 'resources_lib.Resources'
) -> Dict[clouds.CloudImplementationFeatures, str]:
"""The features not supported based on the resources provided.

This method is used by check_features_are_supported() to check if the
cloud implementation supports all the requested features.

Returns:
A dict of {feature: reason} for the features not supported by the
cloud implementation.
"""
del resources # unused
return cls._CLOUD_UNSUPPORTED_FEATURES

@classmethod
def _max_cluster_name_length(cls) -> Optional[int]:
return cls._MAX_CLUSTER_NAME_LEN_LIMIT

@classmethod
def regions_with_offering(cls, instance_type: str,
accelerators: Optional[Dict[str, int]],
use_spot: bool, region: Optional[str],
zone: Optional[str]) -> List[clouds.Region]:
assert zone is None, 'RunPod does not support zones.'
del accelerators, zone # unused
if use_spot:
return []
else:
regions = service_catalog.get_region_zones_for_instance_type(
instance_type, use_spot, 'runpod')

if region is not None:
regions = [r for r in regions if r.name == region]
return regions

@classmethod
def get_vcpus_mem_from_instance_type(
cls,
instance_type: str,
) -> Tuple[Optional[float], Optional[float]]:
return service_catalog.get_vcpus_mem_from_instance_type(instance_type,
clouds='runpod')

@classmethod
def zones_provision_loop(
cls,
*,
region: str,
num_nodes: int,
instance_type: str,
accelerators: Optional[Dict[str, int]] = None,
use_spot: bool = False,
) -> Iterator[None]:
del num_nodes # unused
regions = cls.regions_with_offering(instance_type,
accelerators,
use_spot,
region=region,
zone=None)
for r in regions:
assert r.zones is None, r
yield r.zones

def instance_type_to_hourly_cost(self,
instance_type: str,
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None) -> float:
return service_catalog.get_hourly_cost(instance_type,
use_spot=use_spot,
region=region,
zone=zone,
clouds='runpod')

def accelerators_to_hourly_cost(self,
accelerators: Dict[str, int],
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None) -> float:
"""Returns the hourly cost of the accelerators, in dollars/hour."""
del accelerators, use_spot, region, zone # unused
return 0.0 # RunPod includes accelerators in the hourly cost.

def get_egress_cost(self, num_gigabytes: float) -> float:
return 0.0

def __repr__(self):
return 'RunPod'
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

def is_same_cloud(self, other: clouds.Cloud) -> bool:
# Returns true if the two clouds are the same cloud type.
return isinstance(other, RunPod)

@classmethod
def get_default_instance_type(
cls,
cpus: Optional[str] = None,
memory: Optional[str] = None,
disk_tier: Optional[str] = None) -> Optional[str]:
"""Returns the default instance type for RunPod."""
return service_catalog.get_default_instance_type(cpus=cpus,
memory=memory,
disk_tier=disk_tier,
clouds='runpod')

@classmethod
def get_accelerators_from_instance_type(
cls, instance_type: str) -> Optional[Dict[str, int]]:
return service_catalog.get_accelerators_from_instance_type(
instance_type, clouds='runpod')

@classmethod
def get_zone_shell_cmd(cls) -> Optional[str]:
return None

def make_deploy_resources_variables(
self, resources: 'resources_lib.Resources',
cluster_name_on_cloud: str, region: 'clouds.Region',
zones: Optional[List['clouds.Zone']]) -> Dict[str, Optional[str]]:
del zones
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

r = resources
acc_dict = self.get_accelerators_from_instance_type(r.instance_type)
if acc_dict is not None:
custom_resources = json.dumps(acc_dict, separators=(',', ':'))
else:
custom_resources = None

return {
'instance_type': resources.instance_type,
'custom_resources': custom_resources,
'region': region.name,
}

def _get_feasible_launchable_resources(
self, resources: 'resources_lib.Resources'):
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
"""Returns a list of feasible resources for the given resources."""
if resources.use_spot:
return ([], [])
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
if resources.instance_type is not None:
assert resources.is_launchable(), resources
resources = resources.copy(accelerators=None)
return ([resources], [])

def _make(instance_list):
resource_list = []
for instance_type in instance_list:
r = resources.copy(
cloud=RunPod(),
instance_type=instance_type,
accelerators=None,
cpus=None,
)
resource_list.append(r)
return resource_list

# Currently, handle a filter on accelerators only.
accelerators = resources.accelerators
if accelerators is None:
# Return a default instance type
default_instance_type = RunPod.get_default_instance_type(
cpus=resources.cpus,
memory=resources.memory,
disk_tier=resources.disk_tier)
if default_instance_type is None:
return ([], [])
else:
return (_make([default_instance_type]), [])

assert len(accelerators) == 1, resources
acc, acc_count = list(accelerators.items())[0]
(instance_list, fuzzy_candidate_list
) = service_catalog.get_instance_type_for_accelerator(
acc,
acc_count,
use_spot=resources.use_spot,
cpus=resources.cpus,
region=resources.region,
zone=resources.zone,
clouds='runpod')
if instance_list is None:
return ([], fuzzy_candidate_list)
return (_make(instance_list), fuzzy_candidate_list)

@classmethod
def check_credentials(cls) -> Tuple[bool, Optional[str]]:
""" Verify that the user has valid credentials for RunPod. """
try:
import runpod # pylint: disable=import-outside-toplevel
valid, error = runpod.check_credentials()

if not valid:
return False, (
f'{error} \n' # First line is indented by 4 spaces
' Credentials can be set up by running: \n'
f' $ pip install runpod \n'
f' $ runpod store_api_key <YOUR_RUNPOD_API_KEY> \n'
' For more information, see https://docs.runpod.io/docs/skypilot' # pylint: disable=line-too-long
)

return True, None

except ImportError:
return False, (
'Failed to import runpod.'
'To install, run: "pip install runpod" or "pip install sky[runpod]"' # pylint: disable=line-too-long
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
)

def get_credential_file_mounts(self) -> Dict[str, str]:
return {
f'~/.runpod/{filename}': f'~/.runpod/{filename}'
for filename in _CREDENTIAL_FILES
}

@classmethod
def get_current_user_identity(cls) -> Optional[List[str]]:
# NOTE: used for very advanced SkyPilot functionality
# Can implement later if desired
return None

def instance_type_exists(self, instance_type: str) -> bool:
return service_catalog.instance_type_exists(instance_type, 'runpod')

def validate_region_zone(self, region: Optional[str], zone: Optional[str]):
return service_catalog.validate_region_zone(region,
zone,
clouds='runpod')

def accelerator_in_region_or_zone(self,
accelerator: str,
acc_count: int,
region: Optional[str] = None,
zone: Optional[str] = None) -> bool:
return service_catalog.accelerator_in_region_or_zone(
accelerator, acc_count, region, zone, 'runpod')
Loading