diff --git a/docs/source/cloud-setup/cloud-permissions/aws.rst b/docs/source/cloud-setup/cloud-permissions/aws.rst index e34499df3b4..89510331988 100644 --- a/docs/source/cloud-setup/cloud-permissions/aws.rst +++ b/docs/source/cloud-setup/cloud-permissions/aws.rst @@ -148,7 +148,7 @@ AWS accounts can be attached with a policy that limits the permissions of the ac :align: center :alt: AWS Add Policy -8. **Optional**: If you would like to have your users access S3 buckets: You can additionally attach S3 access, such as the "AmazonS3FullAccess" policy. +8. **Optional**: If you would like to have your users access S3 buckets: You can additionally attach S3 access, such as the "AmazonS3FullAccess" policy. Note that enabling S3 access is required to use :ref:`managed-jobs` with `workdir` or `file_mounts` for now. .. image:: ../../images/screenshots/aws/aws-s3-policy.png :width: 80% diff --git a/docs/source/examples/interactive-development.rst b/docs/source/examples/interactive-development.rst index cc50f8e6ea8..40920934597 100644 --- a/docs/source/examples/interactive-development.rst +++ b/docs/source/examples/interactive-development.rst @@ -110,7 +110,7 @@ This is supported by simply connecting VSCode to the cluster with the cluster na For more details, please refer to the `VSCode documentation `__. -.. image:: https://imgur.com/8mKfsET.gif +.. image:: https://i.imgur.com/8mKfsET.gif :align: center :alt: Connect to the cluster with VSCode diff --git a/docs/source/getting-started/installation.rst b/docs/source/getting-started/installation.rst index be7ae1ff327..9f251a5aafe 100644 --- a/docs/source/getting-started/installation.rst +++ b/docs/source/getting-started/installation.rst @@ -301,13 +301,12 @@ RunPod Fluidstack ~~~~~~~~~~~~~~~~~~ -`Fluidstack `__ is a cloud provider offering low-cost GPUs. To configure Fluidstack access, go to the `Home `__ page on your Fluidstack console to generate an API key and then add the :code:`API key` to :code:`~/.fluidstack/api_key` and the :code:`API token` to :code:`~/.fluidstack/api_token`: - +`Fluidstack `__ is a cloud provider offering low-cost GPUs. To configure Fluidstack access, go to the `Home `__ page on your Fluidstack console to generate an API key and then add the :code:`API key` to :code:`~/.fluidstack/api_key` : .. code-block:: shell mkdir -p ~/.fluidstack echo "your_api_key_here" > ~/.fluidstack/api_key - echo "your_api_token_here" > ~/.fluidstack/api_token + Cudo Compute diff --git a/llm/codellama/README.md b/llm/codellama/README.md index 8e5025d22b5..f145fd062ff 100644 --- a/llm/codellama/README.md +++ b/llm/codellama/README.md @@ -10,14 +10,14 @@ The followings are the demos of Code Llama 70B hosted by SkyPilot Serve (aka Sky ## Demos
- +
Coding Assistant: Connect to hosted Code Llama with Tabby in VScode
- +
Chat: Connect to hosted Code Llama with FastChat
diff --git a/llm/falcon/README.md b/llm/falcon/README.md index 837e93f5558..6eb480d9ea8 100644 --- a/llm/falcon/README.md +++ b/llm/falcon/README.md @@ -50,7 +50,7 @@ sky launch -c falcon -s falcon.yaml --no-use-spot For reference, below is a loss graph you may expect to see, and the amount of time and the approximate cost of fine-tuning each of the models over 500 epochs (assuming a spot instance A100 GPU rate at $1.1 / hour and a A100-80GB rate of $1.61 / hour): -image +image 1. `ybelkada/falcon-7b-sharded-bf16`: 2.5 to 3 hours using 1 A100 spot GPU; total cost ≈ $3.3. diff --git a/llm/gpt-2/README.md b/llm/gpt-2/README.md index bc9893fec5b..10fa2cf6998 100644 --- a/llm/gpt-2/README.md +++ b/llm/gpt-2/README.md @@ -28,14 +28,14 @@ Run the following command to start GPT-2 (124M) training on a GPU VM with 8 A100 sky launch -c gpt2 gpt2.yaml ``` -![GPT-2 training with 8 A100 GPUs](https://imgur.com/v8SGpsF.png) +![GPT-2 training with 8 A100 GPUs](https://i.imgur.com/v8SGpsF.png) Or, you can train the model with a single A100, by adding `--gpus A100`: ```bash sky launch -c gpt2 gpt2.yaml --gpus A100 ``` -![GPT-2 training with a single A100](https://imgur.com/hN65g4r.png) +![GPT-2 training with a single A100](https://i.imgur.com/hN65g4r.png) It is also possible to speed up the training of the model on 8 H100 (2.3x more tok/s than 8x A100s): @@ -43,7 +43,7 @@ It is also possible to speed up the training of the model on 8 H100 (2.3x more t sky launch -c gpt2 gpt2.yaml --gpus H100:8 ``` -![GPT-2 training with 8 H100](https://imgur.com/STbi80b.png) +![GPT-2 training with 8 H100](https://i.imgur.com/STbi80b.png) ### Download logs and visualizations @@ -54,7 +54,7 @@ scp -r gpt2:~/llm.c/log124M . We can visualize the training progress with the notebook provided in [llm.c](https://github.com/karpathy/llm.c/blob/master/dev/vislog.ipynb). (Note: we cut off the training after 10K steps, which already achieve similar validation loss as OpenAI GPT-2 checkpoint.)
- +
> Yes! We are able to reproduce the training of GPT-2 (124M) on any cloud with SkyPilot. diff --git a/llm/llama-2/README.md b/llm/llama-2/README.md index d8f8151572e..4f1a8f60cae 100644 --- a/llm/llama-2/README.md +++ b/llm/llama-2/README.md @@ -94,6 +94,6 @@ You can also host the official FAIR model without using huggingface and gradio. ``` 3. Open http://localhost:7681 in your browser and start chatting! -LLaMA chatbot running on the cloud via SkyPilot +LLaMA chatbot running on the cloud via SkyPilot diff --git a/llm/llama-3/README.md b/llm/llama-3/README.md index d0c28dc93c6..ef19d94b5c0 100644 --- a/llm/llama-3/README.md +++ b/llm/llama-3/README.md @@ -5,7 +5,7 @@

-Llama-3 x SkyPilot +Llama-3 x SkyPilot

[Llama-3](https://github.com/meta-llama/llama3) is the latest top open-source LLM from Meta. It has been released with a license that authorizes commercial use. You can deploy a private Llama-3 chatbot with SkyPilot in your own cloud with just one simple command. @@ -248,7 +248,7 @@ To use the Gradio UI, open the URL shown in the logs:

-Gradio UI serving Llama-3 +Gradio UI serving Llama-3

To stop the instance: diff --git a/llm/llama-3_1-finetuning/readme.md b/llm/llama-3_1-finetuning/readme.md index 836f3bf1b3b..935dccde84e 100644 --- a/llm/llama-3_1-finetuning/readme.md +++ b/llm/llama-3_1-finetuning/readme.md @@ -135,7 +135,7 @@ sky launch -c llama31 lora.yaml \
- +
Training Loss of LoRA finetuning Llama 3.1
@@ -218,10 +218,10 @@ run: | ## Appendix: Preparation 1. Request the access to [Llama 3.1 weights on huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) (Click on the blue box and follow the steps): -![](https://imgur.com/snIQhr9.png) +![](https://i.imgur.com/snIQhr9.png) 2. Get your [huggingface access token](https://huggingface.co/settings/tokens): -![](https://imgur.com/3idBgHn.png) +![](https://i.imgur.com/3idBgHn.png) 3. Add huggingface token to your environment variable: diff --git a/llm/lorax/README.md b/llm/lorax/README.md index 2fe548c92a8..6cc44cf1134 100644 --- a/llm/lorax/README.md +++ b/llm/lorax/README.md @@ -4,7 +4,7 @@

- LoRAX + LoRAX

[LoRAX](https://github.com/predibase/lorax) (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned LLMs on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. It works by dynamically loading multiple fine-tuned "adapters" (LoRAs, etc.) on top of a single base model at runtime. Concurrent requests for different adapters can be processed together in a single batch, allowing LoRAX to maintain near linear throughput scaling as the number of adapters increases. diff --git a/llm/vicuna-llama-2/README.md b/llm/vicuna-llama-2/README.md index 899792c299d..24caa525a56 100644 --- a/llm/vicuna-llama-2/README.md +++ b/llm/vicuna-llama-2/README.md @@ -1,6 +1,6 @@ # Train Your Own Vicuna on Llama-2 -![Vicuna-Llama-2](https://imgur.com/McZWg6z.gif "Result model in action, trained using this guide. From the SkyPilot and Vicuna teams.") +![Vicuna-Llama-2](https://i.imgur.com/McZWg6z.gif "Result model in action, trained using this guide. From the SkyPilot and Vicuna teams.") Meta released [Llama 2](https://ai.meta.com/llama/) two weeks ago and has made a big wave in the AI community. In our opinion, its biggest impact is that the model is now released under a [permissive license](https://github.com/facebookresearch/llama/blob/main/LICENSE) that **allows the model weights to be used commercially**[^1]. This differs from Llama 1 which cannot be used commercially. @@ -106,7 +106,7 @@ sky launch --no-use-spot ...

- Optimizer + Optimizer

**Optional**: Try out the training for the 13B model: @@ -139,7 +139,7 @@ sky launch -c serve serve.yaml --env MODEL_CKPT=/chatbot/ ``` In [serve.yaml](https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna-llama-2/serve.yaml), we specified launching a Gradio server that serves the model checkpoint at `/chatbot/7b`. -![Vicuna-Llama-2](https://imgur.com/McZWg6z.gif "Serving the resulting model with Gradio.") +![Vicuna-Llama-2](https://i.imgur.com/McZWg6z.gif "Serving the resulting model with Gradio.") > **Tip**: You can also switch to a cheaper accelerator, such as L4, to save costs, by adding `--gpus L4` to the above command. diff --git a/llm/vllm/README.md b/llm/vllm/README.md index e3a2befbecc..9fb3c0c1364 100644 --- a/llm/vllm/README.md +++ b/llm/vllm/README.md @@ -4,7 +4,7 @@

- vLLM + vLLM

This README contains instructions to run a demo for vLLM, an open-source library for fast LLM inference and serving, which improves the throughput compared to HuggingFace by **up to 24x**. diff --git a/sky/clouds/fluidstack.py b/sky/clouds/fluidstack.py index d292ace02f8..ef397d4c55e 100644 --- a/sky/clouds/fluidstack.py +++ b/sky/clouds/fluidstack.py @@ -15,8 +15,7 @@ _CREDENTIAL_FILES = [ # credential files for FluidStack, - fluidstack_utils.FLUIDSTACK_API_KEY_PATH, - fluidstack_utils.FLUIDSTACK_API_TOKEN_PATH, + fluidstack_utils.FLUIDSTACK_API_KEY_PATH ] if typing.TYPE_CHECKING: # Renaming to avoid shadowing variables. @@ -189,20 +188,12 @@ def make_deploy_resources_variables( custom_resources = json.dumps(acc_dict, separators=(',', ':')) else: custom_resources = None - cuda_installation_commands = """ - sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb -O /usr/local/cuda-keyring_1.1-1_all.deb; - sudo dpkg -i /usr/local/cuda-keyring_1.1-1_all.deb; - sudo apt-get update; - sudo apt-get -y install cuda-toolkit-12-3; - sudo apt-get install -y cuda-drivers; - sudo apt-get install -y python3-pip; - nvidia-smi || sudo reboot;""" + return { 'instance_type': resources.instance_type, 'custom_resources': custom_resources, 'region': region.name, - 'fluidstack_username': self.default_username(region.name), - 'cuda_installation_commands': cuda_installation_commands, + 'fluidstack_username': 'ubuntu', } def _get_feasible_launchable_resources( @@ -270,17 +261,26 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]: try: assert os.path.exists( os.path.expanduser(fluidstack_utils.FLUIDSTACK_API_KEY_PATH)) - assert os.path.exists( - os.path.expanduser(fluidstack_utils.FLUIDSTACK_API_TOKEN_PATH)) + + with open(os.path.expanduser( + fluidstack_utils.FLUIDSTACK_API_KEY_PATH), + encoding='UTF-8') as f: + api_key = f.read().strip() + if not api_key.startswith('api_key'): + return False, ('Invalid FluidStack API key format. ' + 'To configure credentials, go to:\n ' + ' https://dashboard.fluidstack.io \n ' + 'to obtain an API key, ' + 'then add save the contents ' + 'to ~/.fluidstack/api_key \n') except AssertionError: - return False, ( - 'Failed to access FluidStack Cloud' - ' with credentials. ' - 'To configure credentials, go to:\n ' - ' https://console.fluidstack.io \n ' - 'to obtain an API key and API Token, ' - 'then add save the contents ' - 'to ~/.fluidstack/api_key and ~/.fluidstack/api_token \n') + return False, ('Failed to access FluidStack Cloud' + ' with credentials. ' + 'To configure credentials, go to:\n ' + ' https://dashboard.fluidstack.io \n ' + 'to obtain an API key, ' + 'then add save the contents ' + 'to ~/.fluidstack/api_key \n') except requests.exceptions.ConnectionError: return False, ('Failed to verify FluidStack Cloud credentials. ' 'Check your network connection ' @@ -303,21 +303,6 @@ def validate_region_zone(self, region: Optional[str], zone: Optional[str]): zone, clouds='fluidstack') - @classmethod - def default_username(cls, region: str) -> str: - return { - 'norway_2_eu': 'ubuntu', - 'calgary_1_canada': 'ubuntu', - 'norway_3_eu': 'ubuntu', - 'norway_4_eu': 'ubuntu', - 'india_2': 'root', - 'nevada_1_usa': 'fsuser', - 'generic_1_canada': 'ubuntu', - 'iceland_1_eu': 'ubuntu', - 'new_york_1_usa': 'fsuser', - 'illinois_1_usa': 'fsuser' - }.get(region, 'ubuntu') - @classmethod def query_status( cls, diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_azure.py b/sky/clouds/service_catalog/data_fetchers/fetch_azure.py index 9a7b2a90bee..bbd337e23aa 100644 --- a/sky/clouds/service_catalog/data_fetchers/fetch_azure.py +++ b/sky/clouds/service_catalog/data_fetchers/fetch_azure.py @@ -140,8 +140,12 @@ def get_pricing_df(region: Optional[str] = None) -> 'pd.DataFrame': print(f'Done fetching pricing {region}') df = pd.DataFrame(all_items) assert 'productName' in df.columns, (region, df.columns) - return df[(~df['productName'].str.contains(' Windows')) & - (df['unitPrice'] > 0)] + # Filter out the cloud services and windows products. + # Some H100 series use ' Win' instead of ' Windows', e.g. + # Virtual Machines NCCadsv5 Srs Win + return df[ + (~df['productName'].str.contains(' Win| Cloud Services| CloudServices')) + & (df['unitPrice'] > 0)] def get_sku_df(region_set: Set[str]) -> 'pd.DataFrame': diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py b/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py index 5d50399ab89..cf943541e08 100644 --- a/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py +++ b/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py @@ -11,10 +11,140 @@ import requests -ENDPOINT = 'https://api.fluidstack.io/v1/plans' +ENDPOINT = 'https://platform.fluidstack.io/list_available_configurations' DEFAULT_FLUIDSTACK_API_KEY_PATH = os.path.expanduser('~/.fluidstack/api_key') -DEFAULT_FLUIDSTACK_API_TOKEN_PATH = os.path.expanduser( - '~/.fluidstack/api_token') + +plan_vcpus_memory = [{ + 'gpu_type': 'RTX_A6000_48GB', + 'gpu_count': 2, + 'min_cpu_count': 12, + 'min_memory': 110.0 +}, { + 'gpu_type': 'RTX_A6000_48GB', + 'gpu_count': 4, + 'min_cpu_count': 24, + 'min_memory': 220.0 +}, { + 'gpu_type': 'A100_NVLINK_80GB', + 'gpu_count': 8, + 'min_cpu_count': 252, + 'min_memory': 960.0 +}, { + 'gpu_type': 'H100_PCIE_80GB', + 'gpu_count': 8, + 'min_cpu_count': 252, + 'min_memory': 1440.0 +}, { + 'gpu_type': 'RTX_A4000_16GB', + 'gpu_count': 2, + 'min_cpu_count': 12, + 'min_memory': 48.0 +}, { + 'gpu_type': 'H100_PCIE_80GB', + 'gpu_count': 2, + 'min_cpu_count': 60, + 'min_memory': 360.0 +}, { + 'gpu_type': 'RTX_A6000_48GB', + 'gpu_count': 8, + 'min_cpu_count': 252, + 'min_memory': 464.0 +}, { + 'gpu_type': 'H100_NVLINK_80GB', + 'gpu_count': 8, + 'min_cpu_count': 252, + 'min_memory': 1440.0 +}, { + 'gpu_type': 'H100_PCIE_80GB', + 'gpu_count': 1, + 'min_cpu_count': 28, + 'min_memory': 180.0 +}, { + 'gpu_type': 'RTX_A5000_24GB', + 'gpu_count': 1, + 'min_cpu_count': 8, + 'min_memory': 30.0 +}, { + 'gpu_type': 'RTX_A5000_24GB', + 'gpu_count': 2, + 'min_cpu_count': 16, + 'min_memory': 60.0 +}, { + 'gpu_type': 'L40_48GB', + 'gpu_count': 2, + 'min_cpu_count': 64, + 'min_memory': 120.0 +}, { + 'gpu_type': 'RTX_A4000_16GB', + 'gpu_count': 8, + 'min_cpu_count': 48, + 'min_memory': 192.0 +}, { + 'gpu_type': 'RTX_A4000_16GB', + 'gpu_count': 1, + 'min_cpu_count': 6, + 'min_memory': 24.0 +}, { + 'gpu_type': 'RTX_A4000_16GB', + 'gpu_count': 4, + 'min_cpu_count': 24, + 'min_memory': 96.0 +}, { + 'gpu_type': 'A100_PCIE_80GB', + 'gpu_count': 4, + 'min_cpu_count': 124, + 'min_memory': 480.0 +}, { + 'gpu_type': 'H100_PCIE_80GB', + 'gpu_count': 4, + 'min_cpu_count': 124, + 'min_memory': 720.0 +}, { + 'gpu_type': 'L40_48GB', + 'gpu_count': 8, + 'min_cpu_count': 252, + 'min_memory': 480.0 +}, { + 'gpu_type': 'RTX_A5000_24GB', + 'gpu_count': 8, + 'min_cpu_count': 64, + 'min_memory': 240.0 +}, { + 'gpu_type': 'L40_48GB', + 'gpu_count': 1, + 'min_cpu_count': 32, + 'min_memory': 60.0 +}, { + 'gpu_type': 'RTX_A6000_48GB', + 'gpu_count': 1, + 'min_cpu_count': 6, + 'min_memory': 55.0 +}, { + 'gpu_type': 'L40_48GB', + 'gpu_count': 4, + 'min_cpu_count': 126, + 'min_memory': 240.0 +}, { + 'gpu_type': 'A100_PCIE_80GB', + 'gpu_count': 1, + 'min_cpu_count': 28, + 'min_memory': 120.0 +}, { + 'gpu_type': 'A100_PCIE_80GB', + 'gpu_count': 8, + 'min_cpu_count': 252, + 'min_memory': 1440.0 +}, { + 'gpu_type': 'A100_PCIE_80GB', + 'gpu_count': 2, + 'min_cpu_count': 60, + 'min_memory': 240.0 +}, { + 'gpu_type': 'RTX_A5000_24GB', + 'gpu_count': 4, + 'min_cpu_count': 32, + 'min_memory': 120.0 +}] GPU_MAP = { 'H100_PCIE_80GB': 'H100', @@ -47,19 +177,15 @@ def get_regions(plans: List) -> dict: regions = {} for plan in plans: for region in plan.get('regions', []): - regions[region['id']] = region['id'] + regions[region] = region return regions def create_catalog(output_dir: str) -> None: - response = requests.get(ENDPOINT) + with open(DEFAULT_FLUIDSTACK_API_KEY_PATH, 'r', encoding='UTF-8') as f: + api_key = f.read().strip() + response = requests.get(ENDPOINT, headers={'api-key': api_key}) plans = response.json() - #plans = [plan for plan in plans if len(plan['regions']) > 0] - plans = [ - plan for plan in plans if plan['minimum_commitment'] == 'hourly' and - plan['type'] in ['preconfigured'] and - plan['gpu_type'] not in ['NO GPU', 'RTX_3080_10GB', 'RTX_3090_24GB'] - ] with open(os.path.join(output_dir, 'vms.csv'), mode='w', encoding='utf-8') as f: @@ -81,39 +207,45 @@ def create_catalog(output_dir: str) -> None: except KeyError: #print(f'Could not map {plan["gpu_type"]}') continue - gpu_memory = int( - str(plan['configuration']['gpu_memory']).replace('GB', - '')) * 1024 - gpu_cnt = int(plan['configuration']['gpu_count']) - vcpus = float(plan['configuration']['core_count']) - mem = float(plan['configuration']['ram']) - price = float(plan['price']['hourly']) * gpu_cnt - gpuinfo = { - 'Gpus': [{ - 'Name': gpu, - 'Manufacturer': 'NVIDIA', - 'Count': gpu_cnt, - 'MemoryInfo': { - 'SizeInMiB': int(gpu_memory) - }, - }], - 'TotalGpuMemoryInMiB': int(gpu_memory * gpu_cnt), - } - gpuinfo = json.dumps(gpuinfo).replace('"', "'") # pylint: disable=invalid-string-quote - for r in plan.get('regions', []): - if r['id'] == 'india_2': + for gpu_cnt in plan['gpu_counts']: + gpu_memory = float(plan['gpu_type'].split('_')[-1].replace( + 'GB', '')) * 1024 + try: + vcpus_mem = [ + x for x in plan_vcpus_memory + if x['gpu_type'] == plan['gpu_type'] and + x['gpu_count'] == gpu_cnt + ][0] + vcpus = vcpus_mem['min_cpu_count'] + mem = vcpus_mem['min_memory'] + except IndexError: continue - writer.writerow([ - plan['plan_id'], - gpu, - gpu_cnt, - vcpus, - mem, - price, - r['id'], - gpuinfo, - '', - ]) + price = float(plan['price_per_gpu_hr']) * gpu_cnt + gpuinfo = { + 'Gpus': [{ + 'Name': gpu, + 'Manufacturer': 'NVIDIA', + 'Count': gpu_cnt, + 'MemoryInfo': { + 'SizeInMiB': int(gpu_memory) + }, + }], + 'TotalGpuMemoryInMiB': int(gpu_memory * gpu_cnt), + } + gpuinfo = json.dumps(gpuinfo).replace('"', "'") # pylint: disable=invalid-string-quote + instance_type = f'{plan["gpu_type"]}::{gpu_cnt}' + for region in plan.get('regions', []): + writer.writerow([ + instance_type, + gpu, + gpu_cnt, + vcpus, + mem, + price, + region, + gpuinfo, + '', + ]) if __name__ == '__main__': diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py b/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py index 9578196b4eb..5b680500c75 100644 --- a/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py +++ b/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py @@ -54,6 +54,110 @@ ,tpu-v3-1024,1,,,tpu-v3-1024,1024.0,307.2,us-east1,us-east1-d ,tpu-v3-2048,1,,,tpu-v3-2048,2048.0,614.4,us-east1,us-east1-d """))) + +# TPU V5 is not visible in specific zones. We hardcode the missing zones here. +# NOTE(dev): Keep the zones and the df in sync. +TPU_V5_MISSING_ZONES_DF = { + 'europe-west4-b': pd.read_csv( + io.StringIO( + textwrap.dedent("""\ + AcceleratorName,AcceleratorCount,Region,AvailabilityZone + tpu-v5p-8,1,europe-west4,europe-west4-b + tpu-v5p-16,1,europe-west4,europe-west4-b + tpu-v5p-32,1,europe-west4,europe-west4-b + tpu-v5p-64,1,europe-west4,europe-west4-b + tpu-v5p-128,1,europe-west4,europe-west4-b + tpu-v5p-256,1,europe-west4,europe-west4-b + tpu-v5p-384,1,europe-west4,europe-west4-b + tpu-v5p-512,1,europe-west4,europe-west4-b + tpu-v5p-640,1,europe-west4,europe-west4-b + tpu-v5p-768,1,europe-west4,europe-west4-b + tpu-v5p-896,1,europe-west4,europe-west4-b + tpu-v5p-1024,1,europe-west4,europe-west4-b + tpu-v5p-1152,1,europe-west4,europe-west4-b + tpu-v5p-1280,1,europe-west4,europe-west4-b + tpu-v5p-1408,1,europe-west4,europe-west4-b + tpu-v5p-1536,1,europe-west4,europe-west4-b + tpu-v5p-1664,1,europe-west4,europe-west4-b + tpu-v5p-1792,1,europe-west4,europe-west4-b + tpu-v5p-1920,1,europe-west4,europe-west4-b + tpu-v5p-2048,1,europe-west4,europe-west4-b + tpu-v5p-2176,1,europe-west4,europe-west4-b + tpu-v5p-2304,1,europe-west4,europe-west4-b + tpu-v5p-2432,1,europe-west4,europe-west4-b + tpu-v5p-2560,1,europe-west4,europe-west4-b + tpu-v5p-2688,1,europe-west4,europe-west4-b + tpu-v5p-2816,1,europe-west4,europe-west4-b + tpu-v5p-2944,1,europe-west4,europe-west4-b + tpu-v5p-3072,1,europe-west4,europe-west4-b + tpu-v5p-3200,1,europe-west4,europe-west4-b + tpu-v5p-3328,1,europe-west4,europe-west4-b + tpu-v5p-3456,1,europe-west4,europe-west4-b + tpu-v5p-3584,1,europe-west4,europe-west4-b + tpu-v5p-3712,1,europe-west4,europe-west4-b + tpu-v5p-3840,1,europe-west4,europe-west4-b + tpu-v5p-3968,1,europe-west4,europe-west4-b + tpu-v5p-4096,1,europe-west4,europe-west4-b + tpu-v5p-4224,1,europe-west4,europe-west4-b + tpu-v5p-4352,1,europe-west4,europe-west4-b + tpu-v5p-4480,1,europe-west4,europe-west4-b + tpu-v5p-4608,1,europe-west4,europe-west4-b + tpu-v5p-4736,1,europe-west4,europe-west4-b + tpu-v5p-4864,1,europe-west4,europe-west4-b + tpu-v5p-4992,1,europe-west4,europe-west4-b + tpu-v5p-5120,1,europe-west4,europe-west4-b + tpu-v5p-5248,1,europe-west4,europe-west4-b + tpu-v5p-5376,1,europe-west4,europe-west4-b + tpu-v5p-5504,1,europe-west4,europe-west4-b + tpu-v5p-5632,1,europe-west4,europe-west4-b + tpu-v5p-5760,1,europe-west4,europe-west4-b + tpu-v5p-5888,1,europe-west4,europe-west4-b + tpu-v5p-6016,1,europe-west4,europe-west4-b + tpu-v5p-6144,1,europe-west4,europe-west4-b + tpu-v5p-6272,1,europe-west4,europe-west4-b + tpu-v5p-6400,1,europe-west4,europe-west4-b + tpu-v5p-6528,1,europe-west4,europe-west4-b + tpu-v5p-6656,1,europe-west4,europe-west4-b + tpu-v5p-6784,1,europe-west4,europe-west4-b + tpu-v5p-6912,1,europe-west4,europe-west4-b + tpu-v5p-7040,1,europe-west4,europe-west4-b + tpu-v5p-7168,1,europe-west4,europe-west4-b + tpu-v5p-7296,1,europe-west4,europe-west4-b + tpu-v5p-7424,1,europe-west4,europe-west4-b + tpu-v5p-7552,1,europe-west4,europe-west4-b + tpu-v5p-7680,1,europe-west4,europe-west4-b + tpu-v5p-7808,1,europe-west4,europe-west4-b + tpu-v5p-7936,1,europe-west4,europe-west4-b + tpu-v5p-8064,1,europe-west4,europe-west4-b + tpu-v5p-8192,1,europe-west4,europe-west4-b + tpu-v5p-8320,1,europe-west4,europe-west4-b + tpu-v5p-8448,1,europe-west4,europe-west4-b + tpu-v5p-8704,1,europe-west4,europe-west4-b + tpu-v5p-8832,1,europe-west4,europe-west4-b + tpu-v5p-8960,1,europe-west4,europe-west4-b + tpu-v5p-9216,1,europe-west4,europe-west4-b + tpu-v5p-9472,1,europe-west4,europe-west4-b + tpu-v5p-9600,1,europe-west4,europe-west4-b + tpu-v5p-9728,1,europe-west4,europe-west4-b + tpu-v5p-9856,1,europe-west4,europe-west4-b + tpu-v5p-9984,1,europe-west4,europe-west4-b + tpu-v5p-10240,1,europe-west4,europe-west4-b + tpu-v5p-10368,1,europe-west4,europe-west4-b + tpu-v5p-10496,1,europe-west4,europe-west4-b + tpu-v5p-10752,1,europe-west4,europe-west4-b + tpu-v5p-10880,1,europe-west4,europe-west4-b + tpu-v5p-11008,1,europe-west4,europe-west4-b + tpu-v5p-11136,1,europe-west4,europe-west4-b + tpu-v5p-11264,1,europe-west4,europe-west4-b + tpu-v5p-11520,1,europe-west4,europe-west4-b + tpu-v5p-11648,1,europe-west4,europe-west4-b + tpu-v5p-11776,1,europe-west4,europe-west4-b + tpu-v5p-11904,1,europe-west4,europe-west4-b + tpu-v5p-12032,1,europe-west4,europe-west4-b + tpu-v5p-12160,1,europe-west4,europe-west4-b + tpu-v5p-12288,1,europe-west4,europe-west4-b + """))) +} # FIXME(woosuk): Remove this once the bug is fixed. # See https://github.com/skypilot-org/skypilot/issues/1759#issue-1619614345 TPU_V4_HOST_DF = pd.read_csv( @@ -415,6 +519,12 @@ def get_gpu_price(row: pd.Series, spot: bool) -> Optional[float]: def _get_tpu_for_zone(zone: str) -> 'pd.DataFrame': + # Use hardcoded TPU V5 data as it is invisible in some zones. + missing_tpus_df = pd.DataFrame(columns=[ + 'AcceleratorName', 'AcceleratorCount', 'Region', 'AvailabilityZone' + ]) + if zone in TPU_V5_MISSING_ZONES_DF: + missing_tpus_df = TPU_V5_MISSING_ZONES_DF[zone] tpus = [] parent = f'projects/{project_id}/locations/{zone}' tpus_request = tpu_client.projects().locations().acceleratorTypes().list( @@ -432,16 +542,14 @@ def _get_tpu_for_zone(zone: str) -> 'pd.DataFrame': new_tpus = [] for tpu in tpus: tpu_name = tpu['type'] - # skip tpu v5 as we currently don't support it - if 'v5' in tpu_name: - continue new_tpus.append({ 'AcceleratorName': f'tpu-{tpu_name}', 'AcceleratorCount': 1, 'Region': zone.rpartition('-')[0], 'AvailabilityZone': zone, }) - return pd.DataFrame(new_tpus).reset_index(drop=True) + new_tpu_df = pd.DataFrame(new_tpus).reset_index(drop=True) + return pd.concat([new_tpu_df, missing_tpus_df]) def _get_tpus() -> 'pd.DataFrame': @@ -458,11 +566,22 @@ def _get_tpus() -> 'pd.DataFrame': # TODO: the TPUs fetched fails to contain us-east1 -def get_tpu_df(skus: List[Dict[str, Any]]) -> 'pd.DataFrame': +def get_tpu_df(gce_skus: List[Dict[str, Any]], + tpu_skus: List[Dict[str, Any]]) -> 'pd.DataFrame': df = _get_tpus() if df.empty: return df + def _get_tpu_description_str(tpu_version: str) -> str: + # TPU V5 has a different naming convention since it is contained in + # the GCE SKUs. v5p -> TpuV5p, v5litepod -> TpuV5e. + if tpu_version.startswith('v5'): + if tpu_version == 'v5p': + return 'TpuV5p' + assert tpu_version == 'v5litepod', tpu_version + return 'TpuV5e' + return f'Tpu-{tpu_version}' + def get_tpu_price(row: pd.Series, spot: bool) -> Optional[float]: assert row['AcceleratorCount'] == 1, row tpu_price = None @@ -475,9 +594,12 @@ def get_tpu_price(row: pd.Series, spot: bool) -> Optional[float]: # whether the TPU is a single device or a pod. # For TPU-v4, the pricing is uniform, and thus the pricing API # only provides the price of TPU-v4 pods. - is_pod = num_cores > 8 or tpu_version == 'v4' + # The price shown for v5 TPU is per chip hour, so there is no 'Pod' + # keyword in the description. + is_pod = ((num_cores > 8 or tpu_version == 'v4') and + not tpu_version.startswith('v5')) - for sku in skus: + for sku in gce_skus + tpu_skus: if tpu_region not in sku['serviceRegions']: continue description = sku['description'] @@ -489,7 +611,7 @@ def get_tpu_price(row: pd.Series, spot: bool) -> Optional[float]: if 'Preemptible' in description: continue - if f'Tpu-{tpu_version}' not in description: + if _get_tpu_description_str(tpu_version) not in description: continue if is_pod: if 'Pod' not in description: @@ -500,7 +622,15 @@ def get_tpu_price(row: pd.Series, spot: bool) -> Optional[float]: unit_price = _get_unit_price(sku) tpu_device_price = unit_price - tpu_core_price = tpu_device_price / 8 + # v5p naming convention is v$VERSION_NUMBERp-$CORES_COUNT, while + # v5e is v$VERSION_NUMBER-$CHIP_COUNT. In the same time, V5 price + # is shown as per chip price, which is 2 cores for v5p and 1 core + # for v5e. Reference here: + # https://cloud.google.com/tpu/docs/v5p#using-accelerator-type + # https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config + core_per_sku = (1 if tpu_version == 'v5litepod' else + 2 if tpu_version == 'v5p' else 8) + tpu_core_price = tpu_device_price / core_per_sku tpu_price = num_cores * tpu_core_price break @@ -546,7 +676,8 @@ def get_catalog_df(region_prefix: str) -> 'pd.DataFrame': region_prefix)] if not gpu_df.empty else gpu_df gcp_tpu_skus = get_skus(TPU_SERVICE_ID) - tpu_df = get_tpu_df(gcp_tpu_skus) + # TPU V5 SKU is not included in the TPU SKUs but in the GCE SKUs. + tpu_df = get_tpu_df(gcp_skus, gcp_tpu_skus) # Merge the dataframes. df = pd.concat([vm_df, gpu_df, tpu_df, TPU_V4_HOST_DF]) diff --git a/sky/data/storage.py b/sky/data/storage.py index f09d79ea48e..b915d1c6d54 100644 --- a/sky/data/storage.py +++ b/sky/data/storage.py @@ -1,5 +1,6 @@ """Storage and Store Classes for Sky Data.""" import enum +import hashlib import os import re import shlex @@ -1942,8 +1943,15 @@ class AzureBlobStore(AbstractStore): """Represents the backend for Azure Blob Storage Container.""" _ACCESS_DENIED_MESSAGE = 'Access Denied' - DEFAULT_STORAGE_ACCOUNT_NAME = 'sky{region}{user_hash}' DEFAULT_RESOURCE_GROUP_NAME = 'sky{user_hash}' + # Unlike resource group names, which only need to be unique within the + # subscription, storage account names must be globally unique across all of + # Azure users. Hence, the storage account name includes the subscription + # hash as well to ensure its uniqueness. + DEFAULT_STORAGE_ACCOUNT_NAME = ( + 'sky{region_hash}{user_hash}{subscription_hash}') + _SUBSCRIPTION_HASH_LENGTH = 4 + _REGION_HASH_LENGTH = 4 class AzureBlobStoreMetadata(AbstractStore.StoreMetadata): """A pickle-able representation of Azure Blob Store. @@ -1977,7 +1985,7 @@ def __init__(self, name: str, source: str, storage_account_name: str = '', - region: Optional[str] = None, + region: Optional[str] = 'eastus', is_sky_managed: Optional[bool] = None, sync_on_reconstruction: bool = True): self.storage_client: 'storage.Client' @@ -2156,6 +2164,41 @@ def initialize(self): # If is_sky_managed is specified, then we take no action. self.is_sky_managed = is_new_bucket + @staticmethod + def get_default_storage_account_name(region: Optional[str]) -> str: + """Generates a unique default storage account name. + + The subscription ID is included to avoid conflicts when user switches + subscriptions. The length of region_hash, user_hash, and + subscription_hash are adjusted to ensure the storage account name + adheres to the 24-character limit, as some region names can be very + long. Using a 4-character hash for the region helps keep the name + concise and prevents potential conflicts. + Reference: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-name-rules#microsoftstorage # pylint: disable=line-too-long + + Args: + region: Name of the region to create the storage account/container. + + Returns: + Name of the default storage account. + """ + assert region is not None + subscription_id = azure.get_subscription_id() + subscription_hash_obj = hashlib.md5(subscription_id.encode('utf-8')) + subscription_hash = subscription_hash_obj.hexdigest( + )[:AzureBlobStore._SUBSCRIPTION_HASH_LENGTH] + region_hash_obj = hashlib.md5(region.encode('utf-8')) + region_hash = region_hash_obj.hexdigest()[:AzureBlobStore. + _REGION_HASH_LENGTH] + + storage_account_name = ( + AzureBlobStore.DEFAULT_STORAGE_ACCOUNT_NAME.format( + region_hash=region_hash, + user_hash=common_utils.get_user_hash(), + subscription_hash=subscription_hash)) + + return storage_account_name + def _get_storage_account_and_resource_group( self) -> Tuple[str, Optional[str]]: """Get storage account and resource group to be used for AzureBlobStore @@ -2239,10 +2282,8 @@ def _get_storage_account_and_resource_group( else: # If storage account name is not provided from config, then # use default resource group and storage account names. - storage_account_name = ( - self.DEFAULT_STORAGE_ACCOUNT_NAME.format( - region=self.region, - user_hash=common_utils.get_user_hash())) + storage_account_name = self.get_default_storage_account_name( + self.region) resource_group_name = (self.DEFAULT_RESOURCE_GROUP_NAME.format( user_hash=common_utils.get_user_hash())) try: diff --git a/sky/provision/azure/config.py b/sky/provision/azure/config.py index 146deaa6781..7b50c3d8c0f 100644 --- a/sky/provision/azure/config.py +++ b/sky/provision/azure/config.py @@ -3,6 +3,7 @@ Creates the resource group and deploys the configuration template to Azure for a cluster to be launched. """ +import hashlib import json import logging from pathlib import Path @@ -15,6 +16,7 @@ logger = logging.getLogger(__name__) +UNIQUE_ID_LEN = 4 _DEPLOYMENT_NAME = 'skypilot-config' _LEGACY_DEPLOYMENT_NAME = 'ray-config' _RESOURCE_GROUP_WAIT_FOR_DELETION_TIMEOUT = 480 # 8 minutes @@ -103,10 +105,12 @@ def bootstrap_instances( logger.info(f'Using cluster name: {cluster_name_on_cloud}') + hasher = hashlib.md5(provider_config['resource_group'].encode('utf-8')) + unique_id = hasher.hexdigest()[:UNIQUE_ID_LEN] subnet_mask = provider_config.get('subnet_mask') if subnet_mask is None: # choose a random subnet, skipping most common value of 0 - random.seed(cluster_name_on_cloud) + random.seed(unique_id) subnet_mask = f'10.{random.randint(1, 254)}.0.0/16' logger.info(f'Using subnet mask: {subnet_mask}') @@ -119,10 +123,10 @@ def bootstrap_instances( 'value': subnet_mask }, 'clusterId': { - # We use the cluster name as the unique ID for the cluster, - # as we have already appended the user hash to the cluster - # name. - 'value': cluster_name_on_cloud + # We use the cluster name + resource group hash as the + # unique ID for the cluster, as we need to make sure that + # the deployments have unique names during failover. + 'value': f'{cluster_name_on_cloud}-{unique_id}' }, }, } diff --git a/sky/provision/docker_utils.py b/sky/provision/docker_utils.py index aa29a3666a3..e989fbc085a 100644 --- a/sky/provision/docker_utils.py +++ b/sky/provision/docker_utils.py @@ -381,7 +381,7 @@ def _configure_runtime(self, run_options: List[str]) -> List[str]: if 'nvidia-container-runtime' in runtime_output: try: self._run('nvidia-smi', log_err_when_fail=False) - return run_options + ['--runtime=nvidia'] + return run_options + ['--runtime=nvidia', '--gpus all'] except Exception as e: # pylint: disable=broad-except logger.debug( 'Nvidia Container Runtime is present in the docker image' diff --git a/sky/provision/fluidstack/fluidstack_utils.py b/sky/provision/fluidstack/fluidstack_utils.py index ebc616c0bfc..a9efb865a3c 100644 --- a/sky/provision/fluidstack/fluidstack_utils.py +++ b/sky/provision/fluidstack/fluidstack_utils.py @@ -3,7 +3,8 @@ import functools import json import os -from typing import Any, Dict, List, Optional +import time +from typing import Any, Dict, List import uuid import requests @@ -13,9 +14,8 @@ def get_key_suffix(): return str(uuid.uuid4()).replace('-', '')[:8] -ENDPOINT = 'https://api.fluidstack.io/v1/' +ENDPOINT = 'https://platform.fluidstack.io/' FLUIDSTACK_API_KEY_PATH = '~/.fluidstack/api_key' -FLUIDSTACK_API_TOKEN_PATH = '~/.fluidstack/api_token' def read_contents(path: str) -> str: @@ -46,109 +46,76 @@ def raise_fluidstack_error(response: requests.Response) -> None: raise FluidstackAPIError(f'{message}', status_code) -@functools.lru_cache() -def with_nvidia_drivers(region: str): - if region in ['norway_4_eu', 'generic_1_canada']: - return False - client = FluidstackClient() - plans = client.get_plans() - for plan in plans: - if region in [r['id'] for r in plan['regions']]: - if 'Ubuntu 20.04 LTS (Nvidia)' in plan['os_options']: - return True - return False - - class FluidstackClient: """FluidStack API Client""" def __init__(self): self.api_key = read_contents( - os.path.expanduser(FLUIDSTACK_API_KEY_PATH)) - self.api_token = read_contents( - os.path.expanduser(FLUIDSTACK_API_TOKEN_PATH)) + os.path.expanduser(FLUIDSTACK_API_KEY_PATH)).strip() def get_plans(self): - response = requests.get(ENDPOINT + 'plans') + response = requests.get(ENDPOINT + 'list_available_configurations', + headers={'api-key': self.api_key}) raise_fluidstack_error(response) plans = response.json() - plans = [ - plan for plan in plans - if plan['minimum_commitment'] == 'hourly' and plan['type'] in - ['preconfigured', 'custom'] and plan['gpu_type'] != 'NO GPU' - ] return plans - def list_instances( - self, - tag_filters: Optional[Dict[str, - str]] = None) -> List[Dict[str, Any]]: + def list_instances(self) -> List[Dict[str, Any]]: response = requests.get( - ENDPOINT + 'servers', - auth=(self.api_key, self.api_token), + ENDPOINT + 'instances', + headers={'api-key': self.api_key}, ) raise_fluidstack_error(response) instances = response.json() - filtered_instances = [] - - for instance in instances: - if isinstance(instance['tags'], str): - instance['tags'] = json.loads(instance['tags']) - if not instance['tags']: - instance['tags'] = {} - if tag_filters: - for key in tag_filters: - if instance['tags'].get(key, None) != tag_filters[key]: - break - else: - filtered_instances.append(instance) - else: - filtered_instances.append(instance) - - return filtered_instances + return instances def create_instance( self, instance_type: str = '', - hostname: str = '', + name: str = '', region: str = '', ssh_pub_key: str = '', count: int = 1, ) -> List[str]: """Launch new instances.""" - config: Dict[str, Any] = {} plans = self.get_plans() regions = self.list_regions() + gpu_type, gpu_count = instance_type.split('::') + gpu_count = int(gpu_count) + plans = [ - plan for plan in plans if plan['plan_id'] == instance_type and - region in [r['id'] for r in plan['regions']] + plan for plan in plans if plan['gpu_type'] == gpu_type and + gpu_count in plan['gpu_counts'] and region in plan['regions'] ] if not plans: raise FluidstackAPIError( f'Plan {instance_type} out of stock in region {region}') ssh_key = self.get_or_add_ssh_key(ssh_pub_key) - os_id = 'Ubuntu 20.04 LTS' - body = dict(plan=None if config else instance_type, - region=regions[region], - os=os_id, - hostname=hostname, - ssh_keys=[ssh_key['id']], - multiplicity=count, - config=config) - - response = requests.post(ENDPOINT + 'server', - auth=(self.api_key, self.api_token), - json=body) - raise_fluidstack_error(response) - instance_ids = response.json().get('multiple') - assert all(id is not None for id in instance_ids), instance_ids + default_operating_system = 'ubuntu_22_04_lts_nvidia' + instance_ids = [] + for _ in range(count): + body = dict(gpu_type=gpu_type, + gpu_count=gpu_count, + region=regions[region], + operating_system_label=default_operating_system, + name=name, + ssh_key=ssh_key['name']) + + response = requests.post(ENDPOINT + 'instances', + headers={'api-key': self.api_key}, + json=body) + raise_fluidstack_error(response) + instance_id = response.json().get('id') + instance_ids.append(instance_id) + time.sleep(1) + return instance_ids def list_ssh_keys(self): - response = requests.get(ENDPOINT + 'ssh', - auth=(self.api_key, self.api_token)) + response = requests.get(ENDPOINT + 'ssh_keys', + headers={'api-key': self.api_key}) raise_fluidstack_error(response) return response.json() @@ -156,86 +123,50 @@ def get_or_add_ssh_key(self, ssh_pub_key: str = '') -> Dict[str, str]: """Add ssh key if not already added.""" ssh_keys = self.list_ssh_keys() for key in ssh_keys: - if key['public_key'].strip() == ssh_pub_key.strip(): - return { - 'id': key['id'], - 'name': key['name'], - 'ssh_key': ssh_pub_key - } + if key['public_key'].strip().split()[:2] == ssh_pub_key.strip( + ).split()[:2]: + return {'name': key['name'], 'ssh_key': ssh_pub_key} ssh_key_name = 'skypilot-' + get_key_suffix() response = requests.post( - ENDPOINT + 'ssh', - auth=(self.api_key, self.api_token), + ENDPOINT + 'ssh_keys', + headers={'api-key': self.api_key}, json=dict(name=ssh_key_name, public_key=ssh_pub_key), ) raise_fluidstack_error(response) - key_id = response.json()['id'] - return {'id': key_id, 'name': ssh_key_name, 'ssh_key': ssh_pub_key} + return {'name': ssh_key_name, 'ssh_key': ssh_pub_key} @functools.lru_cache() def list_regions(self): - response = requests.get(ENDPOINT + 'plans') - raise_fluidstack_error(response) - plans = response.json() - plans = [ - plan for plan in plans - if plan['minimum_commitment'] == 'hourly' and plan['type'] in - ['preconfigured', 'custom'] and plan['gpu_type'] != 'NO GPU' - ] + plans = self.get_plans() def get_regions(plans: List) -> dict: """Return a list of regions where the plan is available.""" regions = {} for plan in plans: for region in plan.get('regions', []): - regions[region['id']] = region['id'] + regions[region] = region return regions regions = get_regions(plans) return regions def delete(self, instance_id: str): - response = requests.delete(ENDPOINT + 'server/' + instance_id, - auth=(self.api_key, self.api_token)) + response = requests.delete(ENDPOINT + 'instances/' + instance_id, + headers={'api-key': self.api_key}) raise_fluidstack_error(response) return response.json() def stop(self, instance_id: str): - response = requests.put(ENDPOINT + 'server/' + instance_id + '/stop', - auth=(self.api_key, self.api_token)) - raise_fluidstack_error(response) - return response.json() - - def restart(self, instance_id: str): - response = requests.post(ENDPOINT + 'server/' + instance_id + '/reboot', - auth=(self.api_key, self.api_token)) - raise_fluidstack_error(response) - return response.json() - - def info(self, instance_id: str): - response = requests.get(ENDPOINT + f'server/{instance_id}', - auth=(self.api_key, self.api_token)) - raise_fluidstack_error(response) - return response.json() - - def status(self, instance_id: str): - response = self.info(instance_id) - return response['status'] - - def add_tags(self, instance_id: str, tags: Dict[str, str]) -> str: - response = requests.patch( - ENDPOINT + f'server/{instance_id}/tag', - auth=(self.api_key, self.api_token), - json=dict(tags=json.dumps(tags)), - ) + response = requests.put(ENDPOINT + 'instances/' + instance_id + '/stop', + headers={'api-key': self.api_key}) raise_fluidstack_error(response) return response.json() - def rename(self, instance_id: str, hostname: str) -> str: - response = requests.patch( - ENDPOINT + f'server/{instance_id}/rename', - auth=(self.api_key, self.api_token), - json=dict(name=hostname), + def rename(self, instance_id: str, name: str) -> str: + response = requests.put( + ENDPOINT + f'instances/{instance_id}/rename', + headers={'api-key': self.api_key}, + json=dict(new_instance_name=name), ) raise_fluidstack_error(response) return response.json() diff --git a/sky/provision/fluidstack/instance.py b/sky/provision/fluidstack/instance.py index e870ff15e0c..538aafc8887 100644 --- a/sky/provision/fluidstack/instance.py +++ b/sky/provision/fluidstack/instance.py @@ -27,7 +27,7 @@ def get_internal_ip(node_info: Dict[str, Any]) -> None: node_info['internal_ip'] = node_info['ip_address'] runner = command_runner.SSHCommandRunner( (node_info['ip_address'], 22), - ssh_user=node_info['capabilities']['default_user_name'], + ssh_user='ubuntu', ssh_private_key=auth.PRIVATE_SSH_KEY_PATH) result = runner.run(_GET_INTERNAL_IP_CMD, require_outputs=True, @@ -61,7 +61,7 @@ def _filter_instances( if (include_instances is not None and instance['id'] not in include_instances): continue - if instance.get('hostname') in possible_names: + if instance.get('name') in possible_names: filtered_instances[instance['id']] = instance return filtered_instances @@ -69,7 +69,7 @@ def _filter_instances( def _get_head_instance_id(instances: Dict[str, Any]) -> Optional[str]: head_instance_id = None for inst_id, inst in instances.items(): - if inst['hostname'].endswith('-head'): + if inst['name'].endswith('-head'): head_instance_id = inst_id break return head_instance_id @@ -80,16 +80,7 @@ def run_instances(region: str, cluster_name_on_cloud: str, """Runs instances for the given cluster.""" pending_status = [ - 'create', - 'requesting', - 'provisioning', - 'customizing', - 'starting', - 'stopping', - 'start', - 'stop', - 'reboot', - 'rebooting', + 'pending', ] while True: instances = _filter_instances(cluster_name_on_cloud, pending_status) @@ -127,7 +118,7 @@ def rename(instance_id: str, new_name: str) -> None: f'{instance_name}') rename(instance_id, instance_name) if (instance_id != head_instance_id and - instance['hostname'].endswith('-head')): + instance['name'].endswith('-head')): # Multiple head instances exist. # This is a rare case when the instance name was manually modified # on the cloud or some unexpected behavior happened. @@ -167,7 +158,7 @@ def rename(instance_id: str, new_name: str) -> None: node_type = 'head' if head_instance_id is None else 'worker' try: instance_ids = utils.FluidstackClient().create_instance( - hostname=f'{cluster_name_on_cloud}-{node_type}', + name=f'{cluster_name_on_cloud}-{node_type}', instance_type=config.node_config['InstanceType'], ssh_pub_key=config.node_config['AuthorizedKey'], region=region) @@ -184,9 +175,6 @@ def rename(instance_id: str, new_name: str) -> None: instances = _filter_instances(cluster_name_on_cloud, pending_status + ['running']) if len(instances) < config.count: - # Some of pending instances have been convert to a state that will - # not convert to `running` status. This can be due to resource - # availability issue. all_instances = _filter_instances( cluster_name_on_cloud, status_filters=None, @@ -253,15 +241,11 @@ def terminate_instances( instances = _filter_instances(cluster_name_on_cloud, None) for inst_id, inst in instances.items(): logger.debug(f'Terminating instance {inst_id}: {inst}') - if worker_only and inst['hostname'].endswith('-head'): + if worker_only and inst['name'].endswith('-head'): continue try: utils.FluidstackClient().delete(inst_id) except Exception as e: # pylint: disable=broad-except - if (isinstance(e, utils.FluidstackAPIError) and - 'Machine is already terminated' in str(e)): - logger.debug(f'Instance {inst_id} is already terminated.') - continue with ux_utils.print_exception_no_traceback(): raise RuntimeError( f'Failed to terminate instance {inst_id}: ' @@ -291,7 +275,7 @@ def get_cluster_info( tags={}, ) ] - if instance_info['hostname'].endswith('-head'): + if instance_info['name'].endswith('-head'): head_instance_id = instance_id return common.ClusterInfo(instances=instances, @@ -311,22 +295,10 @@ def query_instances( instances = _filter_instances(cluster_name_on_cloud, None) instances = _filter_instances(cluster_name_on_cloud, None) status_map = { - 'provisioning': status_lib.ClusterStatus.INIT, - 'requesting': status_lib.ClusterStatus.INIT, - 'create': status_lib.ClusterStatus.INIT, - 'customizing': status_lib.ClusterStatus.INIT, - 'stopping': status_lib.ClusterStatus.STOPPED, - 'stop': status_lib.ClusterStatus.STOPPED, - 'start': status_lib.ClusterStatus.INIT, - 'reboot': status_lib.ClusterStatus.STOPPED, - 'rebooting': status_lib.ClusterStatus.STOPPED, + 'pending': status_lib.ClusterStatus.INIT, 'stopped': status_lib.ClusterStatus.STOPPED, - 'starting': status_lib.ClusterStatus.INIT, 'running': status_lib.ClusterStatus.UP, - 'failed to create': status_lib.ClusterStatus.INIT, - 'timeout error': status_lib.ClusterStatus.INIT, - 'out of stock': status_lib.ClusterStatus.INIT, - 'terminating': None, + 'unhealthy': status_lib.ClusterStatus.INIT, 'terminated': None, } statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {} diff --git a/sky/provision/kubernetes/manifests/smarter-device-manager-daemonset.yaml b/sky/provision/kubernetes/manifests/smarter-device-manager-daemonset.yaml index 664fd69a8c8..2f8abf00550 100644 --- a/sky/provision/kubernetes/manifests/smarter-device-manager-daemonset.yaml +++ b/sky/provision/kubernetes/manifests/smarter-device-manager-daemonset.yaml @@ -26,6 +26,9 @@ spec: hostname: smarter-device-management hostNetwork: true dnsPolicy: ClusterFirstWithHostNet + tolerations: + - effect: NoSchedule + operator: Exists containers: - name: smarter-device-manager image: us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/smarter-device-manager:v1.1.2 diff --git a/sky/resources.py b/sky/resources.py index f0cb1abda1e..2f19cd1aa01 100644 --- a/sky/resources.py +++ b/sky/resources.py @@ -578,10 +578,17 @@ def _set_accelerators( 'Cannot specify instance type' f' (got "{self.instance_type}") for TPU VM.') if 'runtime_version' not in accelerator_args: - if use_tpu_vm: - accelerator_args['runtime_version'] = 'tpu-vm-base' - else: - accelerator_args['runtime_version'] = '2.12.0' + + def _get_default_runtime_version() -> str: + if not use_tpu_vm: + return '2.12.0' + # TPU V5 requires a newer runtime version. + if acc.startswith('tpu-v5'): + return 'v2-alpha-tpuv5' + return 'tpu-vm-base' + + accelerator_args['runtime_version'] = ( + _get_default_runtime_version()) logger.info( 'Missing runtime_version in accelerator_args, using' f' default ({accelerator_args["runtime_version"]})') diff --git a/sky/templates/aws-ray.yml.j2 b/sky/templates/aws-ray.yml.j2 index 6e3dc76750c..7e9dfccdaf1 100644 --- a/sky/templates/aws-ray.yml.j2 +++ b/sky/templates/aws-ray.yml.j2 @@ -11,9 +11,6 @@ docker: container_name: {{docker_container_name}} run_options: - --ulimit nofile=1048576:1048576 - {%- if custom_resources is not none %} - --gpus all - {%- endif %} {%- for run_option in docker_run_options %} - {{run_option}} {%- endfor %} diff --git a/sky/templates/azure-ray.yml.j2 b/sky/templates/azure-ray.yml.j2 index 39672a976b8..65d500fc677 100644 --- a/sky/templates/azure-ray.yml.j2 +++ b/sky/templates/azure-ray.yml.j2 @@ -11,9 +11,6 @@ docker: container_name: {{docker_container_name}} run_options: - --ulimit nofile=1048576:1048576 - {%- if custom_resources is not none %} - --gpus all - {%- endif %} {%- for run_option in docker_run_options %} - {{run_option}} {%- endfor %} diff --git a/sky/templates/fluidstack-ray.yml.j2 b/sky/templates/fluidstack-ray.yml.j2 index 309a5393828..3eb277ec6d9 100644 --- a/sky/templates/fluidstack-ray.yml.j2 +++ b/sky/templates/fluidstack-ray.yml.j2 @@ -65,7 +65,6 @@ setup_commands: sudo pkill -9 apt-get; sudo pkill -9 dpkg; sudo dpkg --configure -a; - {{ cuda_installation_commands }} mkdir -p ~/.ssh; touch ~/.ssh/config; {{ conda_installation_commands }} {{ ray_skypilot_installation_commands }} diff --git a/sky/templates/gcp-ray.yml.j2 b/sky/templates/gcp-ray.yml.j2 index d7e787953d9..bcc16bac531 100644 --- a/sky/templates/gcp-ray.yml.j2 +++ b/sky/templates/gcp-ray.yml.j2 @@ -12,9 +12,6 @@ docker: container_name: {{docker_container_name}} run_options: - --ulimit nofile=1048576:1048576 - {%- if gpu is not none %} - --gpus all - {%- endif %} {%- for run_option in docker_run_options %} - {{run_option}} {%- endfor %} diff --git a/sky/templates/paperspace-ray.yml.j2 b/sky/templates/paperspace-ray.yml.j2 index 400714978b9..8eea5ac4f8a 100644 --- a/sky/templates/paperspace-ray.yml.j2 +++ b/sky/templates/paperspace-ray.yml.j2 @@ -11,9 +11,6 @@ docker: container_name: {{docker_container_name}} run_options: - --ulimit nofile=1048576:1048576 - {%- if custom_resources is not none %} - --gpus all - {%- endif %} {%- for run_option in docker_run_options %} - {{run_option}} {%- endfor %} diff --git a/tests/skyserve/update/new_autoscaler_after.yaml b/tests/skyserve/update/new_autoscaler_after.yaml index f5a2e552f67..2d12d3ef109 100644 --- a/tests/skyserve/update/new_autoscaler_after.yaml +++ b/tests/skyserve/update/new_autoscaler_after.yaml @@ -8,7 +8,6 @@ service: base_ondemand_fallback_replicas: 1 resources: - cloud: gcp ports: 8081 use_spot: true cpus: 2+ @@ -22,4 +21,4 @@ run: | # blue-green update. sleep 120 fi - python3 server.py + python3 server.py --port 8081 diff --git a/tests/skyserve/update/new_autoscaler_before.yaml b/tests/skyserve/update/new_autoscaler_before.yaml index a91c3cd230a..793221080ae 100644 --- a/tests/skyserve/update/new_autoscaler_before.yaml +++ b/tests/skyserve/update/new_autoscaler_before.yaml @@ -5,10 +5,9 @@ service: replicas: 2 resources: - cloud: gcp ports: 8081 cpus: 2+ workdir: examples/serve/http_server -run: python3 server.py +run: python3 server.py --port 8081 diff --git a/tests/test_smoke.py b/tests/test_smoke.py index 33280e1fd4c..4d2c26103a7 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -48,6 +48,7 @@ from sky import jobs from sky import serve from sky import skypilot_config +from sky.adaptors import azure from sky.adaptors import cloudflare from sky.adaptors import ibm from sky.clouds import AWS @@ -839,6 +840,7 @@ def test_image_no_conda(): run_one_test(test) +@pytest.mark.no_fluidstack # FluidStack does not support stopping instances in SkyPilot implementation @pytest.mark.no_kubernetes # Kubernetes does not support stopping instances def test_custom_default_conda_env(generic_cloud: str): name = _get_cluster_name() @@ -1103,9 +1105,8 @@ def test_azure_storage_mounts_with_stop(): cloud = 'azure' storage_name = f'sky-test-{int(time.time())}' default_region = 'eastus' - storage_account_name = ( - storage_lib.AzureBlobStore.DEFAULT_STORAGE_ACCOUNT_NAME.format( - region=default_region, user_hash=common_utils.get_user_hash())) + storage_account_name = (storage_lib.AzureBlobStore. + get_default_storage_account_name(default_region)) storage_account_key = data_utils.get_az_storage_account_key( storage_account_name) template_str = pathlib.Path( @@ -1549,6 +1550,7 @@ def test_job_queue_multinode(generic_cloud: str): run_one_test(test) +@pytest.mark.no_fluidstack # No FluidStack VM has 8 CPUs @pytest.mark.no_lambda_cloud # No Lambda Cloud VM has 8 CPUs def test_large_job_queue(generic_cloud: str): name = _get_cluster_name() @@ -1592,6 +1594,7 @@ def test_large_job_queue(generic_cloud: str): run_one_test(test) +@pytest.mark.no_fluidstack # No FluidStack VM has 8 CPUs @pytest.mark.no_lambda_cloud # No Lambda Cloud VM has 8 CPUs def test_fast_large_job_queue(generic_cloud: str): # This is to test the jobs can be scheduled quickly when there are many jobs in the queue. @@ -1699,6 +1702,7 @@ def test_multi_echo(generic_cloud: str): # ---------- Task: 1 node training. ---------- +@pytest.mark.no_fluidstack # Fluidstack does not have T4 gpus for now @pytest.mark.no_lambda_cloud # Lambda Cloud does not have V100 gpus @pytest.mark.no_ibm # IBM cloud currently doesn't provide public image with CUDA @pytest.mark.no_scp # SCP does not have V100 (16GB) GPUs. Run test_scp_huggingface instead. @@ -2325,6 +2329,7 @@ def test_cancel_azure(): run_one_test(test) +@pytest.mark.no_fluidstack # Fluidstack does not support V100 gpus for now @pytest.mark.no_lambda_cloud # Lambda Cloud does not have V100 gpus @pytest.mark.no_ibm # IBM cloud currently doesn't provide public image with CUDA @pytest.mark.no_paperspace # Paperspace has `gnome-shell` on nvidia-smi @@ -2990,8 +2995,7 @@ def test_managed_jobs_storage(generic_cloud: str): region = 'westus2' region_flag = f' --region {region}' storage_account_name = ( - storage_lib.AzureBlobStore.DEFAULT_STORAGE_ACCOUNT_NAME.format( - region=region, user_hash=common_utils.get_user_hash())) + storage_lib.AzureBlobStore.get_default_storage_account_name(region)) region_cmd = TestStorageWithCredentials.cli_region_cmd( storage_lib.StoreType.AZURE, storage_account_name=storage_account_name) @@ -3501,6 +3505,7 @@ def test_skyserve_kubernetes_http(): run_one_test(test) +@pytest.mark.no_fluidstack # Fluidstack does not support T4 gpus for now @pytest.mark.serve def test_skyserve_llm(generic_cloud: str): """Test skyserve with real LLM usecase""" @@ -3558,6 +3563,7 @@ def test_skyserve_spot_recovery(): run_one_test(test) +@pytest.mark.no_fluidstack # Fluidstack does not support spot instances @pytest.mark.serve @pytest.mark.no_kubernetes def test_skyserve_base_ondemand_fallback(generic_cloud: str): @@ -3622,6 +3628,8 @@ def test_skyserve_dynamic_ondemand_fallback(): run_one_test(test) +# TODO: fluidstack does not support `--cpus 2`, but the check for services in this test is based on CPUs +@pytest.mark.no_fluidstack @pytest.mark.serve def test_skyserve_user_bug_restart(generic_cloud: str): """Tests that we restart the service after user bug.""" @@ -3806,6 +3814,8 @@ def test_skyserve_large_readiness_timeout(generic_cloud: str): run_one_test(test) +# TODO: fluidstack does not support `--cpus 2`, but the check for services in this test is based on CPUs +@pytest.mark.no_fluidstack @pytest.mark.serve def test_skyserve_update(generic_cloud: str): """Test skyserve with update""" @@ -3834,6 +3844,8 @@ def test_skyserve_update(generic_cloud: str): run_one_test(test) +# TODO: fluidstack does not support `--cpus 2`, but the check for services in this test is based on CPUs +@pytest.mark.no_fluidstack @pytest.mark.serve def test_skyserve_rolling_update(generic_cloud: str): """Test skyserve with rolling update""" @@ -3870,6 +3882,7 @@ def test_skyserve_rolling_update(generic_cloud: str): run_one_test(test) +@pytest.mark.no_fluidstack @pytest.mark.serve def test_skyserve_fast_update(generic_cloud: str): """Test skyserve with fast update (Increment version of old replicas)""" @@ -3947,12 +3960,13 @@ def test_skyserve_update_autoscale(generic_cloud: str): run_one_test(test) +@pytest.mark.no_fluidstack # Spot instances are note supported by Fluidstack @pytest.mark.serve @pytest.mark.no_kubernetes # Spot instances are not supported in Kubernetes @pytest.mark.parametrize('mode', ['rolling', 'blue_green']) def test_skyserve_new_autoscaler_update(mode: str, generic_cloud: str): """Test skyserve with update that changes autoscaler""" - name = _get_service_name() + mode + name = f'{_get_service_name()}-{mode}' wait_until_no_pending = ( f's=$(sky serve status {name}); echo "$s"; ' @@ -3982,7 +3996,7 @@ def test_skyserve_new_autoscaler_update(mode: str, generic_cloud: str): _check_service_version(name, "1"), ] test = Test( - 'test-skyserve-new-autoscaler-update', + f'test-skyserve-new-autoscaler-update-{mode}', [ f'sky serve up -n {name} --cloud {generic_cloud} -y tests/skyserve/update/new_autoscaler_before.yaml', _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=2) + @@ -4010,6 +4024,8 @@ def test_skyserve_new_autoscaler_update(mode: str, generic_cloud: str): run_one_test(test) +# TODO: fluidstack does not support `--cpus 2`, but the check for services in this test is based on CPUs +@pytest.mark.no_fluidstack @pytest.mark.serve def test_skyserve_failures(generic_cloud: str): """Test replica failure statuses""" @@ -4287,9 +4303,8 @@ def cli_delete_cmd(store_type, if store_type == storage_lib.StoreType.AZURE: default_region = 'eastus' storage_account_name = ( - storage_lib.AzureBlobStore.DEFAULT_STORAGE_ACCOUNT_NAME.format( - region=default_region, - user_hash=common_utils.get_user_hash())) + storage_lib.AzureBlobStore.get_default_storage_account_name( + default_region)) storage_account_key = data_utils.get_az_storage_account_key( storage_account_name) return ('az storage container delete ' @@ -4324,11 +4339,9 @@ def cli_ls_cmd(store_type, bucket_name, suffix=''): config_storage_account = skypilot_config.get_nested( ('azure', 'storage_account'), None) storage_account_name = config_storage_account if ( - config_storage_account is not None - ) else ( - storage_lib.AzureBlobStore.DEFAULT_STORAGE_ACCOUNT_NAME.format( - region=default_region, - user_hash=common_utils.get_user_hash())) + config_storage_account is not None) else ( + storage_lib.AzureBlobStore.get_default_storage_account_name( + default_region)) storage_account_key = data_utils.get_az_storage_account_key( storage_account_name) list_cmd = ('az storage blob list ' @@ -4390,9 +4403,8 @@ def cli_count_name_in_bucket(store_type, if storage_account_name is None: default_region = 'eastus' storage_account_name = ( - storage_lib.AzureBlobStore.DEFAULT_STORAGE_ACCOUNT_NAME. - format(region=default_region, - user_hash=common_utils.get_user_hash())) + storage_lib.AzureBlobStore.get_default_storage_account_name( + default_region)) storage_account_key = data_utils.get_az_storage_account_key( storage_account_name) return ('az storage blob list ' @@ -4418,9 +4430,8 @@ def cli_count_file_in_bucket(store_type, bucket_name): elif store_type == storage_lib.StoreType.AZURE: default_region = 'eastus' storage_account_name = ( - storage_lib.AzureBlobStore.DEFAULT_STORAGE_ACCOUNT_NAME.format( - region=default_region, - user_hash=common_utils.get_user_hash())) + storage_lib.AzureBlobStore.get_default_storage_account_name( + default_region)) storage_account_key = data_utils.get_az_storage_account_key( storage_account_name) return ('az storage blob list ' @@ -4622,8 +4633,8 @@ def tmp_az_bucket(self, tmp_bucket_name): # Creates a temporary bucket using gsutil default_region = 'eastus' storage_account_name = ( - storage_lib.AzureBlobStore.DEFAULT_STORAGE_ACCOUNT_NAME.format( - region=default_region, user_hash=common_utils.get_user_hash())) + storage_lib.AzureBlobStore.get_default_storage_account_name( + default_region)) storage_account_key = data_utils.get_az_storage_account_key( storage_account_name) bucket_uri = data_utils.AZURE_CONTAINER_URL.format( @@ -4859,9 +4870,8 @@ def test_nonexistent_bucket(self, nonexist_bucket_url): elif nonexist_bucket_url.startswith('https'): default_region = 'eastus' storage_account_name = ( - storage_lib.AzureBlobStore.DEFAULT_STORAGE_ACCOUNT_NAME. - format(region=default_region, - user_hash=common_utils.get_user_hash())) + storage_lib.AzureBlobStore.get_default_storage_account_name( + default_region)) storage_account_key = data_utils.get_az_storage_account_key( storage_account_name) command = f'az storage container exists --account-name {storage_account_name} --account-key {storage_account_key} --name {nonexist_bucket_name}'