[Lambda] Lambda Cloud SkyPilot provisioner #3865

kmushegi · 2024-08-22T22:42:42Z

This PR implements the SkyPilot provisioner for Lambda Cloud.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- Ran multiple tests against Lambda Cloud.
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

cblmemo

Thanks for this amazing PR @kmushegi ! 🚀 It would be really useful to move Lambda to the new provisioner and speed up provisioning a lot. Left some comments to discuss!

sky/provision/lambda_cloud/instance.py

sky/provision/lambda_cloud/lambda_utils.py

romilbhardwaj · 2024-08-27T23:17:26Z

Thanks @kmushegi!

Trying this out, ran into this error when trying to launch

$ sky launch -c lamb --num-nodes 2 --cloud lambda -- echo hi
...
I 08-27 16:16:02 provisioner.py:65] Launching on Lambda us-east-1 (all zones)
E 08-27 16:16:02 provisioner.py:80] Failed to configure 'lamb' on Lambda Region(name='us-east-1') (all zones) with the following error:
E 08-27 16:16:02 provisioner.py:80] AssertionError: Unknown provider: lambda
D 08-27 16:16:02 provisioner.py:171] Failed to provision 'lamb' on Lambda (all zones).
D 08-27 16:16:02 provisioner.py:173] bulk_provision for 'lamb' failed. Stacktrace:
D 08-27 16:16:02 provisioner.py:173] Traceback (most recent call last):
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 165, in bulk_provision
D 08-27 16:16:02 provisioner.py:173]     return _bulk_provision(cloud, region, zones, cluster_name,
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 76, in _bulk_provision
D 08-27 16:16:02 provisioner.py:173]     config = provision.bootstrap_instances(provider_name, region_name,
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/__init__.py", line 44, in _wrapper
D 08-27 16:16:02 provisioner.py:173]     assert module is not None, f'Unknown provider: {module_name}'
D 08-27 16:16:02 provisioner.py:173] AssertionError: Unknown provider: lambda

kmushegi · 2024-08-29T17:47:07Z

thanks for the reviews folks, will try to address asap

kmushegi · 2024-08-30T19:02:47Z

Thanks @kmushegi!

Trying this out, ran into this error when trying to launch

$ sky launch -c lamb --num-nodes 2 --cloud lambda -- echo hi
...
I 08-27 16:16:02 provisioner.py:65] Launching on Lambda us-east-1 (all zones)
E 08-27 16:16:02 provisioner.py:80] Failed to configure 'lamb' on Lambda Region(name='us-east-1') (all zones) with the following error:
E 08-27 16:16:02 provisioner.py:80] AssertionError: Unknown provider: lambda
D 08-27 16:16:02 provisioner.py:171] Failed to provision 'lamb' on Lambda (all zones).
D 08-27 16:16:02 provisioner.py:173] bulk_provision for 'lamb' failed. Stacktrace:
D 08-27 16:16:02 provisioner.py:173] Traceback (most recent call last):
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 165, in bulk_provision
D 08-27 16:16:02 provisioner.py:173]     return _bulk_provision(cloud, region, zones, cluster_name,
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 76, in _bulk_provision
D 08-27 16:16:02 provisioner.py:173]     config = provision.bootstrap_instances(provider_name, region_name,
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/__init__.py", line 44, in _wrapper
D 08-27 16:16:02 provisioner.py:173]     assert module is not None, f'Unknown provider: {module_name}'
D 08-27 16:16:02 provisioner.py:173] AssertionError: Unknown provider: lambda

Fixed this, missed a change to commit initially.

~~Moving onto some testing~~

up down works but ray failing to start, i'll keep debugging. error

RuntimeError: Failed to start ray on the worker node (exit code 1).
Detailed Error:
===== stdout =====
2024-08-30 19:21:56,699	INFO scripts.py:1163 -- Did not find any active Ray processes.
2024-08-30 19:21:57,524	INFO scripts.py:926 -- Local node IP: 127.0.0.1
Traceback (most recent call last):
  File "/home/ubuntu/skypilot-runtime/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/scripts/scripts.py", line 928, in start
    node = ray._private.node.Node(
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/node.py", line 153, in __init__
    self._init_gcs_client()
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/node.py", line 730, in _init_gcs_client
    raise RuntimeError(
RuntimeError: Failed to connect to GCS.

update: single node works, multi-node still struggling but root-caused

update: multi-node fixed as well

cblmemo · 2024-09-11T21:20:54Z

Hi @kmushegi I'm trying this today and encountered the following error. What does quantity: Input should be less than or equal to 1 mean here? Can we have a more informative error message here?

sky launch --cloud lambda --num-nodes 3 -c lmd-3node
I 09-11 14:16:56 optimizer.py:719] == Optimizer ==
I 09-11 14:16:56 optimizer.py:730] Target: minimizing cost
I 09-11 14:16:56 optimizer.py:742] Estimated cost: $2.2 / hour
I 09-11 14:16:56 optimizer.py:742] 
I 09-11 14:16:56 optimizer.py:867] Considered resources (3 nodes):
I 09-11 14:16:56 optimizer.py:937] ------------------------------------------------------------------------------------------
I 09-11 14:16:56 optimizer.py:937]  CLOUD    INSTANCE     vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 09-11 14:16:56 optimizer.py:937] ------------------------------------------------------------------------------------------
I 09-11 14:16:56 optimizer.py:937]  Lambda   gpu_1x_a10   30      200       A10:1          us-east-1     2.25          ✔     
I 09-11 14:16:56 optimizer.py:937] ------------------------------------------------------------------------------------------
I 09-11 14:16:56 optimizer.py:937] 
Launching a new cluster 'lmd-3node'. Proceed? [Y/n]: 
I 09-11 14:16:57 cloud_vm_ray_backend.py:4397] Creating a new cluster: 'lmd-3node' [3x Lambda(gpu_1x_a10, {'A10': 1})].
I 09-11 14:16:57 cloud_vm_ray_backend.py:4397] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 09-11 14:16:57 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/memory/sky_logs/sky-2024-09-11-14-16-56-536636/provision.log
I 09-11 14:16:58 provisioner.py:65] Launching on Lambda us-east-1 (all zones)
W 09-11 14:17:02 instance.py:117] run_instances error: global/invalid-parameters: quantity: Input should be less than or equal to 1
W 09-11 14:17:05 cloud_vm_ray_backend.py:2003] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in us-east-1. Try changing resource requirements or use another region.
W 09-11 14:17:05 cloud_vm_ray_backend.py:2012] 
W 09-11 14:17:05 cloud_vm_ray_backend.py:2012] Provision failed for 3x Lambda(gpu_1x_a10, {'A10': 1}) in us-east-1. Trying other locations (if any).

cblmemo

Thanks for the fix @kmushegi ! I tested this PR and all of launch/terminate/multinode works well. Left some nits and after that it should be ready to go!

sky/provision/lambda_cloud/instance.py

sky/provision/lambda_cloud/lambda_utils.py

sky/provision/lambda_cloud/instance.py

cblmemo · 2024-10-11T16:39:29Z

Hi @kmushegi , just checking the status of this PR - Is this still in your schedule? If not we can take it from here :))

kmushegi · 2024-10-11T21:31:19Z

Hi @kmushegi , just checking the status of this PR - Is this still in your schedule? If not we can take it from here :))

apologies for the delay had a bunch of things come up, just addressed all reviews and waiting for CI pass. Upon CI pass should be good for final testing.

cblmemo

Thanks for the prompt fix @kmushegi ! I tested launch/execution/termination and all works smoothly. It mostly looks good to me! cc @Michaelvll for a final check here

sky/provision/lambda_cloud/instance.py

cblmemo · 2024-10-11T23:25:56Z

sky/provision/lambda_cloud/instance.py

+
+    assert head_instance_id is not None, 'head_instance_id should not be None'
+
+    worker_node_count = to_start_count - 1


Suggested change

worker_node_count = to_start_count - 1

worker_node_count = to_start_count - 1

this should only minus one if the head instance id is none?

hmm we assert above that head instance id is not none?

Sorry for the confusion. I mean this

logically it is possible that our head instance is provisioned and this time we only wants to add some workers. in this case, the to_start_count should be equal to worker_node_count. Though practically this wont appear in our system but we should do it for future expansion

so are you suggesting smth like

if head_instance_id is None: worker_node_count = to_start_count - 1 else: worker_node_count = to_start_count

See this comment: #3865 (comment)

sky/provision/lambda_cloud/instance.py

Michaelvll

Thanks again for adding this @kmushegi!

Michaelvll · 2024-10-14T04:01:30Z

sky/provision/lambda_cloud/instance.py

+        if head_instance_id is None:
+            raise RuntimeError(
+                f'Cluster {cluster_name_on_cloud} has no head node.')


nit: Instead of error out, can we just patch one of the nodes to make it head?

I don't have bandwidth this week to implement this, if someone wants to take a stab at it pls do cause I like the idea

@Michaelvll I'm trying to give this a stab but just realized that there might be different runtime on worker node and head node (e.g. ray config) which will involve complex pattern recognition on the node runtime. Also, the only scenario i can think of that will trigger this case is user manually change the instance name on the console, which is a rare case, and seems like runpod's implementation takes the similar approach of directly raising it. Do you think we could leave this to another PR?

skypilot/sky/provision/runpod/instance.py

Lines 68 to 70 in a4e2fcd

if head_instance_id is None:

raise RuntimeError(

f'Cluster {cluster_name_on_cloud} has no head node.')

Just filed an issue for that: #4087

sky/provision/lambda_cloud/instance.py

…k again

cblmemo

Thanks for the awesome work @kmushegi ! I tested it and it all runs like smooth. There are only one minor issue: for lambda cloud cluster that launched before this PR, running sky status --refresh makes it in INIT state. Besides that it should be ready to go! Could you help take a look at this?

(sky) ➜  skypilot git:(master) sky launch --cloud lambda -c lmd-old --gpus A100 nvidia-smi
Task from command: nvidia-smi
Considered resources (1 node):
------------------------------------------------------------------------------------------------
 CLOUD    INSTANCE           vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
------------------------------------------------------------------------------------------------
 Lambda   gpu_1x_a100_sxm4   30      200       A100:1         us-east-1     1.29          ✔     
------------------------------------------------------------------------------------------------
Multiple Lambda instances satisfy A100:1. The cheapest Lambda(gpu_1x_a100_sxm4, {'A100': 1}) is considered among:
['gpu_1x_a100_sxm4', 'gpu_1x_a100'].
To list more details, run: sky show-gpus A100

Launching a new cluster 'lmd-old'. Proceed? [Y/n]: 
⚙︎ Launching on Lambda us-east-1.
  Head VM is up.
✓ Cluster launched: 'lmd-old'.  View logs at: ~/sky_logs/sky-2024-10-16-11-59-04-228003/provision.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=4130) Wed Oct 16 19:04:24 2024       
(sky-cmd, pid=4130) +---------------------------------------------------------------------------------------+
(sky-cmd, pid=4130) | NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
(sky-cmd, pid=4130) |-----------------------------------------+----------------------+----------------------+
(sky-cmd, pid=4130) | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
(sky-cmd, pid=4130) | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
(sky-cmd, pid=4130) |                                         |                      |               MIG M. |
(sky-cmd, pid=4130) |=========================================+======================+======================|
(sky-cmd, pid=4130) |   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
(sky-cmd, pid=4130) | N/A   32C    P0              44W / 400W |      4MiB / 40960MiB |      0%      Default |
(sky-cmd, pid=4130) |                                         |                      |             Disabled |
(sky-cmd, pid=4130) +-----------------------------------------+----------------------+----------------------+
(sky-cmd, pid=4130)                                                                                          
(sky-cmd, pid=4130) +---------------------------------------------------------------------------------------+
(sky-cmd, pid=4130) | Processes:                                                                            |
(sky-cmd, pid=4130) |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
(sky-cmd, pid=4130) |        ID   ID                                                             Usage      |
(sky-cmd, pid=4130) |=======================================================================================|
(sky-cmd, pid=4130) |  No running processes found                                                           |
(sky-cmd, pid=4130) +---------------------------------------------------------------------------------------+
✓ Job finished (status: SUCCEEDED).
Shared connection to 150.136.118.84 closed.

📋 Useful Commands
Job ID: 1
├── To cancel the job:          sky cancel lmd-old 1
├── To stream job logs:         sky logs lmd-old 1
└── To view job queue:          sky queue lmd-old

Cluster name: lmd-old
├── To log into the head VM:    ssh lmd-old
├── To submit a job:            sky exec lmd-old yaml_file
├── To stop the cluster:        sky stop lmd-old
└── To teardown the cluster:    sky down lmd-old

(sky) ➜  skypilot git:(master) gsw feat/oss-lambda-cloud-new-provisioner
Switched to branch 'feat/oss-lambda-cloud-new-provisioner'
(sky) ➜  skypilot git:(feat/oss-lambda-cloud-new-provisioner) sst -r                                                     
Clusters
Refreshing status for 2 clusters ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--instance_id: 25d1811e8e004be7975979a6b3dd4923, status: ClusterStatus.UP
NAME                           LAUNCHED     RESOURCES                                                 STATUS   AUTOSTOP  COMMAND                          
lmd-old                        36 secs ago  1x Lambda(gpu_1x_a100_sxm4, {'A100': 1})                  INIT     -         sky launch --cloud lambda -c...  
sky-serve-controller-402b1bba  1 hr ago     1x AWS(m6i.xlarge, disk_size=200, ports=['30001-30020'])  STOPPED  10m       sky serve up examples/ser...     

Managed jobs
No in-progress managed jobs. (See: sky jobs -h)

Services
No live services.

cblmemo

Thanks for contributing this awesome work @kmushegi ! I filed an issue to keep track of the issue above. Merging it now to unlock it first!

…t Flow Definition (#4067) * provide an example, edited from pipeline.yml * more focus on dependencies for user dag lib * more powerful user interface * load and dump new yaml format * fix * fix: reversed logic in add_edge * [docs] Unroll k8s internal load balancer docs (#4083) unroll load balancer docs * rename * refactor due to reviewer's comments * generate task.name if not given * [docs] `sky status --kubernetes` docs (#4064) * observability docs * comments * [UX] Show log after failure and fix the color issue with narrow window (#4084) * fix narrow window and show log path during exception * format * format * [k8s] `sky status --k8s` refactor (#4079) * refactor * lint * refactor, dataclass * refactor, dataclass * refactor * lint * add comments for add_edge * add `print_exception_no_traceback` when raise * make `Dag.tasks` a property * print dependencies for `__repr__` * move `get_unique_task_name` to common_utils * [Performance] Use new GCP custom images (#4027) * [Performance] Use new custom image to create GCP GPU VMs * update image tags for both CPU and GPU * always generate .sky/python_path --------- Co-authored-by: Yika Luo <[email protected]> * [GCP] Add H100 mega (#4099) * Add H100 mega support on GCP * fix for some other regions * format * fix resource type * fix catalog fetching * [GCP] Add gVNIC support (#4095) * add gvnic support through config.yaml * lint * docs * [Lambda] Lambda Cloud SkyPilot provisioner (#3865) * feat: lambda cloud new provisioner * feat: address cblmemo reviews and other reviews + make multi-node work again * fix: quotes * fix: address some reviews * chore: rm unused option * chore: update typedef * feat: use lists directly * fix: formatting * chore: address reviews * fix: formatting * chore: rm query ports since default impl per review * feat: add back query ports * fix: formatting * chore: add newline at eof * feat: try removing query ports again * [Docs] GKE Nvidia Driver installation instructions update (#4106) * docs * docs * docs * [Performance] Use new AWS custom images (#4091) * rename methods to use downstream/edge terminology * [Performance] Add Packer image generation scripts for GCP and AWS (#4068) * [Performance] Add Packer image generation scripts for GCP and AWS * Add docker install and tests * solve nvidia container issue * Install cuDNN * [Performance] Scripts to copy/delete AWS images for all regions and add cloud deps (#4073) * [Performance] Add AWS script to copy images for all regions * script to delete all AWS images across regions * Add cloud dependencies to image --------- Co-authored-by: Yika Luo <[email protected]> * Disable AWS images.csv refreshing (#4116) * [Docs] .skyignore doc (#4114) * [Docs] .skyignore doc * Correct typos Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * [Core] Raise error for none existing cluster when endpoint is called (#4117) raise error for none existing cluster * Refresh local aws images.csv when image not found (#4127) Refresh local aws images.csv by pulling from github catalog when image tag not found * [Docs] News revamps. (#4126) * News revamps. updates updates updates updates updates updates updates updates * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> --------- Co-authored-by: Zhanghao Wu <[email protected]> * [Serve] Support manually terminating a replica and with purge option (#4032) * define replica id param in cli * create endpoint on controller * call controller endpoint to scale down replica * add classmethod decorator * add handler methods for readability in cli * update docstr and error msg, and inline in cli * update log and return err msg * add docstr, catch and reraise err, add stopped and nonexistent message * inline constant to avoid circular import * fix error statement and return encoded str * add purge feature * add purge replica usage in docstr * use .get to handle unexpected packages * fix: diff terminate replica when failed/purging or not * fix: stay up to date for `is_controller_accessible` * revert * up to date with current APIs * error handling * when purged remove record in the main loop * refactor due to reviewer's suggestions * combine functions * fix: terminate the healthy replica even with purge option * remove abbr * Update sky/serve/core.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/core.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/controller.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/controller.py Co-authored-by: Tian Xia <[email protected]> * Update sky/cli.py Co-authored-by: Tian Xia <[email protected]> * got services hint * check if not yes in the outside if branch * fix some output messages * Update sky/serve/core.py Co-authored-by: Tian Xia <[email protected]> * set conflict status code for already scheduled termination * combine purge and normal terminating down branch together * bump version * global exception handler to render a json response with error messages * fix: use responses.JSONResponse for dict serialize * error messages for old controller * fix: check version mismatch in generated code * revert mistakenly change update_service * refine already in terminating message * fix: branch code workaround in cls.build * wording Co-authored-by: Tian Xia <[email protected]> * refactor due to reviewer's comments * fix use ux_utils Co-authored-by: Tian Xia <[email protected]> * add changelog as comments * fix messages * edit the message for mismatch error Co-authored-by: Tian Xia <[email protected]> * no traceback when raising in `terminate_replica` * messages decode * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * format * forma * Empty commit --------- Co-authored-by: David Tran <[email protected]> Co-authored-by: David Tran <[email protected]> Co-authored-by: Tian Xia <[email protected]> * [Provisioner] Support docker in Lambda Cloud and TPU (#4115) * [Provisioner] Support docker in Lambda Cloud * fix permission issue * merge with check docker installed * add tpu support & test * patch lambda cloud * add comment * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * change wording all to up/downstream style * Add unique suffix to task names, fallback to timestamp if unnamed * Unify handling of single and multiple tasks without dependencies * Refactor tasks initialization: use list comprehension and fail fast * Fix remove task dependency description: upstream, not downstream Co-authored-by: Tian Xia <[email protected]> * Remove duplicated `self.edges`, use nx api instead * [Serve] Add `ux_utils.print_exception_no_traceback()` for cleaner error output (#4111) * add `ux_utils.print_exception_no_traceback()` for cleaner error output * Empty commit * remove unnecessary with block * Partially revert: Remove unnecessary `ux_utils.print_exception_no_traceback()` wrappers (#4130) fix unnecessary with block for returning * Revert "Add unique suffix to task names, fallback to timestamp if unnamed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352. * comment the checking used as upstream logic * [examples] Deepspeed fixes + k8s support (#4124) deepspeed kubernetes fixes * Empty commit * [OCI] Support more OS types in addition to ubuntu (#4080) * Bug fix for sky config file path resolution. * format * [OCI] Bug fix for image_id in Task YAML * [OCI]: Support more OS types (esp. oraclelinux) in addition to ubuntu. * format * Disable system firewall * Bug fix for validation of the Marketplace images * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * variable/function naming * address review comments: not to change the service_catalog api. call oci_catalog directly for get os type for a image. * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * address review comments --------- Co-authored-by: Zhanghao Wu <[email protected]> * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * fix: typing.cast * add TODOs for future function migration * remove dependencies wording to reduce ambiguity * temporarily add github actions --------- Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: yika-luo <[email protected]> Co-authored-by: Yika Luo <[email protected]> Co-authored-by: Kote Mushegiani <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: David Tran <[email protected]> Co-authored-by: David Tran <[email protected]> Co-authored-by: Tian Xia <[email protected]> Co-authored-by: Hysun He <[email protected]>

kmushegi changed the title ~~feat: Lambda Cloud SkyPilot provisioner~~ [Lambda] Lambda Cloud SkyPilot provisioner Aug 22, 2024

kmushegi marked this pull request as ready for review August 22, 2024 23:18

kmushegi force-pushed the feat/oss-lambda-cloud-new-provisioner branch 2 times, most recently from 3b00b53 to 4048b32 Compare August 22, 2024 23:39

Michaelvll requested a review from cblmemo August 27, 2024 06:15

cblmemo reviewed Aug 27, 2024

View reviewed changes

kmushegi force-pushed the feat/oss-lambda-cloud-new-provisioner branch from cbbb07c to baf0951 Compare August 30, 2024 22:45

Michaelvll requested a review from cblmemo September 10, 2024 01:51

kmushegi force-pushed the feat/oss-lambda-cloud-new-provisioner branch from 6897ab9 to 2de3d04 Compare September 11, 2024 20:16

cblmemo reviewed Sep 11, 2024

View reviewed changes

cblmemo reviewed Oct 11, 2024

View reviewed changes

Michaelvll reviewed Oct 14, 2024

View reviewed changes

kmushegi added 11 commits October 14, 2024 15:14

feat: lambda cloud new provisioner

22d2a65

feat: address cblmemo reviews and other reviews + make multi-node wor…

9c0e204

…k again

fix: quotes

eca0df1

fix: address some reviews

f47bf31

chore: rm unused option

eb2c76d

chore: update typedef

86d6a3f

feat: use lists directly

099a67b

fix: formatting

a2e31a5

chore: address reviews

ec3e815

fix: formatting

0bc4509

chore: rm query ports since default impl per review

c612df8

kmushegi force-pushed the feat/oss-lambda-cloud-new-provisioner branch from 9e1e819 to c612df8 Compare October 14, 2024 22:14

kmushegi added 4 commits October 14, 2024 15:16

feat: add back query ports

0025a3a

fix: formatting

f8eb44c

chore: add newline at eof

21a3475

feat: try removing query ports again

b1dd794

This was referenced Oct 15, 2024

[Provisioner] Patch when head node not found for launching on existing cluster #4087

Open

[UX][Bug] Abnormal dim in the new UX #4093

Closed

cblmemo reviewed Oct 16, 2024

View reviewed changes

cblmemo mentioned this pull request Oct 17, 2024

[Provisioner] Backward compatibility for status refreshing on Lambda New Provisioner #4103

Open

cblmemo approved these changes Oct 17, 2024

View reviewed changes

cblmemo added this pull request to the merge queue Oct 17, 2024

Merged via the queue into skypilot-org:master with commit c2e12af Oct 17, 2024
20 checks passed

Michaelvll mentioned this pull request Oct 22, 2024

[Lambda] Fix internal IP regex #4072

Closed

5 tasks

romilbhardwaj mentioned this pull request Jan 21, 2025

[Lambda] Remove local_ray dependency for lambda #4601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lambda] Lambda Cloud SkyPilot provisioner #3865

[Lambda] Lambda Cloud SkyPilot provisioner #3865

kmushegi commented Aug 22, 2024 •

edited

Loading

cblmemo left a comment

romilbhardwaj commented Aug 27, 2024

kmushegi commented Aug 29, 2024

kmushegi commented Aug 30, 2024 •

edited

Loading

cblmemo commented Sep 11, 2024

cblmemo left a comment

cblmemo commented Oct 11, 2024

kmushegi commented Oct 11, 2024

cblmemo left a comment

cblmemo Oct 11, 2024

kmushegi Oct 11, 2024

cblmemo Oct 11, 2024

kmushegi Oct 14, 2024

Michaelvll Oct 14, 2024

Michaelvll left a comment

Michaelvll Oct 14, 2024

kmushegi Oct 14, 2024

cblmemo Oct 15, 2024

cblmemo left a comment

cblmemo left a comment


		assert head_instance_id is not None, 'head_instance_id should not be None'

		worker_node_count = to_start_count - 1

	if head_instance_id is None:
	raise RuntimeError(
	f'Cluster {cluster_name_on_cloud} has no head node.')

[Lambda] Lambda Cloud SkyPilot provisioner #3865

[Lambda] Lambda Cloud SkyPilot provisioner #3865

Conversation

kmushegi commented Aug 22, 2024 • edited Loading

cblmemo left a comment

Choose a reason for hiding this comment

romilbhardwaj commented Aug 27, 2024

kmushegi commented Aug 29, 2024

kmushegi commented Aug 30, 2024 • edited Loading

cblmemo commented Sep 11, 2024

cblmemo left a comment

Choose a reason for hiding this comment

cblmemo commented Oct 11, 2024

kmushegi commented Oct 11, 2024

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblmemo left a comment

Choose a reason for hiding this comment

cblmemo left a comment

Choose a reason for hiding this comment

kmushegi commented Aug 22, 2024 •

edited

Loading

kmushegi commented Aug 30, 2024 •

edited

Loading