-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP: Fix subnet filtering. #2854
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch @concretevitamin ! Thanks for fixing this. LGTM.
sky/skylet/providers/gcp/config.py
Outdated
matched_items = [] | ||
for item in items: | ||
# 'https://www.googleapis.com/compute/v1/projects/<project_id>/global/networks/<network_name>' | ||
if network == item["network"].split("/")[-1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use _network_interface_to_vpc_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, done!
) | ||
.execute() | ||
) | ||
|
||
return response["items"] if "items" in response else [] | ||
items = response["items"] if "items" in response else [] | ||
if network is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is it possible to do the network filtering with the filter field? If not, maybe we can have a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment:
# Filter by network (VPC) name.
#
# Note we do not directly use the filter (network=<...>) arg of the list()
# call above, because it'd involve constructing a long URL of the following
# format and passing it as the filter value:
# 'https://www.googleapis.com/compute/v1/projects/<project_id>/global/networks/<network_name>'
* Fix GCP _list_subnets() filtering. * Format * updates
* init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * [Provisioner] Support reserved instances in GCP (#2824) * Support reserved instances * remove min max count * remove unecessary fields * Add todo * Add todo * remove unused reseravation config * Fix config.yaml tests * format * sync with the upstream (Dec 05, 23) * set timeout and retries * handle GCP creation errors * Fix provisioning errors and improve error handling * update blocklist for GCP * refactor code for linting issues * fix * show instance status during assertion error * Refactor error handling for failover * adopt changes in #2854 * format * retry for wait operation * format * fix typo * fix interface * more robust zone to region * Fix tpu vm external IP setup * Fix get node * format * revert for TPU VM pod * Fix get_cluster_info call * fix tab * Fix timeout case * remvoe \t * GCP query statuses with new provisioner * format * fix import * refactor query status * fix stopped status * Fix stopped status * Add head ray start command * Add back keys * add workers * Fix non stopped states * Add more logs for autostop * format * increase job_docker job time * better logging * shorter time for recovering * fix conflicting var * change to V1 * fix comments * refactor constants * refactoring * typo * Fix max retry * longer sleep time for job * add detach setup * revert --detach-setup * shorter time for recovering * more retries * Update sky/provision/instance_setup.py Co-authored-by: Zongheng Yang <[email protected]> * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zongheng Yang <[email protected]> * format * [Provisioner] New provisioner for GCP TPU VM (#2898) * init * test * test ins_type * fix * format.. * wip * remove TPU config * fix node ips * Fix TPU VM pod * format * use TPU VM as default * Fix example for TPU VM * format * fix optimizer random dag * set TPU-VM * accelerator_args False * backward compatibility * add tpu filter for tests * fix * Fix * fix status refresh for tpu VM pod * Support autodown for TPU VM pod * Allow multi-node TPU VM pod * Allow multi-node TPU VM pod * fix * add execute for operation * avoid from * Wait for pending before set_labels * format * refactor constants * Fix for API changes * remove GCP failover handler v1 * format * remove TPU VM pod specific codes as they have been moved to new provisioner * Add error handling for TPU pod case * fix * fix multiple node calculation * refactor tpu_utils to gcp_utils * shorter time for recovering * format --------- Co-authored-by: Wei-Lin Chiang <[email protected]> * better error logging * Fix logging for TPU VM * Fix logging * Add insufficientCapacity to error handler * Avoid adding duplicated resources to blocked_resources * Fix blocked resources * address comment * add comment * Add comments * format * Fix num_node_ips * format * fix smoke test for preinstalled package * shorter wait time for recovering * Fix TPU VM pod stop * format * Update sky/provision/gcp/instance_utils.py Co-authored-by: Zongheng Yang <[email protected]> * update * format * Add debug message * revert version for handle * disable tpu name set --------- Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Wei-Lin Chiang <[email protected]>
Previously we were filtering by subnet name (
name
), but rather we should be filtering by vpc name (network
). We did not hit this bug because our subnets were named the same as the VPC:while a user hit this bug because they are named differently:
Tested (run the relevant ones):
bash format.sh
gcp.vpc_name
config:sky launch -c no-gcp-config --cpus 2 --use-spot --cloud gcp -i0 --down -y
gcp.vpc_name
config:sky launch -c skypilot-vpc-gcp-config --cpus 2 --use-spot --cloud gcp -i0 --down -y
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh