[Release] Release 0.7.1 #4438

zpoint · 2024-12-04T10:48:25Z

Based on releases/0.7.0, cherry-picks all commits from 0.7.1

With some manual changes:

Only to smoke_tests.py to ensure more smoke tests pass and Buildkite works.
And bump up version

The release should include: version 0.7.1 along with the manual changes

This PR currently contains only the 0.7.1 updates based on version 0.7.0, and is open for review.
The manual changes have been submitted separately in this PR to facilitate easier review.

Code to run test below include: version 0.7.1 along with the manual changes

Smoke tests:

Use buildkite CI to run the following tests:

pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py --gcp
pytest tests/test_smoke.py --azure
pytest tests/test_smoke.py --kubernetes

All passes except the failures:

pytest tests/test_smoke.py::test_tpu_vm_pod --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::test_tpu_vm --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --azure --- credential issue, Permission denied by location policies.
pytest tests/test_smoke.py::test_gcp_force_enable_external_ips --gcp --- ssh fail on provision, even on master
pytest tests/test_smoke.py::test_managed_jobs_storage --azure --- ? Fail to provision ?
pytest tests/test_smoke.py::test_azure_best_tier_failover --azure --- ResourcesUnavailableError
pytest tests/test_smoke.py::test_file_mounts --azure ---Failed to run command before rsync ?
pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
pytest tests/test_smoke.py::test_azure_disk_tier --azure --- ResourcesUnavailableError
pytest tests/test_smoke.py::test_kubernetes_context_failover --kubernetes --- Resource limit, no h100
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --aws --- Permission denied by location policies

You can view by clicking the failure from buildkite:

Manual tests:

locally build docs, open docs/build/index.html, scroll over “CLI Reference” (ideally, every page) to see if there are missing sections (we once caught the CLI page completely missing due to an import error; and once it has weird blockquotes displayed)
Check sky -v
backward_compatibility_tests.sh run against 0.7.0 on aws, run by buildkite

Run manual stress tests (see subsection below)

following script

sky jobs launch --gpus A100:8 --cloud aws echo hi -y
# Check we are properly failing over the zones:
sky jobs logs --controller

following script (Fail due to resource unavaliable)

sky launch -c dbg --cloud aws --num-nodes 16 --gpus T4 --down --use-spot 
sky down dbg

sky launch --num-nodes=75 -c dbg --cpus 2+ --use-spot --down --cloud aws -y
many jobs

# Launching many jobs on a cluster
sky launch -c test-many-jobs --cloud aws --cpus 16 --region us-east-1
python3 -c "
import subprocess
from multiprocessing.pool import ThreadPool

def run_task(task):
    print(f'Running task {task}')
    subprocess.run(f'sky exec test-many-jobs -d \"echo hi {task}; sleep 60\"', shell=True)

pool = ThreadPool(8)
pool.map(run_task, range(1000))
"
# Test the job queue on cluster is correct
sky queue test-many-jobs

sky show-gpus manual tests
Run a 24-hour+ spot job and ensure it doesn’t OOM
sky spot launch -n test-oom --cloud aws --cpus 2 sleep 1000000000000000

…ing (skypilot-org#4264) * fix race condition for setting job status to FAILED during INIT * Fix * fix * format * Add smoke tests * revert pending submit * remove update entirely for the job schedule step * wait for job 32 to finish * fix smoke * move and rename * Add comment * minor

* Avoid job schedule race condition * format * format * Avoid race for cancel

…ounts are specified (skypilot-org#4317) do file mounts if storage is specified

* avoid catching ValueError during failover If the cloud api raises ValueError or a subclass of ValueError during instance termination, we will assume the cluster was downed. Fix this by introducing a new exception ClusterDoesNotExist that we can catch instead of the more general ValueError. * add unit test * lint

…g#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses skypilot-org#4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>

* smoke tests support storage mount only * fix verify command * rename to only_mount

Michaelvll · 2024-12-10T18:32:13Z

tests/test_smoke.py

@@ -1144,7 +1144,7 @@ def test_gcp_stale_job_manual_restart():
            # Ensure the skylet updated the stale job status.
            _get_cmd_wait_until_job_status_contains_without_matching_job(
                cluster_name=name,
-                job_status=[JobStatus.FAILED.value],
+                job_status=[JobStatus.FAILED],


For this kind of hot fixes, we may want to include it in master and cherry pick it?

It's due to a merge conflict. The master branch's value is FAILED_DRIVER, which does not exist in version 0.7.1 but is correct in the master branch.

This reverts commit 2448bf6.

This reverts commit 6653026.

This reverts commit 4e42ddd.

This reverts commit 86a3564.

Michaelvll · 2024-12-19T02:22:56Z

Looking at the test failures (checked ones should be fine):

pytest tests/test_smoke.py::test_tpu_vm_pod --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::test_tpu_vm --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --azure --- credential issue, Permission denied by location policies for me-central2
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --aws --- Permission denied by location policies
pytest tests/test_smoke.py::test_gcp_force_enable_external_ips --gcp --- ssh fail on provision, even on master
TODO: we should add a skip for this smoke test as this will only work running on a GCP instance.

The following does not fail on release/0.7.0, we should fix:

pytest tests/test_smoke.py::test_file_mounts --azure ---Failed to run command before rsync ?
Seems the GCP credential needs reauth on the agent? Should we switch to service account?
pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
TODO: @cblmemo do you know what is the reason here?

pytest tests/test_smoke.py::test_kubernetes_context_failover --kubernetes --- Resource limit, no h100
It should pass with the setup below. Can we try to set this up in buildkite?

skypilot/tests/smoke_tests/test_basic.py

Lines 520 to 540 in ce1cb83

    
           def test_kubernetes_context_failover(): 
        
               """Test if the kubernetes context failover works. 
        
               This test requires two kubernetes clusters: 
        
               - kind-skypilot: the local cluster with mock labels for 8 H100 GPUs. 
        
               - another accessible cluster: with enough CPUs 
        
               To start the first cluster, run: 
        
                 sky local up 
        
                 # Add mock label for accelerator 
        
                 kubectl label node --overwrite skypilot-control-plane skypilot.co/accelerator=h100 --context kind-skypilot 
        
                 # Get the token for the cluster in context kind-skypilot 
        
                 TOKEN=$(kubectl config view --minify --context kind-skypilot -o jsonpath=\'{.users[0].user.token}\') 
        
                 # Get the API URL for the cluster in context kind-skypilot 
        
                 API_URL=$(kubectl config view --minify --context kind-skypilot -o jsonpath=\'{.clusters[0].cluster.server}\') 
        
                 # Add mock capacity for GPU 
        
                 curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $TOKEN" --request PATCH --data \'[{"op": "add", "path": "/status/capacity/nvidia.com~1gpu", "value": "8"}]\' "$API_URL/api/v1/nodes/skypilot-control-plane/status" 
        
                 # Add a new namespace to test the handling of namespaces 
        
                 kubectl create namespace test-namespace --context kind-skypilot 
        
                 # Set the namespace to test-namespace 
        
                 kubectl config set-context kind-skypilot --namespace=test-namespace --context kind-skypilot 
        
               """

pytest tests/test_smoke.py::test_managed_jobs_storage --azure --- ? Fail to provision ?
TODO: try to change the region to eastus2 and fix it on master @zpoint
pytest tests/test_smoke.py::test_azure_best_tier_failover --azure --- ResourcesUnavailableError
TODO: try change the region to eastus2 and fix it on master @zpoint
pytest tests/test_smoke.py::test_azure_disk_tier --azure --- ResourcesUnavailableError
TODO: try to change to eastus2 and fix it on master @zpoint

cblmemo · 2024-12-19T16:21:48Z

pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
TODO: @cblmemo do you know what is the reason here?

Does this issue persists? Since Azure provisioning is relatively slow, it is possible that sometimes it passes the initial delay and sometimes not.

Also, I'm a little bit confused - why is there a expected FAILED status?

zpoint · 2024-12-20T07:14:00Z

pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
TODO: @cblmemo do you know what is the reason here?

Does this issue persists? Since Azure provisioning is relatively slow, it is possible that sometimes it passes the initial delay and sometimes not.

I've tried many times with no luck. The failure rate is high, even if it's flaky. Could we fix the flakiness?

zpoint · 2024-12-20T08:21:07Z

pytest tests/test_smoke.py::test_managed_jobs_storage --azure --- ? Fail to provision ?
TODO: try to change the region to eastus2 and fix it on master @zpoint

After changing the region, I found that this test case needs to be run on the aws controller. If we don't have a controller running, sky launches an azure controller, which then fails due to missing aws credentials. Is this a bug? @Michaelvll

(t-managed-jobs-storage-8b, pid=2429) Traceback (most recent call last):
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(t-managed-jobs-storage-8b, pid=2429)     return _run_code(code, main_globals, None,
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/runpy.py", line 86, in _run_code
(t-managed-jobs-storage-8b, pid=2429)     exec(code, run_globals)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 583, in <module>
(t-managed-jobs-storage-8b, pid=2429)     start(args.job_id, args.dag_yaml, args.retry_until_up)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 541, in start
(t-managed-jobs-storage-8b, pid=2429)     _cleanup(job_id, dag_yaml=dag_yaml)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 480, in _cleanup
(t-managed-jobs-storage-8b, pid=2429)     dag, _ = _get_dag_and_name(dag_yaml)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 40, in _get_dag_and_name
(t-managed-jobs-storage-8b, pid=2429)     dag = dag_utils.load_chain_dag_from_yaml(dag_yaml)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/dag_utils.py", line 101, in load_chain_dag_from_yaml
(t-managed-jobs-storage-8b, pid=2429)     task = task_lib.Task.from_yaml_config(task_config, env_overrides)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/task.py", line 438, in from_yaml_config
(t-managed-jobs-storage-8b, pid=2429)     storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1043, in from_yaml_config
(t-managed-jobs-storage-8b, pid=2429)     storage_obj = cls(name=name,
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 556, in __init__
(t-managed-jobs-storage-8b, pid=2429)     self.add_store(StoreType.S3)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 894, in add_store
(t-managed-jobs-storage-8b, pid=2429)     store = store_cls(
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1110, in __init__
(t-managed-jobs-storage-8b, pid=2429)     super().__init__(name, source, region, is_sky_managed,
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 261, in __init__
(t-managed-jobs-storage-8b, pid=2429)     self._validate()
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1156, in _validate
(t-managed-jobs-storage-8b, pid=2429)     raise exceptions.ResourcesUnavailableError(
(t-managed-jobs-storage-8b, pid=2429) sky.exceptions.ResourcesUnavailableError: Storage 'store: s3' specified, but AWS access is disabled. To fix, enable AWS by running `sky check`. More info: https://docs.skypilot.co/en/latest/getting-started/installation.html.

pytest tests/test_smoke.py::test_file_mounts --azure ---Failed to run command before rsync ?
Seems the GCP credential needs reauth on the agent? Should we switch to service account?

It's aws sync error, not gcp, and it's a 100% reproduction failure. @Michaelvll

E 12-20 16:05:02 subprocess_utils.py:141] Successfully installed PyYAML-6.0.2 awscli-1.36.26 botocore-1.35.85 colorama-0.4.6 docutils-0.16 jmespath-1.0.1 pyasn1-0.6.1 rsa-4.7.2 s3transfer-0.10.4
E 12-20 16:05:02 subprocess_utils.py:141] fatal error: Unable to locate credentials
E 12-20 16:05:02 subprocess_utils.py:141] 

Traceback (most recent call last):
  File "/Users/zepingguo/miniconda3/envs/sky/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 366, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 838, in invoke
    return super().invoke(ctx)
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 1159, in launch
    _launch_with_confirm(task,
  File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 628, in _launch_with_confirm
    sky.launch(
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/execution.py", line 529, in launch
    return _execute(
  File "/Users/zepingguo/Desktop/skypilot/sky/execution.py", line 329, in _execute
    backend.sync_file_mounts(handle, task.file_mounts,
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 366, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/backend.py", line 101, in sync_file_mounts
    return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/cloud_vm_ray_backend.py", line 3174, in _sync_file_mounts
    self._execute_file_mounts(handle, all_file_mounts)
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/cloud_vm_ray_backend.py", line 4634, in _execute_file_mounts
    backend_utils.parallel_data_transfer_to_nodes(
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/backend_utils.py", line 1440, in parallel_data_transfer_to_nodes
    subprocess_utils.run_in_parallel(_sync_node, runners, num_threads)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/subprocess_utils.py", line 121, in run_in_parallel
    return list(p.imap(func, args))
  File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
  File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/backend_utils.py", line 1418, in _sync_node
    subprocess_utils.handle_returncode(rc,
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/subprocess_utils.py", line 148, in handle_returncode
    raise exceptions.CommandError(returncode, command, format_err_msg,
sky.exceptions.CommandError: Command mkdir -p ~/.sky/file_mounts/s3-data-test && aws --version >/dev/null 2>&1 || pip3 install awscli && aws s3 sync --no-follow-symlinks s3://fah-public-data-covid19-cryptic-pockets/human/il6/PROJ14534/RUN999/CLONE0/results0 ~/.sky/file_mounts/s3-data-test failed with return code 1.
Failed to run command before rsync s3://fah-public-data-covid19-cryptic-pockets/human/il6/PROJ14534/RUN999/CLONE0/results0 -> /s3-data-test. Ensure that the network is stable, then retry. mkdir -p ~/.sky/file_mounts/s3-data-test && aws --version >/dev/null 2>&1 || pip3 install awscli && aws s3 sync --no-follow-symlinks s3://fah-public-data-covid19-cryptic-pockets/human/il6/PROJ14534/RUN999/CLONE0/results0 ~/.sky/file_mounts/s3-data-test See logs in ~/sky_logs/sky-2024-12-20-15-58-06-704254/file_mounts.log
D 12-20 16:05:02 skypilot_config.py:228] Using config path: /Users/zepingguo/.sky/config.yaml
D 12-20 16:05:02 skypilot_config.py:233] Config loaded:

cblmemo · 2024-12-21T01:07:26Z

pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
TODO: @cblmemo do you know what is the reason here?

Does this issue persists? Since Azure provisioning is relatively slow, it is possible that sometimes it passes the initial delay and sometimes not.

I've tried many times with no luck. The failure rate is high, even if it's flaky. Could we fix the flakiness?

Does increasing the initial delay works for you?

Michaelvll and others added 9 commits December 4, 2024 18:34

[Core] Avoid job scheduling race condition (skypilot-org#4310)

0ccab08

* Avoid job schedule race condition * format * format * Avoid race for cancel

merge skypilot-org#4284

dfe43e2

[Storage] Call sync_file_mounts when either rsync or storage file_m…

36b3616

…ounts are specified (skypilot-org#4317) do file mounts if storage is specified

merge skypilot-org#4386

0ece9a8

Fix Spot instance on Azure (skypilot-org#4408)

07f87dd

Fix OD instance on Azure (skypilot-org#4411)

a2ebb05

hot fix to support smoke tests

86a3564

zpoint changed the title ~~[Release] Release 0.7.0~~ [Release] Release 0.7.1 Dec 4, 2024

zpoint changed the base branch from releases/0.7.1 to releases/0.7.1_pure December 4, 2024 10:51

zpoint added 2 commits December 4, 2024 21:03

bump up version

4e42ddd

pytest hot fix from master branch

6653026

zpoint mentioned this pull request Dec 6, 2024

smoke tests support storage mount only #4446

Merged

5 tasks

cg505 and others added 3 commits December 9, 2024 10:58

smoke tests support storage mount only (skypilot-org#4446)

c5ae5ca

* smoke tests support storage mount only * fix verify command * rename to only_mount

hot fix for smoke tests

2448bf6

zpoint requested a review from Michaelvll December 10, 2024 04:30

romilbhardwaj self-requested a review December 10, 2024 18:25

Michaelvll reviewed Dec 10, 2024

View reviewed changes

zpoint added 5 commits December 18, 2024 11:58

Revert "hot fix for smoke tests"

a94b0d8

This reverts commit 2448bf6.

Revert "pytest hot fix from master branch"

f81c240

This reverts commit 6653026.

Revert "bump up version"

06adeec

This reverts commit 4e42ddd.

Revert "hot fix to support smoke tests"

67d4b90

This reverts commit 86a3564.

merge skypilot-org#4475

3f25d0b

zpoint mentioned this pull request Dec 20, 2024

fix smoke tests bug #4492

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release] Release 0.7.1 #4438

[Release] Release 0.7.1 #4438

zpoint commented Dec 4, 2024 •

edited

Loading

Michaelvll Dec 10, 2024

zpoint Dec 11, 2024

Michaelvll commented Dec 19, 2024 •

edited by zpoint

Loading

cblmemo commented Dec 19, 2024

zpoint commented Dec 20, 2024

zpoint commented Dec 20, 2024

cblmemo commented Dec 21, 2024

[Release] Release 0.7.1 #4438

Are you sure you want to change the base?

[Release] Release 0.7.1 #4438

Conversation

zpoint commented Dec 4, 2024 • edited Loading

Michaelvll Dec 10, 2024

Choose a reason for hiding this comment

zpoint Dec 11, 2024

Choose a reason for hiding this comment

Michaelvll commented Dec 19, 2024 • edited by zpoint Loading

cblmemo commented Dec 19, 2024

zpoint commented Dec 20, 2024

zpoint commented Dec 20, 2024

cblmemo commented Dec 21, 2024

zpoint commented Dec 4, 2024 •

edited

Loading

Michaelvll commented Dec 19, 2024 •

edited by zpoint

Loading