[Core] make per-cloud catalog lookup parallel #4483

aylei · 2024-12-18T17:47:07Z

Part of #3159

This is a draft PR with a known issue to be solved.

The test is done locally with the duration of launch to confirm timed (aylei@e662df0). On my laptop, this PR improves the performance in 4 test cases with 7 clouds enabled compared to main (f0ebf13):

sky launch --gpus H100:8, from 2.38s to 2.06s
rm -rf ~/.sky/catalogs/** && sky launch --gpus H100:8 (cold start), from 12.15s to 6.29s
sky launch test.yaml, ordered resources with non-canonical names, from 5.75s to 5.07s
rm -rf ~/.sky/catalogs/** && sky launch test.yaml, same as above, but cold start, from 15.75s to 10.17s

test.yaml:

resources:
  cpus: 8+
  ordered:
    - accelerators: L40S:1
    - accelerators: L40:1
    - accelerators: A10g:1
    - accelerators: A10:1
    - accelerators: A100:1
    - accelerators: H100:1

run: |
  echo 'test done'

Cold-start and test.yaml are chosen deliberately. Because I found that for the case of "--gpu H100:8", the improvement was less than I expected (I expected a ~1s improvement according to former analysis).

The reason is GIL. Through profiling (source: gil.svg), I found that each cloud spends 10~100+ ms on init, which can not be parallelized due to GIL:

Actually in my setup only K8S and AWS would file network IO, where K8S ListNode costs ~0.5s and AWS get-caller-identity costs ~1s. So I tested the cold-start case which shows in scenarios with more I/O calls, the performance optimization of concurrency will be more significant.

For test.yaml, I deliberately included non-canonical GPU names to cover the non short-circuit path, where the IO calls will be advanced to get_accelerators.

I thought carefully about whether this PR is a premature optimization and my final thought is that processing each cloud concurrently in multi-cloud scenario is necessary for scalability as skypilot is a multi-cloud solution at its core(IIUC). This is why I did not parallelize the single get_accelerators function but parallelize all catalog lookup calls in _map_clouds_catalog. @Michaelvll What do you think?

Note, since sub thread status is silent yet, sky launch will lose catalog progress after this PR merged. Follow up: #4491

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (see above)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Signed-off-by: Aylei <[email protected]>

Michaelvll

Thanks @aylei! Adding the parallelization for all cloud methods makes sense to me. Left some minor comments mainly for interfaces.

Investigating AWS and kubernetes latency should be a great next step. IIRC, trying to fetch the mapping for AWS zone names can cause a large latency even when we don't have AWS enabled.

sky/utils/subprocess_utils.py

sky/optimizer.py

aylei · 2024-12-19T02:58:48Z

Thanks @aylei! Adding the parallelization for all cloud methods makes sense to me. Left some minor comments mainly for interfaces.

Investigating AWS and kubernetes latency should be a great next step. IIRC, trying to fetch the mapping for AWS zone names can cause a large latency even when we don't have AWS enabled.

@Michaelvll Thanks for your suggestions❤️. I reproduced the AWS zone name issue by cleaning catalog cache, excluding aws from allowed_clouds in ~/.sky/config.yaml and run sky launch --gpus L40:8, which uses a non-canonical GPU name and will enter the slow path. I think service_catalog should also honor allowed_clouds configuration but this seems involve semantic change, since allowed_clouds is documented as

This field is used to restrict the clouds that SkyPilot will check and use
when running sky check. "

My idea is to apply the same semantics to other commands like launch, wdyt? @Michaelvll

As of normal cases, the zone mapping is usually cached with AWS userID as the cache key (

skypilot/sky/clouds/service_catalog/aws_catalog.py

Line 79 in c19e94e

filename = f'aws/az_mappings-{aws_user_hash}.csv'

), so the only relevant RPC is get-caller-identity in daily usage, with the assumption that the AWS account used rarely changes.

Therefore, for the mainstream scenarios, I currently think there might be no optimization method for the get-caller-identity RPC of AWS and ListNode RPC of K8S without adjusting UX. That is, caching these calls requires extra user attention and should be carefully designed. I will evaluate whether the benefits would payoff.

Signed-off-by: Aylei <[email protected]>

aylei · 2024-12-19T12:32:38Z

@Michaelvll I've addressed the comments and added sub thread status support. Please kindly take a look. I will move further work on #3159 to separate PRs so each of them can be focused and coherent.

Current look when sub thread status is involved:

PS: fetch zone mapping is slow indeed on cache miss😂

Michaelvll

Thanks @aylei! It looks good to me. Just some minor nits. Let's proceed with the optimization of the second step. : )

Michaelvll · 2024-12-19T19:15:21Z

sky/optimizer.py

+        feasible_list = subprocess_utils.run_in_parallel(
+            lambda cloud, r=resources, n=task.num_nodes:
+            (cloud, cloud.get_feasible_launchable_resources(r, n)),
+            clouds_list)
+        for cloud, feasible_resources in feasible_list:


very minor nit: the run_in_parallel keeps the original order of the arguments, so no need to have cloud in the output, but zip(cloud_list, feasible_list) should be fine.

Good catch, with this new knowledge, I personally still tend to loosen the behavior dependency on the run_in_parallel function for maintainability, wdyt?

Michaelvll · 2024-12-20T00:05:38Z

sky/utils/rich_utils.py

@@ -1,16 +1,21 @@
 """Rich status spinner utils."""


This is great! However, since this can affect many other part of the code, let's move this to another PR, and keep this PR clean for only having the parallelism changes.

Sure, but since status of sub thread is silent on master, there would be a UX degradation in between, is that okay?

Michaelvll · 2024-12-20T00:09:52Z

sky/utils/subprocess_utils.py

@@ -113,6 +114,12 @@ def run_in_parallel(func: Callable,
      A list of the return values of the function func, in the same order as the
      arguments.
    """
+    if isinstance(args, collections.abc.Sized):


nit: how about we change the input args to be List[Any] and change the invocations to use lists to avoid this collections.abc.Sized?

Sounds good, I will try this~

Signed-off-by: Aylei <[email protected]>

aylei · 2024-12-20T09:19:28Z

@Michaelvll ready for the next review

Signed-off-by: Aylei <[email protected]>

Michaelvll

Thanks @aylei! LGTM. Could you share a bit of how the UX looks like without the thread-based handling for spinner?

Michaelvll · 2024-12-21T01:47:14Z

sky/clouds/service_catalog/__init__.py

+        return method(*args, **kwargs)
+
+    results = subprocess_utils.run_in_parallel(_execute_catalog_method,
+                                               list(clouds), len(clouds))


style nit: use kwargs for non-obvious argument meaning : )

sky/utils/subprocess_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

aylei · 2024-12-21T13:18:05Z

@Michaelvll

For canonical GPU:

For non-canonical GPU:

Fine when no catalog refresh is involved, otherwise it looks a little confusing since Optimizing... or List accelerators... would last several seconds with no detailed message.

Signed-off-by: Aylei <[email protected]>

Michaelvll · 2025-01-13T20:16:55Z

For non-canonical GPU:

Fine when no catalog refresh is involved, otherwise it looks a little confusing since Optimizing... or List accelerators... would last several seconds with no detailed message.

Nice finding @aylei! It seems we can solve the issue by having the detailed spinner you shown above. Let's do it in another PR. Merging this PR for now.

aylei added 2 commits December 19, 2024 00:07

feat: make per-cloud catalog lookup parallel

26e9067

Signed-off-by: Aylei <[email protected]>

lint and test

c9a2ce3

Signed-off-by: Aylei <[email protected]>

Michaelvll reviewed Dec 18, 2024

View reviewed changes

sky/utils/subprocess_utils.py Outdated Show resolved Hide resolved

sky/optimizer.py Outdated Show resolved Hide resolved

aylei added 5 commits December 19, 2024 11:00

address review comments

fb6c8ac

Signed-off-by: Aylei <[email protected]>

fix lint

3571b55

Signed-off-by: Aylei <[email protected]>

feat: support sub thread status attaching

e8e13bb

Signed-off-by: Aylei <[email protected]>

fix lint

dad3a65

Signed-off-by: Aylei <[email protected]>

Merge branch 'master' into issue-3159/catalog-parallel

0f21c7c

aylei marked this pull request as ready for review December 19, 2024 12:31

aylei requested a review from Michaelvll December 19, 2024 12:42

Michaelvll approved these changes Dec 20, 2024

View reviewed changes

aylei mentioned this pull request Dec 20, 2024

sky launch takes ~5s to print out optimizer table, which is slow #3159

Open

aylei added 2 commits December 20, 2024 16:27

revert sub thread status

99a1edb

Signed-off-by: Aylei <[email protected]>

address review comments

2ff4517

Signed-off-by: Aylei <[email protected]>

aylei mentioned this pull request Dec 20, 2024

[UX] sky launch does not show cloud catalog progress in multi-cloud setup #4491

Open

aylei requested a review from Michaelvll December 20, 2024 09:18

revert debug change

4619413

Signed-off-by: Aylei <[email protected]>

This was referenced Dec 20, 2024

[UX]: support sub thread status attaching #4493

Draft

[Core] honor allowed_cloud config in catalog lookup #4496

Open

Michaelvll approved these changes Dec 21, 2024

View reviewed changes

Update sky/utils/subprocess_utils.py

9bb757e

Co-authored-by: Zhanghao Wu <[email protected]>

address review comments

a9026db

Signed-off-by: Aylei <[email protected]>

Michaelvll merged commit 051b5fa into skypilot-org:master Jan 13, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] make per-cloud catalog lookup parallel #4483

[Core] make per-cloud catalog lookup parallel #4483

aylei commented Dec 18, 2024 •

edited

Loading

Michaelvll left a comment

aylei commented Dec 19, 2024 •

edited

Loading

aylei commented Dec 19, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll Dec 19, 2024

aylei Dec 20, 2024

Michaelvll Dec 20, 2024

aylei Dec 20, 2024

Michaelvll Dec 20, 2024

aylei Dec 20, 2024

aylei commented Dec 20, 2024

Michaelvll left a comment

Michaelvll Dec 21, 2024

aylei Dec 21, 2024

aylei commented Dec 21, 2024

Michaelvll commented Jan 13, 2025 •

edited

Loading

[Core] make per-cloud catalog lookup parallel #4483

[Core] make per-cloud catalog lookup parallel #4483

Conversation

aylei commented Dec 18, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

aylei commented Dec 19, 2024 • edited Loading

aylei commented Dec 19, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aylei commented Dec 20, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aylei commented Dec 21, 2024

Michaelvll commented Jan 13, 2025 • edited Loading

aylei commented Dec 18, 2024 •

edited

Loading

aylei commented Dec 19, 2024 •

edited

Loading

aylei commented Dec 19, 2024 •

edited

Loading

Michaelvll commented Jan 13, 2025 •

edited

Loading