-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add device memory #2565
Add device memory #2565
Conversation
update tf version
…g#2604) * [GCP] Avoid dumping cachetools for backward compatibilty * Add comment
update tpu runtime version for doc
…pilot-org#2606) * Add debug dockerfile * Add docker
…#2610) * Fix aws catalog fetcher for removed offerings * lint
This is a minor update to use `sky status --ip` to simplify the fetching of IP.
* News: Tweak width. * update * tweak * LRU, tweaks * tweaks * updates * updates
* Minor: Update CONTRIBUTING.md * Update CONTRIBUTING.md Co-authored-by: Romil Bhardwaj <[email protected]> --------- Co-authored-by: Romil Bhardwaj <[email protected]>
* add aws, gcp * fix * fix * add IBM message * Update sky/clouds/gcp.py Co-authored-by: Zongheng Yang <[email protected]> * Update sky/clouds/aws.py Co-authored-by: Zongheng Yang <[email protected]> * apply suggestion from code review * add newline * add blank line * add comments for constants --------- Co-authored-by: Zongheng Yang <[email protected]>
* new provisioner with AWS support fix sky status refresh update setup update metadata per-instance logs improve ux update comments update logging reduce retries of provisioning to match the original number fix Ray compat issue update the config with the latest changes sync events with the latest changes sync with upstream resolve conflict * cleanup unused functions * lint * update * fix * fix * Fix network configuration and authentication * Fix creation cluster name * fix logging * format * format * slight improvement for logging * minor fix logging * fix autostop * fix autostop * Fix ray ports used * fix ray start * Fix logging for ray start and skylet * fix ports for ray dashboard * add head / worker in name * More logging information for launching * fix cluster name in progress bar * Fix ssh * format * grammar * Fix message * fix ssh_proxy_commands * fix merge error * Get rid of ray logging * use info * change logger to use contextmanager * format * [major] remove dependency for ray on aws * minor fix * allow warning to be printed * adopt changes * rename * Better spinner * format * update logging * rich_utils in provision_utils * Fix instance log path * Log head node to provision.log * remove unused var * fix ssh method * Fix ray for worker node * Fix ray cluster on worker * fix ray cluster * Fix logging and ip fetching * show cluster name * fix wait after cluster launch * Add logging * Add continue * Change back to sleep before wait * use sleep 1 instead * clean * Avoid check skylet twice * skip docker * remove get_feasible_ips * fix comments * minor fix for logs * minor fix * Fix the ip fetching logic * Add quote for `cluster_name!r` * fix return * fix test * fix comments * fix comments * fix comments * Smoke tests changes to use a private VPC region. * refactor get_ips * minor debug info fix * fix * UX fix * Fix proxy command to be None * Address comments * format * Fix quote * update logging * compatibility for python 3.7 * format * format * change launched to provisioned * clean up ray yaml * format * [Provisioner] Add docker support back (skypilot-org#2507) * Add docker support back * Fix * Works! * format * Add docker back to the supported features * fix docker_cmd * Fix docker cmd * fix DockerLoginConfig * Move docker login config * Fix backward compatibility for docker * Fix docker login * Fix docker login config new line issue * Fix string process in ray yaml * Update sky/provision/docker_utils.py Co-authored-by: Tian Xia <[email protected]> * Update sky/provision/instance_setup.py Co-authored-by: Tian Xia <[email protected]> * Update sky/provision/docker_utils.py Co-authored-by: Tian Xia <[email protected]> * Update sky/provision/instance_setup.py Co-authored-by: Tian Xia <[email protected]> * address comments * format * format * Add comment * Address comments * format --------- Co-authored-by: Tian Xia <[email protected]> * Wording for SSH connection * Fix ray status check for backward compatibility * remnant * stronger backward compatibility tests * remove unused tag * Revert "remove unused tag" This reverts commit a12df8e. * Add default value for `docker_user` * fix botocore config * lint * Fix mypy * remove unused variable * minor fix for logging * expose bootstrap error * format * minor change for `wait_instances` API * lint * use `command_runner.ssh_options_list` * fix ssh command * reword * Dim error for bootstrapping * fix user known file * Update sky/setup_files/setup.py Co-authored-by: Zongheng Yang <[email protected]> * Add ray back, will fix the dependency issue skypilot-org#2625 * move dependency to local ray * Address renaming comments * renamed to provisioner * refactoring for logging to reduce confusion * format * rename back to meta data for metadata utils * format * move provisioner to `provision/` * Do not propagate to provisioner logger * Minor changes * Fix color for error of provisioner * Remove dimmer * Add back missing handler for `provision_logger` * add comments * Print error message for the failed ssh command * Make the error yellow --------- Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Tian Xia <[email protected]>
* UX: Fix spot launch hint. * lint
* Add falcon to list of LLMs * Add falcon to list of LLMs
* fix nits * update
…#2625) * new provisioner with AWS support fix sky status refresh update setup update metadata per-instance logs improve ux update comments update logging reduce retries of provisioning to match the original number fix Ray compat issue update the config with the latest changes sync events with the latest changes sync with upstream resolve conflict * cleanup unused functions * lint * update * fix * fix * Fix network configuration and authentication * Fix creation cluster name * fix logging * format * format * slight improvement for logging * minor fix logging * fix autostop * fix autostop * Fix ray ports used * fix ray start * Fix logging for ray start and skylet * fix ports for ray dashboard * add head / worker in name * More logging information for launching * fix cluster name in progress bar * Fix ssh * format * grammar * Fix message * fix ssh_proxy_commands * fix merge error * Get rid of ray logging * use info * change logger to use contextmanager * format * [major] remove dependency for ray on aws * minor fix * allow warning to be printed * adopt changes * rename * Better spinner * format * update logging * rich_utils in provision_utils * Fix instance log path * Log head node to provision.log * remove unused var * fix ssh method * Fix ray for worker node * Fix ray cluster on worker * fix ray cluster * Fix logging and ip fetching * show cluster name * fix wait after cluster launch * Add logging * Add continue * Change back to sleep before wait * use sleep 1 instead * clean * Avoid check skylet twice * skip docker * remove get_feasible_ips * fix comments * minor fix for logs * minor fix * Fix the ip fetching logic * Add quote for `cluster_name!r` * fix return * fix test * fix comments * fix comments * fix comments * Smoke tests changes to use a private VPC region. * refactor get_ips * minor debug info fix * fix * UX fix * Fix proxy command to be None * Address comments * format * Fix quote * update logging * compatibility for python 3.7 * format * format * change launched to provisioned * clean up ray yaml * format * [Provisioner] Add docker support back (skypilot-org#2507) * Add docker support back * Fix * Works! * format * Add docker back to the supported features * fix docker_cmd * Fix docker cmd * fix DockerLoginConfig * Move docker login config * Fix backward compatibility for docker * Fix docker login * Fix docker login config new line issue * Fix string process in ray yaml * Update sky/provision/docker_utils.py Co-authored-by: Tian Xia <[email protected]> * Update sky/provision/instance_setup.py Co-authored-by: Tian Xia <[email protected]> * Update sky/provision/docker_utils.py Co-authored-by: Tian Xia <[email protected]> * Update sky/provision/instance_setup.py Co-authored-by: Tian Xia <[email protected]> * address comments * format * format * Add comment * Address comments * format --------- Co-authored-by: Tian Xia <[email protected]> * Wording for SSH connection * Fix ray status check for backward compatibility * remnant * stronger backward compatibility tests * remove unused tag * Revert "remove unused tag" This reverts commit a12df8e. * Add default value for `docker_user` * fix botocore config * lint * Fix mypy * remove unused variable * minor fix for logging * expose bootstrap error * format * minor change for `wait_instances` API * lint * use `command_runner.ssh_options_list` * fix ssh command * reword * Dim error for bootstrapping * fix user known file * avoid ray dependency for aws * fix setup.py * format * remove unused file * fix setup.py * __init__ for utils * format * format * Address comments * Elaborate readme --------- Co-authored-by: Siyuan <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Tian Xia <[email protected]>
…2642) install typin_extensions for all versions
* README: update news and LLM list. * updates
* fix * apply suggestions from code review * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * lint --------- Co-authored-by: Zhanghao Wu <[email protected]>
…org#2650) * feat(update): Adding CoreWeave label to identify GPUs in K8s * feat(fix): Remove broken line * feat(fix): Fixing label_formatter detection * feat(code): Lint
…sible (skypilot-org#2657) * Fix optimizer for dag when some of the resources provided are invalid * format * format * address comments * better output * spacing * Fix inconsistent repr * use str instead of repr * more robust replace for region, zone
skip v5 catalog
* Add traceback for exceptions raised and detailed reasons for CommandError * format * Add exception column to make exception easier to see * remove color for exception * Fix docstr * revert cli and fix comment
* Initial multi-node support * Add pod anti-affinity * Fix concurrent SSH for Kubernetes * lint * comments * update readme * remove lsof dependency * newline * Update roadmap in readme
* fix * lint * Update sky/cli.py Co-authored-by: Zhanghao Wu <[email protected]> * fix * var name * var name --------- Co-authored-by: Zhanghao Wu <[email protected]>
* UX: Allow infering cloud from region or zone. * format * minor fix * Remove Local cloud from registry. * UX for 1-cloud cases * Format * Fix test fixtures. * isort
* Pin remote pydantic for ray * remote to be the second requirement * Add remote for k8s yml * format * format
…ommands (skypilot-org#2667) * fix * change position
…ypilot-org#2669) * Add retry for flaky error during launching GCP clusters * handle error * format * Do not log out stderr * Add retry for gcloud crash * fix retry return code
* Add sky show-gpus support for Kubernetes * Update sky/clouds/service_catalog/kubernetes_catalog.py Co-authored-by: Romil Bhardwaj <[email protected]> * PR feedback * PR feedback part 2 * Format fix * PR feedback part 3 * Fix bug with checking enabled clouds in k8s list_accelerators * Pylint fixes * Pylint fixes part 2 * Pylint fixes part 3 * Pylint fixes part 4 --------- Co-authored-by: Romil Bhardwaj <[email protected]>
* Fix the NoOp for rich status * fix shell completion installation * pin pandas * revert dependency
* Fix caching error for aws resources * Add todo for local cache * backward compatibility * Use lru cache instead * format * add unit test for aws resources memory leakage * pytest larger memory limit * Fix unit test * fix comment * add requirement for memory profiler * install memory profiler in pytest * fix ci * Update sky/provision/aws/instance.py Co-authored-by: Zongheng Yang <[email protected]> * Address comments * Add new line * backward compatibility * Use thread_local LRU instead * init * Update sky/adaptors/aws.py Co-authored-by: Zongheng Yang <[email protected]> * make private * Add comments * fail early for memory exceeding * Less frequent memory test * shorter period * Update sky/provision/aws/instance.py * refactor * Address comments * fix test * Add comment to modules --------- Co-authored-by: Zongheng Yang <[email protected]>
…org#2594) * refactor: 💡 update faq and add more detailed error message * fix
…ypilot-org#2682) * import cloud provisioners in advance instead of importing it online * Fix lower * format * bump skylet version * Add comment * Add a comment for skylet version
* Add kapa * fix color and logo * height * z-index fix for rtd flyout * newline * newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jc9123 thanks! the gpu memory are super useful info to the catalog. left a few comments.
'V100': 16, | ||
'P100': 16, | ||
'K80': 12, | ||
'': '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry why this ''
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a mistake on my end, I will have this removed.
@@ -513,6 +513,23 @@ def get_catalog_df(region_prefix: str) -> pd.DataFrame: | |||
# Round the prices. | |||
df['Price'] = df['Price'].round(PRICE_ROUNDING) | |||
df['SpotPrice'] = df['SpotPrice'].round(PRICE_ROUNDING) | |||
gpu_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a reference link with comment above?
def create_gpu_map(df): | ||
# Map of Azure's machine with GPU to their corresponding memory | ||
# Result is hard-coded since Azure's API to not return such info | ||
# may be outdated so need to be maintained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a reference link on how these information are found? also, how did we make sure we cover all the instance types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the azure documentation to ensure that all instance type have been included and also ran the script on --all-regions and the result looked fine to me.
However, I think the approach you mentioned below makes more sense in which we map instance type -> gpu name then from gpu name -> gpu memory. There is already a mapping from instance type -> gpu name in the script. Assuming this mapping is complete, we can easily map the gpu name to their corresponding memory.
'Standard_ND6s': 24, | ||
'Standard_ND12s': 48, | ||
'Standard_ND24s': 96, | ||
'Standard_ND24rs*': 96, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there a *
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its one of the instance type offered by azure.
https://learn.microsoft.com/en-us/azure/virtual-machines/nd-series
# Map of Azure's machine with GPU to their corresponding memory | ||
# Result is hard-coded since Azure's API to not return such info | ||
# may be outdated so need to be maintained | ||
gpu_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we try to map instance type -> gpu name first and then calculate the resulting device memory later? this two-level approach might be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this approach is much cleaner as it uses much less hard-coding and utilize already fetched info. I will change the script to use this approach.
This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR was closed because it has been stalled for 10 days with no activity. |
Modified fetch_azure.py so that the catalog now contains device memory for GPU.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh