Skip to content

Commit

Permalink
CLI: Deprecate cpunode/gpunode/tpunode, hide admin (#2800)
Browse files Browse the repository at this point in the history
* CLI: deprecate + hide interactive node commands and `admin`

* Purge interactive node mentions in docs.

* Update docs/source/examples/gpu-jupyter.rst

Co-authored-by: Zhanghao Wu <[email protected]>

* updates

* add TODO

---------

Co-authored-by: Zhanghao Wu <[email protected]>
  • Loading branch information
concretevitamin and Michaelvll authored Nov 18, 2023
1 parent 77a32d2 commit 3a7c858
Show file tree
Hide file tree
Showing 10 changed files with 79 additions and 213 deletions.
43 changes: 9 additions & 34 deletions docs/source/examples/auto-failover.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,9 @@ searching for regions (or clouds) that can provide the requested resources.

.. tip::

No action is required to use this feature.

Auto-failover is automatically enabled whenever a new cluster is to be
provisioned, such as during :code:`sky launch` or the :ref:`interactive node
commands <interactive-nodes>` :code:`sky {gpunode,cpunode,tpunode}`.
No action is required to use this feature. Auto-failover is automatically
enabled whenever a new cluster is to be provisioned, such as during :code:`sky
launch`.

If specific :code:`cloud`, ``region``, or ``zone`` are requested for a
task, auto-failover retries only within the specified location.
Expand All @@ -36,16 +34,8 @@ provisioner handles such a request:

.. code-block::
$ sky gpunode -c gpu --gpus V100
I 02-11 21:17:43 optimizer.py:211] Defaulting estimated time to 1 hr. Call Task.set_time_estimator() to override.
I 02-11 21:17:43 optimizer.py:317] Optimizer - plan minimizing cost (~$3.0):
I 02-11 21:17:43 optimizer.py:332]
I 02-11 21:17:43 optimizer.py:332] TASK BEST_RESOURCE
I 02-11 21:17:43 optimizer.py:332] gpunode GCP(n1-highmem-8, {'V100': 1.0})
I 02-11 21:17:43 optimizer.py:332]
I 02-11 21:17:43 optimizer.py:285] Considered resources -> cost
I 02-11 21:17:43 optimizer.py:286] {AWS(p3.2xlarge): 3.06, GCP(n1-highmem-8, {'V100': 1.0}): 2.953212}
I 02-11 21:17:43 optimizer.py:286]
$ sky launch -c gpu --gpus V100
... # optimizer output
I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Creating a new cluster: "gpu" [1x GCP(n1-highmem-8, {'V100': 1.0})].
I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters.
I 02-11 21:17:43 cloud_vm_ray_backend.py:614] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-11-21-17-43-171661/provision.log
Expand Down Expand Up @@ -78,17 +68,9 @@ AWS, where it succeeded after two regions:

.. code-block::
$ sky gpunode --gpus V100:8
I 02-23 16:39:59 optimizer.py:213] Defaulting estimated time to 1 hr. Call Task.set_time_estimator() to override.
I 02-23 16:39:59 optimizer.py:323] Optimizer - plan minimizing cost (~$20.3):
I 02-23 16:39:59 optimizer.py:337]
I 02-23 16:39:59 optimizer.py:337] TASK BEST_RESOURCE
I 02-23 16:39:59 optimizer.py:337] gpunode GCP(n1-highmem-8, {'V100': 8.0})
I 02-23 16:39:59 optimizer.py:337]
I 02-23 16:39:59 optimizer.py:290] Considered resources -> cost
I 02-23 16:39:59 optimizer.py:292] {GCP(n1-highmem-8, {'V100': 8.0}): 20.313212, AWS(p3.16xlarge): 24.48}
I 02-23 16:39:59 optimizer.py:292]
I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Creating a new cluster: "sky-gpunode-zongheng" [1x GCP(n1-highmem-8, {'V100': 8.0})].
$ sky launch -c v100-8 --gpus V100:8
... # optimizer output
I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Creating a new cluster: "v100-8" [1x GCP(n1-highmem-8, {'V100': 8.0})].
I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters.
I 02-23 16:39:59 cloud_vm_ray_backend.py:658] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-23-16-39-58-577551/provision.log
I 02-23 16:39:59 cloud_vm_ray_backend.py:668]
Expand All @@ -112,14 +94,7 @@ AWS, where it succeeded after two regions:
E 02-23 16:41:50 cloud_vm_ray_backend.py:746] Failed to acquire resources in all regions/zones (requested GCP(n1-highmem-8, {'V100': 8.0})). Try changing resource requirements or use another cloud.
W 02-23 16:41:50 cloud_vm_ray_backend.py:891]
W 02-23 16:41:50 cloud_vm_ray_backend.py:891] Provision failed for GCP(n1-highmem-8, {'V100': 8.0}). Trying other launchable resources (if any)...
I 02-23 16:41:50 optimizer.py:213] Defaulting estimated time to 1 hr. Call Task.set_time_estimator() to override.
I 02-23 16:41:50 optimizer.py:323] Optimizer - plan minimizing cost (~$24.5):
I 02-23 16:41:50 optimizer.py:337]
I 02-23 16:41:50 optimizer.py:337] TASK BEST_RESOURCE
I 02-23 16:41:50 optimizer.py:337] gpunode AWS(p3.16xlarge)
I 02-23 16:41:50 optimizer.py:337]
I 02-23 16:41:50 cloud_vm_ray_backend.py:658] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-23-16-39-58-577551/provision.log
I 02-23 16:41:50 cloud_vm_ray_backend.py:668]
...
I 02-23 16:41:50 cloud_vm_ray_backend.py:668] Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f)
W 02-23 16:42:15 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-1:
W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1d, us-east-1f., retrying.
Expand Down
12 changes: 5 additions & 7 deletions docs/source/examples/gpu-jupyter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,20 @@ Jupyter notebooks are a useful tool for interactive development, debugging, and
visualization. SkyPilot makes the process of running a GPU-backed Jupyter notebook
simple by automatically managing provisioning and port forwarding.

To get a machine with a GPU attached, we recommend using an interactive **GPU node**.
You can read more about interactive nodes :ref:`here <interactive-nodes>`.
To get a machine with a GPU attached, use:

.. code-block:: bash
# Launch a VM with 1 NVIDIA GPU and forward port 8888 to localhost
sky gpunode -p 8888 -c jupyter-vm --gpus K80:1
sky launch -c jupyter-vm --gpus K80:1
ssh -L 8888:localhost:8888 jupyter-vm
.. note::

View the supported GPUs with the :code:`sky show-gpus` command.


The above command will automatically log in to the cluster once the cluster is provisioned (or re-use an existing one).

Inside the VM, you can run the following commands to start a Jupyter session:
Use ``ssh jupyter-vm`` to SSH into the VM. Inside the VM, you can run the
following commands to start a Jupyter session:

.. code-block:: bash
Expand Down
17 changes: 15 additions & 2 deletions docs/source/getting-started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ This may show multiple clusters, if you have created several:
.. code-block::
NAME LAUNCHED RESOURCES COMMAND STATUS
gcp 1 day ago 1x GCP(n1-highmem-8) sky cpunode -c gcp --cloud gcp STOPPED
mygcp 1 day ago 1x GCP(n1-highmem-8) sky launch -c mygcp --cloud gcp STOPPED
mycluster 4 mins ago 1x AWS(p3.2xlarge) sky exec mycluster hello_sky.yaml UP
Expand Down Expand Up @@ -152,6 +152,9 @@ Simply run :code:`ssh <cluster_name>` to log into a cluster:
The above are achieved by adding appropriate entries to ``~/.ssh/config``.

Because SkyPilot exposes SSH access to clusters, this means clusters can be easily used inside
tools such as `Visual Studio Code Remote <https://code.visualstudio.com/docs/remote/remote-overview>`_.

Transfer files
===============

Expand All @@ -178,6 +181,16 @@ To terminate a cluster instead, run :code:`sky down`:
$ sky down mycluster
.. note::

Stopping a cluster does not lose data on the attached disks (billing for the
instances will stop while the disks will still be charged). Those disks
will be reattached when restarting the cluster.

Terminating a cluster will delete all associated resources (all billing
stops), and any data on the attached disks will be lost. Terminated
clusters cannot be restarted.

Find more commands that manage the lifecycle of clusters in the :ref:`CLI reference <cli>`.

Scaling out
Expand All @@ -186,7 +199,7 @@ Scaling out
So far, we have used SkyPilot's CLI to submit work to and interact with a single cluster.
When you are ready to scale out (e.g., run 10s or 100s of jobs), SkyPilot supports two options:

- Queue jobs on one or more clusters with ``sky exec`` (see :ref:`Job Queue <job-queue>`); or
- Queue many jobs on your cluster(s) with ``sky exec`` (see :ref:`Job Queue <job-queue>`);
- Use :ref:`Managed Spot Jobs <spot-jobs>` to run on auto-managed spot instances
(users need not interact with the underlying clusters)

Expand Down
1 change: 0 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,6 @@ Documentation
examples/docker-containers
examples/ports
reference/tpu
reference/interactive-nodes
reference/logging
reference/faq

Expand Down
17 changes: 0 additions & 17 deletions docs/source/reference/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,23 +69,6 @@ Managed Spot Jobs CLI
:prog: sky spot logs
:nested: full

Interactive Node CLI
-----------------------

.. click:: sky.cli:cpunode
:prog: sky cpunode
:nested: full

.. _sky-gpunode:
.. click:: sky.cli:gpunode
:prog: sky gpunode
:nested: full

.. click:: sky.cli:tpunode
:prog: sky tpunode
:nested: full


Storage CLI
------------

Expand Down
128 changes: 0 additions & 128 deletions docs/source/reference/interactive-nodes.rst

This file was deleted.

18 changes: 10 additions & 8 deletions docs/source/reference/tpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,17 @@ ML researchers and students are encouraged to apply for free TPU access through
Getting TPUs in one command
===========================

Like :ref:`GPUs <interactive-nodes>`, SkyPilot provides a simple command to quickly get TPUs for development:
Use one command to quickly get TPU nodes for development:

.. code-block:: bash
sky tpunode # By default TPU v2-8 is used
sky tpunode --use-spot # Preemptible TPUs
sky tpunode --tpus tpu-v3-8 # Change TPU type to tpu-v3-8
sky tpunode --instance-type n1-highmem-16 # Change the host VM type to n1-highmem-16
sky tpunode --tpu-vm # Use TPU VM (instead of TPU Node)
sky launch --gpus tpu-v2-8
# Preemptible TPUs:
sky launch --gpus tpu-v2-8 --use-spot
# Change TPU type to tpu-v3-8:
sky launch --gpus tpu-v3-8
# Change the host VM type to n1-highmem-16:
sky launch --gpus tpu-v3-8 -t n1-highmem-16
After the command finishes, you will be dropped into a TPU host VM and can start developing code right away.

Expand All @@ -48,7 +50,7 @@ More details can be found on GCP `documentation <https://cloud.google.com/tpu/do
TPU VMs
-------

To use TPU VMs, set the following in a task YAML's ``resources`` field:
To use TPU VMs, set the following in a task YAML's ``resources`` field:

.. code-block:: yaml
Expand Down Expand Up @@ -223,7 +225,7 @@ To use a TPU Pod, simply change the ``accelerators`` field in the task YAML (e.
:emphasize-lines: 2-2
resources:
accelerators: tpu-v2-32 # Pods have > 8 cores (the last number)
accelerators: tpu-v2-32 # Pods have > 8 cores (the last number)
accelerator_args:
runtime_version: tpu-vm-base
tpu_vm: True
Expand Down
Loading

0 comments on commit 3a7c858

Please sign in to comment.