29 Apr 15:47

2761a90

0.18.1

On-prem servers

Now you can add your own servers as pool instances:

dstack pool add-ssh -i ~/.ssh/id_rsa [email protected]

Note

The server should be pre-installed with CUDA 12.1 and NVIDIA Docker.

Configuration

All .dstack/profiles.yml properties now can be specified via run configurations:

type: dev-environment

ide: vscode

spot_policy: auto
backends: ["aws"]

regions: ["eu-west-1", "eu-west-2"]

instance_types: ["p3.8xlarge", "p3.16xlarge"]
max_price: 2.0

max_duration: 1d

New examples 🔥🔥

Thanks to the contribution from @deep-diver, we got two new examples:

Other

Configuring VPCs using their IDs (via vpc_ids in server/config.yml)
Support for global profiles (via ~/.dstack/profiles.yml)
Updated the default environment variables (DSTACK_RUN_NAME, DSTACK_GPUS_NUM, DSTACK_NODES_NUM, DSTACK_NODE_RANK, and DSTACK_MASTER_NODE_IP)
It’s now possible to use NVIDIA A10 GPU on Azure
More granular permissions for Azure

What's changed

Fix server freeze on terminate instance by @jvstme in #1132
Support profile params in run configurations by @r4victor in #1131
Support global .dstack/profiles.yml by @r4victor in #1134
Fix No such profile: None when missing .dstack/profiles.yml by @r4victor in #1135
Make Azure permissions more granular by @r4victor in #1139
Validate min disk size by @r4victor in #1146
Fix unexpected error if system Python version is unknown by @r4victor in #1147
Add request timeouts to prevent code freezes by @jvstme in #1140
Refactor backends to wait for instance IP address outside run_job/create_instance by @r4victor in #1149
Fix provisioning Azure instances with A10 GPU by @jvstme in #1150
[Internal] Move packer -> scripts/packer by @jvstme in #1153
Added the ability of adding own instances by @TheBits in #1115
An issue with the executor_error check being falsely positive by @TheBits in #1160
Make user project quota configurable by @r4victor in #1161
Configure CORS headers on gateway by @r4victor in #1166
Allow to configure AWS vpc_ids by @r4victor in #1170
[Internal] Show dstack version in Sentry issues by @jvstme in #1167
Fix KeyError: 'IpPermissions' when using AWS by @jvstme in #1169
Create public ssh key is it not exist in dstack pool add-ssh by @TheBits in #1173
Fixed is the environment file upload by @TheBits in #1175
Updated shim status processing by @TheBits in #1174
Fix bugs in dstack pool add-ssh by @TheBits in #1178
Fix Cudo Create VM response error by @Bihan in #1179
Implement API for configuring backends via yaml by @r4victor in #1181
Allow running gated models with HUGGING_FACE_HUB_TOKEN by @r4victor in #1184
Pass all dstack runner envs as DSTACK_* by @r4victor in #1185
Improve the retries in the get_host_info and get_shim_healthcheck by @TheBits in #1183
Example/h4alignment handbook by @deep-diver in #1180
The deploy is launched in ThreadPoolExecutor by @TheBits in #1186

Full Changelog: 0.18.0...0.18.1rc2

Contributors

TheBits, Bihan, and 3 other contributors

Assets 2

10 Apr 15:46

peterschmidt85

0.18.0

f9b941c

0.18.0

RunPod

The update adds the long-awaited integration with RunPod, a distributed GPU cloud that offers GPUs at affordable prices.

To use RunPod, specify your RunPod API key in ~/.dstack/server/config.yml:

projects:
- name: main
  backends:
  - type: runpod
    creds:
      type: api_key
      api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9

Once the server is restarted, go ahead and run workloads.

Clusters

Another major change with the update is the ability to run multi-node tasks over an interconnected cluster of instances.

type: task

nodes: 2

commands:
  - git clone https://github.com/r4victor/pytorch-distributed-resnet.git
  - cd pytorch-distributed-resnet
  - mkdir -p data
  - cd data
  - wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
  - tar -xvzf cifar-10-python.tar.gz
  - cd ..
  - pip3 install -r requirements.txt torch
  - mkdir -p saved_models
  - torchrun --nproc_per_node=$DSTACK_GPUS_PER_NODE 
     --node_rank=$DSTACK_NODE_RANK 
     --nnodes=$DSTACK_NODES_NUM
     --master_addr=$DSTACK_MASTER_NODE_IP
     --master_port=8008 resnet_ddp.py 
     --num_epochs 20

resources:
  gpu: 1

Currently supported providers for this feature include AWS, GCP, and Azure.

Other

The commands property is now not required for tasks and services if you use an image that has a default entrypoint configured.
The permissions required for using dstack with GCP are more granular.

What's changed

Add username filter to /api/runs/list by @r4victor in #1068
Inherit core models from DualBaseModel by @r4victor in #967
Fixed the YAML schema validation for replicas by @peterschmidt85 in #1055
Improve the server/config.yml reference documentation by @peterschmidt85 in #1077
Add the runpod backend by @Bihan in #1063
Support JSON log handler by @TheBits in #1085
Added lock to the terminate_idle_instance by @TheBits in #1081
dstack init doesn't work with a remote Git repo by @peterschmidt85 in #1090
Minor improvements of dstack server output by @peterschmidt85 in #1088
Return an error information from dstack-shim by @TheBits in #1061
Replace RetryPolicy.limit to RetryPolicy.duration by @TheBits in #1074
Make dstack version configurable when deploying docs by @peterschmidt85 in #1095
dstack init doesn't work with a local Git repo by @peterschmidt85 in #1096
Fix infinite create_instance() on the cudo provider by @r4victor in #1082
Do not update the latest Docker image and YAML scheme for pre-release builds by @peterschmidt85 in #1099
Support multi-node tasks by @r4victor in #1103
Make commands optional in run configurations by @jvstme in #1104
Allow the cudo backend use non-gpu instances by @Bihan in #1092
Make GCP permissions more granular by @r4victor in #1107

Full changelog: 0.17.0...0.18.0

Contributors

TheBits, Bihan, and 3 other contributors

Assets 2

03 Apr 10:20

peterschmidt85

0.17.0

f14ddf5

0.17.0

Service auto-scaling

Previously, dstack always served services as single replicas. While this is suitable for development, in production, the service must automatically scale based on the load.

That's why in 0.17.0, we extended dstack with the capability to configure replicas (the number of replicas) as well as scaling (the auto-scaling policy).

Regions and instance types

The update brings support for specifying regions and instance types (in dstack run and .dstack/profiles.yml)

Environment variables

Firstly, it's now possible to configure an environment variable in the configuration without hardcoding its value. Secondly, dstack run now inherits environment variables from the current process.

For more details on these new features, check the changelog.

What's changed

Support running multiple replicas for a service by @Egor-S in #986 and #1015
Allow to specify instance_type via CLI and profiles by @r4victor in #1023
Allow to specify regions via CLI and profiles by @r4victor in #947
Allow specifying required env variables by @spott in #1003
Allow configuring CA for gateways by @jvstme in #1022
Support Python 3.12 by @peterschmidt85 in #1031
The shm_size property in resources doesn't take effect by @peterschmidt85 in #1007
Sometimes, runs get stuck at pulling by @TheBits in #1035
vastai doesn't show any offers since 0.16.0 by @iRohith in #959
It's not possible to configure projects other than main by @peterschmidt85 in #992
Spot instances don't work on GCP by @peterschmidt85 in #996

New contributors

@iRohith made their first contribution in #959
@Bihan made their first contribution in #928

Full changelog: 0.16.5...0.17.0

Contributors

spott, TheBits, and 6 other contributors

Assets 2

26 Mar 07:59

peterschmidt85

0.16.5

7ab0cff

0.16.5

Bug-fixes

Docker pull related issues #1025

Full changelog: 0.16.4...0.16.5

Assets 2

18 Mar 13:25

peterschmidt85

0.16.4

05bde51

0.16.4

CUDO Compute

The 0.16.4 update introduces the cudo backend, which allows running workloads with CUDO Compute, a cloud GPU marketplace.

To configure the cudo backend, you simply need to specify your CUDO Compute project ID and API key:

projects:
- name: main
  backends:
  - type: cudo
    project_id: my-cudo-project
    creds:
      type: api_key
      api_key: 7487240a466624b48de22865589

Once it's done, you can restart the dstack server and use the dstack CLI or API to run workloads.

Note

Limitations

The dstack gateway feature is not yet compatible with cudo, but it is expected to be supported in version 0.17.0,
planned for release within a week.
The cudo backend cannot yet be used with dstack Sky, but it will also be enabled within a week.

Full changelog: 0.16.3...0.16.4

Assets 2

13 Mar 15:19

peterschmidt85

0.16.3

0b8011c

0.16.3

Bug-fixes

[Bug] The shm_size property in resources doesn't take effect #1006
[Bug]: It's not possible to configure projects other than main via ~/.dstack/server/config.yml #991
[Bug] Spot instances don't work on GCP if the username has upper case letters #975

Full changelog: 0.16.2...0.16.3

Assets 2

05 Mar 12:22

peterschmidt85

0.16.1

a1cc151

0.16.1

Improvements to `dstack pool`

Change default idle duration for dstack pool add to 72h #964
Set the default spot policy in dstack pool add to on-demand #962
Add pool support for lambda, azure, and tensordock #923
Allow to pass idle duration and spot policy in dstack pool add #918
dstack run does not respect pool-related profiles.yml parameters #949

Bug-fixes

Runs submitted via Python API have no termination policy #955
The vastai backend doesn't show any offers since 0.16.0 #958
Handle permission error when adding Include to ~/.ssh/config #937
The SSH tunnel fails because of a messy ~/.ssh/config #933
The PATH is overridden when logging via SSH #930
The SSH tunnel fails with Too many authentication failures #927

We've also updated our guide on how to add new backends. It's now available here.

New contributors

@iRohith made their first contribution in #959
@spott made their first contribution in #934
@KevKibe made their first contribution in #917

Full Changelog: 0.16.0...0.16.1

Contributors

spott, KevKibe, and iRohith

Assets 2

26 Feb 15:35

peterschmidt85

0.16.0

aec760c

0.16.0

Pools

The 0.16.0 release is the next major update, which, in addition to many bug fixes, introduces pools, a major new feature that enables a more efficient way to manage instance lifecycles and reuse instances across runs.

`dstack run`

Previously, when running a dev environment, task, or service, dstack provisioned an instance in a configured
backend, and upon completion of the run, deleted the instance.

Now, when using the dstack run command, it tries to reuse an instance from a pool. If no ready instance meets the
requirements, dstack automatically provisions a new one and adds it to the pool.

Once the workload finishes, the instance is marked as idle.
If the instance remains idle for the configured duration, dstack tears it down.

`dstack pool`

The dstack pool command allows for managing instances within pools.

To manually add an instance to a pool, use dstack pool add:

dstack pool add --gpu 80GB --idle-duration 1d

The dstack pool add command allows specifying resource requirements, along with the spot policy, idle duration, max
price, retry policy, and other policies.

If no idle duration is configured, by default, dstack sets it to 72h.
To override it, use the --idle-duration DURATION argument.

To learn more about pools, refer to the official documentation. To learn more about 0.16.0, refer to the changelog.

What's changed

Add dstack pool by @TheBits in #880
Pools: fix failed instance status by @Egor-S in #889
Add columns to dstack pool show by @TheBits in #898
Add submit stop by @TheBits in #895
Add kubernetes logo by @plutov in #900
Handle exceptions from backend.compute().get_offers by @r4victor in #904
Fix process_finished_jobs parsing None job_model.job_provisioning_data by @r4victor in #905
Validate run_name by @r4victor in #906
Filter out private subnets when provisioning in custom aws vpc by @r4victor in #909
Issue 894 rework failed instance status by @TheBits in #899
Handle unexpected exceptions from run_job by @r4victor in #911
Request GPU in docker with --gpus=all by @Egor-S in #913
Issue 918 fix cli argimenuts for dstack pool add by @TheBits in #919
Added router tests for pools by @TheBits in #916
Fix #921 by @TheBits in #922

New contributors

@plutov made their first contribution in #900

Full changelog: 0.15.1...0.16.0

Contributors

TheBits, plutov, and 2 other contributors

Assets 2

19 Feb 17:03

peterschmidt85

0.15.2rc2

5e74a49

0.15.2rc2 Pre-release

Pre-release

Bug-fixes

Exclude private subnets when provisioning in AWS #908
Ollama doesn't detect the GPU (requires --gpus==all instead of --runtime=nvidia) #910

Full changelog: 0.15.1...0.15.2rc2

Assets 2

15 Feb 09:48

peterschmidt85

0.15.1

4bc6133

0.15.1

Kubernetes

With the latest update, it's now possible to configure a Kubernetes backend. In this case, if you run a workload, dstack will provision infrastructure within your Kubernetes cluster. This may work with both self-managed and managed clusters.

Specifying a custom VPC for AWS

If you're using dstack with AWS, it's now possible to configure a vpc_name via ~/.dstack/server/config.yml.

** Learn more about the new features in detail on the changelog page.**

What's changed

Print total offers count in run plan by @Egor-S in #862
Add OpenAPI reference to the docs by @Egor-S in #863
Fixes #864 by pinning the APScheduler dep to < 4 by @tleyden in #867
Support gateway creation for Kubernetes by @r4victor in #870
Improve get_latest_runner_build by @Egor-S in #871
Added ruff by @TheBits in #850
Handle ResourceNotExistsError instead of 404 by @r4victor in #875
Simplify Kubernetes backend config by @r4victor in #879
Add SSH keys to GCP metadata by @Egor-S in #881
Allow to configure VPC for an AWS backend by @r4victor in #883

New contributors

@tleyden made their first contribution in #867

Full Changelog: 0.15.0...0.15.1

Contributors

TheBits, tleyden, and 2 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-prem servers

Configuration

New examples 🔥🔥

Other

What's changed

Contributors

RunPod

Clusters

Other

What's changed

Contributors

Service auto-scaling

Regions and instance types

Environment variables

What's changed

New contributors

Contributors

Bug-fixes

CUDO Compute

Bug-fixes

Improvements to `dstack pool`

Bug-fixes

New contributors

Contributors

Pools

`dstack run`

`dstack pool`

What's changed

New contributors

Contributors

Bug-fixes

Kubernetes

Specifying a custom VPC for AWS

What's changed

New contributors

Contributors

Releases: dstackai/dstack

0.18.1

On-prem servers

Configuration

New examples 🔥🔥

Other

What's changed

Contributors

0.18.0

RunPod

Clusters

Other

What's changed

Contributors

0.17.0

Service auto-scaling

Regions and instance types

Environment variables

What's changed

New contributors

Contributors

0.16.5

Bug-fixes

0.16.4

CUDO Compute

0.16.3

Bug-fixes

0.16.1

Improvements to dstack pool

Bug-fixes

New contributors

Contributors

0.16.0

Pools

dstack run

dstack pool

What's changed

New contributors

Contributors

0.15.2rc2

Bug-fixes

0.15.1

Kubernetes

Specifying a custom VPC for AWS

What's changed

New contributors

Contributors

Improvements to `dstack pool`

`dstack run`

`dstack pool`