From 9b4bc21bece276afbb74228c0b731f8f8746581f Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Mon, 26 Jan 2026 16:58:06 +0100 Subject: [PATCH 1/3] [Docs] Improved documentation structure (WIP) - [x] Introduced `More` under `Concepts` - [x] Moved `Metrics` to `Concepts` - [x] Improved `Installation` (removed SSH fleets - only keep it in `Backends`; moved `Configure` after `Set up the server`) - [x] Mention `server restart is required after updating server/config.yml` in `Backends` - [x] Improved `Distributed tasks` (structure; links to `Fleets` and `Examples`) --- docs/blog/posts/gpu-health-checks.md | 4 ++-- docs/blog/posts/metrics-ui.md | 2 +- docs/blog/posts/prometheus.md | 4 ++-- docs/docs/concepts/backends.md | 7 +++--- docs/docs/concepts/fleets.md | 2 +- docs/docs/{guides => concepts}/metrics.md | 0 docs/docs/concepts/tasks.md | 24 ++++++++------------ docs/docs/guides/protips.md | 2 +- docs/docs/guides/troubleshooting.md | 2 +- docs/docs/installation/index.md | 27 ++++++++++------------- docs/docs/quickstart.md | 4 ++-- docs/overrides/home.html | 6 ----- mkdocs.yml | 25 +++++++++------------ 13 files changed, 46 insertions(+), 63 deletions(-) rename docs/docs/{guides => concepts}/metrics.md (100%) diff --git a/docs/blog/posts/gpu-health-checks.md b/docs/blog/posts/gpu-health-checks.md index c10557e753..b864e77855 100644 --- a/docs/blog/posts/gpu-health-checks.md +++ b/docs/blog/posts/gpu-health-checks.md @@ -12,7 +12,7 @@ categories: In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results. -`dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/guides/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads. +`dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/concepts/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads. @@ -69,5 +69,5 @@ If you have experience with GPU reliability or ideas for automated recovery, joi !!! info "What's next?" 1. Check [Quickstart](../../docs/quickstart.md) 2. Explore the [clusters](../../docs/guides/clusters.md) guide - 3. Learn more about [metrics](../../docs/guides/metrics.md) + 3. Learn more about [metrics](../../docs/concepts/metrics.md) 4. Join [Discord](https://discord.gg/u8SmfwPpMd) diff --git a/docs/blog/posts/metrics-ui.md b/docs/blog/posts/metrics-ui.md index db21cf019a..877ae9fca8 100644 --- a/docs/blog/posts/metrics-ui.md +++ b/docs/blog/posts/metrics-ui.md @@ -53,6 +53,6 @@ For persistent storage and long-term access to metrics, we still recommend setti metrics from `dstack`. !!! info "What's next?" - 1. See [Metrics](../../docs/guides/metrics.md) + 1. See [Metrics](../../docs/concepts/metrics.md) 2. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md) 3. Join [Discord](https://discord.gg/u8SmfwPpMd) diff --git a/docs/blog/posts/prometheus.md b/docs/blog/posts/prometheus.md index 8a4d579c04..08aecb4cf5 100644 --- a/docs/blog/posts/prometheus.md +++ b/docs/blog/posts/prometheus.md @@ -45,7 +45,7 @@ Overall, `dstack` collects three groups of metrics: | **Runs** | Run metrics include run counters for each user in each project. | | **Jobs** | A run consists of one or more jobs, each mapped to a container. Job metrics offer insights into execution time, cost, GPU model, NVIDIA DCGM telemetry, and more. | -For a full list of available metrics and labels, check out [Metrics](../../docs/guides/metrics.md). +For a full list of available metrics and labels, check out [Metrics](../../docs/concepts/metrics.md). ??? info "NVIDIA" NVIDIA DCGM metrics are automatically collected for `aws`, `azure`, `gcp`, and `oci` backends, @@ -59,7 +59,7 @@ For a full list of available metrics and labels, check out [Metrics](../../docs/ only accessible through the UI and the [`dstack metrics`](dstack-metrics.md) CLI. !!! info "What's next?" - 1. See [Metrics](../../docs/guides/metrics.md) + 1. See [Metrics](../../docs/concepts/metrics.md) 1. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md) diff --git a/docs/docs/concepts/backends.md b/docs/docs/concepts/backends.md index 9a1c90ec5c..4d3617988c 100644 --- a/docs/docs/concepts/backends.md +++ b/docs/docs/concepts/backends.md @@ -7,15 +7,16 @@ Backends allow `dstack` to provision fleets across cloud providers or Kubernetes * [VM-based](#vm-based) – use `dstack`'s native integration with cloud providers to provision VMs, manage clusters, and orchestrate container-based runs. * [Container-based](#container-based) – use either `dstack`'s native integration with cloud providers or Kubernetes to orchestrate container-based runs; provisioning in this case is delegated to the cloud provider or Kubernetes. -??? info "SSH fleets" +!!! info "SSH fleets" When using `dstack` with on-prem servers, backend configuration isn’t required. Simply create [SSH fleets](../concepts/fleets.md#ssh-fleets) once the server is up. Backends can be configured via `~/.dstack/server/config.yml` or through the [project settings page](../concepts/projects.md#backends) in the UI. See the examples of backend configuration below. +> If you update `~/.dstack/server/config.yml`, you have to restart the server. + ## VM-based -VM-based backends allow `dstack` users to manage clusters and orchestrate container-based runs across a wide range of cloud providers. -Under the hood, `dstack` uses native integrations with these providers to provision clusters on demand. +VM-based backends allow `dstack` users to manage clusters and orchestrate container-based runs across a wide range of cloud providers. Under the hood, `dstack` uses native integrations with these providers to provision clusters on demand. Compared to [container-based](#container-based) backends, this approach offers finer-grained, simpler control over cluster provisioning and eliminates the dependency on a Kubernetes layer. diff --git a/docs/docs/concepts/fleets.md b/docs/docs/concepts/fleets.md index 99912cd75b..4ab066cb9f 100644 --- a/docs/docs/concepts/fleets.md +++ b/docs/docs/concepts/fleets.md @@ -298,7 +298,7 @@ Define a fleet configuration as a YAML file in your project directory. The file -??? info "Requirements" +??? info "Host requirements" 1. Hosts must be pre-installed with Docker. === "NVIDIA" diff --git a/docs/docs/guides/metrics.md b/docs/docs/concepts/metrics.md similarity index 100% rename from docs/docs/guides/metrics.md rename to docs/docs/concepts/metrics.md diff --git a/docs/docs/concepts/tasks.md b/docs/docs/concepts/tasks.md index ac94415d4d..35662b9cc0 100644 --- a/docs/docs/concepts/tasks.md +++ b/docs/docs/concepts/tasks.md @@ -135,18 +135,17 @@ resources: -Nodes can communicate using their private IP addresses. -Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other -[System environment variables](#system-environment-variables) for inter-node communication. +!!! info "Cluster placement" + To submit a distributed task, you must create at least one fleet with a [cluster placement](fleets.md#backend-placement). + -`dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks. +Jobs on each node communicate using their private IP addresses. Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other [system environment variables](#system-environment-variables) for inter-node communication. - -!!! info "MPI" - If want to use MPI, you can set `startup_order` to `workers-first` and `stop_criteria` to `master-done`, and use `DSTACK_MPI_HOSTFILE`. - See the [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) examples. - -> For detailed examples, see [distributed training](../../examples.md#distributed-training) examples. +!!! info "Examples" + `dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks. + + For detailed examples, see the [distributed training](../../examples.md#distributed-training) + and [clusters](../../examples.md#clusters) examples. ??? info "Network interface" Distributed frameworks usually detect the correct network interface automatically, @@ -172,11 +171,6 @@ Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other For convenience, `~/.ssh/config` is preconfigured with these options, so a simple `ssh ` is enough. For a list of nodes IPs check the `DSTACK_NODES_IPS` environment variable. -!!! info "Cluster fleets" - To run distributed tasks, you need to create a fleet with [`placement: cluster`](fleets.md#cloud-placement). - -> See the [Clusters](../guides/clusters.md) guide for more details on how to use `dstack` on clusters. - ### Resources When you specify a resource value like `cpu` or `memory`, diff --git a/docs/docs/guides/protips.md b/docs/docs/guides/protips.md index 167b8f1b4b..dfb7abf0b6 100644 --- a/docs/docs/guides/protips.md +++ b/docs/docs/guides/protips.md @@ -482,7 +482,7 @@ The `offer` command allows you to filter and group offers with various [advanced ## Metrics -`dstack` tracks essential metrics accessible via the CLI and UI. To access advanced metrics like DCGM, configure the server to export metrics to Prometheus. See [Metrics](metrics.md) for details. +`dstack` tracks essential metrics accessible via the CLI and UI. To access advanced metrics like DCGM, configure the server to export metrics to Prometheus. See [Metrics](../concepts/metrics.md) for details. ## Service quotas diff --git a/docs/docs/guides/troubleshooting.md b/docs/docs/guides/troubleshooting.md index 5d17b894d0..a0dab1e9f7 100644 --- a/docs/docs/guides/troubleshooting.md +++ b/docs/docs/guides/troubleshooting.md @@ -119,7 +119,7 @@ one of these features, `dstack` will only select offers from the backends that s [Instance volumes](../concepts/volumes.md#instance-volumes), and [Privileged containers](../reference/dstack.yml/dev-environment.md#privileged) are supported by all backends except `runpod`, `vastai`, and `kubernetes`. -- [Clusters](../concepts/fleets.md#cloud-placement) +- [Clusters](../concepts/fleets.md#backend-placement) and [distributed tasks](../concepts/tasks.md#distributed-tasks) are only supported by the `aws`, `azure`, `gcp`, `nebius`, `oci`, and `vultr` backends, as well as SSH fleets. diff --git a/docs/docs/installation/index.md b/docs/docs/installation/index.md index aad8741b66..f76b5869db 100644 --- a/docs/docs/installation/index.md +++ b/docs/docs/installation/index.md @@ -6,15 +6,6 @@ ## Set up the server -### Configure backends - -To orchestrate compute across cloud providers or Kubernetes clusters, you need to configure [backends](../concepts/backends.md). - -??? info "SSH fleets" - When using `dstack` with on-prem servers, backend configuration isn’t required. Simply create [SSH fleets](../concepts/fleets.md#ssh-fleets) once the server is up. - -### Start the server - The server can run on your laptop or any environment with access to the cloud and on-prem clusters you plan to use. === "uv" @@ -72,10 +63,11 @@ The server can run on your laptop or any environment with access to the cloud an -To verify that backends are properly configured, use the [`dstack offer`](../reference/cli/dstack/offer.md#list-gpu-offers) command to list available GPU offers. +For more details on server deployment options, see the [Server deployment](../guides/server-deployment.md) guide. -!!! info "Server deployment" - For more details on server deployment options, see the [Server deployment](../guides/server-deployment.md) guide. +### Configure backends + +To orchestrate compute across cloud providers or Kubernetes clusters, you need to configure [backends](../concepts/backends.md). ## Set up the CLI @@ -112,6 +104,8 @@ Once the server is up, you can access it via the `dstack` CLI. (or `Use Git and optional Unix tools from the Command Prompt`), and `Use bundled OpenSSH`. +### Configure the default project + To point the CLI to the `dstack` server, configure it with the server address, user token, and project name: @@ -130,6 +124,10 @@ Configuration is updated at ~/.dstack/config.yml This configuration is stored in `~/.dstack/config.yml`. +### Check offers + +To verify that both the server and CLI are properly configured, use the [`dstack offer`](../reference/cli/dstack/offer.md#list-gpu-offers) command to list available GPU offers. If you don't see valid offers, ensure you've set up [backends](../concepts/backends.md). + ??? info "Shell autocompletion" `dstack` supports shell autocompletion for `bash` and `zsh`. @@ -195,11 +193,10 @@ This configuration is stored in `~/.dstack/config.yml`. > If you get an error similar to `2: command not found: compdef`, then add the following line to the beginning of your `~/.zshrc` file: > `autoload -Uz compinit && compinit`. - !!! info "What's next?" - 1. Follow [Quickstart](../quickstart.md) - 2. See [Backends](../concepts/backends.md) + 1. See [Backends](../concepts/backends.md) + 2. Follow [Quickstart](../quickstart.md) 3. Check the [server deployment](../guides/server-deployment.md) guide 4. Browse [examples](../../examples.md) 5. Join the community via [Discord](https://discord.gg/u8SmfwPpMd) diff --git a/docs/docs/quickstart.md b/docs/docs/quickstart.md index 759ec1b573..506ebec31f 100644 --- a/docs/docs/quickstart.md +++ b/docs/docs/quickstart.md @@ -1,11 +1,11 @@ # Quickstart -??? info "Prerequsites" +!!! info "Prerequsites" Before using `dstack`, ensure you've [installed](installation/index.md) the server and the CLI. ## Create a fleet -Before you can submit your first run, you have to create a [fleet](concepts/fleets.md). +> Before submitting runs, you must create a [fleet](concepts/fleets.md). === "Backend fleet" If you're using cloud providers or Kubernetes clusters and have configured the corresponding [backends](concepts/backends.md), create a fleet as follows: diff --git a/docs/overrides/home.html b/docs/overrides/home.html index ced53fb1e8..c1b51e87d2 100644 --- a/docs/overrides/home.html +++ b/docs/overrides/home.html @@ -296,12 +296,6 @@

Single-node & distributed tasks

Tasks - - - Clusters - -

diff --git a/mkdocs.yml b/mkdocs.yml index 07eed5f3b7..5dcdd69999 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -74,15 +74,13 @@ plugins: - docs/concepts/tasks.md: How to run tasks - for training or fine-tuning, including distributed tasks - docs/concepts/services.md: How to deploy services - for model inference or web apps - docs/concepts/volumes.md: How to manage volumes - for persistent storage or caching + - docs/concepts/gateways.md: How to manage gateways - enabling auto-scaling, rate limits, and custom domains - docs/concepts/secrets.md: How to manage secrets - for API keys or other sensitive data - docs/concepts/projects.md: How to manage projects - for managing separate teams - - docs/concepts/gateways.md: How to manage gateways - enabling auto-scaling, rate limits, and custom domains + - docs/concepts/metrics.md: How to manage gateways - enabling auto-scaling, rate limits, and custom domains Guides: - - docs/guides/clusters.md: How to work with clusters - for distributed tasks - - docs/guides/kubernetes.md: How to work with Kubernetes - docs/guides/server-deployment.md: Detailed guide on how to deploy the dstack server - docs/guides/troubleshooting.md: Common issues and how to troubleshoot them - - docs/guides/metrics.md: How to monitor metrics - docs/guides/protips.md: Pro tips - tips and tricks to use dstack more efficiently Examples: - examples/single-node-training/trl/index.md: TRL @@ -149,8 +147,8 @@ plugins: "blog/data-centers-and-private-clouds.md": "blog/posts/gpu-blocks-and-proxy-jump.md" "blog/distributed-training-with-aws-efa.md": "examples/clusters/aws/index.md" "blog/dstack-stats.md": "blog/posts/dstack-metrics.md" - "docs/concepts/metrics.md": "docs/guides/metrics.md" - "docs/guides/monitoring.md": "docs/guides/metrics.md" + "docs/guides/metrics.md": "docs/concepts/metrics.md" + "docs/guides/monitoring.md": "docs/concepts/metrics.md" "blog/nvidia-and-amd-on-vultr.md.md": "blog/posts/nvidia-and-amd-on-vultr.md" "examples/misc/nccl-tests/index.md": "examples/clusters/nccl-rccl-tests/index.md" "examples/misc/a3high-clusters/index.md": "examples/clusters/gcp/index.md" @@ -268,19 +266,18 @@ nav: - Tasks: docs/concepts/tasks.md - Services: docs/concepts/services.md - Volumes: docs/concepts/volumes.md - - Secrets: docs/concepts/secrets.md - - Projects: docs/concepts/projects.md - - Gateways: docs/concepts/gateways.md + - More: + - Gateways: docs/concepts/gateways.md + - Secrets: docs/concepts/secrets.md + - Projects: docs/concepts/projects.md + - Metrics: docs/concepts/metrics.md - Guides: - - Clusters: docs/guides/clusters.md - - Kubernetes: docs/guides/kubernetes.md - Server deployment: docs/guides/server-deployment.md - Troubleshooting: docs/guides/troubleshooting.md - - Metrics: docs/guides/metrics.md - Protips: docs/guides/protips.md - Upgrade: docs/guides/upgrade.md - - Migration: - - Slurm: docs/guides/migration/slurm.md + - Migration: + - Slurm: docs/guides/migration/slurm.md - Reference: - .dstack.yml: - dev-environment: docs/reference/dstack.yml/dev-environment.md From 768386c4524534555884542de2065da22290c706 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Mon, 26 Jan 2026 23:42:26 +0100 Subject: [PATCH 2/3] [Docs] Documentation improvements - [x] Improved `Fleets` documentation - [x] Minor improvements of the `Tasks` page under `Concepts` - [x] Minor improvements on the home page --- docs/docs/concepts/backends.md | 12 +- docs/docs/concepts/fleets.md | 487 ++++++++++++++------------------ docs/docs/concepts/tasks.md | 11 +- docs/docs/installation/index.md | 2 +- docs/overrides/home.html | 37 ++- 5 files changed, 242 insertions(+), 307 deletions(-) diff --git a/docs/docs/concepts/backends.md b/docs/docs/concepts/backends.md index 4d3617988c..f9be6c4703 100644 --- a/docs/docs/concepts/backends.md +++ b/docs/docs/concepts/backends.md @@ -1037,9 +1037,13 @@ projects: No additional setup is required — `dstack` configures and manages the proxy automatically. -??? info "NVIDIA GPU Operator" - For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the - [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) pre-installed. +??? info "Required operators" + === "NVIDIA" + For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the + [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) pre-installed. + === "AMD" + For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the + [AMD GPU Operator](https://github.com/ROCm/gpu-operator) pre-installed. 100% +
- FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED - my-fleet - - - - - - -``` + ```shell + $ dstack apply -f fleet.dstack.yml + + # BACKEND REGION RESOURCES SPOT PRICE + 1 gcp us-west4 2xCPU, 8GB, 100GB (disk) yes $0.010052 + 2 azure westeurope 2xCPU, 8GB, 100GB (disk) yes $0.0132 + 3 gcp europe-central2 2xCPU, 8GB, 100GB (disk) yes $0.013248 -
+ Create the fleet? [y/n]: y -If `nodes` is a range that starts above `0`, `dstack` pre-creates the initial number of instances up front, while any additional ones are created on demand. + FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED + my-fleet 0 gcp (europe-west-1) L4:24GB (spot) $0.1624 idle 3 mins ago + 1 gcp (europe-west-1) L4:24GB (spot) $0.1624 idle 3 mins ago + ``` -> Setting the `nodes` range to start above `0` is supported only for [VM-based backends](backends.md#vm-based). + -??? info "Target number of nodes" + If the `nodes` range starts with `0`, `dstack apply` creates only a template. Actual instances are provisioned when you submit runs. - If `nodes` is defined as a range, you can start with more than the minimum number of instances by using the `target` parameter when creating the fleet. +=== "SSH fleet" + If you have a group of on-prem servers accessible via SSH, you can create an SSH fleet as follows:
- + ```yaml type: fleet - name: my-fleet + + # Uncomment if instances are interconnected + #placement: cluster - nodes: - min: 0 - max: 2 - - # Provision 2 instances initially - target: 2 - - # Deprovision instances above the minimum if they remain idle - idle_duration: 1h + ssh_config: + user: ubuntu + identity_file: ~/.ssh/id_rsa + hosts: + - 3.255.177.51 + - 3.255.177.52 ``` - +
-By default, when you submit a [dev environment](dev-environments.md), [task](tasks.md), or [service](services.md), `dstack` tries all available fleets. However, you can explicitly specify the [`fleets`](../reference/dstack.yml/dev-environment.md#fleets) in your run configuration -or via [`--fleet`](../reference/cli/dstack/apply.md#fleet) with `dstack apply`. + Pass the fleet configuration to `dstack apply`: -### Configuration options +
-#### Placement { #backend-placement } + ```shell + $ dstack apply -f fleet.dstack.yml + + Provisioning... + ---> 100% -To ensure instances are interconnected (e.g., for -[distributed tasks](tasks.md#distributed-tasks)), set `placement` to `cluster`. -This ensures all instances are provisioned with optimal inter-node connectivity. + FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED + my-fleet 0 ssh (remote) L4:24GB $0 idle 3 mins ago + 1 ssh (remote) L4:24GB $0 idle 3 mins ago + ``` -??? info "AWS" - When you create a fleet with AWS, [Elastic Fabric Adapter networking](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) is automatically configured if it’s supported for the corresponding instance type. - Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration. - Otherwise, instances are only connected by the default VPC subnet. +
- Refer to the [AWS](../../examples/clusters/aws/index.md) example for more details. + `dstack apply` automatically connects to on-prem servers, installs the required dependencies, and adds them to the created fleet. -??? info "GCP" - When you create a fleet with GCP, `dstack` automatically configures [GPUDirect-TCPXO and GPUDirect-TCPX](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot) networking for the A3 Mega and A3 High instance types, as well as RoCE networking for the A4 instance type. + ??? info "Host requirements" + 1. Hosts must be pre-installed with Docker. - !!! info "Backend configuration" - You may need to configure `extra_vpcs` and `roce_vpcs` in the `gcp` backend configuration. - Refer to the [GCP](../../examples/clusters/gcp/index.md) examples for more details. - -??? info "Nebius" - When you create a fleet with Nebius, [InfiniBand networking](https://docs.nebius.com/compute/clusters/gpu) is automatically configured if it’s supported for the corresponding instance type. - Otherwise, instances are only connected by the default VPC subnet. + === "NVIDIA" + 2. Hosts with NVIDIA GPUs must also be pre-installed with CUDA 12.1 and + [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). - An InfiniBand fabric for the cluster is selected automatically. If you prefer to use some specific fabrics, configure them in the - [backend settings](../reference/server/config.yml.md#nebius). + === "AMD" + 2. Hosts with AMD GPUs must also be pre-installed with AMDGPU-DKMS kernel driver (e.g. via + [native package manager](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/native-install/index.html) + or [AMDGPU installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html).) -The `cluster` placement is supported for `aws`, `azure`, `gcp`, `nebius`, `oci`, and `vultr` -backends. + === "Intel Gaudi" + 2. Hosts with Intel Gaudi accelerators must be pre-installed with [Gaudi software and drivers](https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation). + This must include the drivers, `hl-smi`, and Habana Container Runtime. -> For more details on optimal inter-node connectivity, read the [Clusters](../guides/clusters.md) guide. + === "Tenstorrent" + 2. Hosts with Tenstorrent accelerators must be pre-installed with [Tenstorrent software](https://docs.tenstorrent.com/getting-started/README.html#software-installation). + This must include the drivers, `tt-smi`, and HugePages. - + 3. The user specified must have passwordless `sudo` access. -#### Resources + 4. The SSH server must be running and configured with `AllowTcpForwarding yes` in `/etc/ssh/sshd_config`. -When you specify a resource value like `cpu` or `memory`, -you can either use an exact value (e.g. `24GB`) or a -range (e.g. `24GB..`, or `24GB..80GB`, or `..80GB`). + 5. The firewall must allow SSH and should forbid any other connections from external networks. For `placement: cluster` fleets, it should also allow any communication between fleet nodes. -
+> Once the fleet is created, you can run [dev environments](dev-environments.md), [tasks](tasks.md), and [services](services.md). -```yaml -type: fleet -# The name is optional, if not specified, generated randomly -name: my-fleet +## Configuration options -nodes: 2 +Backend fleets support [many options](../reference/dstack.yml/fleet.md); see some major configuration examples below. -resources: - # 200GB or more RAM - memory: 200GB.. - # 4 GPUs from 40GB to 80GB - gpu: 40GB..80GB:4 - # Disk size - disk: 500GB -``` +### Cluster placement -
+Both [backend fleets](#backend-fleet) and [SSH fleets](#ssh-fleet) allow the `placement` property to be set to `cluster`. -The `gpu` property allows specifying not only memory size but also GPU vendor, names -and their quantity. Examples: `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either A10G or A100), -`A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB), -`A100:40GB:2` (two A100 GPUs of 40GB). +This property ensures that instances are interconnected. This is required for running [distributed tasks](tasks.md#distributed-tasks). -??? info "Google Cloud TPU" - To use TPUs, specify its architecture via the `gpu` property. +=== "Backend fleet" + Backend fleets allow to provision interconnected clusters across supported backends. +
+ ```yaml type: fleet - # The name is optional, if not specified, generated randomly name: my-fleet nodes: 2 - + placement: cluster + resources: - gpu: v2-8 + gpu: H100:8 ``` + +
- Currently, only 8 TPU cores can be specified, supporting single TPU device workloads. Multi-TPU support is coming soon. - -> If you’re unsure which offers (hardware configurations) are available from the configured backends, use the -> [`dstack offer`](../reference/cli/dstack/offer.md#list-gpu-offers) command to list them. - -#### Blocks { #backend-blocks } - -For backend fleets, `blocks` function the same way as in SSH fleets. -See the [`Blocks`](#ssh-blocks) section under SSH fleets for details on the blocks concept. - -
- -```yaml -type: fleet - -name: my-fleet - -resources: - gpu: NVIDIA:80GB:8 + For backend fleets, fast interconnect is currently supported only on the `aws`, `gcp`, `nebius`, and `runpod` backends. -# Split into 4 blocks, each with 2 GPUs -blocks: 4 -``` + === "AWS" + EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration. + Refer to the [AWS](../../examples/clusters/aws/index.md) example for more details. -
+ === "GCP" + You may need to configure `extra_vpcs` and `roce_vpcs` in the `gcp` backend configuration. + Refer to the [GCP](../../examples/clusters/gcp/index.md) examples for more details. -#### Idle duration + === "Nebius" + When you create a cloud fleet with Nebius, [InfiniBand](https://docs.nebius.com/compute/clusters/gpu) networking is automatically configured if it’s supported for the corresponding instance type. -By default, fleet instances stay `idle` for 3 days and can be reused within that time. -If an instance is not reused within this period, it is automatically terminated. + === "Runpod" + When you run multinode tasks in a cluster cloud fleet with Runpod, `dstack` provisions [Runpod Instant Clusters](https://docs.runpod.io/instant-clusters) with InfiniBand networking configured. + + > See the [Clusters](../../examples.md#clusters) examples. -To change the default idle duration, set -[`idle_duration`](../reference/dstack.yml/fleet.md#idle_duration) in the fleet configuration (e.g., `0s`, `1m`, or `off` for -unlimited). +=== "SSH fleets" + If the hosts in the SSH fleet have interconnect configured, you only need to set `placement` to `cluster`. -
- +
+ ```yaml type: fleet - # The name is optional, if not specified, generated randomly name: my-fleet - - nodes: 2 - # Terminate instances idle for more than 1 hour - idle_duration: 1h - - resources: - gpu: 24GB - ``` - -
+ placement: cluster -#### Spot policy + ssh_config: + user: ubuntu + identity_file: ~/.ssh/id_rsa + hosts: + - 3.255.177.51 + - 3.255.177.52 + ``` + +
-By default, `dstack` uses on-demand instances. However, you can change that -via the [`spot_policy`](../reference/dstack.yml/fleet.md#spot_policy) property. It accepts `spot`, `on-demand`, and `auto`. + !!! info "Network" + By default, `dstack` automatically detects the network shared by the hosts. However, it's possible to configure it explicitly via the [`network`](../reference/dstack.yml/fleet.md#network) property. -#### Retry policy + -By default, if `dstack` fails to provision an instance or an instance is interrupted, no retry is attempted. +### Nodes -If you'd like `dstack` to do it, configure the -[retry](../reference/dstack.yml/fleet.md#retry) property accordingly: +The `nodes` property is supported only by backend fleets and specifies how many nodes `dstack` must provision or may provision. -
+
```yaml type: fleet -# The name is optional, if not specified, generated randomly name: my-fleet -nodes: 1 +# Allow to provision of up to 2 instances +nodes: 0..2 -resources: - gpu: 24GB +# Uncomment to ensure instances are inter-connected +#placement: cluster -retry: - # Retry on specific events - on_events: [no-capacity, interruption] - # Retry for up to 1 hour - duration: 1h +# Deprovision instances above the minimum if they remain idle +idle_duration: 1h + +resources: + # Allow to provision up to 8 GPUs + gpu: 0..8 ```
-!!! info "Reference" - Backend fleets support many more configuration options, - incl. [`backends`](../reference/dstack.yml/fleet.md#backends), - [`regions`](../reference/dstack.yml/fleet.md#regions), - [`max_price`](../reference/dstack.yml/fleet.md#max_price), and - among [others](../reference/dstack.yml/fleet.md). - -## SSH fleets +If `nodes` is a range that starts above `0`, `dstack` pre-creates the initial number of instances up front, while any additional ones are created on demand. -If you have a group of on-prem servers accessible via SSH, you can create an SSH fleet. +> Setting the `nodes` range to start above `0` is supported only for [VM-based backends](backends.md#vm-based). -### Apply a configuration +??? info "Target number" + If `nodes` is defined as a range, you can start with more than the minimum number of instances by using the `target` parameter when creating the fleet. -Define a fleet configuration as a YAML file in your project directory. The file must have a -`.dstack.yml` extension (e.g. `.dstack.yml` or `fleet.dstack.yml`). +
-
- ```yaml type: fleet - # The name is optional, if not specified, generated randomly name: my-fleet - # Uncomment if instances are interconnected - #placement: cluster + nodes: + min: 0 + max: 2 + target: 2 - # SSH credentials for the on-prem servers - ssh_config: - user: ubuntu - identity_file: ~/.ssh/id_rsa - hosts: - - 3.255.177.51 - - 3.255.177.52 + # Deprovision instances above the minimum if they remain idle + idle_duration: 1h ``` - -
-??? info "Host requirements" - 1. Hosts must be pre-installed with Docker. +
- === "NVIDIA" - 2. Hosts with NVIDIA GPUs must also be pre-installed with CUDA 12.1 and - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). +### Resources - === "AMD" - 2. Hosts with AMD GPUs must also be pre-installed with AMDGPU-DKMS kernel driver (e.g. via - [native package manager](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/native-install/index.html) - or [AMDGPU installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html).) +Backend fleets allow you to specify the resource requirements for the instances to be provisioned. The `resources` property syntax is the same as for [run configurations](dev-environments.md#resources). - === "Intel Gaudi" - 2. Hosts with Intel Gaudi accelerators must be pre-installed with [Gaudi software and drivers](https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation). - This must include the drivers, `hl-smi`, and Habana Container Runtime. +> Not directly related, but in addition to `resources`, you can specify [`spot_policy`](../reference/dstack.yml/fleet.md#instance_types), [`instance_types`](../reference/dstack.yml/fleet.md#instance_types), [`max_price`](../reference/dstack.yml/fleet.md#max_price), [`region`](../reference/dstack.yml/fleet.md#max_price), and other [options](../reference/dstack.yml/fleet.md#). - === "Tenstorrent" - 2. Hosts with Tenstorrent accelerators must be pre-installed with [Tenstorrent software](https://docs.tenstorrent.com/getting-started/README.html#software-installation). - This must include the drivers, `tt-smi`, and HugePages. + - 3. The user specified must have passwordless `sudo` access. +### Backends - 4. The SSH server must be running and configured with `AllowTcpForwarding yes` in `/etc/ssh/sshd_config`. +### Idle duration - 5. The firewall must allow SSH and should forbid any other connections from external networks. For `placement: cluster` fleets, it should also allow any communication between fleet nodes. +By default, instances of a backend fleet stay `idle` for 3 days and can be reused within that time. +If an instance is not reused within this period, it is automatically terminated. -To create or update the fleet, pass the fleet configuration to [`dstack apply`](../reference/cli/dstack/apply.md): +To change the default idle duration, set +[`idle_duration`](../reference/dstack.yml/fleet.md#idle_duration) in the fleet configuration (e.g., `0s`, `1m`, or `off` for +unlimited). -
+
+ +```yaml +type: fleet +name: my-fleet -```shell -$ dstack apply -f examples/misc/fleets/.dstack.yml +nodes: 2 -Provisioning... ----> 100% +# Terminate instances idle for more than 1 hour +idle_duration: 1h - FLEET INSTANCE GPU PRICE STATUS CREATED - my-fleet 0 L4:24GB (spot) $0 idle 3 mins ago - 1 L4:24GB (spot) $0 idle 3 mins ago +resources: + gpu: 24GB ```
-When you apply, `dstack` connects to the specified hosts using the provided SSH credentials, -installs the dependencies, and configures these hosts as a fleet. - -Once the status of instances changes to `idle`, they can be used by dev environments, tasks, and services. - -### Configuration options +### Blocks -#### Placement { #ssh-placement } +By default, a job uses the entire instance—e.g., all 8 GPUs. To allow multiple jobs on the same instance, set the `blocks` property to divide the instance. Each job can then use one or more blocks, up to the full instance. -If the hosts are interconnected (i.e. share the same network), set `placement` to `cluster`. -This is required if you'd like to use the fleet for [distributed tasks](tasks.md#distributed-tasks). +=== "Backend fleet" +
-??? info "Network" - By default, `dstack` automatically detects the network shared by the hosts. - However, it's possible to configure it explicitly via - the [`network`](../reference/dstack.yml/fleet.md#network) property. + ```yaml + type: fleet + name: my-fleet - [//]: # (TODO: Provide an example and more detail) + nodes: 0..2 -> For more details on optimal inter-node connectivity, read the [Clusters](../guides/clusters.md) guide. + resources: + gpu: H100:8 -#### Blocks { #ssh-blocks } + # Split into 4 blocks, each with 2 GPUs + blocks: 4 + ``` -By default, a job uses the entire instance—e.g., all 8 GPUs. To allow multiple jobs on the same instance, set the `blocks` property to divide the instance. Each job can then use one or more blocks, up to the full instance. +
-
+=== "SSH fleet" +
```yaml type: fleet @@ -386,7 +326,7 @@ By default, a job uses the entire instance—e.g., all 8 GPUs. To allow multiple blocks: 1 ``` -
+
All resources (GPU, CPU, memory) are split evenly across blocks, while disk is shared. @@ -396,37 +336,16 @@ Set `blocks` to `auto` to match the number of blocks to the number of GPUs. !!! info "Distributed tasks" Distributed tasks require exclusive access to all host resources and therefore must use all blocks on each node. - -#### Environment variables - -If needed, you can specify environment variables that will be used by `dstack-shim` and passed to containers. - -[//]: # (TODO: Explain what dstack-shim is) - -For example, these variables can be used to configure a proxy: - -```yaml -type: fleet -name: my-fleet -env: - - HTTP_PROXY=http://proxy.example.com:80 - - HTTPS_PROXY=http://proxy.example.com:80 - - NO_PROXY=localhost,127.0.0.1 +### SSH config -ssh_config: - user: ubuntu - identity_file: ~/.ssh/id_rsa - hosts: - - 3.255.177.51 - - 3.255.177.52 -``` + #### Proxy jump -If fleet hosts are behind a head node (aka "login node"), configure [`proxy_jump`](../reference/dstack.yml/fleet.md#proxy_jump): +If hosts are behind a head node (aka "login node"), configure [`proxy_jump`](../reference/dstack.yml/fleet.md#proxy_jump): -
+
```yaml type: fleet @@ -446,8 +365,7 @@ If fleet hosts are behind a head node (aka "login node"), configure [`proxy_jump
-To be able to attach to runs, both explicitly with `dstack attach` and implicitly with `dstack apply`, you must either -add a front node key (`~/.ssh/head_node_key`) to an SSH agent or configure a key path in `~/.ssh/config`: +To be able to attach to runs, both explicitly with `dstack attach` and implicitly with `dstack apply`, you must either add a front node key (`~/.ssh/head_node_key`) to an SSH agent or configure a key path in `~/.ssh/config`:
@@ -458,22 +376,33 @@ add a front node key (`~/.ssh/head_node_key`) to an SSH agent or configure a key
-where `Host` must match `ssh_config.proxy_jump.hostname` or `ssh_config.hosts[n].proxy_jump.hostname` if you configure head nodes -on a per-worker basis. +where `Host` must match `ssh_config.proxy_jump.hostname` or `ssh_config.hosts[n].proxy_jump.hostname` if you configure head nodes on a per-worker basis. -!!! info "Reference" - For all SSH fleet configuration options, refer to the [reference](../reference/dstack.yml/fleet.md). +### Environment variables -#### Troubleshooting +If needed, you can specify environment variables that will be automatically passed to any jobs running on this fleet. + +For example, these variables can be used to configure a proxy: + +```yaml +type: fleet +name: my-fleet -!!! info "Resources" - Once the fleet is created, double-check that the GPU, memory, and disk are detected correctly. +env: + - HTTP_PROXY=http://proxy.example.com:80 + - HTTPS_PROXY=http://proxy.example.com:80 + - NO_PROXY=localhost,127.0.0.1 -If the status does not change to `idle` after a few minutes or the resources are not displayed correctly, ensure that -all host requirements are satisfied. +ssh_config: + user: ubuntu + identity_file: ~/.ssh/id_rsa + hosts: + - 3.255.177.51 + - 3.255.177.52 +``` -If the requirements are met but the fleet still fails to be created correctly, check the logs at -`/root/.dstack/shim.log` on the hosts for error details. +!!! info "Reference" + The fleet configuration file supports many more options. See the [reference](../reference/dstack.yml/fleet.md). ## Manage fleets @@ -513,4 +442,6 @@ To terminate and delete specific instances from a fleet, pass `-i INSTANCE_NUM`. !!! info "What's next?" 1. Check [dev environments](dev-environments.md), [tasks](tasks.md), and [services](services.md) - 2. Read the [Clusters](../guides/clusters.md) guide + 2. Read about [Backends](backends.md) guide + 3. Explore the [`.dstack.yml` reference](../reference/dstack.yml/fleet.md) + 4. See the [Clusters](../../examples.md#clusters) example diff --git a/docs/docs/concepts/tasks.md b/docs/docs/concepts/tasks.md index 35662b9cc0..6f3f2fabb7 100644 --- a/docs/docs/concepts/tasks.md +++ b/docs/docs/concepts/tasks.md @@ -136,16 +136,17 @@ resources:
!!! info "Cluster placement" - To submit a distributed task, you must create at least one fleet with a [cluster placement](fleets.md#backend-placement). + To submit a distributed task, you must create at least one fleet with a [cluster placement](fleets.md#cluster-placement). Jobs on each node communicate using their private IP addresses. Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other [system environment variables](#system-environment-variables) for inter-node communication. -!!! info "Examples" - `dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks. + + +`dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks. - For detailed examples, see the [distributed training](../../examples.md#distributed-training) - and [clusters](../../examples.md#clusters) examples. +> For detailed examples, see the [distributed training](../../examples.md#distributed-training) + and [clusters](../../examples.md#clusters) examples. ??? info "Network interface" Distributed frameworks usually detect the correct network interface automatically, diff --git a/docs/docs/installation/index.md b/docs/docs/installation/index.md index f76b5869db..c1dff911d7 100644 --- a/docs/docs/installation/index.md +++ b/docs/docs/installation/index.md @@ -73,7 +73,7 @@ To orchestrate compute across cloud providers or Kubernetes clusters, you need t Once the server is up, you can access it via the `dstack` CLI. -> The CLI can be set up via `pip` or `uv` on Linux, macOS, and Windows. It requires Git and OpenSSH. +> The CLI can be used on Linux, macOS, and Windows. It requires Git and OpenSSH. === "uv" diff --git a/docs/overrides/home.html b/docs/overrides/home.html index c1b51e87d2..7cebed7b6a 100644 --- a/docs/overrides/home.html +++ b/docs/overrides/home.html @@ -190,15 +190,6 @@

Native integration with GPU clouds

fill-rule="nonzero" fill="currentColor" class="fill-main"> - - - Kubernetes - - - -

@@ -217,17 +208,17 @@

Easy to use with on-prem clusters

- - Kubernetes + + SSH fleets - - - SSH fleets + + + Kubernetes Single-node & distributed tasks

@@ -590,10 +587,12 @@

Get started in minutes

- + + Installation + +

From 86a90121b22e447aedd4a4d9943b51bfaacf7e13 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 27 Jan 2026 13:55:36 +0100 Subject: [PATCH 3/3] [Docs] Minor updates to `README.md`, `Overview`, `Fleets`, `Quickstart`, and examples --- README.md | 13 ++-- docs/docs/concepts/backends.md | 2 +- docs/docs/concepts/fleets.md | 68 +++++++++++++------ docs/docs/guides/troubleshooting.md | 2 +- docs/docs/index.md | 8 +-- docs/docs/installation/index.md | 4 +- docs/docs/quickstart.md | 6 +- examples/clusters/nccl-rccl-tests/README.md | 2 +- .../distributed-training/axolotl/README.md | 2 +- .../distributed-training/ray-ragen/README.md | 2 +- examples/distributed-training/trl/README.md | 2 +- 11 files changed, 68 insertions(+), 43 deletions(-) diff --git a/README.md b/README.md index 71d8d1a8b7..bbba8e136a 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ It streamlines development, training, and inference, and is compatible with any hardware, open-source tools, and frameworks. -#### Hardware +#### Accelerators `dstack` supports `NVIDIA`, `AMD`, `Google TPU`, `Intel Gaudi`, and `Tenstorrent` accelerators out of the box. @@ -46,7 +46,7 @@ It streamlines development, training, and inference, and is compatible with any ##### Configure backends -To orchestrate compute across cloud providers or existing Kubernetes clusters, you need to configure backends. +To orchestrate compute across GPU clouds or Kubernetes clusters, you need to configure backends. Backends can be set up in `~/.dstack/server/config.yml` or through the [project settings page](https://dstack.ai/docs/concepts/projects#backends) in the UI. @@ -123,12 +123,11 @@ Configuration is updated at ~/.dstack/config.yml `dstack` supports the following configurations: -* [Dev environments](https://dstack.ai/docs/dev-environments) — for interactive development using a desktop IDE -* [Tasks](https://dstack.ai/docs/tasks) — for scheduling jobs (incl. distributed jobs) or running web apps -* [Services](https://dstack.ai/docs/services) — for deployment of models and web apps (with auto-scaling and authorization) -* [Fleets](https://dstack.ai/docs/fleets) — for managing cloud and on-prem clusters +* [Fleets](https://dstack.ai/docs/concepts/fleets) — for managing cloud and on-prem clusters +* [Dev environments](https://dstack.ai/docs/concepts/dev-environments) — for interactive development using a desktop IDE +* [Tasks](https://dstack.ai/docs/concepts/tasks) — for scheduling jobs (incl. distributed jobs) or running web apps +* [Services](https://dstack.ai/docs/concepts/services) — for deployment of models and web apps (with auto-scaling and authorization) * [Volumes](https://dstack.ai/docs/concepts/volumes) — for managing persisted volumes -* [Gateways](https://dstack.ai/docs/concepts/gateways) — for configuring the ingress traffic and public endpoints Configuration can be defined as YAML files within your repo. diff --git a/docs/docs/concepts/backends.md b/docs/docs/concepts/backends.md index f9be6c4703..572d4e0411 100644 --- a/docs/docs/concepts/backends.md +++ b/docs/docs/concepts/backends.md @@ -1,6 +1,6 @@ # Backends -Backends allow `dstack` to provision fleets across cloud providers or Kubernetes clusters. +Backends allow `dstack` to provision fleets across GPU clouds or Kubernetes clusters. `dstack` supports two types of backends: diff --git a/docs/docs/concepts/fleets.md b/docs/docs/concepts/fleets.md index 4fb1704913..4def218456 100644 --- a/docs/docs/concepts/fleets.md +++ b/docs/docs/concepts/fleets.md @@ -8,8 +8,8 @@ Before submitting runs, you must create a fleet. Fleets act as both pools of ins To create a fleet, define its configuration in a YAML file. The filename must end with `.dstack.yml` (e.g. `.dstack.yml` or `fleet.dstack.yml`), regardless of fleet type. -=== "Backend fleet" - If you're using cloud providers or Kubernetes clusters and have configured the corresponding [backends](backends.md), create a fleet as follows: +=== "Backend fleets" + If you're using cloud providers or Kubernetes clusters and have configured the corresponding [backends](backends.md), create a backend fleet as follows:
@@ -54,9 +54,9 @@ To create a fleet, define its configuration in a YAML file. The filename must en
- If the `nodes` range starts with `0`, `dstack apply` creates only a template. Actual instances are provisioned when you submit runs. + If the `nodes` range starts with `0`, `dstack apply` creates only a template. Instances are provisioned only when you submit runs. -=== "SSH fleet" +=== "SSH fleets" If you have a group of on-prem servers accessible via SSH, you can create an SSH fleet as follows:
@@ -135,7 +135,7 @@ Both [backend fleets](#backend-fleet) and [SSH fleets](#ssh-fleet) allow the `pl This property ensures that instances are interconnected. This is required for running [distributed tasks](tasks.md#distributed-tasks). -=== "Backend fleet" +=== "Backend fleets" Backend fleets allow to provision interconnected clusters across supported backends.
@@ -153,21 +153,27 @@ This property ensures that instances are interconnected. This is required for ru
- For backend fleets, fast interconnect is currently supported only on the `aws`, `gcp`, `nebius`, and `runpod` backends. + #### Backends + + Fast interconnect is supported on the `aws`, `gcp`, `nebius`, `kubernetes`, and `runpod` backends. Some backends may require additional configuration. === "AWS" - EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration. + On AWS, `dstack` requires `public_ips` to be set to `false` in the backend configuration. Refer to the [AWS](../../examples/clusters/aws/index.md) example for more details. === "GCP" - You may need to configure `extra_vpcs` and `roce_vpcs` in the `gcp` backend configuration. + On GCP, you may need to configure `extra_vpcs` and `roce_vpcs` in the `gcp` backend configuration. Refer to the [GCP](../../examples/clusters/gcp/index.md) examples for more details. === "Nebius" - When you create a cloud fleet with Nebius, [InfiniBand](https://docs.nebius.com/compute/clusters/gpu) networking is automatically configured if it’s supported for the corresponding instance type. + On [Nebius](https://docs.nebius.com/compute/clusters/gpu), `dstack` automatically configures InfiniBand networking if it is supported by the selected instance type. + === "Kubernetes" + If the Kubernetes cluster has interconnect configured, `dstack` can use it without additional setup. + See the [Lambda](../../examples/clusters/lambda/index.md#kubernetes) or [Crusoe](../../examples/clusters/crusoe/index.md#kubernetes) examples. + === "Runpod" - When you run multinode tasks in a cluster cloud fleet with Runpod, `dstack` provisions [Runpod Instant Clusters](https://docs.runpod.io/instant-clusters) with InfiniBand networking configured. + On [Runpod](https://docs.runpod.io/instant-clusters), `dstack` automatically configures InfiniBand networking if it is supported by the selected instance type. > See the [Clusters](../../examples.md#clusters) examples. @@ -199,7 +205,7 @@ This property ensures that instances are interconnected. This is required for ru ### Nodes -The `nodes` property is supported only by backend fleets and specifies how many nodes `dstack` must provision or may provision. +The `nodes` property is supported only by backend fleets and specifies how many nodes `dstack` must or can provision.
@@ -223,12 +229,34 @@ resources:
-If `nodes` is a range that starts above `0`, `dstack` pre-creates the initial number of instances up front, while any additional ones are created on demand. +#### Pre-provisioning + +If the `nodes` range starts with `0`, `dstack apply` creates only a template, and instances are provisioned when you submit runs. + +To provision instances up front, set the `nodes` range to start above `0`. This pre-creates the initial number of instances; additional instances (if any) are provisioned on demand. + + +
+ + ```yaml + type: fleet + name: my-fleet + + nodes: 2..10 + + # Uncomment to ensure instances are inter-connected + #placement: cluster + + resources: + gpu: H100:8 + ``` + +
-> Setting the `nodes` range to start above `0` is supported only for [VM-based backends](backends.md#vm-based). +Pre-provisioning is supported only for [VM-based backends](backends.md#vm-based). ??? info "Target number" - If `nodes` is defined as a range, you can start with more than the minimum number of instances by using the `target` parameter when creating the fleet. + To pre-provision more than the minimum number of instances, set the `target` parameter.
@@ -237,9 +265,9 @@ If `nodes` is a range that starts above `0`, `dstack` pre-creates the initial nu name: my-fleet nodes: - min: 0 - max: 2 - target: 2 + min: 2 + max: 10 + target: 6 # Deprovision instances above the minimum if they remain idle idle_duration: 1h @@ -247,6 +275,8 @@ If `nodes` is a range that starts above `0`, `dstack` pre-creates the initial nu
+ `dstack apply` pre-provisions up to `target` and scales back to `min` after `idle_duration`. + ### Resources Backend fleets allow you to specify the resource requirements for the instances to be provisioned. The `resources` property syntax is the same as for [run configurations](dev-environments.md#resources). @@ -287,7 +317,7 @@ resources: By default, a job uses the entire instance—e.g., all 8 GPUs. To allow multiple jobs on the same instance, set the `blocks` property to divide the instance. Each job can then use one or more blocks, up to the full instance. -=== "Backend fleet" +=== "Backend fleets"
```yaml @@ -305,7 +335,7 @@ By default, a job uses the entire instance—e.g., all 8 GPUs. To allow multiple
-=== "SSH fleet" +=== "SSH fleets"
```yaml diff --git a/docs/docs/guides/troubleshooting.md b/docs/docs/guides/troubleshooting.md index a0dab1e9f7..2b4356fb5c 100644 --- a/docs/docs/guides/troubleshooting.md +++ b/docs/docs/guides/troubleshooting.md @@ -119,7 +119,7 @@ one of these features, `dstack` will only select offers from the backends that s [Instance volumes](../concepts/volumes.md#instance-volumes), and [Privileged containers](../reference/dstack.yml/dev-environment.md#privileged) are supported by all backends except `runpod`, `vastai`, and `kubernetes`. -- [Clusters](../concepts/fleets.md#backend-placement) +- [Clusters](../concepts/fleets.md#cluster-placement) and [distributed tasks](../concepts/tasks.md#distributed-tasks) are only supported by the `aws`, `azure`, `gcp`, `nebius`, `oci`, and `vultr` backends, as well as SSH fleets. diff --git a/docs/docs/index.md b/docs/docs/index.md index b0228fb2c9..121a379150 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -4,9 +4,8 @@ It streamlines development, training, and inference, and is compatible with any hardware, open-source tools, and frameworks. -#### Hardware - -`dstack` supports `NVIDIA`, `AMD`, `TPU`, `Intel Gaudi`, and `Tenstorrent` accelerators out of the box. +!!! info "Accelerators" + `dstack` supports `NVIDIA`, `AMD`, `TPU`, `Intel Gaudi`, and `Tenstorrent` accelerators out of the box. ## How does it work? @@ -20,12 +19,11 @@ It streamlines development, training, and inference, and is compatible with any `dstack` supports the following configurations: +* [Fleets](concepts/fleets.md) — for managing cloud and on-prem clusters * [Dev environments](concepts/dev-environments.md) — for interactive development using a desktop IDE * [Tasks](concepts/tasks.md) — for scheduling jobs, incl. distributed ones (or running web apps) * [Services](concepts/services.md) — for deploying models (or web apps) -* [Fleets](concepts/fleets.md) — for managing cloud and on-prem clusters * [Volumes](concepts/volumes.md) — for managing network volumes (to persist data) -* [Gateways](concepts/gateways.md) — for publishing services with a custom domain and HTTPS Configuration can be defined as YAML files within your repo. diff --git a/docs/docs/installation/index.md b/docs/docs/installation/index.md index c1dff911d7..e179a8e663 100644 --- a/docs/docs/installation/index.md +++ b/docs/docs/installation/index.md @@ -67,7 +67,7 @@ For more details on server deployment options, see the [Server deployment](../gu ### Configure backends -To orchestrate compute across cloud providers or Kubernetes clusters, you need to configure [backends](../concepts/backends.md). +> To orchestrate compute across GPU clouds or Kubernetes clusters, you need to configure [backends](../concepts/backends.md). ## Set up the CLI @@ -97,7 +97,7 @@ Once the server is up, you can access it via the `dstack` CLI. ??? info "Windows" To use the CLI on Windows, ensure you've installed Git and OpenSSH via - [Git for Windows:material-arrow-top-right-thin:{ .external }](https://git-scm.com/download/win). + [Git for Windows](https://git-scm.com/download/win). When installing it, ensure you've checked `Git from the command line and also from 3-rd party software` diff --git a/docs/docs/quickstart.md b/docs/docs/quickstart.md index 506ebec31f..2a8d3f0610 100644 --- a/docs/docs/quickstart.md +++ b/docs/docs/quickstart.md @@ -49,11 +49,9 @@
- If `nodes` is a range that starts above `0`, `dstack` pre-creates the initial number of instances up front, while any additional ones are created on demand. - - > Setting the `nodes` range to start above `0` is supported only for [VM-based backends](concepts/backends.md#vm-based). + If the `nodes` range starts with `0`, `dstack apply` creates only a template. Instances are provisioned only when you submit runs. - If the fleet needs to be a cluster, the [placement](concepts/fleets.md#backend-placement) property must be set to `cluster`. + If the fleet needs to be a cluster, the [placement](concepts/fleets.md#cluster-placement) property must be set to `cluster`. === "SSH fleet" If you have a group of on-prem servers accessible via SSH, you can create an SSH fleet as follows: diff --git a/examples/clusters/nccl-rccl-tests/README.md b/examples/clusters/nccl-rccl-tests/README.md index 7248a4f422..0a2a138b1a 100644 --- a/examples/clusters/nccl-rccl-tests/README.md +++ b/examples/clusters/nccl-rccl-tests/README.md @@ -3,7 +3,7 @@ This example shows how to run [NCCL](https://github.com/NVIDIA/nccl-tests) or [RCCL](https://github.com/ROCm/rccl-tests) tests on a cluster using [distributed tasks](https://dstack.ai/docs/concepts/tasks#distributed-tasks). !!! info "Prerequisites" - Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#backend-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)). + Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#cluster-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)). ## Running as a task diff --git a/examples/distributed-training/axolotl/README.md b/examples/distributed-training/axolotl/README.md index 9ddd77a363..1454732ad5 100644 --- a/examples/distributed-training/axolotl/README.md +++ b/examples/distributed-training/axolotl/README.md @@ -3,7 +3,7 @@ This example walks you through how to run distributed fine-tune using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) and [distributed tasks](https://dstack.ai/docs/concepts/tasks#distributed-tasks). !!! info "Prerequisites" - Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#backend-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)). + Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#cluster-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)). ## Define a configuration diff --git a/examples/distributed-training/ray-ragen/README.md b/examples/distributed-training/ray-ragen/README.md index e79f27f788..32ce0173fd 100644 --- a/examples/distributed-training/ray-ragen/README.md +++ b/examples/distributed-training/ray-ragen/README.md @@ -6,7 +6,7 @@ to fine-tune an agent on multiple nodes. Under the hood `RAGEN` uses [verl](https://github.com/volcengine/verl) for Reinforcement Learning and [Ray](https://docs.ray.io/en/latest/) for distributed training. !!! info "Prerequisites" - Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#backend-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)). + Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#cluster-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)). ## Run a Ray cluster diff --git a/examples/distributed-training/trl/README.md b/examples/distributed-training/trl/README.md index c6231e5170..9df482da52 100644 --- a/examples/distributed-training/trl/README.md +++ b/examples/distributed-training/trl/README.md @@ -3,7 +3,7 @@ This example walks you through how to run distributed fine-tune using [TRL](https://github.com/huggingface/trl), [Accelerate](https://github.com/huggingface/accelerate) and [Deepspeed](https://github.com/deepspeedai/DeepSpeed). !!! info "Prerequisites" - Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#backend-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)). + Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#cluster-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)). ## Define a configuration