Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

It streamlines development, training, and inference, and is compatible with any hardware, open-source tools, and frameworks.

#### Hardware
#### Accelerators

`dstack` supports `NVIDIA`, `AMD`, `Google TPU`, `Intel Gaudi`, and `Tenstorrent` accelerators out of the box.

Expand Down Expand Up @@ -46,7 +46,7 @@ It streamlines development, training, and inference, and is compatible with any

##### Configure backends

To orchestrate compute across cloud providers or existing Kubernetes clusters, you need to configure backends.
To orchestrate compute across GPU clouds or Kubernetes clusters, you need to configure backends.

Backends can be set up in `~/.dstack/server/config.yml` or through the [project settings page](https://dstack.ai/docs/concepts/projects#backends) in the UI.

Expand Down Expand Up @@ -123,12 +123,11 @@ Configuration is updated at ~/.dstack/config.yml

`dstack` supports the following configurations:

* [Dev environments](https://dstack.ai/docs/dev-environments) — for interactive development using a desktop IDE
* [Tasks](https://dstack.ai/docs/tasks) — for scheduling jobs (incl. distributed jobs) or running web apps
* [Services](https://dstack.ai/docs/services) — for deployment of models and web apps (with auto-scaling and authorization)
* [Fleets](https://dstack.ai/docs/fleets) — for managing cloud and on-prem clusters
* [Fleets](https://dstack.ai/docs/concepts/fleets) — for managing cloud and on-prem clusters
* [Dev environments](https://dstack.ai/docs/concepts/dev-environments) — for interactive development using a desktop IDE
* [Tasks](https://dstack.ai/docs/concepts/tasks) — for scheduling jobs (incl. distributed jobs) or running web apps
* [Services](https://dstack.ai/docs/concepts/services) — for deployment of models and web apps (with auto-scaling and authorization)
* [Volumes](https://dstack.ai/docs/concepts/volumes) — for managing persisted volumes
* [Gateways](https://dstack.ai/docs/concepts/gateways) — for configuring the ingress traffic and public endpoints

Configuration can be defined as YAML files within your repo.

Expand Down
4 changes: 2 additions & 2 deletions docs/blog/posts/gpu-health-checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ categories:

In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results.

`dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/guides/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.
`dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/concepts/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.

<img src="https://dstack.ai/static-assets/static-assets/images/gpu-health-checks.png" width="630"/>

Expand Down Expand Up @@ -69,5 +69,5 @@ If you have experience with GPU reliability or ideas for automated recovery, joi
!!! info "What's next?"
1. Check [Quickstart](../../docs/quickstart.md)
2. Explore the [clusters](../../docs/guides/clusters.md) guide
3. Learn more about [metrics](../../docs/guides/metrics.md)
3. Learn more about [metrics](../../docs/concepts/metrics.md)
4. Join [Discord](https://discord.gg/u8SmfwPpMd)
2 changes: 1 addition & 1 deletion docs/blog/posts/metrics-ui.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,6 @@ For persistent storage and long-term access to metrics, we still recommend setti
metrics from `dstack`.

!!! info "What's next?"
1. See [Metrics](../../docs/guides/metrics.md)
1. See [Metrics](../../docs/concepts/metrics.md)
2. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
3. Join [Discord](https://discord.gg/u8SmfwPpMd)
4 changes: 2 additions & 2 deletions docs/blog/posts/prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Overall, `dstack` collects three groups of metrics:
| **Runs** | Run metrics include run counters for each user in each project. |
| **Jobs** | A run consists of one or more jobs, each mapped to a container. Job metrics offer insights into execution time, cost, GPU model, NVIDIA DCGM telemetry, and more. |

For a full list of available metrics and labels, check out [Metrics](../../docs/guides/metrics.md).
For a full list of available metrics and labels, check out [Metrics](../../docs/concepts/metrics.md).

??? info "NVIDIA"
NVIDIA DCGM metrics are automatically collected for `aws`, `azure`, `gcp`, and `oci` backends,
Expand All @@ -59,7 +59,7 @@ For a full list of available metrics and labels, check out [Metrics](../../docs/
only accessible through the UI and the [`dstack metrics`](dstack-metrics.md) CLI.

!!! info "What's next?"
1. See [Metrics](../../docs/guides/metrics.md)
1. See [Metrics](../../docs/concepts/metrics.md)
1. Check [dev environments](../../docs/concepts/dev-environments.md),
[tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md),
and [fleets](../../docs/concepts/fleets.md)
Expand Down
21 changes: 13 additions & 8 deletions docs/docs/concepts/backends.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
# Backends

Backends allow `dstack` to provision fleets across cloud providers or Kubernetes clusters.
Backends allow `dstack` to provision fleets across GPU clouds or Kubernetes clusters.

`dstack` supports two types of backends:

* [VM-based](#vm-based) – use `dstack`'s native integration with cloud providers to provision VMs, manage clusters, and orchestrate container-based runs.
* [Container-based](#container-based) – use either `dstack`'s native integration with cloud providers or Kubernetes to orchestrate container-based runs; provisioning in this case is delegated to the cloud provider or Kubernetes.

??? info "SSH fleets"
!!! info "SSH fleets"
When using `dstack` with on-prem servers, backend configuration isn’t required. Simply create [SSH fleets](../concepts/fleets.md#ssh-fleets) once the server is up.

Backends can be configured via `~/.dstack/server/config.yml` or through the [project settings page](../concepts/projects.md#backends) in the UI. See the examples of backend configuration below.

> If you update `~/.dstack/server/config.yml`, you have to restart the server.

## VM-based

VM-based backends allow `dstack` users to manage clusters and orchestrate container-based runs across a wide range of cloud providers.
Under the hood, `dstack` uses native integrations with these providers to provision clusters on demand.
VM-based backends allow `dstack` users to manage clusters and orchestrate container-based runs across a wide range of cloud providers. Under the hood, `dstack` uses native integrations with these providers to provision clusters on demand.

Compared to [container-based](#container-based) backends, this approach offers finer-grained, simpler control over cluster provisioning and eliminates the dependency on a Kubernetes layer.

Expand Down Expand Up @@ -1036,9 +1037,13 @@ projects:

No additional setup is required — `dstack` configures and manages the proxy automatically.

??? info "NVIDIA GPU Operator"
For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the
[NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) pre-installed.
??? info "Required operators"
=== "NVIDIA"
For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the
[NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) pre-installed.
=== "AMD"
For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the
[AMD GPU Operator](https://github.com/ROCm/gpu-operator) pre-installed.

<!-- ??? info "Managed Kubernetes"
While `dstack` supports both managed and on-prem Kubernetes clusters, it can only run on pre-provisioned nodes.
Expand Down Expand Up @@ -1071,7 +1076,7 @@ projects:

Ensure you've created a ClusterRoleBinding to grant the role to the user or the service account you're using.

> To learn more, see the [Kubernetes](../guides/kubernetes.md) guide.
> To learn more, see the [Lambda](../../examples/clusters/lambda/#kubernetes) and [Lambda](../../examples/clusters/crusoe/#kubernetes) examples.

### RunPod

Expand Down
Loading