Docs minor improvements #3501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

peterschmidt85 merged 3 commits into master from docs-minor-improvements

Jan 27, 2026

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -18,7 +18,7 @@
  
    It streamlines development, training, and inference, and is compatible with any hardware, open-source tools, and frameworks.

    #### Hardware

    #### Accelerators

    `dstack` supports `NVIDIA`, `AMD`, `Google TPU`, `Intel Gaudi`, and `Tenstorrent` accelerators out of the box.

    @@ -46,7 +46,7 @@ It streamlines development, training, and inference, and is compatible with any
  
    ##### Configure backends

    To orchestrate compute across cloud providers or existing Kubernetes clusters, you need to configure backends.

    To orchestrate compute across GPU clouds or Kubernetes clusters, you need to configure backends.

    Backends can be set up in `~/.dstack/server/config.yml` or through the [project settings page](https://dstack.ai/docs/concepts/projects#backends) in the UI.

    @@ -123,12 +123,11 @@ Configuration is updated at ~/.dstack/config.yml
  
    `dstack` supports the following configurations:

    * [Dev environments](https://dstack.ai/docs/dev-environments) &mdash; for interactive development using a desktop IDE

    * [Tasks](https://dstack.ai/docs/tasks) &mdash; for scheduling jobs (incl. distributed jobs) or running web apps

    * [Services](https://dstack.ai/docs/services) &mdash; for deployment of models and web apps (with auto-scaling and authorization)

    * [Fleets](https://dstack.ai/docs/fleets) &mdash; for managing cloud and on-prem clusters

    * [Fleets](https://dstack.ai/docs/concepts/fleets) &mdash; for managing cloud and on-prem clusters

    * [Dev environments](https://dstack.ai/docs/concepts/dev-environments) &mdash; for interactive development using a desktop IDE

    * [Tasks](https://dstack.ai/docs/concepts/tasks) &mdash; for scheduling jobs (incl. distributed jobs) or running web apps

    * [Services](https://dstack.ai/docs/concepts/services) &mdash; for deployment of models and web apps (with auto-scaling and authorization)

    * [Volumes](https://dstack.ai/docs/concepts/volumes) &mdash; for managing persisted volumes

    * [Gateways](https://dstack.ai/docs/concepts/gateways) &mdash; for configuring the ingress traffic and public endpoints

    Configuration can be defined as YAML files within your repo.

docs/blog/posts/gpu-health-checks.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -12,7 +12,7 @@ categories:
  
    In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results.

    `dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/guides/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.

    `dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/concepts/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.

    <img src="https://dstack.ai/static-assets/static-assets/images/gpu-health-checks.png" width="630"/>

    @@ -69,5 +69,5 @@ If you have experience with GPU reliability or ideas for automated recovery, joi
  
    !!! info "What's next?"

        1. Check [Quickstart](../../docs/quickstart.md)

        2. Explore the [clusters](../../docs/guides/clusters.md) guide

        3. Learn more about [metrics](../../docs/guides/metrics.md)

        3. Learn more about [metrics](../../docs/concepts/metrics.md)

        4. Join [Discord](https://discord.gg/u8SmfwPpMd)

docs/blog/posts/metrics-ui.md

-Original file line number
+Diff line change
@@ Expand Up @@
     metrics from `dstack`.
     !!! info "What's next?"
-. See [Metrics](../../docs/guides/metrics.md)
+. See [Metrics](../../docs/concepts/metrics.md)
 . Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
 . Join [Discord](https://discord.gg/u8SmfwPpMd)

docs/blog/posts/prometheus.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -45,7 +45,7 @@ Overall, `dstack` collects three groups of metrics:
  
    | **Runs**   | Run metrics include run counters for each user in each project.                                                                                                   |

    | **Jobs**   | A run consists of one or more jobs, each mapped to a container. Job metrics offer insights into execution time, cost, GPU model, NVIDIA DCGM telemetry, and more. |

    For a full list of available metrics and labels, check out [Metrics](../../docs/guides/metrics.md).

    For a full list of available metrics and labels, check out [Metrics](../../docs/concepts/metrics.md).

    ??? info "NVIDIA"

        NVIDIA DCGM metrics are automatically collected for `aws`, `azure`, `gcp`, and `oci` backends,

    @@ -59,7 +59,7 @@ For a full list of available metrics and labels, check out [Metrics](../../docs/
  
        only accessible through the UI and the [`dstack metrics`](dstack-metrics.md) CLI.

    !!! info "What's next?"

        1. See [Metrics](../../docs/guides/metrics.md)

        1. See [Metrics](../../docs/concepts/metrics.md)

        1. Check [dev environments](../../docs/concepts/dev-environments.md),

           [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md),

           and [fleets](../../docs/concepts/fleets.md)

docs/docs/concepts/backends.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,21 +1,22 @@
  
    # Backends

    Backends allow `dstack` to provision fleets across cloud providers or Kubernetes clusters.

    Backends allow `dstack` to provision fleets across GPU clouds or Kubernetes clusters.

    `dstack` supports two types of backends: 

      * [VM-based](#vm-based) – use `dstack`'s native integration with cloud providers to provision VMs, manage clusters, and orchestrate container-based runs.  

      * [Container-based](#container-based) – use either `dstack`'s native integration with cloud providers or Kubernetes to orchestrate container-based runs; provisioning in this case is delegated to the cloud provider or Kubernetes.  

    ??? info "SSH fleets"

    !!! info "SSH fleets"

        When using `dstack` with on-prem servers, backend configuration isn’t required. Simply create [SSH fleets](../concepts/fleets.md#ssh-fleets) once the server is up.

    Backends can be configured via `~/.dstack/server/config.yml` or through the [project settings page](../concepts/projects.md#backends) in the UI. See the examples of backend configuration below.

    > If you update `~/.dstack/server/config.yml`, you have to restart the server.

    ## VM-based

    VM-based backends allow `dstack` users to manage clusters and orchestrate container-based runs across a wide range of cloud providers.  

    Under the hood, `dstack` uses native integrations with these providers to provision clusters on demand.  

    VM-based backends allow `dstack` users to manage clusters and orchestrate container-based runs across a wide range of cloud providers. Under the hood, `dstack` uses native integrations with these providers to provision clusters on demand.  

    Compared to [container-based](#container-based) backends, this approach offers finer-grained, simpler control over cluster provisioning and eliminates the dependency on a Kubernetes layer.

    @@ -1036,9 +1037,13 @@ projects:
  
        No additional setup is required — `dstack` configures and manages the proxy automatically.

    ??? info "NVIDIA GPU Operator"

        For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the

        [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) pre-installed.

    ??? info "Required operators"

        === "NVIDIA"

            For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the

            [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) pre-installed.

        === "AMD"

            For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the

            [AMD GPU Operator](https://github.com/ROCm/gpu-operator) pre-installed.

    <!-- ??? info "Managed Kubernetes"

        While `dstack` supports both managed and on-prem Kubernetes clusters, it can only run on pre-provisioned nodes.

    @@ -1071,7 +1076,7 @@ projects:
  
        Ensure you've created a ClusterRoleBinding to grant the role to the user or the service account you're using.

    > To learn more, see the [Kubernetes](../guides/kubernetes.md) guide.

    > To learn more, see the [Lambda](../../examples/clusters/lambda/#kubernetes) and [Lambda](../../examples/clusters/crusoe/#kubernetes) examples.

    ### RunPod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs minor improvements #3501

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!