GPU cluster orchestration on Crusoe with dstack

Table of contents

This is some text inside of a div block.

Scaling distributed GPU workloads on cloud infrastructure involves two compounding friction points: provisioning interconnected clusters fast enough to keep training pipelines moving, and maintaining consistency across development, training, and inference without accumulating configuration debt.

To address this, Crusoe now natively integrates with dstack, an open-source control plane for GPU orchestration. With this integration, teams can provision Crusoe GPU clusters, run multi-node training jobs, and deploy inference endpoints through a single declarative workflow, all without managing Kubernetes or writing custom provisioning scripts.

Why dstack?

dstack is an open-source control plane for GPU provisioning and orchestration across GPU clouds, Kubernetes, and on-prem clusters. On Crusoe, it provides a VM-first path to provision interconnected GPU clusters and run development, training and batch jobs, and model inference through a single control plane.

If you are new to dstack, start with the repository and docs:

GitHub: https://github.com/dstackai/dstack
Docs: https://dstack.ai/docs

Why orchestration on Crusoe?

Teams usually face two different bottlenecks. First, manual VM provisioning and ad-hoc commands create configuration drift and slow iteration. Second, Kubernetes-heavy workflows can add operational complexity when the immediate goal is to provision GPU clusters quickly and run workloads consistently.

With dstack on Crusoe, you define infrastructure and workloads declaratively:

Provisioning via backends and fleets
Scheduling across development, training or batch jobs, and model inference

What the Crusoe integration does

For Crusoe VMs, dstack natively integrates with the crusoe backend so you can:

Provision interconnected GPU clusters directly from dstack
Reuse the same fleet for interactive development and production workloads

For cluster-style workloads, this is the key operational split:

Backends define where dstack can provision compute: Backends (Crusoe)
Fleets define how that compute pool is provisioned and reused: Fleets

Prerequisites and installation

You need:

A Crusoe account, API key, and project ID
dstack server and CLI installed

Minimal installation path:

uv tool install "dstack[all]" -U
dstack server

Then configure your local CLI profile to point to your server (see installation docs for full details): Installation | Quickstart

Configure the Crusoe backend

In ~/.dstack/server/config.yml:

projects:
- name: main
  backends:
    - type: crusoe
      project_id: your-project-id
      creds:
        type: access_key
        access_key: your-access-key
        secret_key: your-secret-key
      regions:
        - us-east1-a
        - us-southcentral1-a

regions is optional. If omitted, dstack uses all available Crusoe regions for offer matching.

Create a Cluster Fleet on Crusoe

Define a fleet with cluster placement:

type: fleet
name: crusoe-fleet
nodes: 2
placement: cluster
backends: [crusoe]
resources:
  gpu: H100:8

Apply it:

dstack apply -f crusoe-fleet.dstack.yml.

Notes:

placement: cluster is required for distributed multi-node tasks.
If you want on-demand provisioning instead of pre-provisioning, use nodes: 0..2.
Once created, the same fleet can run dev environments, tasks, and services.

Run workloads on the fleet

Distributed training on Crusoe clusters

Use a distributed task with nodes and torchrun to run multi-node training jobs.

type: task
name: distributed-job
repos:
  - .  # Mount the current Git repo into the run working directory

nodes: 2

python: "3.12"
env:
  - NCCL_DEBUG=INFO
commands:
  - uv pip install -r requirements.txt
  - |
    torchrun \
      --nproc-per-node=$DSTACK_GPUS_PER_NODE \
      --node-rank "$DSTACK_NODE_RANK" \
      --nnodes "$DSTACK_NODES_NUM" \
      --master-addr "$DSTACK_MASTER_NODE_IP" \
      --master-port=12345 \
      train.py

resources:
  gpu: H100:1..8
  shm_size: 16GB

This model stays framework-agnostic while remaining practical. The same pattern works with PyTorch, Hugging Face TRL, Accelerate, DeepSpeed, and similar stacks.

Notes:

For custom scripts, include repos (or files) so your training code is available inside the container.
You can use any Docker image. If image is not specified, dstack uses its default image.
Distributed tasks require a fleet with placement: cluster.

Reference: Distributed tasks

Single-node tasks

The same task model also works for single-node jobs: simply omit nodes and keep the same operational flow (dstack apply, dstack logs, dstack stop).

Validate interconnect with NCCL tests

On Crusoe VMs, HPC-X and NCCL topology files are pre-installed on the host image, which makes validation straightforward. By default, the dstack image already includes nccl-tests and mpirun, and you can also use a custom image when needed.

type: task
name: nccl-tests

nodes: 2
startup_order: workers-first
stop_criteria: master-done

volumes:
  - /opt/hpcx:/opt/hpcx
  - /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo

commands:
  - . /opt/hpcx/hpcx-init.sh
  - hpcx_load
  - |
    if [ $DSTACK_NODE_RANK -eq 0 ]; then
      mpirun \
        --allow-run-as-root \
        --hostfile $DSTACK_MPI_HOSTFILE \
        -n $DSTACK_GPUS_NUM \
        -N $DSTACK_GPUS_PER_NODE \
        --bind-to none \
        -mca btl tcp,self \
        -mca coll_hcoll_enable 0 \
        -x PATH \
        -x LD_LIBRARY_PATH \
        -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
        -x NCCL_SOCKET_NTHREADS=4 \
        -x NCCL_NSOCKS_PERTHREAD=8 \
        -x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
        -x NCCL_IB_MERGE_VFS=0 \
        -x NCCL_IB_HCA=^mlx5_0:1 \
        /opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
    else
      sleep infinity
    fi

backends: [crusoe]
resources:
  gpu: A100:80GB:8
  shm_size: 16GB

Apply it:

dstack apply -f crusoe-nccl-tests.dstack.yml

Make sure NCCL_TOPO_FILE matches your instance type.

Dev environments for interactive workflows

You can use the same fleet for remote IDE sessions:

type: dev-environment
name: crusoe-dev

python: "3.12"
ide: cursor

resources:
  gpu: 1

Apply it:

dstack apply -f crusoe-dev.dstack.yml

Launching `crusoe-dev`...
---> 100%
To open in Cursor Desktop, use this link:
  cursor://vscode-remote/ssh-remote+crusoe-dev/workflow

Opening the generated cursor:// link launches the remote workspace directly in Cursor Desktop.

Reference: Dev environments

Services and inference

Beyond development and training tasks, the same Crusoe-backed fleet can run dstack services for model inference as scalable endpoints. This includes features required for production workloads, such as autoscaling, built-in authentication, and gateway-based endpoint management. For advanced serving architectures, dstack also supports Prefill-Decode disaggregation with SGLang.

References:

Kubernetes on Crusoe

If you want to run through Crusoe Managed Kubernetes instead of VM-based backend fleets, use the Kubernetes path here:
https://dstack.ai/examples/clusters/crusoe/#kubernetes

In this post, the VM path is intentionally primary because it provides the most direct provisioning and orchestration flow for cluster workloads.

Get started on Crusoe with dstack

Everything covered here (cluster provisioning, distributed training, dev environments, and inference) is available on Crusoe today. Start with the resources below:

Start from the Crusoe example: https://dstack.ai/examples/clusters/crusoe/
Configure backend: Backends (Crusoe)
Create and manage fleets: Fleets
Run training and batch jobs: Tasks
Use Desktop IDE with Crusoe GPUs: Dev environments‍
Deploy model inference: Services

GPU cluster orchestration on Crusoe with dstack

Why dstack?

Why orchestration on Crusoe?

What the Crusoe integration does

Prerequisites and installation

Configure the Crusoe backend

Create a Cluster Fleet on Crusoe

Run workloads on the fleet

Distributed training on Crusoe clusters

Single-node tasks

Validate interconnect with NCCL tests

Dev environments for interactive workflows

Services and inference

Kubernetes on Crusoe

Get started on Crusoe with dstack

Latest articles

Are you ready to build something amazing?