GPU cluster orchestration on Crusoe with dstack
Crusoe now integrates natively with dstack, giving ML teams a single declarative workflow to provision GPU clusters, run distributed training, and deploy inference endpoints on Crusoe infrastructure.

Scaling distributed GPU workloads on cloud infrastructure involves two compounding friction points: provisioning interconnected clusters fast enough to keep training pipelines moving, and maintaining consistency across development, training, and inference without accumulating configuration debt.
To address this, Crusoe now natively integrates with dstack, an open-source control plane for GPU orchestration. With this integration, teams can provision Crusoe GPU clusters, run multi-node training jobs, and deploy inference endpoints through a single declarative workflow, all without managing Kubernetes or writing custom provisioning scripts.
Why dstack?
dstack is an open-source control plane for GPU provisioning and orchestration across GPU clouds, Kubernetes, and on-prem clusters. On Crusoe, it provides a VM-first path to provision interconnected GPU clusters and run development, training and batch jobs, and model inference through a single control plane.
If you are new to dstack, start with the repository and docs:
- GitHub: https://github.com/dstackai/dstack
- Docs: https://dstack.ai/docs
Why orchestration on Crusoe?
Teams usually face two different bottlenecks. First, manual VM provisioning and ad-hoc commands create configuration drift and slow iteration. Second, Kubernetes-heavy workflows can add operational complexity when the immediate goal is to provision GPU clusters quickly and run workloads consistently.
With dstack on Crusoe, you define infrastructure and workloads declaratively:
- Provisioning via backends and fleets
- Scheduling across development, training or batch jobs, and model inference
What the Crusoe integration does
For Crusoe VMs, dstack natively integrates with the crusoe backend so you can:
- Provision interconnected GPU clusters directly from
dstack - Reuse the same fleet for interactive development and production workloads
For cluster-style workloads, this is the key operational split:
- Backends define where
dstackcan provision compute: Backends (Crusoe) - Fleets define how that compute pool is provisioned and reused: Fleets
Prerequisites and installation
You need:
- A Crusoe account, API key, and project ID
dstackserver and CLI installed
Minimal installation path:
uv tool install "dstack[all]" -U
dstack serverThen configure your local CLI profile to point to your server (see installation docs for full details): Installation | Quickstart
Configure the Crusoe backend
In ~/.dstack/server/config.yml:
projects:
- name: main
backends:
- type: crusoe
project_id: your-project-id
creds:
type: access_key
access_key: your-access-key
secret_key: your-secret-key
regions:
- us-east1-a
- us-southcentral1-aregions is optional. If omitted, dstack uses all available Crusoe regions for offer matching.
Create a Cluster Fleet on Crusoe
Define a fleet with cluster placement:
type: fleet
name: crusoe-fleet
nodes: 2
placement: cluster
backends: [crusoe]
resources:
gpu: H100:8Apply it:
dstack apply -f crusoe-fleet.dstack.yml.
Notes:
placement: clusteris required for distributed multi-node tasks.- If you want on-demand provisioning instead of pre-provisioning, use
nodes: 0..2. - Once created, the same fleet can run dev environments, tasks, and services.
Run workloads on the fleet
Distributed training on Crusoe clusters
Use a distributed task with nodes and torchrun to run multi-node training jobs.
type: task
name: distributed-job
repos:
- . # Mount the current Git repo into the run working directory
nodes: 2
python: "3.12"
env:
- NCCL_DEBUG=INFO
commands:
- uv pip install -r requirements.txt
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank "$DSTACK_NODE_RANK" \
--nnodes "$DSTACK_NODES_NUM" \
--master-addr "$DSTACK_MASTER_NODE_IP" \
--master-port=12345 \
train.py
resources:
gpu: H100:1..8
shm_size: 16GBThis model stays framework-agnostic while remaining practical. The same pattern works with PyTorch, Hugging Face TRL, Accelerate, DeepSpeed, and similar stacks.
Notes:
- For custom scripts, include
repos(orfiles) so your training code is available inside the container. - You can use any Docker image. If
imageis not specified,dstackuses its default image. - Distributed tasks require a fleet with
placement: cluster.
Reference: Distributed tasks
Single-node tasks
The same task model also works for single-node jobs: simply omit nodes and keep the same operational flow (dstack apply, dstack logs, dstack stop).
Validate interconnect with NCCL tests
On Crusoe VMs, HPC-X and NCCL topology files are pre-installed on the host image, which makes validation straightforward. By default, the dstack image already includes nccl-tests and mpirun, and you can also use a custom image when needed.
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
volumes:
- /opt/hpcx:/opt/hpcx
- /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo
commands:
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-mca btl tcp,self \
-mca coll_hcoll_enable 0 \
-x PATH \
-x LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_NTHREADS=4 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
-x NCCL_IB_MERGE_VFS=0 \
-x NCCL_IB_HCA=^mlx5_0:1 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi
backends: [crusoe]
resources:
gpu: A100:80GB:8
shm_size: 16GBApply it:
dstack apply -f crusoe-nccl-tests.dstack.ymlMake sure NCCL_TOPO_FILE matches your instance type.
Dev environments for interactive workflows
You can use the same fleet for remote IDE sessions:
type: dev-environment
name: crusoe-dev
python: "3.12"
ide: cursor
resources:
gpu: 1Apply it:
dstack apply -f crusoe-dev.dstack.yml
Launching `crusoe-dev`...
---> 100%
To open in Cursor Desktop, use this link:
cursor://vscode-remote/ssh-remote+crusoe-dev/workflowOpening the generated cursor:// link launches the remote workspace directly in Cursor Desktop.
Reference: Dev environments
Services and inference
Beyond development and training tasks, the same Crusoe-backed fleet can run dstack services for model inference as scalable endpoints. This includes features required for production workloads, such as autoscaling, built-in authentication, and gateway-based endpoint management. For advanced serving architectures, dstack also supports Prefill-Decode disaggregation with SGLang.
References:
Kubernetes on Crusoe
If you want to run through Crusoe Managed Kubernetes instead of VM-based backend fleets, use the Kubernetes path here:
https://dstack.ai/examples/clusters/crusoe/#kubernetes
In this post, the VM path is intentionally primary because it provides the most direct provisioning and orchestration flow for cluster workloads.
Get started on Crusoe with dstack
Everything covered here (cluster provisioning, distributed training, dev environments, and inference) is available on Crusoe today. Start with the resources below:
- Start from the Crusoe example: https://dstack.ai/examples/clusters/crusoe/
- Configure backend: Backends (Crusoe)
- Create and manage fleets: Fleets
- Run training and batch jobs: Tasks
- Use Desktop IDE with Crusoe GPUs: Dev environments
- Deploy model inference: Services


.jpg)