Running AI workloads on AMD GPUs with SkyPilot
Learn how to orchestrate AMD Instinct™ MI300X clusters on Crusoe Managed Kubernetes using SkyPilot. This step-by-step technical guide covers installing the AMD GPU Operator, configuring ROCm drivers, and launching production-grade AI training jobs.

In the evolving AI landscape, staying ahead requires strategic versatility. Builders are increasingly leveraging GPU vendor diversity to optimize price-performance and ensure capacity availability for mission-critical workloads.
Using the AMD Instinct™ MI300X as an example, this post explores how to utilize SkyPilot — a multi-cloud deployment framework — to orchestrate AMD clusters seamlessly on Crusoe Managed Kubernetes (CMK), helping you build a more resilient, multi-vendor AI strategy.
This guide walks through the exact steps needed to:
- Enable AMD GPU support on Kubernetes
- Configure the AMD GPU operator
- Integrate SkyPilot with your cluster
- Launch GPU jobs on MI300X nodes
This is based on a full working setup on an MI300X node pool running AMD GPU Operator v1.2.2.
Prerequisites
Ensure that you have a supported Kubernetes cluster (CMK) with at least one AMD MI300X node pool.
If you don't have a CMK cluster yet, let's get one set up.
Step 1: Install the Crusoe CLI
Then, run the following command to create your cluster:
crusoe kubernetes clusters create --name <cluster-name> \
--cluster-version <version> --location <location> \
--project-id <project-id>
crusoe kubernetes nodepools create --name <node-pool-name> --count 1 --cluster-name <cluster-name> --type mi300x-192gb-ib.8x
Once your cluster is provisioned with a MI300x node, fetch your credentials with this command:
crusoe clusters get-credentials <cluster-name> --project-id <project-id>Step 2: Install SkyPilot with Kubernetes support
Next, install or upgrade SkyPilot’s CLI to include Kubernetes. Make sure your kube-config is stored in $HOME/.kube/config.
pip install --upgrade "skypilot[kubernetes]"Verify that SkyPilot can see your CMK cluster:
sky check kubernetesOverview
In this tutorial, we’ll walk through:
- Installing the AMD GPU Operator
- Verifying AMD GPU recognition in Kubernetes
- Configuring SkyPilot to understand MI300X accelerators
- Running your first MI300X workload with SkyPilot
This setup results in a fully operational AMD GPU cluster with SkyPilot-powered scheduling and execution.
1. Installing the AMD GPU Operator
Typically we bundle the GPU Operators into our clusters for a functional OOTB setup but we're being explicit with our steps for explainability.
Add cert-manager repo.
helm repo add jetstack https://charts.jetstack.io --force-update && helm repo updateInstall cert-manager (AMD instructions use v1.15.1).
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.15.1 \
--set crds.enabled=trueAdd ROCm repo.
helm repo add rocm https://rocm.github.io/gpu-operator && helm repo updateInstall AMD GPU Operator (I used v1.2.2).
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version v1.2.2
Create a registry secret for AMD GPU Operator to push/pull driver images to/from (I used my personal Docker Hub account).
kubectl create secret docker-registry my-docker-secret -n kube-amd-gpu --docker-username $YOUR_USERNAME --docker-email $YOUR_EMAIL --docker-password $YOUR_PASSWORDDeploy the AMD DeviceConfig to kickstart the discovery and driver installation process.
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: gpu-operator
namespace: kube-amd-gpu
spec:
driver:
enable: true
blacklist: true
version: "6.4.1"
image: docker.io/<username>/<your-amd-gpu-driver-repo>
imageRegistrySecret:
name: my-docker-secret
devicePlugin:
devicePluginImage: rocm/k8s-device-plugin:latest
nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
enableNodeLabeller: true
metricsExporter:
enable: false
selector:
feature.node.kubernetes.io/amd-gpu: "true"
testRunner:
enable: true
logsLocation:
mountPath: "/var/log/amd-test-runner"
hostPath: "/var/log/amd-test-runner"
2. Verifying AMD GPU detection in Kubernetes
Now check that Kubernetes advertises the MI300X GPUs:
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'amd\.com/gpu'Example output:
NAME GPUs
np-bdd12851-1.us-east1-a.compute.internal 8
Success! Your cluster now recognizes all eight MI300X GPUs.
3. Integrating SkyPilot with AMD MI300X GPUs
In order for SkyPilot to recognize the AMD node we must manually add the following label:
kubectl label nodes <NODE> skypilot.co/accelerator=mi300x --overwriteVerify that SkyPilot is now fully AMD-aware.
sky show-gpus --infra k8sExample output:
Kubernetes GPUs
Context: amdxskypilot
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
MI300X 1, 2, 4, 8 8 of 8 free
Kubernetes per-node GPU availability
CONTEXT NODE GPU UTILIZATION
amdxskypilot np-bdd12851-1.us-east1-a.compute.internal MI300X 8 of 8 free
4. Running a SkyPilot Job on MI300X
For our example AI workload let’s run one of SkyPilot’s DDP training examples from their github. This will pull a large image so expect to wait 5-10 minutes for the workload to start.
# test-job.yaml
name: amd-rocm-minGPT-ddp
resources:
cloud: kubernetes
image_id: docker:rocm/pytorch-training:v25.6
accelerators: MI300:4
cpus: 128
memory: 512+
setup: |
echo " minGPT example derived from https://github.com/pytorch/examples"
run: |
# amd dockers can use their own conda environment
conda deactivate
git clone https://github.com/pytorch/examples.git
cd examples/distributed/minGPT-ddp
# Install dependencies
pip install -r requirements.txt
sleep 5
echo "Running Pytorch minGPT example..."
sudo /bin/bash run_example.sh ./mingpt/main.py 4
rocm-smi
Launch it:
sky launch -y test-job.yaml --cluster mingpt-test
sky status
sky logs mingpt-test
Example output:
pphelan@MBP-Patrick-Phelan ~ % sky logs sky-216e-pphelan
...
(amd-rocm-minGPT-ddp, pid=4314) Snapshot saved at epoch 9
(amd-rocm-minGPT-ddp, pid=4314) [RANK1] Epoch 9 | Iter 0 | Eval Loss 1.29033
(amd-rocm-minGPT-ddp, pid=4314) [RANK3] Epoch 9 | Iter 0 | Eval Loss 1.30004
(amd-rocm-minGPT-ddp, pid=4314) [RANK2] Epoch 9 | Iter 0 | Eval Loss
...
✓ Job finished (status: SUCCEEDED).
Sign up for Crusoe Cloud
By combining SkyPilot’s intuitive orchestration with the massive compute density of AMD’s GPUs, builders can achieve a streamlined, production-grade environment for their most ambitious models. At Crusoe, we are committed to making this next-gen hardware accessible on a platform purpose-built for high-performance, sustainable AI.
Ready to accelerate your breakthroughs? Check out Crusoe Cloud.


