Running AI workloads on AMD GPUs with SkyPilot

Table of contents

This is some text inside of a div block.

In the evolving AI landscape, staying ahead requires strategic versatility. Builders are increasingly leveraging GPU vendor diversity to optimize price-performance and ensure capacity availability for mission-critical workloads.

Using the AMD Instinct™ MI300X as an example, this post explores how to utilize SkyPilot — a multi-cloud deployment framework — to orchestrate AMD clusters seamlessly on Crusoe Managed Kubernetes (CMK), helping you build a more resilient, multi-vendor AI strategy.

This guide walks through the exact steps needed to:

Enable AMD GPU support on Kubernetes
Configure the AMD GPU operator
Integrate SkyPilot with your cluster
Launch GPU jobs on MI300X nodes

This is based on a full working setup on an MI300X node pool running AMD GPU Operator v1.2.2.

Prerequisites

Ensure that you have a supported Kubernetes cluster (CMK) with at least one AMD MI300X node pool.

If you don't have a CMK cluster yet, let's get one set up.

Step 1: Install the Crusoe CLI

Then, run the following command to create your cluster:

crusoe kubernetes clusters create --name <cluster-name> \
--cluster-version <version> --location <location> \
--project-id <project-id>

crusoe kubernetes nodepools create --name <node-pool-name> --count 1 --cluster-name <cluster-name> --type mi300x-192gb-ib.8x

Once your cluster is provisioned with a MI300x node, fetch your credentials with this command:

crusoe clusters get-credentials <cluster-name> --project-id <project-id>

Step 2: Install SkyPilot with Kubernetes support

Next, install or upgrade SkyPilot’s CLI to include Kubernetes. Make sure your kube-config is stored in $HOME/.kube/config.

pip install --upgrade "skypilot[kubernetes]"

Verify that SkyPilot can see your CMK cluster:

sky check kubernetes

Overview

In this tutorial, we’ll walk through:

Installing the AMD GPU Operator
Verifying AMD GPU recognition in Kubernetes
Configuring SkyPilot to understand MI300X accelerators
Running your first MI300X workload with SkyPilot

This setup results in a fully operational AMD GPU cluster with SkyPilot-powered scheduling and execution.

1. Installing the AMD GPU Operator

Typically we bundle the GPU Operators into our clusters for a functional OOTB setup but we're being explicit with our steps for explainability.

Add cert-manager repo.

helm repo add jetstack https://charts.jetstack.io --force-update && helm repo update

Install cert-manager (AMD instructions use v1.15.1).

helm install cert-manager jetstack/cert-manager \
 --namespace cert-manager \
 --create-namespace \
 --version v1.15.1 \
 --set crds.enabled=true

Add ROCm repo.

helm repo add rocm https://rocm.github.io/gpu-operator && helm repo update

Install AMD GPU Operator (I used v1.2.2).

helm install amd-gpu-operator rocm/gpu-operator-charts \
  --namespace kube-amd-gpu --create-namespace \
  --version v1.2.2

Create a registry secret for AMD GPU Operator to push/pull driver images to/from (I used my personal Docker Hub account).

kubectl create secret docker-registry my-docker-secret -n kube-amd-gpu --docker-username $YOUR_USERNAME --docker-email $YOUR_EMAIL --docker-password $YOUR_PASSWORD

Deploy the AMD DeviceConfig to kickstart the discovery and driver installation process.

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: gpu-operator
  namespace: kube-amd-gpu
spec:
  driver:
    enable: true
    blacklist: true
    version: "6.4.1"
    image: docker.io/<username>/<your-amd-gpu-driver-repo>
    imageRegistrySecret:
      name: my-docker-secret
  devicePlugin:
    devicePluginImage: rocm/k8s-device-plugin:latest
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
    enableNodeLabeller: true
  metricsExporter:
    enable: false
  selector:
    feature.node.kubernetes.io/amd-gpu: "true"
  testRunner:
    enable: true
    logsLocation:
      mountPath: "/var/log/amd-test-runner"
      hostPath: "/var/log/amd-test-runner"

2. Verifying AMD GPU detection in Kubernetes

Now check that Kubernetes advertises the MI300X GPUs:

kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'amd\.com/gpu'

Example output:

NAME                                        GPUs
np-bdd12851-1.us-east1-a.compute.internal   8

Success! Your cluster now recognizes all eight MI300X GPUs.

3. Integrating SkyPilot with AMD MI300X GPUs

In order for SkyPilot to recognize the AMD node we must manually add the following label:

kubectl label nodes <NODE> skypilot.co/accelerator=mi300x --overwrite

Verify that SkyPilot is now fully AMD-aware.

sky show-gpus --infra k8s

Example output:

Kubernetes GPUs
Context: amdxskypilot
GPU     REQUESTABLE_QTY_PER_NODE  UTILIZATION  
MI300X  1, 2, 4, 8                8 of 8 free  
Kubernetes per-node GPU availability
CONTEXT       NODE                                       GPU     UTILIZATION  
amdxskypilot  np-bdd12851-1.us-east1-a.compute.internal  MI300X  8 of 8 free

4. Running a SkyPilot Job on MI300X

For our example AI workload let’s run one of SkyPilot’s DDP training examples from their github. This will pull a large image so expect to wait 5-10 minutes for the workload to start.

# test-job.yaml
name: amd-rocm-minGPT-ddp

resources:
  cloud: kubernetes
  image_id: docker:rocm/pytorch-training:v25.6
  accelerators: MI300:4
  cpus: 128
  memory: 512+


setup: |
  echo " minGPT example derived from https://github.com/pytorch/examples"
  
  
run: |     
  # amd dockers can use their own conda environment
  conda deactivate
  
  git clone https://github.com/pytorch/examples.git
  cd examples/distributed/minGPT-ddp
  # Install dependencies
  pip install -r requirements.txt
  sleep 5
 
  echo "Running Pytorch minGPT example..."
  sudo /bin/bash run_example.sh ./mingpt/main.py 4
  rocm-smi

Launch it:

sky launch -y test-job.yaml --cluster mingpt-test
sky status
sky logs mingpt-test

Example output:

pphelan@MBP-Patrick-Phelan ~ % sky logs sky-216e-pphelan
...
(amd-rocm-minGPT-ddp, pid=4314) Snapshot saved at epoch 9
(amd-rocm-minGPT-ddp, pid=4314) [RANK1] Epoch 9 | Iter 0 | Eval Loss 1.29033
(amd-rocm-minGPT-ddp, pid=4314) [RANK3] Epoch 9 | Iter 0 | Eval Loss 1.30004
(amd-rocm-minGPT-ddp, pid=4314) [RANK2] Epoch 9 | Iter 0 | Eval Loss 
...
✓ Job finished (status: SUCCEEDED).

Sign up for Crusoe Cloud

By combining SkyPilot’s intuitive orchestration with the massive compute density of AMD’s GPUs, builders can achieve a streamlined, production-grade environment for their most ambitious models. At Crusoe, we are committed to making this next-gen hardware accessible on a platform purpose-built for high-performance, sustainable AI.

Ready to accelerate your breakthroughs? Check out Crusoe Cloud.

Running AI workloads on AMD GPUs with SkyPilot

Prerequisites

Overview

1. Installing the AMD GPU Operator

2. Verifying AMD GPU detection in Kubernetes

3. Integrating SkyPilot with AMD MI300X GPUs

4. Running a SkyPilot Job on MI300X

Sign up for Crusoe Cloud

Latest articles

Are you ready to build something amazing?