Up to 3X faster: Benchmarking Llama 3.1 fine-tuning on Crusoe Cloud with NVIDIA GB200 NVL72
Benchmark results: Llama 3.1 fine-tuning on NVIDIA GB200 NVL72 delivers up to 3X faster performance vs HGX H100. See how topology-aware scheduling unlocks Blackwell's potential.

The next era of AI isn't just about faster chips — it’s about a more cohesive compute fabric. While the industry standard has long relied on PCIe Gen5, the NVIDIA GB200 NVL72 shatters those legacy bottlenecks by creating a 72 GPU unified rack-scale domain with 130 TB/s of aggregate bandwidth, 14x faster than PCI Gen5.
But with this massive leap in power comes a new requirement for builders: topology-aware scheduling. To unlock the full potential of NVIDIA Blackwell, workloads must be scheduled coherently across the rack's 72-GPU NVIDIA NVLink™ domain.
In this post, we’ll show you exactly how to orchestrate distributed training pods using Kubernetes Dynamic Resource Allocation (DRA). By aligning training runs to respect rack-scale NVLink domains, we achieved a 2–3x increase in throughput compared to the previous-generation NVIDIA Hopper. Here is how we did it (and how you can too).
How to get started
To benchmark training performance on GB200 NVL72, we teamed up with the AI engineers at Zoom to run a fine-tuning job for the Llama 3.1 8B parameter model. This allowed our team to both validate that our GB200 NVL72 Kubernetes cluster was set up appropriately, and also allows us to work with the Zoom team directly to help validate GB200 NVL72 support for their particular setup.
Kubernetes setup
First you need to provision a Crusoe Managed Kubernetes cluster in our eu-iceland1-a` region, with the following CMK (Crusoe Managed Kubernetes) add-ons: NVIDIA GPU Operator, Container Storage Interface (CSI), and nvidia_gb200_support. After the cluster is provisioned, you can create a GB200 Node Pool (either via the Crusoe CLI or UI). Any CMK cluster created for GB200 NVL72 will require the nvidia_gb200_support add-on, which enables support for NVIDIA DRA (Dynamic Resource Allocation) Driver for GPUs. DRA is a novel Kubernetes concept introduced in 2022 that makes device sharing, particularly accelerated devices like GPUs, amongst pods easier using a claim process similar to dynamic volume provisioning for storage devices.
NVIDIA DRA driver, in particular, has abstracted away the configuration of MNNVL (Multi-Node NVLink) seen on a rack-scale unit like GB200 NVL72. Because it is complex to set up NVLink domains and channels necessary for all the nodes in GB200 to communicate effectively via NVLink, the driver handles all of it while the users use a resource called ComputeDomain to specify number of nodes and claim it on the workload (Deployment or Training Jobs) using a resource claim.
Other necessary Kubernetes add-ons include Kubeflow Trainer operator and Kubeflow SDK to launch the job.
Data setup
Before creating a ComputeDomain and our workload on CMK, we need to create a storage for the necessary data and framework. Crusoe provides a Shared Disk using NFS as a distributed filesystem that can be used to mount to multiple CMK worker nodes so that necessary data such as model weights can be downloaded with high speed and low latency.
The Shared Disk should be used to store Llama 3.1 8B model weights, which can be downloaded from HuggingFace, LLaMA-Factory directory, sub-directory to save checkpoints from fine-tuning jobs, and the entrypoint code that executes the torchrun command.
Fine-tuning Llama 3.1 model using Llama Factory
Once we have everything set up, it is time to deploy the workloads to run fine-tuning for the Llama 3.1 model with LLaMA-Factory. Since the application pods need to run efficiently using the intra-node NVLink, we need to create a custom ComputeDomain that the application can claim. For example:
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: compute-domain
spec:
numNodes: 0
channel:
resourceClaimTemplate:
name: training-alpha-domain-channel
Then you submit a PyTorchJob resource, which is a custom resource supported by the Kubeflow Trainer to run a distributed PyTorch Job. It creates a master pod that coordinates jobs across the worker pods that actually execute the fine-tuning job. The volume is mounted as PVC (Persistent Volume Claim) using Crusoe CSI Driver - an example of how to configure all the underlying CRDs (Custom Resource Definitions) for existing Shared Disk is available here:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorchjob-crusoe-gb200-nvl72-young
namespace: default
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.clique
operator: Exists
containers:
- command:
- /bin/bash
args:
- -c
- ./run.sh
env:
- name: GPU_NUM
value: "4"
- name: NCCL_DEBUG_SUBSYS
value: INIT,NET
- name: NCCL_DEBUG
value: INFO
image: youngjeong46/llama-factory:latest
imagePullPolicy: Always
name: pytorch
ports:
- containerPort: 24456
name: pytorchjob-port
protocol: TCP
resources:
claims:
- name: compute-domain-channel
limits:
nvidia.com/gpu: 4
securityContext:
capabilities:
add:
- IPC_LOCK
- CAP_SYS_ADMIN
volumeMounts:
- mountPath: '/data'
name: data
- mountPath: /dev/shm
name: dshm
- mountPath: /dev/nvidia-caps
name: nvidia-caps
workingDir: /code
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
initContainers:
- command:
- /bin/bash
- -c
- "sleep 10"
image: nginx
imagePullPolicy: Always
name: init-k8s-job-container
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: training-alpha-domain-channel
securityContext:
runAsGroup: 0
runAsUser: 0
supplementalGroups:
- 22080
volumes:
- name: data
persistentVolumeClaim:
claimName: pytorchjob-data-pvc
readOnly: false
- emptyDir:
medium: Memory
sizeLimit: 64Gi
name: dshm
- hostPath:
path: /dev/nvidia-caps
type: Directory
name: nvidia-caps
Worker:
replicas: 16
restartPolicy: OnFailure
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.clique
operator: Exists
containers:
- command:
- /bin/bash
args:
- "-c"
- ./run.sh
env:
- name: GPU_NUM
value: "4"
- name: NCCL_DEBUG_SUBSYS
value: INIT,NET
- name: NCCL_DEBUG
value: INFO
image: youngjeong46/llama-factory:latest
imagePullPolicy: Always
name: pytorch
ports:
- containerPort: 24456
name: pytorchjob-port
protocol: TCP
resources:
claims:
- name: compute-domain-channel
limits:
nvidia.com/gpu: 4
securityContext:
capabilities:
add:
- IPC_LOCK
- CAP_SYS_ADMIN
volumeMounts:
- mountPath: '/data'
name: data
- mountPath: /dev/shm
name: dshm
- mountPath: /dev/nvidia-caps
name: nvidia-caps
workingDir: /code
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
initContainers:
- command:
- /bin/bash
- "-c"
- "sleep 200"
image: nginx
imagePullPolicy: Always
name: init-k8s-job-container
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: training-alpha-domain-channel
securityContext:
runAsGroup: 0
runAsUser: 0
supplementalGroups:
- 22080
volumes:
- hostPath:
path: /dev/nvidia-caps
type: Directory
name: nvidia-caps
- name: data
persistentVolumeClaim:
claimName: pytorchjob-data-pvc
readOnly: false
- emptyDir:
medium: Memory
sizeLimit: 32Gi
name: dshmResults
Comparing the previous results on the fine-tuning job on Llama 3.1 8B, Crusoe and Zoom were able to observe the following improvements on throughput, measured in seconds per iterations (s/iter), when using the same number of GPUs on both NVIDIA GB200 NVL72 and NVIDIA HGX H100 (8x GPUs on a single instance, with nodes interconnected via NVIDIA Quantum InfiniBand) with same training framework settings:
Furthermore, as the team scaled the number of GPUs in a fine-tuning job (from 24, 32, 48 to 64 GPUs), on the GB200 NVL72 rack system, the throughput increased linearly, whereas on HGX H100 the increase observed was not linear. We attribute this to the fact that within a single GB200 NVL72 rack, all cross communications depend on the MNNVL with a much higher throughput (1.8TB/s bi-directional) versus the NVIDIA Quantum InfiniBand across multiple HGX H100 nodes (standard rated at 400GB/s).
The final comparison the team made is to fine-tune a much larger model to take advantage of the GB200 NVL72 system’s much larger GPU memory (186GB vs. HGX H100’s 80GB per GPU). Using the Llama 3.1 70B model for this exercise, and implementing ZeRO-1 parallelization without OOM, achieved close to 300% improvement over H100.
The transition from NVIDIA HGX H100 to NVIDIA GB200 NVL72 is more than a hardware swap; it’s an architectural evolution. As our benchmarks with Zoom show, the move to a unified memory pool and rack-scale NVLink doesn't just improve speed, it enables linear scalability and solves the "Out of Memory" (OOM) hurdles that stifle large-model development.
Ready to supercharge your workloads?
Whether you’re training foundation models or fine-tuning at scale, our team is standing by to help you navigate the transition to Arm and maximize the value of Blackwell's extreme compute density. Contact us today to run a Proof of Concept (POC) and see these performance gains on your own workloads.
Join Crusoe at the San Jose Convention Center for NVIDIA GTC from March 16-19. We look forward to meeting the builders shaping AI’s future.

.jpg)
