Up to 3X faster: Benchmarking Llama 3.1 fine-tuning on Crusoe Cloud with NVIDIA GB200 NVL72

Table of contents

This is some text inside of a div block.

The next era of AI isn't just about faster chips — it’s about a more cohesive compute fabric. While the industry standard has long relied on PCIe Gen5, the NVIDIA GB200 NVL72 shatters those legacy bottlenecks by creating a 72 GPU unified rack-scale domain with 130 TB/s of aggregate bandwidth, 14x faster than PCI Gen5.

But with this massive leap in power comes a new requirement for builders: topology-aware scheduling. To unlock the full potential of NVIDIA Blackwell, workloads must be scheduled coherently across the rack's 72-GPU NVIDIA NVLink™ domain.

In this post, we’ll show you exactly how to orchestrate distributed training pods using Kubernetes Dynamic Resource Allocation (DRA). By aligning training runs to respect rack-scale NVLink domains, we achieved a 2–3x increase in throughput compared to the previous-generation NVIDIA Hopper. Here is how we did it (and how you can too).

How to get started

To benchmark training performance on GB200 NVL72, we teamed up with the AI engineers at Zoom to run a fine-tuning job for the Llama 3.1 8B parameter model. This allowed our team to both validate that our GB200 NVL72 Kubernetes cluster was set up appropriately, and also allows us to work with the Zoom team directly to help validate GB200 NVL72 support for their particular setup.

Kubernetes setup

First you need to provision a Crusoe Managed Kubernetes cluster in our eu-iceland1-a` region, with the following CMK (Crusoe Managed Kubernetes) add-ons: NVIDIA GPU Operator, Container Storage Interface (CSI), and nvidia_gb200_support. After the cluster is provisioned, you can create a GB200 Node Pool (either via the Crusoe CLI or UI). Any CMK cluster created for GB200 NVL72 will require the nvidia_gb200_support add-on, which enables support for NVIDIA DRA (Dynamic Resource Allocation) Driver for GPUs. DRA is a novel Kubernetes concept introduced in 2022 that makes device sharing, particularly accelerated devices like GPUs, amongst pods easier using a claim process similar to dynamic volume provisioning for storage devices.

NVIDIA DRA driver, in particular, has abstracted away the configuration of MNNVL (Multi-Node NVLink) seen on a rack-scale unit like GB200 NVL72. Because it is complex to set up NVLink domains and channels necessary for all the nodes in GB200 to communicate effectively via NVLink, the driver handles all of it while the users use a resource called ComputeDomain to specify number of nodes and claim it on the workload (Deployment or Training Jobs) using a resource claim.

Other necessary Kubernetes add-ons include Kubeflow Trainer operator and Kubeflow SDK to launch the job.

Data setup

Before creating a ComputeDomain and our workload on CMK, we need to create a storage for the necessary data and framework. Crusoe provides a Shared Disk using NFS as a distributed filesystem that can be used to mount to multiple CMK worker nodes so that necessary data such as model weights can be downloaded with high speed and low latency.

The Shared Disk should be used to store Llama 3.1 8B model weights, which can be downloaded from HuggingFace, LLaMA-Factory directory, sub-directory to save checkpoints from fine-tuning jobs, and the entrypoint code that executes the torchrun command.

Fine-tuning Llama 3.1 model using Llama Factory

Once we have everything set up, it is time to deploy the workloads to run fine-tuning for the Llama 3.1 model with LLaMA-Factory. Since the application pods need to run efficiently using the intra-node NVLink, we need to create a custom ComputeDomain that the application can claim. For example:

apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
 name: compute-domain
spec:
 numNodes: 0
 channel:
   resourceClaimTemplate:
     name: training-alpha-domain-channel

Then you submit a PyTorchJob resource, which is a custom resource supported by the Kubeflow Trainer to run a distributed PyTorch Job. It creates a master pod that coordinates jobs across the worker pods that actually execute the fine-tuning job. The volume is mounted as PVC (Persistent Volume Claim) using Crusoe CSI Driver - an example of how to configure all the underlying CRDs (Custom Resource Definitions) for existing Shared Disk is available here:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
 name: pytorchjob-crusoe-gb200-nvl72-young
 namespace: default
spec:
 pytorchReplicaSpecs:
   Master:
     replicas: 1
     restartPolicy: Never
     template:
       spec:
         affinity:
           nodeAffinity:
             requiredDuringSchedulingIgnoredDuringExecution:
               nodeSelectorTerms:
               - matchExpressions:
                 - key: nvidia.com/gpu.clique
                   operator: Exists
         containers:
         - command:
           - /bin/bash
           args:
           - -c
           - ./run.sh
           env:
           - name: GPU_NUM
             value: "4"
           - name: NCCL_DEBUG_SUBSYS
             value: INIT,NET
           - name: NCCL_DEBUG
             value: INFO
           image: youngjeong46/llama-factory:latest
           imagePullPolicy: Always
           name: pytorch
           ports:
           - containerPort: 24456
             name: pytorchjob-port
             protocol: TCP
           resources:
             claims:
             - name: compute-domain-channel
             limits:
               nvidia.com/gpu: 4
           securityContext:
             capabilities:
               add:
               - IPC_LOCK
               - CAP_SYS_ADMIN
           volumeMounts:
           - mountPath: '/data'
             name: data
           - mountPath: /dev/shm
             name: dshm
           - mountPath: /dev/nvidia-caps
             name: nvidia-caps
           workingDir: /code
         dnsPolicy: ClusterFirstWithHostNet
         hostNetwork: true
         initContainers:
         - command:
           - /bin/bash
           - -c
           - "sleep 10"
           image: nginx
           imagePullPolicy: Always
           name: init-k8s-job-container
         resourceClaims:
         - name: compute-domain-channel
           resourceClaimTemplateName: training-alpha-domain-channel
         securityContext:
           runAsGroup: 0
           runAsUser: 0
           supplementalGroups:
           - 22080
         volumes:
         - name: data
           persistentVolumeClaim:
              claimName: pytorchjob-data-pvc
              readOnly: false
         - emptyDir:
             medium: Memory
             sizeLimit: 64Gi
           name: dshm
         - hostPath:
             path: /dev/nvidia-caps
             type: Directory
           name: nvidia-caps
   Worker:
     replicas: 16
     restartPolicy: OnFailure
     template:
       spec:
         affinity:
           nodeAffinity:
             requiredDuringSchedulingIgnoredDuringExecution:
               nodeSelectorTerms:
               - matchExpressions:
                 - key: nvidia.com/gpu.clique
                   operator: Exists
         containers:
         - command:
           - /bin/bash
           args:
           - "-c"
           - ./run.sh
           env:
           - name: GPU_NUM
             value: "4"
           - name: NCCL_DEBUG_SUBSYS
             value: INIT,NET
           - name: NCCL_DEBUG
             value: INFO
           image: youngjeong46/llama-factory:latest
           imagePullPolicy: Always
           name: pytorch
           ports:
           - containerPort: 24456
             name: pytorchjob-port
             protocol: TCP
           resources:
             claims:
             - name: compute-domain-channel
             limits:
               nvidia.com/gpu: 4
           securityContext:
             capabilities:
               add:
               - IPC_LOCK
               - CAP_SYS_ADMIN
           volumeMounts:
           - mountPath: '/data'
             name: data
           - mountPath: /dev/shm
             name: dshm
           - mountPath: /dev/nvidia-caps
             name: nvidia-caps
           workingDir: /code
         dnsPolicy: ClusterFirstWithHostNet
         hostNetwork: true
         initContainers:
         - command:
           - /bin/bash
           - "-c"
           - "sleep 200"
           image: nginx
           imagePullPolicy: Always
           name: init-k8s-job-container
         resourceClaims:
         - name: compute-domain-channel
           resourceClaimTemplateName: training-alpha-domain-channel
         securityContext:
           runAsGroup: 0
           runAsUser: 0
           supplementalGroups:
           - 22080
         volumes:
         - hostPath:
             path: /dev/nvidia-caps
             type: Directory
           name: nvidia-caps
         - name: data
           persistentVolumeClaim:
              claimName: pytorchjob-data-pvc
              readOnly: false
         - emptyDir:
             medium: Memory
             sizeLimit: 32Gi
           name: dshm

Results

Comparing the previous results on the fine-tuning job on Llama 3.1 8B, Crusoe and Zoom were able to observe the following improvements on throughput, measured in seconds per iterations (s/iter), when using the same number of GPUs on both NVIDIA GB200 NVL72 and NVIDIA HGX H100 (8x GPUs on a single instance, with nodes interconnected via NVIDIA Quantum InfiniBand) with same training framework settings:

# GPUs	GB200 NVL72 Performance (seconds per iteration)	H100 Performance (seconds per iteration)	Improvement %
24	4.76s/it	8.41s/it	178%
32	4.84s/it	9.54s/it	197%

Furthermore, as the team scaled the number of GPUs in a fine-tuning job (from 24, 32, 48 to 64 GPUs), on the GB200 NVL72 rack system, the throughput increased linearly, whereas on HGX H100 the increase observed was not linear. We attribute this to the fact that within a single GB200 NVL72 rack, all cross communications depend on the MNNVL with a much higher throughput (1.8TB/s bi-directional) versus the NVIDIA Quantum InfiniBand across multiple HGX H100 nodes (standard rated at 400GB/s).

# GPUs	GB200 NVL72 Performance (seconds per iteration)
24	4.76s/it
32	4.84s/it
48	5.03s/it
64	5.10s/it

The final comparison the team made is to fine-tune a much larger model to take advantage of the GB200 NVL72 system’s much larger GPU memory (186GB vs. HGX H100’s 80GB per GPU). Using the Llama 3.1 70B model for this exercise, and implementing ZeRO-1 parallelization without OOM, achieved close to 300% improvement over H100.

GPU Setting	Parallelism	Time per Iteration	Improvement %
32xH100	ZeRO-1	OOM	N/A
32xH100	ZeRO-2	OOM	N/A
32xH100	ZeRO-3	94.07s/it	Baseline
32xGB200 NVL72	ZeRO-1	31.19s/it	302%

The transition from NVIDIA HGX H100 to NVIDIA GB200 NVL72 is more than a hardware swap; it’s an architectural evolution. As our benchmarks with Zoom show, the move to a unified memory pool and rack-scale NVLink doesn't just improve speed, it enables linear scalability and solves the "Out of Memory" (OOM) hurdles that stifle large-model development.

Ready to supercharge your workloads?

Whether you’re training foundation models or fine-tuning at scale, our team is standing by to help you navigate the transition to Arm and maximize the value of Blackwell's extreme compute density. Contact us today to run a Proof of Concept (POC) and see these performance gains on your own workloads.

Join Crusoe at the San Jose Convention Center for NVIDIA GTC from March 16-19. We look forward to meeting the builders shaping AI’s future.