The AI engineer's checklist for optimal GPU performance

Table of contents

This is some text inside of a div block.

It's one of the most frustrating scenarios in AI development: you've secured powerful, cutting-edge GPUs, but your training jobs are still crawling. You run dcgm-diag and see the culprit — your GPU utilization is hovering at a dismal 40%.

This isn't just a technical headache; it's a massive waste of time and money. The ultimate goal is to maximize throughput; the number of images, tokens, or data samples you can process per second. Every moment your GPU sits idle is a moment your throughput suffers, delaying iteration and progress.

The good news is that the GPU itself is rarely the problem. High performance isn't just about the chip, it's about the entire symphony of components working in harmony. By diagnosing the true source of the slowdown, you can fix low utilization, boost your throughput, and maximize your hardware utilization.

Here's a practical walk through to identify and resolve the silent performance killers.

1. Is your data pipeline the bottleneck?

Before data can be processed, it has to get to the GPU. A slow or inefficient data pipeline is the most common cause of GPU starvation. If your GPUs are processing data faster than your storage and data loaders can supply it, they'll be forced to wait, causing utilization to plummet.

What to check:

Data loading: Are you using a high-performance, distributed file system? Avoid the "lots of small files" (LOSF) problem, which can cripple I/O performance. Pre-process and batch your data efficiently.
CPU bottleneck: Data augmentation and pre-processing are often handled by the CPU. If your CPU cores are maxed out while your GPU is idle, it's a clear sign your data loaders can't keep up. Use tools like PyTorch's DataLoader with multiple workers to parallelize the load.
Storage speed: Ensure your data is stored on a system that can handle the high-throughput demands of AI training. A single NVMe SSD, for example, can be an excellent upgrade from slow network-attached storage (NAS) and can prevent it from becoming a bottleneck.‍

‍Side note: Are you using NVIDIA GPUDirect^Ⓡ RDMA to move data from storage directly into GPU memory or are you waiting on the CPU to do it for the GPUs? GPUDirect Storage (GDS) provides this service and can be activated with major parallel file systems.

2. Are the nodes communicating efficiently?

For distributed training, the interconnect between GPUs is just as crucial as the GPUs themselves. A slow or misconfigured network link will directly cause low GPU utilization, as processors sit idle waiting for gradients and parameters from their peers.

What to check:

Network interconnects: Are you using high-bandwidth, low-latency interconnects like NVIDIA NVLink™ and NVIDIA Quantum InfiniBand? Standard Ethernet is often insufficient for the intense communication patterns of large-scale training.
Diagnostic tools: Use tools like NVIDIA Data Center GPU Manager (DCGM) to run comprehensive health checks. For network specifics, leverage NVIDIA NCCL tests to benchmark performance.
Single-node communication: Run a single-node NCCL test to measure the bandwidth between GPUs within a single server. This confirms that on-server links, like NVLink, are performing as expected.
Multi-node communication: Use multi-node NCCL tests (e.g., all_reduce_perf) to benchmark the performance between different servers across your network fabric. If the nodes are within the NVL72 rack, then they’ll be leveraging NVLink. Examine your network interconnect with this tool. This will quickly expose any issues with your InfiniBand or RoCE configuration.

3. Is your environment configured for performance?

Even with perfect hardware, software misconfigurations can leave performance on the table. An unoptimized environment creates friction that prevents your code from running at its full potential.

What to check:

Drivers and libraries: Are you using the latest, hardware-accelerated drivers and libraries? Ensure your CUDA®, Fabric Manager, and NVIDIA Quantum InfiniBand drivers (like MLNX_OFED) are up-to-date and correctly configured with features like NVIDIA GPUDirect® RDMA enabled.
Code profiling: Use advanced diagnostic tools like DCGM and profilers like NVIDIA Nsight™ to get a detailed view of what's happening at the kernel level. This can help you identify specific operations that are causing delays.
Measure what matters: While utilization is a good indicator, throughput is the ultimate metric for success. Benchmark your model's performance in terms of samples/second or tokens/second to get a true measure of its efficiency.

An infrastructure designed for AI excellence

Diagnosing and fixing these issues is possible, but it takes time and deep expertise. The ultimate solution is to build on an infrastructure that was designed from the ground up to prevent these bottlenecks from ever occurring.

At Crusoe, we've engineered our cloud to eliminate the friction between your ambition and your results.

Optimized from day one: Our environments come pre-configured with the latest drivers and libraries, so you can skip the setup and get straight to building.
Frictionless data and networking: We've built our infrastructure with high-speed NVIDIA Quantum InfiniBand for seamless multi-GPU communication and high-performance storage to ensure your data pipelines never run dry.
Uncompromising reliability: We perform rigorous, multi-week burn-in testing on our clusters before they ever reach a customer, dramatically reducing hardware failures and ensuring the 99.98% uptime you need for mission-critical workloads.

Ready to stop diagnosing and start building? Download the full AI engineer's checklist for optimal GPU performance.

The AI engineer's checklist for optimal GPU performance

1. Is your data pipeline the bottleneck?

2. Are the nodes communicating efficiently?

3. Is your environment configured for performance?

An infrastructure designed for AI excellence

Latest articles

Are you ready to build something amazing?