Your GPUs Are 95% Idle, and Buying More Won’t Fix It

Published by Akamas

AI is reshaping every industry, and agents are at the center of that shift. Many companies are racing to put them to work building assistants that handle customer requests around the clock, pipelines that classify and route data without human input or systems that reason across tools and act on their own. The appeal is obvious. What’s easy to miss in all the excitement is that every one of those agents, no matter how capable it looks from the outside, runs on a model, and that model runs on a GPU that is the engine keeping the whole thing moving.

Enterprises are pouring billions into GPUs to run LLMs and AI agents. For the first time in computing history, we are seeing data-center build-outs at a scale never witnessed before, where the GPU portion alone can account for as much as 70% of total investment, according to recent industry estimates.

Then the first real utilization reports come in. The number is hard to ignore: GPU usage as low as 5%. You might be spending millions on accelerators while 95% of that capacity sits idle and yet almost every platform and engineering leader we talk to is trying to do the same thing: buy more GPUs to support the GenAI use cases moving into production this year. The bottleneck, as it turns out, is usually not capacity but configuration. On a stack as deep as the GenAI one, the gap between a default setup and a properly tuned one can easily reach 2x throughput or more on the same hardware.

This is the central tension we explored in our recent webinar with Daniele Zonca and Roland Huß from Red Hat, who quite literally wrote the book on running generative AI on Kubernetes.

We’ve Already Seen This Pattern with CPUs

Here’s a number that should worry anyone treating GPUs the way they treated CPUs. Across one of our customers, with several dozen Kubernetes clusters and around 60,000 CPUs, average utilization over a full day was 5%. At cloud prices, that adds up to roughly $15 million a year spent on resources that are almost entirely idle.

Those are CPUs. Now apply the same operational habits (over-provisioning, ignoring efficiency, assuming the answer to every capacity question is “add more”) to hardware that costs roughly 40x more per unit and is genuinely scarce. In much of Europe you simply can’t procure H100s on demand, no matter how big your budget is. If 5% utilization on cheap, abundant CPUs costs $15 million a year, the equivalent waste on GPUs is an order of magnitude worse. It can quietly block the ROI of an entire GenAI initiative.

Why GPUs Are So Easy to Waste

GPUs are nothing like CPUs, and that difference is exactly why they’re hard to keep busy.

A CPU is built for latency: most of its silicon is dedicated to control logic and caches that predict what comes next, so that a handful of threads can execute general-purpose work as fast as possible otherwise a GPU is built for throughput: almost the entire chip is devoted to computation, with thousands of cores optimized for the matrix multiplications that LLM serving depends on. Think of a CPU as a fast motorbike and a GPU as a cargo ship.

The catch is feeding that cargo ship. Every computation requires data to move from high-bandwidth memory (HBM) into the cores, and that transfer frequently becomes the real bottleneck, which is the first reason GPU utilization ends up so low.

This gets sharper once you look at how inference actually runs, across two very different phases:

Prefill processes the entire input prompt in parallel to produce the first token. It is compute-bound and drives your time to first token (TTFT), which is the waiting time your user feels before anything appears on screen.
Decode generates tokens one at a time, each step reusing the KV cache built from all previous tokens. It is memory-bound and drives your time per output token (TPOT), which measures how fast the response streams out.

Because the same hardware handles two completely different access patterns, optimizing for one does not automatically improve the other. That mismatch is a core source of complexity in the whole stack.

The Trade-off You’re Probably Losing On

NVIDIA’s Jensen Huang has popularized a chart worth understanding. When you plot total token throughput (your revenue proxy, since AI services are priced per token) against per-user interactivity (latency), you don’t get a single best point. You get a Pareto front: a curve of configurations where improving one dimension always costs you something on the other.

Push for maximum throughput and single-user latency suffers. Push for fast responses and you sacrifice a large share of GPU capacity.

Your use case fixes one constraint on that curve. A chatbot needs to emit tokens faster than human reading speed (around 200ms per token) or it feels sluggish and users leave. That means you have a fixed latency limit you cannot cross.

But along that line sit hundreds of possible configurations, and most teams are unknowingly sitting on a suboptimal one. A single misconfiguration can cut effective throughput in half, leading teams to conclude they’re maxed out and order more GPUs, when in reality they’re using only half of what they already paid for.

With proper tuning, doubling the throughput from existing hardware is not unusual. That is the real opportunity.

The Playbook: Getting 2x from the GPUs You Already Have

The GenAI stack has hundreds of tuning knobs, from the model down to the kernel and Kubernetes itself. That depth is the real challenge. The skills needed to navigate it effectively are genuinely hard to find, which means most teams end up working with defaults that no one has had the time or expertise to question. Here are the levers with the highest payoff.

Right-size the model, then quantize it

Defaulting to the biggest model for every request is rarely the right call. Smaller and reasoning-tuned models can match quality on the right tasks, and a well-designed agent framework can maintain accuracy while running a smaller model. Quantization can compress a model to as little as 25–30% of its original size while preserving most of its accuracy, which is a significant reduction in resource use for almost no quality cost.

Route intelligently and exploit cache locality

Pick the right model per request, but also send follow-up messages to the replica that already holds the relevant KV cache. Multi-turn chat and agentic flows send the whole growing conversation back and forth, so cache-aware routing is critical to avoid recomputing work you’ve already done.

Tune the inference engine and the parallelism strategy

The engine layer around vLLM (or any other inference server) is evolving fast. It pays to match the parallelism strategy to your model:

Data parallelism (replicas behind a load balancer) scales concurrency but not single-request latency.
Pipeline parallelism splits a too-big model across nodes by layers.
Tensor parallelism splits within a node using NVLink/NVSwitch to actually improve latency.

You can combine pipeline and tensor parallelism to get the best of both, but neither is free of trade-offs. Pipeline parallelism is constrained by inter-node network latency, which can be 25–50x slower than staying on a single node. Tensor parallelism introduces its own synchronization overhead across the NVLink fabric inside the node. The right combination always needs to be validated against your specific setup rather than assumed to work out of the box.

Match hardware to the bottleneck

If your constraint is in the decode phase, the real problem is memory bandwidth. Buying the GPU with the most teraflops will not help. Disaggregated serving (separating prefill and decode pools, as projects like llm-d enable) lets you scale each phase independently and route large inputs to dedicated prefill instances. As inference scales out, expect training-grade networking such as RDMA and InfiniBand to start mattering for inference as well.

Partition GPUs with care

If a GPU is allocated but underused, you may be able to host a second model on it, but GPUs have both a compute side and a bandwidth side. Partitioning can create a noisy-neighbor effect: spare capacity on paper does not guarantee you can add load without degrading both workloads. Always validate the impact before packing more onto a single card.

Measure what you optimize

None of these changes are safe to make without proper visibility. Track LLM-specific metrics (TTFT, TPOT), lean on the OpenTelemetry semantic conventions emerging for GenAI, and use tracing tools like MLflow to follow non-deterministic, multi-hop agentic requests. This also gives you the data for internal cost allocation across teams.

GPU-level metrics deserve special attention because they are significantly harder to interpret than application-level ones. Tools like NVIDIA’s DCGM give you utilization numbers, but those numbers are often too coarse to identify whether your actual bottleneck is compute, memory bandwidth, or something else. GPU memory bandwidth, for instance, cannot be directly measured from DCGM at all: vLLM can expose an estimation through the –enable-mfu-metrics flag, but it remains an approximation. To make things more confusing, “GPU utilization” means very different things depending on which layer of the stack you’re looking at (allocation, kernel execution, or model FLOP/s), and conflating them leads to wrong conclusions.

Load test with your own traffic, not a vendor benchmark

Traditional tools like k6, JMeter, and Locust work at the application level, but purpose-built tools like GuideLLM (part of the vLLM project) let you build realistic, record-and-replay workloads that reflect how your system is actually used. The most important point here is to always build your test dataset from real traffic, because a vendor’s benchmark on a foundation model says nothing about how the system will behave under your specific workload. The optimization space for GenAI is too workload-dependent for generic results to transfer.

Too Many Knobs for a Human to Tune

With hundreds of knobs spread across three fundamentally different layers (the model, the inference server, and the GPU itself), millions of possible combinations, and a target that shifts with every new use case and every new model release, this is not a one-off exercise you do at launch. It is a continuous practice.

Each layer also requires different expertise. The people who understand model selection and quantization are rarely the same people who can tune vLLM internals or reason about GPU memory bandwidth, and finding someone who spans all three is genuinely uncommon. That makes this well beyond what any team can manage by hand.

The emerging answer is to use AI to optimize AI systems: autonomous tuning that explores the configuration space and finds the settings that maximize throughput while staying within your latency and cost targets. This is exactly the problem Akamas was built to solve, first for traditional Kubernetes workloads and now for GenAI inference, and we have something significant coming on that front that we will be sharing very soon.

The Bottom Line

As enterprises move AI agents into production and pour significant capital into the GPU infrastructure to support them, the pressure to show returns on that investment is growing fast. But utilization tells a different story, and the 5% figure is not a hardware problem. It is what happens when a genuinely complex stack gets deployed with default settings and no one has the time to tune it properly.

The good news is that the same depth that makes the GenAI stack hard to configure also creates real headroom. The difference between where most teams are parked today and what their existing hardware can actually deliver is large enough to change the ROI equation entirely.

You do not need more GPUs to get there. You need visibility into your actual bottleneck, a realistic picture of your workload, and a systematic way to explore the configuration space. Everything else in this article follows from those three things.