April 16, 2026 · Nocturnus

$600 Billion Says You'll Keep Wasting Tokens

The hyperscaler capex buildout only pays off if token volume grows. Your agent's context replay is the business model.

economicsagentsinfrastructurecontext-serverinference

Illustration of a massive AI data center complex with golden tokens cascading down a conveyor into enterprise budgets, representing $600 billion in hyperscaler capex.

The five largest cloud companies will spend more than $600 billion on capital expenditure in 2026 — a 36% increase over last year. Amazon alone is committing $200 billion. Google doubled its guidance to $175-185 billion. Meta is pushing $115-135 billion.

Roughly 75% of that — about $450 billion — is earmarked for AI infrastructure. Data centers. GPUs. Networking. Cooling. The physical substrate that turns electricity into tokens.

Here’s the question nobody on stage at re:Invent or Google Cloud Next is asking: Who pays for all of that?

You do. One token at a time.

The Business Model Is the Problem

Anthropic crossed $30 billion in annualized revenue in March 2026 — passing OpenAI’s $25 billion for the first time. Their APIs now process more than 15 billion tokens per minute. Enterprise clients spending over $1 million per year doubled from 500 to 1,000 in under two months.

This is not a criticism. It’s a description of incentive structure.

Model providers are paid per token. Hyperscalers are paid per compute cycle that processes those tokens. The entire $600 billion infrastructure buildout only generates returns if token volume grows. Every pricing page, every context window expansion, every “now with 1M tokens!” announcement serves the same function: increase the number of tokens flowing through the pipe.

Per-token prices have dropped — sometimes dramatically. Gemini 2.5 Flash is roughly 10x cheaper on input than competitors. NVIDIA’s Blackwell platform has reduced inference cost per token by up to 10x. But total spending keeps climbing, because the industry has discovered something important: cheaper tokens don’t reduce bills. They increase usage.

This is the inference cost paradox. Per-unit prices fall. Total consumption rises faster. Enterprise AI budgets have grown from $1.2 million per year in 2024 to $7 million in 2026. AI inference now represents 85% of the enterprise AI budget. The tokens are cheaper. The bill is bigger.

And agents make it worse.

Agents Are a Token Multiplier

A standard chatbot interaction is a single LLM call. An AI agent completing the same task triggers 10 to 20 calls — tool lookups, reasoning chains, memory retrieval, verification steps. Gartner’s March 2026 analysis found that agentic AI requires 5 to 30 times more tokens per task than a standard chatbot.

This is exactly what the infrastructure buildout is designed for. More calls per task. More tokens per call. More revenue per user.

But the math has a breaking point. One developer tracked 42 agent runs on a FastAPI codebase and found 70% waste — from re-reading files the model already processed, failed attempts that replayed full context, and verbose tool output that nobody needed. A broader audit pattern has emerged: 40-60% of enterprise AI inference spend is waste.

Not “could be optimized.” Waste. Tokens that the model already saw. Context that didn’t change. History replayed because no one built the architecture to do otherwise.

Gartner projects that 40% of agentic AI projects will fail by 2027 — not because the models aren’t good enough, but because the economics don’t survive contact with production.

The hyperscalers are spending $600 billion on the assumption that token volume will keep growing. The enterprises writing those checks are discovering that most of those tokens are noise.

The Structural Misalignment

Here’s the game theory.

Four players. Two sides.

On one side: hyperscalers and model providers. Their revenue scales with token volume. Amazon’s $244 billion backlog, Google Cloud’s 50% growth rate, Anthropic’s $30 billion ARR — all of it is denominated in tokens processed. Better context management means fewer tokens means less revenue. They are structurally incentivized to maximize token flow.

On the other side: agent builders and enterprises. Their success depends on completing tasks at sustainable cost. Every wasted token is margin erosion. Every context replay is latency they didn’t need. They are structurally incentivized to minimize token flow.

This isn’t a market inefficiency that will self-correct. It’s an antagonistic contradiction. The infrastructure providers profit from the same waste that kills their customers’ projects.

That’s why the model providers won’t build context compression. It’s not a technical limitation — it’s a business model conflict. You don’t optimize away your own revenue.

That’s why the hyperscalers won’t build it either. AWS doesn’t sell you “fewer tokens, same result.” They sell you reserved capacity for more tokens.

And that’s why the context layer has to come from outside the hyperscaler stack.

What Compression Actually Means at Scale

A 15-turn product support conversation replays ~1,259 tokens to the model on every turn when using the standard approach. With context compression — extracting structured facts and returning only what changed — that drops to ~221 tokens. Same agent. Same accuracy. 82% fewer tokens.

At 1,000 requests per hour on Claude Opus 4, that’s the difference between $13,600/month and $2,400/month.

But the real impact isn’t the per-agent savings. It’s what happens when you multiply across the enterprise.

An organization running 50 agents at production scale isn’t saving $11,000/month. They’re saving enough to fund the next 50 agents. Context compression doesn’t just reduce cost — it changes the unit economics of the entire AI program. Projects that were killed at the POC stage because “the token costs don’t scale” suddenly have a viable production path.

This is the part the hyperscaler narrative misses. Efficiency doesn’t shrink the market. It expands it. The $600 billion infrastructure buildout doesn’t need every token to be wasted. It needs every enterprise to get past the POC stage and into production. Right now, 40% of them won’t.

The Parallel Ecosystem Is Already Forming

The signs are everywhere.

NVIDIA’s Blackwell push is about inference efficiency — not because NVIDIA wants fewer tokens, but because they want more customers running inference at all. The model routing pattern — directing 80% of simple queries to small, cheap models — is now standard at forward-thinking enterprises. Caching discounts of 90% from both OpenAI and Anthropic acknowledge that re-processing the same context is waste, even if they can’t eliminate it architecturally.

A parallel “efficient inference” ecosystem is coalescing alongside the hyperscaler stack, not against it. The components: intelligent routing, context compression, model cascading, inference-optimized hardware.

Context compression is the missing layer. Not because it’s the most technically complex — but because it sits at the exact point where the hyperscaler incentive breaks down. The model provider won’t compress your context (revenue loss). The framework won’t compress your context (provider neutrality). The hyperscaler won’t compress your context (capacity utilization). So the layer has to be independent.

Source-available. Provider-neutral. Self-hosted. Paid for intelligence, not volume.

The $600 Billion Question

The hyperscalers are making a bet: that token volume will grow fast enough to pay back $600 billion in infrastructure. They’re probably right — the market is expanding rapidly.

But within that market, there’s a fork.

One path: enterprises keep sending 1,259 tokens when 221 would do. Inference costs consume 85% of AI budgets. 40% of agentic projects fail. The token volume grows, but the customer base doesn’t.

The other path: a context layer emerges that makes agents economically viable at scale. More enterprises reach production. More agents get deployed. Token volume still grows — but because there are 10x more agents running, not because each agent wastes 5x more tokens than it needs to.

The hyperscalers get paid either way. But only one path builds a sustainable market.

$600 billion is a lot of infrastructure. It deserves better than wasted tokens.

— The Nocturnus team