How is AI cost management different from normal cloud cost work?

The unit changes and the control changes. You're costing tokens, model requests, and GPU-hours instead of instances, and the spend scales with usage you don't directly provision. A single inefficient prompt or an agent stuck retrying can multiply cost with no deploy going out, so the old 'size to measured load' reflex doesn't cover it. The attribution and controls discipline transfers; the metering has to be rebuilt at the request level.

How do you attribute the cost of a shared model endpoint?

By measured per-request usage, the same way you split a shared EKS cluster. The endpoint is one line item no matter how well it's tagged, so you need per-request data — which feature, how many tokens, at what model tier — to divide it. Tag the endpoint and you've attributed 100% of the spend to 'the AI platform,' which is true and useless.

What's the single most useful thing to do first?

Put a cost estimate in the design review, before the workload ships. The FinOps Foundation calls it shift-left costing, and it's the request practitioners most often can't fulfill. Catching an expensive architecture in review costs an afternoon; catching it in next month's bill costs a migration.

← Articles

finops series

FinOps for AI: costing the workload before it costs you

AI compute is the fastest-growing line on the cloud bill and the hardest to attribute. What FinOps for AI actually takes, from someone holding ~$4M flat.

Last updated July 4, 2026

The fastest-growing line on the cloud bill in 2026 is one most teams have never had to forecast. Compute you rent by the GPU-hour or pay for by the token behaves nothing like the EC2 fleet you sized to measured load two years ago. It scales with usage you don’t control, from users and agents making requests you didn’t write, and the meter runs whether the output was worth anything or not.

I’ve held ~$4M a year in cloud spend flat at Postscript while traffic grew, leading cost work across the org, and I run agentic AI as a daily part of the job: parallel sub-agents that inventory infrastructure and drive changes from plan to production. So I’ve watched AI compute cost from both ends, the bill I’m accountable for and the workload generating it.

The FinOps Foundation’s State of FinOps 2026 report put a number on how fast this arrived. 98% of the practitioners they surveyed now manage AI spend, up from 63% a year earlier and 31% the year before that, the steepest jump of any scope they track. AI cost management was the most-wanted skill in the same survey. A discipline that didn’t have a name two years ago is now the thing cost teams are hiring for.

Why AI spend resists the old playbook

The attribution problem I took apart earlier gets sharper here. A shared model endpoint is the new shared EKS cluster: one line item serving dozens of features, and perfect tagging on the endpoint tells you the whole cost belongs to “the AI platform,” which is accurate and worthless. To divide it you need per-request measurement, which feature, how many tokens, at what model tier, and almost nobody instruments that until the bill forces the question.

The provisioning model is inverted too. A classic fleet gets sized to measured load and then sits there. Inference scales with demand you often can’t predict, and one bad prompt or an agent caught in a retry loop can double a workload’s cost with no deploy going out. The GPU cluster reserved for training burns between jobs at a rate that would have gotten an ordinary instance deleted in its first week. The waste isn’t hiding in a corner of the bill. It’s in the hot path, moving.

What buyers are actually asking for

The State of FinOps report is useful because it names what practitioners want and can’t get. Three requests come up again and again:

Granular AI spend monitoring, broken down by tokens, model requests, and GPU utilization rather than by account.
Shift-left costing: an estimate of what a design will cost before it ships, so the expensive architecture gets caught in review instead of in next month’s invoice.
A single pane across model APIs, GPU fleets, and ordinary cloud spend, because today they live in three different tools.

None of this is exotic. It’s the same instinct behind the standing thesis of this series, that the bill is design feedback and you should read it that way. The only new part is that the feedback loop on AI spend is faster and less forgiving, so the costing has to move earlier to keep up.

Where to start Monday

Pick one AI feature that’s already in production and answer a single question: what does one unit of it cost? Cost per conversation, per agent run, per thousand inferences, whatever the natural unit is. You will not have the instrumentation to answer cleanly, and building it is the work. Once you have a unit cost, the rest of FinOps applies unchanged: attribute it, watch it, and put a number on the design before the next one ships.

Then find the idle GPU. There is almost always one, reserved for a workload that runs in bursts and billed for the hours in between, and it’s the closest thing to free money in this whole category. Nobody deletes it because nobody owns it, which is the same reason the debug log stream from 2024 is still running on the classic side of the bill.

The teams that get ahead of this treat AI cost as an operational metric with an owner, a budget, and a review cadence, the same as latency or error rate. The ones that don’t will meet the number the way most teams meet it now: as a surprise, at the end of the month, attached to a workload nobody costed before they built it.

Questions this raises

How is AI cost management different from normal cloud cost work?: The unit changes and the control changes. You're costing tokens, model requests, and GPU-hours instead of instances, and the spend scales with usage you don't directly provision. A single inefficient prompt or an agent stuck retrying can multiply cost with no deploy going out, so the old 'size to measured load' reflex doesn't cover it. The attribution and controls discipline transfers; the metering has to be rebuilt at the request level.
How do you attribute the cost of a shared model endpoint?: By measured per-request usage, the same way you split a shared EKS cluster. The endpoint is one line item no matter how well it's tagged, so you need per-request data — which feature, how many tokens, at what model tier — to divide it. Tag the endpoint and you've attributed 100% of the spend to 'the AI platform,' which is true and useless.
What's the single most useful thing to do first?: Put a cost estimate in the design review, before the workload ships. The FinOps Foundation calls it shift-left costing, and it's the request practitioners most often can't fulfill. Catching an expensive architecture in review costs an afternoon; catching it in next month's bill costs a migration.

Consulting

Dealing with this on your own infrastructure?

I take contract and consulting engagements on exactly this kind of work.

Get in touch