LLM Value Benchmark: Cost Per Successful Outcome

TL;DR

We ran 14 AI models — closed frontier and open weights — through 420 graded document-extraction runs and ranked them by Cost Per Successful Outcome (CPSO): total dollars spent ÷ outputs that passed a deterministic grader, failures included in the bill. Cost per success spread 3.5 orders of magnitude on identical work while the token price sheet shows only ~70x. An open-weight model won outright, no model at any price cleared 70%, and the priciest model scored below the winner at ~2,500x the cost. If your AI value statements are built from a rate card and a quality leaderboard, they are wrong — this article shows by how much.

Key takeaways

CPSO — cost per successful outcome — diverges from cost-per-token by a factor of 35 on this workload: the rate card shows a ~70x spread, outcomes show ~2,500x.
An open-weight model (DeepSeek V4 Flash) took both best pass rate (70%) and lowest cost per success ($0.0002), with confidence intervals clear of every frontier model.
No model at any price exceeded 70% on the task set — spend above the cheapest model at the quality ceiling bought nothing.
Price tier and quality were uncorrelated: the most expensive model scored 7 points below the winner at ~2,500x the cost per success.
One model failed 100% of tasks on output formatting alone — invisible to quality leaderboards, caught immediately by outcome-level grading.
Failures are spend: AI bills you for wrong answers and again for retries. Any unit-economics model that drops failures from the numerator understates real cost.

Why we ran this now

Tokenomics went mainstream this week. The Linux Foundation announced the Tokenomics Foundation at FinOps X, in partnership with the FinOps Foundation, citing companies already 3x over their entire 2026 token budgets. And in the Day 1 keynote, SAP's Frederik Pohl and Maida Nazifi described running FinOps for AI at global scale through an AI cost control plane managed by cost per outcome — "because GPUs and LLMs don't behave quite like VMs."

That is a definition you can build on, and it deserves comparison data. So we built a benchmark around it: every model does the same real work, every answer is graded deterministically, and every dollar — including the wasted ones — lands in the numerator.

The metric: CPSO

CPSO = total dollars spent across all attempts (passes and failures) ÷ outputs that passed the grader.

Failures stay in the bill because that is how your invoice works. A cheap model that fails half its tasks is expensive per success. This is the LLM version of the unit-economics translation FinOps already performs for cloud: from rate card to cost per transaction, per customer, per order.

The setup

14 models — closed frontier (OpenAI, Anthropic, Google) and hosted open weights (DeepSeek, Qwen, GLM, Kimi) — ran an identical document-extraction workload: 10 tasks (invoices, receipts, purchase orders, shipping notices, résumés), 3 runs each, temperature 0, one shared neutral prompt. Grading is fully deterministic — JSON schema validation plus normalized field comparison. No LLM judges, no human scoring, run overnight unattended.

The ranking

#	Model	Est. CPSO	95% CI	Pass rate	Basis
1	DeepSeek V4 Flash	$0.0002	$0.0002–$0.0004	70%	open, hosted
2	Gemini 3.1 Flash-Lite	$0.0008	$0.0005–$0.0014	67%	closed
3	DeepSeek V4 Pro	$0.0053	$0.0037–$0.0105	60%	open, hosted
4	Gemini 3 Flash	$0.0059	$0.0041–$0.0099	70%	closed
5	Kimi K2.6	$0.0085	$0.0051–$0.0182	60%	open, hosted
6	GLM-5.1	$0.0086	$0.0055–$0.0170	63%	open, hosted
7	Qwen 3.5 397B	$0.0162	$0.0105–$0.0319	63%	open, hosted
8	Gemini 3.1 Pro	$0.0317	$0.0224–$0.0561	70%	closed
9	GPT-5.4	$0.0970	$0.0624–$0.2160	50%	closed
10	Sonnet 4.6	$0.1286	$0.0855–$0.2581	60%	closed
11	GPT-5.5	$0.1656	$0.1096–$0.3343	60%	closed
12	Opus 4.8	$0.3503	$0.2247–$0.8007	53%	closed
13	Fable 5	$0.5912	$0.4150–$1.1227	63%	closed
14	Haiku 4.5	∞	—	0%	closed (formatting failures)

Adjacent ranks have overlapping confidence intervals and should not be over-read; the top-of-table versus frontier-tier gaps are CI-separated.

What the numbers say

1. The rate card understates the economics by 35x

Cost per success ranged from $0.0002 to $0.59 on identical work — about 3.5 orders of magnitude. The token price sheet for these same models spans roughly 70x. The difference is failures: models that miss more tasks bill you for every miss, and the retry, before delivering a success.

2. An open-weight model won outright

DeepSeek V4 Flash posted the field's best pass rate and its lowest cost per success simultaneously. This was not a "good enough for the price" result — it beat every closed frontier model on both axes at once.

3. There is a quality ceiling, and everyone is at it or below it

Three models tied at 70%; nothing beat it at any price. When no model clears the ceiling on your workload, the marginal dollar above the cheapest ceiling-level model has negative ROI. Finding that ceiling per workload is exactly the kind of measurement a FinOps team can own.

4. Price and quality were uncorrelated

The most expensive model in the field scored 7 points below the winner at ~2,500x the cost per success. Across all 14 models, paying more predicted nothing about output quality on this workload.

5. The failure mode leaderboards can't see

One model returned plausible extractions on every task — wrapped in markdown code fences the instructions explicitly forbade. 100% failure rate, zero intelligence problem. No quality leaderboard surfaces this; 30 graded runs caught it immediately. Output discipline is a separate axis from capability, and it only shows up when you grade outcomes end to end.

What this means for FinOps practitioners

Routing this workload to the value leader instead of a frontier model cuts cost per successful document by ~99.9% with zero quality loss. That is a governable decision — but only if someone in the room can read cost-per-outcome data, understands why it diverges from the rate card by orders of magnitude, and can explain a formatting-failure ∞ honestly to a CFO.

That someone is FinOps. Tokenomics is not cloud FinOps with a find-and-replace; it is a new measurement discipline. The practitioners who build outcome-level measurement now — before the standards bodies finish defining the vocabulary — will be the ones the business trusts on AI spend.

Frequently asked questions

What is Cost Per Successful Outcome (CPSO)?

CPSO is total spend across all attempts — passes and failures — divided by the number of outputs that passed a deterministic grader. It is the AI equivalent of cloud unit economics like cost per transaction, and it diverges sharply from cost-per-token because failed runs still bill you.

How were the costs in this benchmark calculated?

Provider-reported token counts multiplied by published list prices (official API rates as of 2026-06-09). Runs were executed via subscription plans, not invoice-reconciled — costs are labeled estimated throughout, and rows missing provider usage data are excluded from cost rather than guessed.

Why do failures stay in the numerator?

Because the invoice keeps them there. A failed extraction consumes tokens and gets billed exactly like a successful one — then you pay again for the retry, and possibly for human rework. Dropping failures from a unit-cost metric systematically understates what an outcome really costs.

Does this mean open-weight models are always cheaper per outcome?

No — it means you cannot know without measuring your own workload. On this extraction task an open-weight model won both axes; on a different task family the ranking can invert. The transferable result is the method: grade outcomes deterministically, keep failures in the bill, and re-run as prices and models move.

Sources

Stay in touch

If this kind of analysis is useful, the Hybrid FinOps brief ships one essay every two weeks. Subscribe to the Hybrid FinOps brief.

Published on hybridfinops.com — an independent publication.

We benchmarked 14 LLMs by cost per successful outcome. Token prices understate the economics by 35x.