
Most enterprise AI budgets were modeled on single-call chatbot assumptions. Agentic systems — where a model plans, acts, checks results, and retries — burn tokens in loops, not lines. Gartner confirmed in March 2026 that agentic tasks require 5–30× more processing than chatbot-era tools, and the Uber coding budget story is the most visible casualty so far. The fix isn't picking a cheaper model. It's rethinking loop architecture before you sign the next contract.
Uber's engineering team didn't get surprised by a vendor price hike. The per-token rates they were paying were exactly what the contracts said. What nobody had modeled was how many tokens an agentic coding workflow actually consumes per completed task, once you account for planning calls, tool invocations, context accumulation, and retry loops. By the time the budget alarm went off, four months of a twelve-month AI coding budget were gone.
This wasn't a procurement failure. It was a modeling failure. The budget was built on chatbot-era assumptions: one prompt in, one response out, multiply by volume. Agentic systems don't work that way, and the gap between those two mental models is where most mid-market AI budgets are quietly breaking right now.
The Uber situation is notable because the scale makes it visible. Most mid-market companies are replicating the same structural error at smaller volumes, with less observability, and won't notice until they're already over budget on a workflow they committed to scaling.
Gartner's March 2026 finding was direct: agentic systems require 5–30× more processing per task than chatbot-era tools. That range isn't vague hedging. It reflects a real architectural spread across agent types.
Simple tool-use agents, the kind that look up a record and return a formatted result, sit toward the low end of that range. Multi-step planning agents that reason about a goal, break it into subtasks, execute each one, evaluate the output, and retry on failure push toward the high end. The 30× figure isn't a worst-case outlier. It's a normal outcome for coding agents, research agents, and any workflow where the agent is making sequential decisions against uncertain tool responses.
The critical distinction most teams miss is the difference between token cost per call and token cost per completed task. Vendor pricing pages, demo environments, and ROI calculators are almost universally built around per-call framing. That's the number that looks reasonable in a spreadsheet. The per-task number, which is what actually appears on your invoice, is the per-call number multiplied by your loop depth. Most teams don't know their loop depth.
Take a concrete task: draft and send a sales follow-up email with a CRM lookup and calendar check. In a chatbot pattern, a user pastes context into the prompt, the model generates a draft, the user reviews it and sends it manually. That's roughly 1,000 input tokens and 500 output tokens. One call. At GPT-5.5's published output pricing of $30 per million tokens, that's a fraction of a cent per task. At Claude Opus 4.7's $25 per million output tokens, similar. At DeepSeek V4-Pro's $1.76 per million output tokens, nearly negligible.
The same task in an agentic pattern looks different. The agent receives the goal, generates a plan (call 1), queries the CRM tool (call 2), checks the calendar API (call 3), evaluates whether the retrieved data is sufficient (call 4), drafts the email with accumulated context (call 5). If the CRM lookup fails or returns ambiguous data, there's a retry. Each call carries the growing context window from prior steps. Conservatively, that's five calls with context accumulation. At a 15× loop multiplier, you're not looking at a fraction of a cent anymore.
The table below illustrates how agentic AI token costs scale across loop multipliers for the same base task. These are illustrative ranges derived from published pricing, not production benchmarks. Actual costs depend on context window size, tool call overhead, and your specific retry rate.
| Loop Multiplier | GPT-5.5 ($30/M output) | Claude Opus 4.7 ($25/M output) | DeepSeek V4-Pro ($1.76/M output) |
|---|---|---|---|
| 1× (chatbot baseline) | ~$0.015 / task | ~$0.013 / task | ~$0.001 / task |
| 5× (simple tool-use agent) | ~$0.075 / task | ~$0.063 / task | ~$0.004 / task |
| 15× (multi-step planning agent) | ~$0.225 / task | ~$0.188 / task | ~$0.013 / task |
| 30× (deep loop / high retry) | ~$0.450 / task | ~$0.375 / task | ~$0.026 / task |
The insight buried in that table: at a 5× loop multiplier, DeepSeek V4-Pro still costs less per task than the chatbot baseline on GPT-5.5. At 20× or 30×, even DeepSeek becomes non-trivial at scale if you're running tens of thousands of tasks per day. The model choice conversation changes completely once loop depth is in the equation.
Most enterprise AI procurement conversations in 2026 are about model selection: which frontier model, which pricing tier, which vendor contract. For agentic workloads, that's the second-order question. The first-order question is loop architecture.
The real cost levers are: how many steps does the agent take per task, how large is the context window at each step, and what's the retry rate when tools fail or outputs get rejected. A team that optimizes all three of those variables will outperform a team that simply found a cheaper model, almost every time.
In one engagement, a team had switched from GPT-4 to a significantly cheaper model and was proud of the cost reduction on their API dashboard. When we instrumented their agentic loop at the step level, we found a 12× loop multiplier that nobody had measured. The model swap had saved them roughly 15% on token costs. The loop architecture was responsible for a cost multiple that dwarfed that saving by an order of magnitude.
The levers that actually move the number are prompt compression (reducing context window size at each step), early exit conditions (stopping the loop when confidence thresholds are met), tool call batching (combining multiple lookups into a single call where the API supports it), and caching repeated context that doesn't change between steps. None of those optimizations require a vendor negotiation. All of them require visibility into what the agent is actually doing at each step.
This connects directly to the AI workflow automation category. The tools that matter most for cost control aren't the LLMs at the center of your agent. They're the orchestration and observability layers that let you instrument and constrain agent behavior before it burns budget at scale.
Most teams have observability on task success and failure but not on token consumption per step within a task. That's the gap. You need to know which step in the loop is expensive, not just whether the task finished. Honeycomb is well-suited for this work because it's built for high-cardinality, distributed event data, which is exactly what multi-step agent logs produce. Each agent step becomes a span; token counts become attributes on that span. You can then query across thousands of task runs to find which step types are consistently expensive.
Group your workflows by type: research tasks, code generation, data extraction, customer-facing response generation. For each type, calculate the average loop depth and total token consumption per completed task, not per call. Grafana is useful here for building cost dashboards that surface per-workflow token burn over time rather than just aggregate API spend. The goal is a table that shows, for each workflow type, your real loop multiplier in production.
High retry rates are a cost multiplier that almost never appears in standard API dashboards. A tool-call failure rate of 20% with automatic retry adds roughly 40% to your token cost for that workflow before anything looks wrong on the surface. The retry cost is invisible until you're logging at the step level.
A mid-market SaaS company I worked with had deployed an AI coding agent they were reasonably happy with on quality. When we added step-level instrumentation, we found the agent was retrying failed test runs an average of 3.2 times per task. That retry behavior had never been visible in their dashboard. It was the single largest driver of their token spend, and it was fixable with a tighter exit condition on the test evaluation step.
Model choice isn't irrelevant to agentic AI token costs. It's just secondary to loop architecture. Once you've minimized unnecessary loop depth, model selection becomes a meaningful optimization. The right question isn't which model is cheapest per token. It's which model completes this specific task reliably in the fewest steps.
A model that costs three times more per token but completes a task in four steps instead of twelve is cheaper in practice. That's the calculation most procurement teams aren't running, because they don't have per-task cost data. They have per-call cost data, which makes the cheaper model look obviously correct.
DeepSeek V4-Pro at $1.76 per million output tokens is compelling for high-volume, well-structured agentic tasks where loop depth is controlled and the task doesn't require frontier-level reasoning. It's a poor fit for open-ended planning agents with high retry rates, where a more capable model's first-pass accuracy reduces loop depth enough to offset the price difference several times over.
For teams with the infrastructure capacity to self-host, Hugging Face is worth evaluating as a platform for open-weight models on agentic tasks. Self-hosting eliminates per-token costs entirely for certain workflow types, which changes the economics significantly for mid-market companies running high-volume, well-defined agentic loops. The tradeoff is infra overhead and the operational burden of model management.
AI coding tools deserve special mention here because coding agents are the highest-volume agentic workload for most engineering teams, and they're the most likely to hit the 30× multiplier in practice. Code generation, test execution, error interpretation, and retry is a deep loop by design. It's also a workflow where model first-pass accuracy has an outsized effect on total cost, because each failed test run triggers another full loop iteration.
Replace per-call cost estimates with per-task cost estimates, built from measured loop depth in staging or early production. That single change eliminates the structural error that burned Uber's coding budget and is quietly doing the same thing to mid-market teams at smaller scale.
Budget for retry overhead explicitly. Until you have production data showing otherwise, assume a baseline retry rate of 15–25% for tool-using agents. That assumption should be in your model as a line item, not buried in a contingency percentage.
Build in a loop multiplier range rather than a fixed number. A reasonable starting framework: budget at 5× for simple tool-use agents, 15× for multi-step planning agents, and treat anything approaching 30× as requiring architectural review before you commit to scaling it.
A finance team I worked with re-ran their annual AI spend projection using per-task cost modeling instead of per-call estimates, after we spent a week instrumenting their staging environment. Their Q3 estimate for an agentic workflow automation rollout had been off by a factor of four. The good news: they caught it before the rollout, not after. The instrumentation work took less time than the budget reforecast would have.
Run a 30-day instrumented pilot on any agentic workflow before committing budget at scale. The loop multiplier you observe in production is the number your budget should be built around, not the vendor's demo scenario, which is almost always a single-call happy path. PostHog is useful for teams that want to correlate agent cost with business outcome metrics rather than just API spend. It lets you track task completion rates and session-level token consumption alongside the product events that indicate whether the agent actually delivered value, which is the conversation your finance team will eventually want to have.
Adding step-level token instrumentation to every agentic workflow before the next budget planning cycle is the single highest-leverage change available to most mid-market companies today. Without that data, every conversation about model choice, vendor negotiation, or architecture redesign is informed guessing at best.
The companies that get agentic AI economics right won't be the ones that found the cheapest model. They'll be the ones that understood their loop depth first, built budgets around per-task costs instead of per-call costs, and treated retry rates as a first-class cost metric rather than an implementation detail.
Identify your three highest-volume agentic workflows, instrument them at the step level using Honeycomb or whatever distributed tracing tooling you already have, and run the per-task cost calculation for 30 days. That number, your actual loop multiplier in production, is the only budget input that matters for agentic workloads. Everything else is a chatbot-era assumption wearing an agentic label.
Independent consultant specializing in AI adoption for mid-market companies. Writes about practical implementation, ROI, and organizational change.
AI software insights, comparisons, and industry analysis from the TopReviewed team.