You Don't Have an LLM Provider Problem. You Have an Inference Layer Problem.

Most teams obsess over which model to use. The Saturday-morning incident is almost never about the model. It's about the layer between your app and the model — the layer nobody draws.

The page sat there for nineteen seconds doing nothing. Then it returned a 502. The next request did the same, and the next, and somewhere in the third minute a customer success manager pinged the on-call channel to ask why the chat-assist feature was down. By the time we figured out what was happening — a rate-limit threshold we did not know existed had quietly tightened on the model provider's side, and our application had no fallback — the incident was forty-three minutes old, the Slack channel was full of executives, and a quarter of the day's pipeline had already moved on to a competitor.

We had spent six weeks earlier in the year picking between three model providers. We had run the evals, written the migration plan, set up cost tracking. The decision document was forty pages long. None of those pages were about what to do when the provider became unavailable, because we had implicitly assumed that the model provider was the architecture, when in fact the architecture was something else, sitting in the layer between our application and the model, and we did not have one.

This is the most common production-readiness failure I see in 2026 in AI-using teams. The conversation about "which model" has absorbed all the strategic oxygen in the room, and the conversation about how the application talks to the model — what we call the inference layer — has been left to the engineer who got the ticket on Tuesday. Then production runs into the inference layer, and the inference layer is where everything breaks.

The three-layer stack nobody draws

If you draw your AI architecture honestly, it has three layers, not one. The model is at the bottom: Anthropic's Claude, OpenAI's GPT family, Cohere, Mistral, an open-weight model you self-host. The orchestration is at the top: your app code, your prompt templates, your retrieval logic, your function-calling schema. Between them sits the inference layer — the piece responsible for getting bytes from your app to the model and back, reliably, within budget, and within your latency target.

In most teams I have audited, this middle layer does not appear in the architecture diagram. It is an HTTP client wrapped in three lines of retry logic that an engineer wrote in a feature branch eighteen months ago. The diagram shows "App" with an arrow to "Claude," and the arrow does not have a label. The arrow is doing all the work.

The four production failures the inference layer owns

I keep a running list of the postmortems I have written or read in the last year that traced back to this layer. They cluster into four shapes.

Provider-side rate limiting. Every model provider has tier-based rate limits, organization-level limits, per-region limits, and throttling that kicks in based on burst patterns the documentation does not describe in detail. None of these limits are visible in your monitoring until you cross them. The first time you cross one, the symptom is 429s with retry-after headers your client may or may not respect. The second time, the symptom is silent latency increase as the provider applies backpressure invisibly. By the third time, you have a Saturday-morning incident.

Provider-side outages. Model providers are operationally good but not Tier-1-cloud good. The 99.9% claim in the SLA does not include the soft outages — the partial regional degradations, the model-version unavailability, the silent rollouts that change the model behind your existing model identifier. If your inference layer cannot detect and route around these, your application's apparent reliability is bounded by the provider's actual reliability, which is lower than the marketing page implies.

Cost drift. The per-token bill for production AI applications grows in three ways: prompt growth as features ship, retry storms when the inference layer's error handling is naive, and silent context-window inflation when retrieval pipelines start returning more chunks than the prompt template expected. Without an inference layer that meters and reports, the cost is invisible until the monthly invoice. With an inference layer that meters, you can attribute every dollar to a feature, an endpoint, or a customer segment, which is the only basis on which a finance lead will let you scale.

Latency variance. Production AI features live or die by p99 latency, not p50. The inference layer is where p99 happens — provider queueing, network hops, model cold starts, streaming buffer flushes. Teams that monitor only mean latency miss the variance entirely, and the variance is what makes the feature feel unusable to the bottom 10% of users on a busy afternoon.

The four shapes of an inference layer

Once you have decided that the inference layer is real and worth designing, you have four architectural choices. They correspond, roughly, to four classes of vendor in our catalog.

Direct provider, naive client. You call OpenAI or Anthropic directly with the SDK, you wrap it in your own retry logic, and you build everything else yourself. This is the cheapest layer to start and the most expensive to operate, because the operational work shifts onto your engineering team and gets re-discovered on every incident.

// The naive client most teams ship to prod
async function ask(prompt) {
  const res = await client.messages.create({
    model: 'claude-sonnet-4-6',
    messages: [{ role: 'user', content: prompt }]
  });
  return res.content[0].text;
}
// What's missing: timeouts, retries, fallbacks,
// circuit breakers, cost tracking, latency tracking,
// concurrency limits, auth rotation, region failover.

Managed routing. You sit a routing service between your app and the model providers. OpenRouter is the most-cited example in our catalog. The routing service handles provider selection, key management, and basic fallback, and gives you a single API surface across providers. The tradeoff is an extra hop in the latency budget, an extra trust boundary, and pricing that adds a margin to every call. The benefit is operational uniformity: when Claude is degraded, your app does not care.

Dedicated inference platforms. Vendors like Fireworks AI, Together AI, Replicate, and Groq sit one tier deeper. Instead of routing across providers, they own the inference stack themselves — typically optimized for open-weight models, with control over batching, quantization, hardware (Groq's LPUs being the obvious example), and per-request latency. The tradeoff is that you lose access to the closed frontier models. The benefit is that for the workloads you can run on open weights, the cost-per-token can be 5–10× lower and the latency p99 can be 3–5× lower than calling the same-sized model through a frontier provider. For high-volume workloads, this is the rational choice and most teams arrive at it eventually.

# Inference layer with explicit routing + failover
PRIMARY_PROVIDER = "fireworks"
FALLBACK_CHAIN = ["together", "openrouter:claude"]
RETRY_BUDGET_MS = 8000
PER_FEATURE_COST_CAP_USD = 0.04

async def infer(feature_id, prompt):
    for provider in [PRIMARY_PROVIDER, *FALLBACK_CHAIN]:
        deadline = monotonic() + (RETRY_BUDGET_MS / 1000)
        try:
            cost = estimate_cost(provider, prompt)
            if cost > PER_FEATURE_COST_CAP_USD:
                emit_metric("cost_cap_skip", feature_id, provider)
                continue
            return await call_with_circuit_breaker(provider, prompt, deadline)
        except (RateLimited, ProviderUnavailable, Timeout) as e:
            emit_metric("provider_failover", feature_id, provider, type(e).__name__)
            continue
    raise InferenceLayerExhausted(feature_id)

Self-hosted inference. You run open-weight models on your own GPUs (or a hyperscaler's GPUs that you reserve), behind your own serving layer — vLLM, Triton, or one of the managed-self-hosted offerings on Bedrock or Vertex AI. The tradeoff is operational scope: you now own a fleet of inference servers and the failure modes of GPU drivers, model loading, and tensor-parallel correctness. The benefit is full data sovereignty, predictable cost at high volume, and no provider-side rate limit that you cannot adjust by buying another node.

How to choose between the four

The choice maps cleanly to two questions and one constraint.

The first question is about the closed frontier. If you must use closed frontier models — because the eval delta on your task is large and your competition is using them — you need direct provider or managed routing. The other two layers do not host closed frontier weights. If your task can be served by open weights at a quality tier you have actually evaluated against, the second two options are open to you, and the cost math gets dramatically better.

The second question is about volume. Below roughly 100 million tokens per day, the operational complexity of dedicated or self-hosted inference rarely pays for itself. Above that, the per-token economics of frontier APIs become uncomfortable, and dedicated inference becomes both a cost and a reliability win. The crossover point is workload-specific but almost never below 10 million tokens per day for any serious production application.

The constraint is data residency. If you cannot send tokens across a regulatory boundary, the four-shape decision collapses to "self-hosted, in your region." This is the constraint that turns the inference layer conversation into a deployment-architecture conversation, and it is the one finance leads forget to mention until the security review.

What to evaluate, in order

The standard evaluation flow I recommend, in the order that catches the most problems with the least re-work.

Failure mode under provider unavailability. Pull the network plug. Does your app degrade gracefully, queue, or 502? The behavior here is the inference layer's actual contract.
Behavior under rate limiting. Saturate the provider with a load test. Watch for retry storms, watch for silent latency growth, watch for whether your monitoring even noticed.
Cost attribution per feature. Can you produce a dashboard tomorrow morning that says "feature X cost $Y last week"? If not, your inference layer is not yet capable of supporting a finance conversation, which is a soft failure that becomes a hard one at the next budget cycle.
Latency variance under realistic traffic. Mean is uninformative. Look at p50, p95, p99, p99.9. The shape of the distribution is the user experience.
Provider portability. Pick a feature, switch its inference provider in production, and see how long the change takes. Anything more than a config change is a tell that you have couplings you have not paid for yet.

The vendor map

For reference, the inference-layer-relevant products in our catalog cluster as follows:

Layer role	Vendors
Closed frontier providers	Anthropic, OpenAI, Cohere
Managed routing	OpenRouter
Dedicated inference (open-weight focus)	Fireworks AI, Together AI, Replicate, Groq
Hyperscaler-managed self-host	Bedrock, Vertex AI
Open-weight model hub	Hugging Face
Open-weight providers	Mistral AI

The 3am test

The fastest way to evaluate any inference-layer design is to imagine it is 3am on a Saturday and the model provider has just published a status-page incident saying "investigating elevated error rates in us-east-1." What happens to your application in the next nine minutes? If the answer is "it serves cached or degraded responses while routing traffic to a fallback provider, and the on-call gets one Slack message that a fallback was triggered," your inference layer is doing its job. If the answer is "it serves 502s until someone notices and rolls something back," your inference layer is the HTTP client an engineer wrote in a feature branch eighteen months ago, and the postmortem has already been written. You just have not had the incident yet.

The model providers are not your problem. They are everyone's problem, equally, on Saturday morning. The thing that distinguishes teams who ship reliable AI features from teams who ship demos is the layer in the middle, the layer nobody draws, the layer that only becomes visible at the moment it fails. Draw it on the architecture diagram. Give it a name. Make it someone's responsibility before it becomes everyone's incident.

You Don't Have an LLM Provider Problem. You Have an Inference Layer Problem.

The three-layer stack nobody draws

The four production failures the inference layer owns

The four shapes of an inference layer

How to choose between the four

What to evaluate, in order

The vendor map

The 3am test

Discussion

Author

Recent Posts

When the Panel Splits 4 Points: Stripe, Datadog, Figma & Perplexity

The Free-Tier Premium: Why Our Highest-Scored AI Tools Cost the Least

Why Hidden-Pricing Software Hits an 8.15 Ceiling on Our Review Panel

More from the Blog