One Governance Policy for All Your AI Agents Is Exactly How They Fail

One Governance Policy for All Your AI Agents Is Exactly How They Fail

June 20, 202615 min readIndustry Trends

Gartner projects that 40% of enterprises will demote or decommission autonomous AI agents by 2027, and uniform governance policies are a leading cause. Treating a read-only research agent with the same rules as one that can commit code, send messages, or move money isn't neutral — it's a failure mode baked in at the architecture level. This essay argues for autonomy-tiered governance as the structural fix.

Why Are So Many Enterprise AI Agents Getting Shut Down?

Gartner projects that 40% of enterprises will demote or decommission autonomous AI agents by 2027. That figure comes from a 2024 Gartner press release on agentic AI adoption, and it is worth sitting with before treating it as a simple indictment of the technology. The projection measures something specific: enterprises that deploy agents and then pull them back. It does not measure agents that never worked. It measures agents that were running, that someone decided were too risky or too unreliable to keep running, and that were then shut down.

That distinction matters enormously. A failure of the underlying model is one kind of problem. The model hallucinates, produces unreliable outputs, or simply cannot perform the task it was designed for. That is a technology problem, and it is largely solvable through better evaluation, better fine-tuning, and better model selection. But a governance policy failure is a different kind of problem entirely. Governance fails when constraints are either too loose, allowing an agent to cause real harm in the world, or too tight, making the agent so restricted that it cannot complete its purpose. Both failure modes produce the same outcome: the agent gets shut down. Only one of them is the model's fault.

The core problem driving that 40% figure is conceptual before it is technical. Most enterprises approach AI agent governance as a single policy question: what rules apply to AI agents? They write a document, apply it uniformly, and move on. The policy treats every agent as an equivalent risk object, subject to the same approval gates, the same logging requirements, the same escalation paths. But a read-only summarization agent that queries internal documents and surfaces a digest is not the same risk object as an agent with write access to production infrastructure. Governing them identically is not caution. It is a category error.

The argument this essay builds is straightforward: uniform governance is not a conservative default. It is an architectural mistake. The enterprises that avoid the decommissioning wave will be those that classify agents by what they can actually do in the world before they deploy them, and then build oversight proportional to that capability. The enterprises that don't will keep discovering, six months after deployment, that their governance policy was either strangling useful agents or quietly permitting dangerous ones.

What Does 'Uniform Governance' Actually Mean in Practice?

Uniform governance typically emerges from a familiar organizational sequence. Legal and compliance teams conduct a risk review after the first AI agent project surfaces, produce a policy document that covers the general category of "AI agents," and hand it to IT to implement as a blanket configuration. Every agent, regardless of what it can actually do in the world, then inherits the same approval gates before deployment, the same logging requirements during operation, and the same human-in-the-loop checkpoints for any output that influences a decision. The policy feels thorough because it is long. It feels safe because it is uniform.

The hidden cost appears over time, and it compounds. When a low-risk agent, say, one that reads internal Slack threads and produces weekly summaries, requires the same human review cycle as an agent that can send customer-facing messages via Twilio, the review queue fills with trivial approvals. Reviewers learn, rationally, that most approvals are rubber stamps. They start processing them faster. The cognitive investment per review drops. And then, when a genuinely high-risk agent action arrives in that same queue, it receives the same shallow attention as the weekly summary agent. Uniform governance does not distribute oversight evenly. It degrades oversight quality across the board by training humans to stop paying attention.

Uniform governance, designed to reduce risk, concentrates it — by making all oversight equally shallow.

This is what governance theater looks like at scale. The appearance of oversight is present: logs exist, approvals are recorded, checkpoints are documented. The substance of oversight, the genuine human judgment applied proportionally to genuine risk, is absent. Evaluation tooling makes this worse in a subtle way. Promptfoo, which the TopReviewed AI panel scored 8.5/10, is a capable platform for LLM evaluation and red-teaming. It can run test suites against agent outputs, check for policy violations, and flag anomalous behavior before promotion to production. But even the best evaluation tooling breaks down when the policy framework it enforces does not distinguish between agent types. If the policy says "all agents must pass safety evaluation suite A," and suite A was designed with a read-only summarization agent in mind, then an acting agent with write permissions will pass suite A and still be under-governed. The tooling is only as good as the policy architecture it implements.

The deeper issue is that caution is not a neutral default. Treating excessive restriction as the safe choice ignores the real cost of over-governing low-risk agents: wasted review capacity, degraded oversight quality for high-risk ones, and eventually, agents that are so constrained they provide no value and get abandoned. The decommissioning problem has two faces. One is the agent that caused harm. The other is the agent that was governed into uselessness.

How Do Read-Only and Acting Agents Actually Differ in Risk Profile?

The most useful way to classify agents for governance purposes is not by what model powers them, but by what they can do in the world. This produces a three-category taxonomy. Read-only agents retrieve, summarize, and classify. They can query a data warehouse like Snowflake, read documents, and surface information to a human. Advisory agents draft, recommend, and flag. Their outputs influence human decisions, but a human takes the action. Acting agents write to systems, send communications, execute transactions, and modify infrastructure. They take actions with real-world consequences without a human in the loop at the moment of execution.

The governing concept for understanding why this matters is blast radius: the scope of consequences if something goes wrong. A read-only agent querying a data warehouse has near-zero blast radius. If it retrieves the wrong records or produces a misleading summary, a human reviewing its output can catch the error before any downstream harm occurs. An agent that can provision cloud infrastructure via HashiCorp Terraform, or send customer-facing communications through an API, has a blast radius that can touch production systems, real customers, and real money, often before any human has seen what the agent did.

The risk gradient between these categories is not linear. Each permission layer does not add risk arithmetically. It multiplies it, because acting agents can chain consequences in ways that are difficult or impossible to reverse. An agent that can write to a database, send an email, and trigger a downstream workflow based on the email's response has created a consequence chain that may be three or four steps long before anyone notices something went wrong. Read-only agents cannot do this. Advisory agents can influence it but cannot initiate it. Acting agents can initiate, extend, and complete consequence chains autonomously.

Observability tools can surface what agents are doing in real time. Honeycomb's high-cardinality event tracking is well-suited to catching anomalous patterns in agent behavior, and Grafana dashboards can surface action rates against defined thresholds for Tier 3 agents. But observability is a detection mechanism. It tells you what happened, sometimes while it is still happening. It is not a prevention mechanism. Governance architecture must be upstream of observability, not downstream of it. By the time Honeycomb surfaces an anomalous pattern, an acting agent may have already executed a dozen actions that need to be unwound.

This is the connection back to the Gartner finding. The agents most likely to end up in that 40% decommissioning projection are acting agents that were governed like read-only ones. They passed the same evaluation suite, received the same approval, operated under the same logging requirements, and then did something consequential that no one had designed the governance framework to prevent.

What Should an Autonomy-Tiered Governance Framework Actually Look Like?

A tiered approach to AI agent governance matches oversight intensity to consequence severity. This is not a novel principle. It is how financial controls work, how medical device approval works, how aviation maintenance certification works. The insight that a higher-risk action requires more rigorous oversight is not controversial in any other risk management discipline. It is only controversial in AI governance because the field is young enough that enterprises are still reaching for the simplest possible policy rather than the most appropriate one.

Tier 1 covers read-only agents with no external side effects. These agents need logging and periodic output audits, but not per-action human approval. The audit should verify that the agent's outputs are accurate and that its access permissions haven't quietly expanded beyond their original scope. Over-governing Tier 1 agents is itself a governance failure, because it consumes review capacity that should be concentrated on higher tiers. The goal is not to eliminate oversight but to right-size it.

Tier 2 covers advisory agents whose outputs influence human decisions. The governance requirement here is structured review: a clear workflow that documents what the agent recommended and what the human decided, and that preserves the distinction between those two things. The risk in Tier 2 is not that the agent acts autonomously. The risk is that the human stops acting autonomously, that the agent's recommendation becomes the de facto decision without genuine human evaluation. Governance for Tier 2 is as much about protecting the quality of human judgment as it is about constraining the agent.

Tier 3 covers acting agents with real-world write, execute, send, or spend permissions. The non-negotiable controls here are four: pre-authorization scoping, which defines exactly what the agent is permitted to do before it is deployed; hard permission boundaries, enforced at the infrastructure layer so that the agent is technically incapable of exceeding its authorized scope; reversibility requirements, which mean that any action the agent takes must either be reversible or must require explicit human authorization before it is executed; and real-time anomaly detection, which triggers review when the agent's action patterns deviate from its defined operating envelope. MLflow provides the audit trail that compliance teams need to reconstruct which version of an agent was running when a given action was taken, which is essential for Tier 3 post-incident review.

The infrastructure enforcement point deserves particular emphasis. A policy document that says a Tier 3 agent is not permitted to modify production databases is weaker than a permission boundary that makes it technically impossible for the agent to do so. HashiCorp Terraform can encode those boundaries at the infrastructure layer, so that governance is not merely a stated constraint but an architectural one. The agent cannot exceed its permissions because the permissions do not exist in the infrastructure, not because a policy document says it shouldn't. When Promptfoo is configured with tier-specific evaluation criteria, it can verify before deployment that a Tier 1 agent hasn't quietly accrued capabilities that push it into Tier 2 territory. Different test suites for different agent classes, enforced as a promotion gate.

The organizational challenge is as significant as the technical one. Autonomy-tiered governance requires genuine cross-functional alignment. The AI team that builds the agent, the security team that manages endpoint and identity controls (where platforms like CrowdStrike operate), the legal team that owns compliance risk, and the business owner who defined the agent's purpose all need to agree on the tier classification before deployment. If only one team owns the classification decision, the framework fails. The AI team will underestimate risk to ship faster. The legal team will overestimate it to minimize liability. The classification needs to be a joint determination, documented and revisable as the agent's capabilities evolve.

Which Tools Actually Support Tiered Governance Today?

Most of the tools that enterprises use for AI agent management were not designed with autonomy-tiered governance in mind. They were designed to solve adjacent problems: evaluation, observability, error tracking, model management. The honest assessment is that none of them substitute for the upstream policy decision about which tier an agent belongs to. That classification work is human and organizational. But several tools can be configured to enforce tier-specific controls once the classification is made.

Promptfoo operates at the pre-deployment evaluation layer. It can run different test suites against different agent classes, which means a Tier 3 agent can be tested against a more adversarial suite that probes its behavior when tool-use permissions are pushed to their edges. This is particularly useful for catching capability drift, the pattern where a Tier 1 agent has been granted additional tools over time and has effectively become a Tier 2 or Tier 3 agent without anyone having made that decision explicitly. Promptfoo can surface that drift before the agent reaches production.

MLflow provides experiment tracking and model versioning that serves the audit trail function for Tier 3 agents. When an acting agent takes an action that causes a problem, the post-incident review requires knowing exactly which version of the agent was running, what its configuration was, and how its behavior had changed across versions. MLflow's tracking infrastructure makes that reconstruction possible. Without it, post-incident review is largely guesswork.

Honeycomb and Grafana serve the runtime observability function. Honeycomb's high-cardinality event model is well-matched to agent monitoring because agent behavior is highly contextual. A single action type, say, a database write, can be benign or anomalous depending on what preceded it, what parameters it carried, and how frequently it is occurring. Honeycomb can hold that context in a way that traditional metrics systems cannot. Grafana dashboards can surface Tier 3 action rates against defined thresholds, giving operations teams a real-time view of whether agents are operating within their expected envelope.

Sentry addresses a specific and often overlooked signal: what happens when an acting agent fails mid-task. When a Tier 3 agent encounters an error, the error context can distinguish between a model failure and a permission boundary being hit. Those are different signals with different governance implications. A permission boundary hit means the agent attempted something outside its authorized scope, which is a governance event that should trigger review regardless of whether the attempt succeeded. Sentry's error context can surface that distinction in a way that generic logging often cannot.

The Anthropic Claude API, scored 8.3/10 by the TopReviewed AI panel, exposes system prompt constraints and tool-use controls that can be used to enforce tier boundaries at the model call level. An agent built on Claude can be configured so that its available tools are scoped to its tier classification, making the tier boundary a property of the model's operating context rather than only a policy document. This is not a complete governance solution, but it is a meaningful technical layer that complements the infrastructure-level controls that HashiCorp Terraform can enforce.

What this tooling stack cannot do is make the classification decision. No evaluation framework, observability platform, or model API can determine whether a given agent belongs in Tier 1, 2, or 3. That determination requires humans who understand what the agent can do, what it is connected to, and what the consequences of failure look like. The tools enforce the framework. They do not create it.

What Does Getting This Right Actually Require From Enterprise Leadership?

The deepest obstacle to tiered AI agent governance is not tooling. Most enterprises already have access to the tools described above. The obstacle is that most enterprises have not built the internal vocabulary or accountability structures to classify agents by autonomy level before deployment. Without that vocabulary, the classification conversation cannot happen. Without accountability structures, it does not happen even when the vocabulary exists.

The organizational pattern that produces the worst outcomes follows a consistent shape. A product team deploys what they describe internally as a "helpful assistant agent." Compliance reviews it based on that description, treats it as a low-risk read-only tool, and applies Tier 1 governance. Over the following six months, the team adds tools to the agent: first the ability to draft and send internal Slack messages, then the ability to update a CRM record, then the ability to trigger a downstream workflow. Each addition feels incremental. None of them individually triggers a governance review. At the end of six months, the agent is effectively a Tier 3 acting agent operating under Tier 1 governance, and no one made that decision explicitly. Someone just kept adding tools.

The concrete first step is not buying new tooling. It is conducting an autonomy audit of every agent currently in production. The audit asks four questions for each agent: what can it read, what can it write, what can it send, and what can it spend. The answers to those four questions determine the tier. The next question is whether the governance policy currently applied to that agent matches the tier its capabilities place it in. In most enterprises, a meaningful fraction of agents in production will fail that test, not because anyone was negligent, but because capability drift is gradual and governance reviews are episodic.

The Gartner projection of 40% decommissioning is not inevitable. It describes the trajectory of enterprises that treat AI agent governance as a compliance checkbox applied after deployment rather than an architectural decision made before it. The enterprises that avoid that trajectory will be those that build the classification conversation into the deployment process itself, that require a tier assignment as a condition of production access, and that build review triggers into the tool-permission grant process so that capability drift cannot happen silently.

The next time an AI agent is proposed for deployment in your organization, the first governance question should not be "what do our general AI policies say?" It should be: "what tier does this agent belong to, and what does that tier require?" That single reframe, applied consistently before deployment rather than reactively after something fails, is where the work of avoiding the next decommissioning wave actually begins.

AI agent governanceAI complianceAI workflow automationenterprise AIautonomous agents

More from the Blog

AI software insights, comparisons, and industry analysis from the TopReviewed team.