How autonomous should an enterprise agent be at launch?

Start conservative. Let it act autonomously only on low-risk, reversible actions such as reading data or drafting responses, and gate anything destructive or externally visible behind human approval. As your eval data and production traces build confidence, you can widen autonomy deliberately, tier by tier, rather than all at once.

What is the most common cause of agent failures in production?

Malformed or incorrect tool calls, not hallucinated text. Models invent arguments, use wrong identifiers, or loop on the same tool. Strict argument validation, small non-overlapping tool sets, idempotent tools, and structured recoverable error messages address the majority of these failures.

Do I always need an LLM-as-judge for evaluation?

No. Prefer deterministic checks — correct tool used, valid output schema, within step budget, no forbidden actions — because they are cheap, fast and unambiguous. Only reach for an LLM judge where correctness is genuinely subjective, and always calibrate that judge against human labels before trusting it.

AI Agents18 March 2026 · 8 min read

Designing Reliable AI Agents for Enterprise Work

Most agent projects die in the gap between a slick demo and a system an enterprise can actually depend on. Here is how we close it.

Priya Ramachandran

Staff AI Engineer

The demo always works. You wire up a model, give it three tools, ask it a well-behaved question, and it reasons its way to a tidy answer in front of the stakeholders. Everyone is delighted. Then you point it at real traffic — messy inputs, half-populated records, an API that returns a 502 every fiftieth call — and the illusion of competence evaporates. The distance between that demo and a system a bank or an insurer will run unattended is where almost all the engineering actually lives.

After shipping several agents into regulated Australian enterprises, I have stopped treating agent reliability as a prompting problem. It is a systems problem. The model is one unreliable component in a distributed system, and you engineer around it the same way you would engineer around a flaky third-party service: with contracts, retries, fallbacks, observability and hard limits. What follows is the checklist I now apply before anything goes near production.

Tool-calling is the reliability frontier

The single largest source of production failures I see is not hallucinated prose — it is malformed or wrong tool calls. The model invents an argument, passes a string where an integer belongs, calls the right tool with a plausible but incorrect ID, or loops calling the same tool because it did not register the result. Prompting alone will not fix this at scale.

Treat every tool as a public API with a strict contract. Validate arguments against a schema before execution, and return structured, actionable errors the model can recover from rather than a stack trace. A tool result that says "customer_id must be a 9-digit number; you passed 'ACME'" lets the model self-correct; a raw exception usually sends it into a doom loop.

Constrain tool arguments with a strict schema and reject anything that does not validate — do not let bad calls reach your systems.
Make tools idempotent where you can, so a retried call is safe. Assume the model will occasionally call twice.
Keep the tool surface small. Ten tools with clear, non-overlapping purposes beat thirty that the model confuses. Ambiguity in tool descriptions is ambiguity in behaviour.
Return errors as data, with a suggested remediation, not as failures that terminate the turn.

Guardrails and human-in-the-loop are design decisions, not add-ons

Decide up front which actions the agent may take autonomously and which require a human to approve. This is a risk-tiering exercise, and it belongs in the design document, not in a hurried patch after an incident. Reading data, drafting a response, or classifying a ticket can usually run unattended. Issuing a refund, modifying an entitlement, sending an external email, or writing to a system of record should pause for approval — at least until you have the evidence to earn more autonomy.

The pattern that has worked repeatedly: the agent proposes a fully-formed action with its reasoning attached, and a human approves, edits, or rejects it. This keeps a person accountable, generates labelled data for your evals, and gives you an audit trail regulators will ask for. Do not confuse a human-in-the-loop with a human rubber-stamp; if approvers are clicking yes on everything without reading, you have theatre, not a control.

Build the evaluation harness before you tune anything

You cannot improve what you cannot measure, and vibes are not a metric. Before optimising prompts, build an offline eval set from real cases — successes, near-misses, and the ugly edge cases from production. Every time the agent fails in the wild, that case becomes a permanent regression test. This dataset is the most valuable artefact your team will build; guard it more carefully than your prompts.

Evaluate at two levels. First, deterministic checks: did it call the right tool, produce valid output, stay under the step budget, avoid the forbidden actions? These are cheap, fast and unambiguous. Second, where correctness is subjective, use an LLM-as-judge with a tightly-specified rubric — but calibrate the judge against human labels, because an uncalibrated judge just launders your assumptions.

def eval_case(agent, case):
    trace = agent.run(case.input, max_steps=8)
    return {
        "used_tool": case.expected_tool in trace.tools_called,
        "valid_output": case.schema.is_valid(trace.final),
        "within_budget": trace.steps <= 8,
        "no_forbidden": not (trace.tools_called & FORBIDDEN),
    }

# Fail CI if the pass rate on the golden set regresses.
assert pass_rate(results) >= BASELINE

Deterministic fallbacks and step budgets

An agent should never be the only path to an outcome for anything that matters. For high-value or high-frequency flows, keep a deterministic implementation — a rules engine, a SQL query, a plain function — and let the agent handle the long tail. The agent earns its place by covering the cases that are uneconomic to code by hand, not by replacing code that already works reliably.

Impose a hard step or token budget on every run and define what happens when it is hit: escalate to a human, fall back to the deterministic path, or fail cleanly with a clear message. An agent with no budget will, on a bad day, burn your entire month's inference spend chasing an impossible task in a loop. I have seen it happen. Cap it.

Observability: you cannot debug what you cannot trace

Every agent run must emit a complete trace — the prompt, each model output, every tool call and its result, token counts, latency per step, and the final outcome. When an agent misbehaves in production, this trace is the difference between a ten-minute fix and a week of guessing. Structured tracing is not optional; it is the primary interface through which you understand a non-deterministic system.

Assign every run a correlation ID and propagate it through all tool calls and downstream services.
Log token usage and latency per step so you can see where cost and time actually go — the answer is often surprising.
Sample production traces into your eval set continuously; production is the best source of hard cases you will ever have.

Cost and latency are product constraints

A multi-step agent can make a dozen model calls to answer one question, and each is billed and each adds latency. Prompt caching for the stable parts of your context, routing simple sub-tasks to a smaller model, and aggressively trimming what you resend on each turn all matter more than people expect. Reserve the largest model for the steps that genuinely need it — planning and hard reasoning — and use cheaper models for extraction, classification and formatting.

Managing context deliberately

Context is a scarce resource, not a dumping ground. Stuffing entire documents and full conversation history into every turn degrades both accuracy and cost. Retrieve only what the current step needs, summarise long histories, and keep the system prompt lean and versioned. Treat prompts as code: reviewed, versioned, and tied to the eval run that justified the change. A prompt edit with no eval attached is a guess shipped to production.

When not to use an agent

The most senior decision you can make is to not build an agent. If the workflow is well-defined and deterministic, write the code — it will be faster, cheaper, testable and auditable. If the cost of a wrong answer is high and you cannot afford human review, an agent is probably the wrong tool. Agents earn their complexity when the task genuinely requires flexible reasoning over ambiguous inputs and open-ended tool use. Everywhere else, a workflow with a couple of well-placed model calls will beat an autonomous agent on every metric that matters.

Reliable agents are not built by finding the perfect prompt. They are built by surrounding a probabilistic component with deterministic scaffolding: strict contracts, real evals, hard budgets, full traces, and a human on the hook for anything that bites. Do that, and the demo and the production system finally start to look like the same thing.

LLM agentsevaluationobservabilityguardrailsproduction