Agent Evaluation - Metrics, Benchmarks, and Testing

LESSON

RAG, Agents, and LLM Production

008 30 min intermediate

Day 328: Agent Evaluation - Metrics, Benchmarks, and Testing

The core idea: agent evaluation is the discipline of turning "this looked good in a demo" into measurable evidence that the system completes the right task, by the right process, within the right risk and cost bounds.


Today's "Aha!" Moment

The insight: 21/07.md established that production agents need policy controls, monitoring, and traces. Evaluation is how you use that runtime data to decide whether a new prompt, tool, planner, or autonomy level should ship at all.

Why this matters: Teams often evaluate agents like chatbots: "Did the answer sound reasonable?" That is too weak for systems that plan, call tools, wait for approvals, and mutate external state. A production agent has to be judged on at least four dimensions at once:

Concrete anchor: Continue the stolen-laptop assistant from the last two lessons. A model update increases "resolution rate" from 78% to 86% on a small hand-picked set. That sounds like progress until you inspect the full evaluation:

The model became more assertive, not necessarily more reliable.

Keep this mental hook in view: A useful agent eval scores outcomes, process quality, and risk exposure together; otherwise it rewards the wrong behavior.


Why This Matters

Agent evaluation matters because deployed agents fail in ways that ordinary QA does not catch well. A normal software feature is mostly deterministic: given the same input and version, you expect the same behavior. An agent combines probabilistic reasoning, retrieval, tool calls, changing environment state, and sometimes human approval steps. That means "works on my example" is almost meaningless.

Without a real evaluation discipline:

With a real evaluation discipline:

Real-world impact: Better launch decisions, fewer production regressions, faster debugging when behavior shifts, and a much clearer answer to the question every review board eventually asks: "What evidence says this agent is safe enough to trust here?"

This lesson also sets up 21/09.md. Techniques such as chain-of-thought should not be adopted because they look intellectually impressive. They should be adopted only if the evaluation harness shows they improve the metrics that matter for the product.


Learning Objectives

By the end of this session, you should be able to:

  1. Define an evaluation scorecard for an agent by separating task success, process quality, safety, and efficiency metrics.
  2. Build benchmark cases that represent production reality by covering happy paths, ambiguous cases, tool failures, and policy-sensitive situations.
  3. Design a release testing ladder for agent changes so prompts, policies, tools, and planning strategies are validated before wider rollout.

Core Concepts Explained

Concept 1: Metrics Must Describe the Whole Agent Contract

For example, The stolen-laptop assistant receives 100 real-world-style cases. Version A resolves 82 of them end to end. Version B resolves 85. If you stop there, B wins. But a fuller scorecard shows:

The "better" model is only better on one slice of the problem.

At a high level, An agent has a contract with the business, with operators, and with risk owners. The evaluation metrics must reflect that full contract. If the scorecard only rewards completion, the agent will find ways to look effective while pushing hidden cost and risk elsewhere.

Mechanically: A practical scorecard usually has four metric families:

  1. Outcome metrics
    • task completion rate
    • correctness of final state
    • reopen or retry rate
    • escalation rate
  2. Process metrics
    • steps per run
    • duplicate tool call rate
    • plan repair or re-plan rate
    • schema validation failure rate
  3. Safety and policy metrics
    • blocked-action attempt rate
    • approval-required rate
    • unauthorized tool selection rate
    • sensitive-data exposure or redaction incidents
  4. Efficiency metrics
    • latency to resolution
    • tokens and tool calls per run
    • external API cost
    • human review minutes per task

These metrics work best when attached to a task taxonomy. "Customer support" is too broad. "Disable a managed device after verified identity and policy approval" is narrow enough to score meaningfully.

def score_run(run):
    return {
        "task_success": int(run.final_state == "resolved_correctly"),
        "unsafe_attempt": int(run.policy.blocked_actions > 0),
        "duplicate_calls": run.tool_stats.duplicate_call_count,
        "review_minutes": run.human_review_minutes,
        "latency_seconds": run.latency_seconds,
    }

In practice:

The trade-off is clear: Rich scorecards make release decisions more honest, but they require labeling effort, instrumentation, and cross-functional agreement on what "good" really means.

A useful mental model is: Score an agent like a production workflow, not like a school essay.

Use this lens when:

Concept 2: Benchmarks Need Realistic Cases, Not Just Polished Prompts

For example, a benchmark for the stolen-laptop assistant contains 50 neat prompts such as "My company laptop was stolen; please disable it." The agent scores well. In production, the difficult cases look different:

The benchmark passed because it measured the easy subset of reality.

At a high level, Benchmarks are only useful if they preserve the structure of the decisions the agent actually has to make. An eval set made entirely of clear, short, clean prompts teaches the system to win a contest that production never runs.

Mechanically: A strong benchmark suite usually mixes several case types:

  1. Golden path cases
    • ordinary requests that should succeed cleanly
  2. Ambiguity cases
    • missing fields, identity collisions, partial information, vague intent
  3. Policy-sensitive cases
    • requests that should require approval, refusal, or clarification
  4. Dependency-failure cases
    • timeouts, stale memory, partial writes, slow downstream systems
  5. Adversarial or abuse cases
    • prompt injection, social engineering, privilege escalation attempts

Each case should define more than a final answer. It should specify:

This is where runtime observability from 21/07.md becomes essential. The benchmark can assert not only "task completed," but also "the agent did not skip identity verification," "the policy engine blocked the unsafe branch," or "the run stayed within the approved tool budget."

In practice:

The trade-off is clear: Realistic benchmarks predict production behavior better, but they cost more to author, maintain, and normalize across versions of tools and policies.

A useful mental model is: A benchmark is a rehearsal for real incidents, not a gallery of flattering demos.

Use this lens when:

Concept 3: Testing Agents Requires a Layered Release Ladder

For example, The team wants to introduce a chain-of-thought prompting change and a new procurement tool. If they test only the full agent end to end, they will know the combined system got worse, but not whether the regression came from the prompt, the tool wrapper, the policy rule, or the planner.

At a high level, Agent systems need multiple testing layers because failures can originate in different places. Good testing isolates local bugs, then validates integrated behavior, then confirms production fitness under controlled rollout.

Mechanically: A reliable testing ladder usually looks like this:

  1. Deterministic component tests
    • tool wrapper tests
    • schema validation tests
    • policy engine tests
    • memory retrieval and normalization tests
  2. Scenario-level agent evals
    • benchmark cases scored offline
    • assertions on outcome, process, and safety traces
  3. Simulation and replay tests
    • rerun past incidents or sampled traces against the new version
    • inject failures such as timeouts, duplicate callbacks, or missing fields
  4. Shadow or canary rollout
    • run the new agent version on live traffic without or with limited authority
    • compare metrics against the current version before widening scope

The purpose of this ladder is to support release gates. A prompt change should not merge merely because it "felt better." It should satisfy concrete checks such as:

In practice:

The trade-off is clear: Layered testing slows releases and requires more plumbing, but it dramatically reduces the chance that a clever-looking change quietly degrades safety or operability.

A useful mental model is: Treat agent releases like database or distributed-systems changes: prove correctness locally, then validate system behavior under realistic load and failure.

Use this lens when:


Troubleshooting

Issue: Offline benchmark scores keep improving, but operators say production quality is flat or worse.

Why it happens / is confusing: The eval set has become too clean or too familiar. It measures polished prompt handling instead of ambiguous, failure-prone workflows.

Clarification / Fix: Refresh the benchmark with recent incident replays, ambiguous cases, and failure injections. Track separate scores for golden-path tasks and messy production-like tasks so the team can see where the gains really are.

Issue: The agent looks efficient on latency and completion, but risk reviewers are blocking release.

Why it happens / is confusing: The scorecard overweights final-task success and underweights unsafe attempts, policy bypasses, or approval misuse.

Clarification / Fix: Add policy and process assertions to every high-impact benchmark case. Make blocked-action rate, unauthorized tool choice, and required-approval compliance first-class release criteria.

Issue: A new model version passes end-to-end evals, but no one can explain which subsystem improved.

Why it happens / is confusing: The system is being judged only at the final-output layer, so changes in planner quality, retrieval quality, or tool discipline are invisible.

Clarification / Fix: Record versioned spans and structured run annotations for planner output, retrieval hits, tool selection, and policy outcomes. Evaluate at both component and scenario levels so the source of movement is attributable.


Advanced Connections

Connection 1: Agent Evaluation <-> Production Agent Systems

21/07.md explained how safe production agents emit policy decisions, tool traces, and runtime metrics. This lesson turns those signals into judgment:

Without production instrumentation, evaluation collapses back into transcript review and human impression.

Connection 2: Agent Evaluation <-> Chain-of-Thought

21/09.md will examine chain-of-thought prompting as a reasoning aid. Evaluation is the guardrail around that decision:

Reasoning techniques should be measured as system changes, not admired as ideas in isolation.


Resources

Optional Deepening Resources


Key Insights

  1. Agent evaluation is multi-dimensional by necessity - completion alone hides unsafe attempts, wasted tool calls, and rising operator cleanup cost.
  2. Benchmark realism matters more than benchmark polish - production regressions usually appear in ambiguous, stateful, and failure-prone cases, not in clean demos.
  3. Testing is a release system, not a slide deck - prompts, tools, and reasoning strategies should earn rollout through layered evidence from components to canaries.

PREVIOUS Production Agent Systems - Safety, Monitoring, and Observability NEXT Chain-of-Thought (CoT) - Teaching LLMs to Think Step-by-Step

← Back to RAG, Agents, and LLM Production

← Back to Learning Hub