If you can’t measure it, you’re guessing. Here’s how I think about evals, practical examples at ai-evals.io.

Start with pain, not tooling

My eval approach is pain-point driven:

I can’t compare what I can’t measure.
I can’t trust an AI system to run on its own if I can’t quantify failure.

If those two are unresolved, I am not “doing evals.” I am guessing.

Treat it like automation engineering

I frame agent work the same way as any automation effort. Unlike traditional ML experimentation, automation engineering demands you define acceptable behaviour upfront. Not after deployment:

Can I describe exactly what I want? (intentionality)
What is the worst-case blast radius¹ if this fails?

That framing forces clarity early and keeps risk visible.

Build the smallest useful test loop

I treat early eval work like integration testing plus TDD² habits:

Skip big infra at the beginning.
Put the agent where users already work so it behaves like an extra set of hands.
Recreate real user stories and questions.
Use deterministic checks wherever possible; don’t default to LLM-as-a-judge for everything.

The goal is not “more tests.” The goal is tests that maximize iteration speed and control.

Optimize late

This space moves fast. Over-optimizing too early is usually waste.

I prefer to keep things:

Minimal
Easy to Change (ETC)

In practice that means:

Parameterized experiments
Easy comparison across runs, configs, and components

Once benchmarks are stable, then optimize cost and latency:

Pick the cheapest model/config that still meets the bar.

That’s it: measure first, constrain risk, iterate fast, optimize last.

Footnotes

Blast Radius: refers to the extent of system, data, or user impact caused by a component failure, security breach, or faulty code deployment. See an extreme example in the Falcon Sensor 2024 crash.↩︎
TDD: Test Driven Development: The act of driving your code thinking intentionally about testability and what tests expose the behaviour you want. Dogmatically it can seem slow and unhelpful but thinking about testability and adding tests to catch prevent bugs from recurring are very useful practices to have. Otherwise, overbuilding is very easy.↩︎