Reducing Error Compounding in GenAI Systems

GenAI is non-deterministic and can fail or produce different results for the same input.

A typical prompt-to-action flow involves many LLM calls. Each call is a chance for the model to misinterpret, hallucinate, or produce an unusable output.

The question isn’t if errors happen. It’s what happens when they do, and how many opportunities you give them to cascade.

How bad does it get?

Consider a simple case with just 4 steps (note: this is illustrative, consider how your system might have 10, 20, or more LLM calls):

Step 1: 95% chance of correct
Step 2: 95% chance of correct
Step 3: 95% chance of correct
Step 4: 95% chance of correct

End-to-end: ~81% chance everything is correct.

Now compare:

Step 1 (LLM): user intent → structured call (95%)
Step 2 (deterministic tool): execute (98%)
Step 3 (deterministic validation): parse + check (97%)
Step 4 (LLM): result → response (96%)

End-to-end: ~87%

Same model. Different architecture. 6+ point improvement.

Two high-leverage approaches

Remove 1 or many GenAI steps entirely, fewer chances to fail
Replace GenAI steps with deterministic ones, lower error rate per step

Note

These aren’t the only ways to reduce error (e.g. consensus systems, retries, etc) but the fundamentals expressed would apply everywhere, no matter whether it’s one agent or a swarm.

What makes deterministic steps different

Deterministic steps still fail, but the failure characteristics differ from LLM failures:

Bounded: failures come from a finite set of causes (parse error, timeout, missing field), not open-ended misinterpretation
Repeatable: same input, same failure: you can reproduce and fix it
Non-semantic: a crashed process doesn’t convince the next step that “actually the user meant X”

Note

This doesn’t mean deterministic = reliable. It means when things break, they break in less subtle ways and there’s a lot of software engineering history behind their robustness.

The design pattern

LLMs do many hard parts (interpreting intent, choosing tools, dealing with syntax, reasoning through results, deciding what comes next).

In a simplified flow, the model might:

Receive user intent (natural language)
Decide which tool to call and with what parameters
Receive structured output from the tool
Decide: done, or call another tool?
Repeat until ready to respond
Translate the final result back to the user

A lot of reasoning and orchestration happens there and the point isn’t to limit that but to give it good building blocks.

A human user is more effective with better building blocks (e.g. well designed libraries or cli tools) and so is an LLM.

What you want are building blocks that are:

reusable
composable
well tested
easy to change and maintain
cost effective

And a model that acts as a translation layer, not the tool running all the logic.

Practical recommendations

Identify the deterministic core

If you are writing a Claude skill and a step can be expressed as code, ask yourself why you’re not expressing it as code.

The tradeoff is real:

Leaving logic in prose means:

Higher error rate at runtime
Relying on evals instead of unit tests (if you don’t know what one or either of these are, then you’re definitely safer in the frozen code world)
Paying the cost and error on every execution
Yes, it might improve as models improve, but you’re paying for that uncertainty every time

Moving logic to code means:

Lower error rate (deterministic execution)
Unit testable
Cheaper to run
Still easy to write with LLMs, have the model generate the code once instead of regenerating the logic from prose every time
You can still ask LLMs to review or improve the code later if you want

The second option gives you confidence that things actually work. The first option defers that confidence in exchange for alleged convenience.

If the LLM can write code for you, why have it translate markdown to logic on every run? Make the translation once, freeze it as code, and test it properly.

Force structure at boundaries

Don’t pass prose between steps. Use formats that are easy to serialize and deserialize, like JSON/YAML with schemas you can validate against.

Structure lets you validate, detect errors and attempt course correction or fail fast, diff, log, and evaluate deterministically.

Note

This will even save you money, time and compute resources by not having to use LLM as a Judge for assertion in your evals.

Test the building blocks

Write unit tests and integration tests for the core building blocks — same as you would’ve done before LLMs.

Closing

This isn’t about distrusting models, it’s about giving them good building blocks to use.

Use GenAI to translate intent. Use the building blocks to execute. Keep errors where you can measure them.

That is how automation becomes something you can trust instead of manually testing once and being hopeful.