Building Agent Skills: Intent, Determinism, and Stability

A mental model and decision tree for building agent skills incrementally: start with intent, add deterministic tools, then use tests and AI evals to reduce drift and risk.
genai
workflows
reliability
Author

Alex Guglielmone Nemi

Published

February 13, 2026

I want to offer a mental model and decision tree for building Agent Skills incrementally. It’s meant for anyone experimenting with them - not just for software developers - and focuses on staying in control as complexity grows or you start thinking about sharing or collaborating with others.

Note

It’s awesome to see increased adoption of Agent Skills to package workflows 1 and I attribute a lot of their success to standardized contracts for central use and managing context through progressive disclosure 2 (More mature tools/MCPS and models definitely lowered friction as well).

Mental Model

You can think of Agents as assistants that can take the load from you and Agent Skills as the high level instructions you might leave to them, in a standardized format.

You need to know what you want from them, and the tradeoffs between micromanagement and agency - Intent. You want to offload mechanical work so they’re not reasoning about things that a tool like a calculator or a spreadsheet could handle - Determinism and you want the whole thing to hold up even if you swap one assistant for another - Stability.

Looking at the shape of an Agent Skill, you can mentally map it as:

  • Intent -> Markdown instructions
  • Determinism -> tools and scripts
  • Stability -> Tests and AI Evals

If your intent is clear enough, the built-in skill-builder in many agent CLIs may be sufficient especially if you still validate outputs manually or stay in the loop for approvals. The more you want to change the skill without manually validating every run, or the more you worry about undesired/rogue behavior, the more determinism helps: scripts reduce error compounding, and unit tests plus AI evals reduce drift risk and make iteration safer and faster, especially when sharing or collaborating.

Decision Tree

flowchart TD
  A([Start: I have or am developing an Agent Skill]) --> Q1{Only for you<br/>and you're happy manually reviewing outputs?}

  Q1 -- Yes --> L0["Level 0: Intent (Markdown only might be enough for you)<br/>- Clear inputs/outputs help<br/>- Examples help<br/>- Define good enough"]
  Q1 -- No --> Q2{Need repeatable structure<br/>or mechanical consistency?}

  Q2 -- No --> L0
  Q2 -- Yes --> L1["Level 1: Determinism (Tools) <br/>- Move mechanical steps into tools/scripts<br/>- Use structured output<br/>- If possible, log tool inputs/outputs"]

  L1 --> Q3{Will others use it<br/>or will you modify it often<br/>without manual re-checking?}

  Q3 -- No --> L1
  Q3 -- Yes --> L2["Level 2: Stability (Tests + Evals)<br/>- Unit tests for tools/scripts<br/>- AI evals for behavior/user stories<br/>- Minimal Golden cases + some edge cases"]

  L2 --> Q4{Can it access sensitive data<br/>or take impactful actions<br/>or run unattended?}

  Q4 -- No --> L2
  Q4 -- Yes --> L3["Level 3: Safety/Scale<br/>- Guardrails + least privilege<br/>- Human approval for high impact<br/>- Security-focused evals"]

Note

Observability: This is a key point I omitted from the levels here. It lets you monitor cost (tokens spent), latency, tool selection, and more. You should add it as soon as you feel that you are missing that information. Investing in this space is important to answer questions about what happened in a particular “agent instruction following loop”.

Go deeper on each level:

  • Level 0 - Intent: Skills Spec
  • Level 1 - Determinism: Error compounding + determinism
  • Level 2 - Stability: Evals primer
  • Level 3 - Security: If you’re here, least privilege and human approval for high-impact actions are usually a good baseline. For each integration point, ask what the worst-case outcome is. It’s also worth understanding Prompt Injection.

A note on experience

The levels are illustrative and exist to prevent overwhelming yourself and avoid paralysis - either from fear of breaking things or from too many choices at the start.

After you build a few skills, recognizing patterns becomes easier and you can decide where to invest based on your own pain points. You may start thinking about intent, determinism, stability, and safety from the beginning.

That does not mean implementing everything at once. It means being aware of more tradeoffs earlier.

Build only what you need, and keep it as simple as possible.

Practical Takeaway

Use skills to clarify intent. When a step stabilizes, move it into code or tools. Not because the model can’t do it, but because you don’t want to rediscover the same approach on every run. That lowers cost, reduces drift risk, and keeps room for directed experiments.

You can build skills with your agent CLI of choice (Claude/Codex/OpenCode), or use frameworks that support the pattern, like Doug Trajano’s Agent Skills implementation for PydanticAI (docs).


Call to action

If you have a skill you want to take beyond Markdown with determinism or AI Evals, share it. We can discuss which steps are missing to move from one level to the next. It may be simpler than it looks, and we could use it as a public-facing example to help others see specific ways to improve.

Footnotes

  1. Turning a sequence of steps you’d otherwise repeat manually into a single, reusable instruction an agent can follow.↩︎

  2. Progressive disclosure means your agent doesn’t need every instruction in context at once. It can load what’s relevant when needed. See Doug Trajano’s PydanticAI Agent Skills implementation and docs.↩︎