Toward a Shared Vision for LLM Evaluation in the Airflow Ecosystem
Airflow Summit 2025 - Lightning Talk (5 min)
2025-09-09
Pain Points
- Overwhelming choice: MCPs, Agents
- We can’t compare what we can’t measure
- Can I trust this to run on its own?
Automation
This is not a new problem
Automation
- I need to:
- Be able to describe what I want
- Understand my risk profile
Integration / Unit Tests
f(input) -> desired outcome [✓]
Integration / Unit Tests
f(model, prompt, system prompt, context) -> desired outcome [✓]
How do we write OUR benchmarks
(Fast Iteration / Grounded Conversations)
How do we write OUR benchmarks
- Is any dag paused?,
["dag_a", "dag_c"]
- Has the backfill for X completed?,
'No, currently running tasks is ["a.task_x"]'
- How many SLA breaches have we had in the last 2 weeks?,
4
Call to Action: Let’s start
(“Reproducibility” First)
Who’s interested?
(MCP pioneers, I’m looking at you)