Toward a Shared Vision for LLM Evaluation in the Airflow Ecosystem

Airflow Summit 2025 - Lightning Talk (5 min)

Alex Guglielmone Nemi

2025-09-09

Pain Points

  • Overwhelming choice: MCPs, Agents
  • We can’t compare what we can’t measure
  • Can I trust this to run on its own?

Automation

This is not a new problem

Automation

  • I need to:
    • Be able to describe what I want
    • Understand my risk profile

Integration / Unit Tests

Integration / Unit Tests

f(input) -> desired outcome [✓]

Integration / Unit Tests

f(model, prompt, system prompt, context) -> desired outcome [✓]

Existing Tools & Frameworks

Existing Tools & Frameworks

  • Promptfoo, Trulens, DeepEval, Ragas, LangFuse, your-custom-built-tool, and many many more

How do we write OUR benchmarks

(Fast Iteration / Grounded Conversations)

How do we write OUR benchmarks

  • Is any dag paused?, ["dag_a", "dag_c"]
  • Has the backfill for X completed?, 'No, currently running tasks is ["a.task_x"]'
  • How many SLA breaches have we had in the last 2 weeks?, 4

Call to Action: Let’s start

(“Reproducibility” First)

Who’s interested?

(MCP pioneers, I’m looking at you)

Thank you!