The Living Deadline – Toward a Shared Vision for LLM Evaluation in the Airflow Ecosystem

Pain Points

Overwhelming choice: MCPs, Agents
We can’t compare what we can’t measure
Can I trust this to run on its own?

[00:00-01:23] - My name is Alex, I’m an Airflow user. I have more than 15 years of experience in software development, so I’ve seen a few things. I came to talk about a shared vision for LLM evaluation in the Airflow ecosystem. - Every time someone comes to me and talks to me about an AI tool, I stop them and ask: what is your pain point? - Because typically, if they don’t work backwards from a pain point, they’re just using some tool for no reason, right? - Chasing the latest thing. And what is our pain point when we’re thinking about LLM development in the Airflow community? - Well, first is overwhelming choice. Every day we have a new tool, a new foundational model, MCP servers, agents, plugins, it’s just insane. - It’s very hard to reason clearly. Fundamentally, we cannot compare what we can’t measure. - People might point to published foundation-model benchmarks, but what matters is what we can measure in our own context. Not what is published somewhere. - And then the other thing is, let’s say you have an LLM solution and you want to put it somewhere. Can you trust it?

Automation

This is not a new problem

Automation

I need to:
- Be able to describe what I want
- Understand my risk profile

Integration / Unit Tests

f(input) -> desired outcome [✓]

Integration / Unit Tests

f(model, prompt, system prompt, context) -> desired outcome [✓]

Existing Tools & Frameworks

Promptfoo, Trulens, DeepEval, Ragas, LangFuse, your-custom-built-tool, and many many more

How do we write OUR benchmarks

(Fast Iteration / Grounded Conversations)

How do we write OUR benchmarks

Is any dag paused?, ["dag_a", "dag_c"]
Has the backfill for X completed?, 'No, currently running tasks is ["a.task_x"]'
How many SLA breaches have we had in the last 2 weeks?, 4

Toward a Shared Vision for LLM Evaluation in the Airflow Ecosystem

Pain Points

Automation

Automation

Integration / Unit Tests

Integration / Unit Tests

Integration / Unit Tests

Existing Tools & Frameworks

Existing Tools & Frameworks

How do we write OUR benchmarks

How do we write OUR benchmarks

Call to Action: Let’s start

Who’s interested?

Thank you!