A crash gets caught. A confident but wrong answer
gets forwarded to your boss.
For each retrieved chunk: relevant, or not?
"Looks good to me" names none of them
Expensive, slow — and it can trade recall away for precision
Change one thing at a time, then re-run your evals
@seldo.com on BlueSky 🦋
arize.com/docs/ax
These slides: