Stop Vibe Shipping:
Evaluate Your Retrieval

This is the right room
for this talk
What are we talking about?
- Why retrieval fails quietly
- Most similar != most relevant
- The metrics that actually matter
- How to actually improve relevance
- Where LLM-as-judge helps — and where it lies
The vibe-shipping trap
You can't unit-test
your way out
Why retrieval fails quietly
A crash gets caught. A confident but wrong answer
gets forwarded to your boss.
Most similar ≠ most relevant
Similarity is a proxy
- Vector search ranks by similarity.
- Your user needs relevance.
- Those are not the same thing.
How similarity breaks
- Right topic, no answer
- Perfectly matched, stale data
- Near-duplicates crowding out the real answer
How do we measure
retrieval effectively?
Retrieval vs. generation
- Retrieval: did we fetch the right context?
- Generation: did we use it well?
Relevance is the atomic unit
One question, per chunk
For each retrieved chunk: relevant, or not?
Let an LLM judge relevance
- Send the query + the chunk to an LLM
- "Does this text help answer this question?"
- Get back: relevant / unrelated, plus an explanation
The metrics
that actually matter
Hit rate:
Did we retrieve anything relevant at all?
- Did at least one relevant chunk make the cut?
- The floor. If this fails, nothing downstream can work.
Precision@k:
How much of what we retrieved is junk?
- Of the top k chunks, what fraction are relevant?
- Junk dilutes and distracts the model
Recall@k:
Did we leave the answer
on the table?
- Of all the relevant chunks that exist, how many did we get?
- Low recall = the answer was out there, you just didn't fetch it
NDCG and MRR
Is the good stuff near the top?
- Ranking metrics: relevant chunks should rank highest
- Models weight early context — order matters
Four metrics,
four different failures
- Hit rate
- Precision@k
- Recall@k
- Ranking
"Looks good to me" names none of them
Golden datasets
that survive reality
What a golden dataset is
- A set of real queries, with the relevance you'd expect
- The encoded judgment of the people who know your domain
Build it from reality,
not your imagination
- Pull real queries from production traces
- Not five questions you invented at your desk
Label it, and be disciplined
- Read the chunks. Label each relevant / not.
- Specific criteria, not "this one feels right"
Keep it alive
- Your corpus drifts. Your users drift.
- Version it, refresh it, add new failures as you find them
Where LLM-as-judge helps,
and where it lies
Where LLM-as-judge helps
- Relevance judgments at machine speed
- With an explanation for every call
- Catches "similar but irrelevant" that cosine can't
Where LLM-as-judge lies
- Inherits the "plausible = relevant" trap
- Position and verbosity bias
- Self-preference if it grades its own model
Check the judge
against the golden set
- Run the judge on your hand-labeled data
- Measure its precision and recall like any classifier
Back to generation
Two checks on the answer
- Correctness: did it answer the question?
- Faithfulness: did it stick to the retrieved context or make things up?
So what do you actually fix?
- Low recall → the answer never came back
- Low precision / bad ranking → it came back, buried in junk
Low recall:
get the answer in the door
- Chunk on meaning, not token counts
- Add keyword search alongside vectors — hybrid search
- Rewrite the query before you search
Low precision:
clean up what comes back
Reranking:
The single
highest-leverage move
- Over-fetch ~20 candidates, then a cross-encoder re-scores them
- Catches the relevance your embeddings flattened away
Metadata filtering and dedup
- Filter on recency / source — kills the stale perfect match
- Dedupe — stops near-duplicates crowding out the answer
Don't fine-tune your embeddings first
Expensive, slow — and it can trade recall away for precision
One rule for all of it
Change one thing at a time, then re-run your evals
Start small
Don't ship retrieval on vibes
- Measure relevance, not similarity.
- Specify it.
- Measure it.
- Improve toward it.
Thank you!
@seldo.com on BlueSky 🦋
arize.com/docs/ax
These slides:
Stop Vibe Shipping: Evaluate Your Retrieval (Vector Space Day)
By Laurie Voss
Stop Vibe Shipping: Evaluate Your Retrieval (Vector Space Day)
- 15