Stop Vibe Shipping:

Evaluate Your Retrieval

This is the right room

for this talk

What are we talking about?

Why retrieval fails quietly
Most similar != most relevant
The metrics that actually matter
How to actually improve relevance
Where LLM-as-judge helps — and where it lies

The vibe-shipping trap

You can't unit-test

your way out

Why retrieval fails quietly

A crash gets caught. A confident but wrong answer

gets forwarded to your boss.

Most similar ≠ most relevant

Similarity is a proxy

Vector search ranks by similarity.
Your user needs relevance.
Those are not the same thing.

How similarity breaks

Right topic, no answer
Perfectly matched, stale data
Near-duplicates crowding out the real answer

How do we measure

retrieval effectively?

Retrieval vs. generation

Retrieval: did we fetch the right context?
Generation: did we use it well?

Relevance is the atomic unit

One question, per chunk

For each retrieved chunk: relevant, or not?

Let an LLM judge relevance

Send the query + the chunk to an LLM
"Does this text help answer this question?"
Get back: relevant / unrelated, plus an explanation

The metrics

that actually matter

Hit rate:

Did we retrieve anything relevant at all?

Did at least one relevant chunk make the cut?
The floor. If this fails, nothing downstream can work.

Precision@k:

How much of what we retrieved is junk?

Of the top k chunks, what fraction are relevant?
Junk dilutes and distracts the model

Recall@k:

Did we leave the answer

on the table?

Of all the relevant chunks that exist, how many did we get?
Low recall = the answer was out there, you just didn't fetch it

NDCG and MRR

Is the good stuff near the top?

Ranking metrics: relevant chunks should rank highest
Models weight early context — order matters

Four metrics,

four different failures

Hit rate
Precision@k
Recall@k
Ranking

"Looks good to me" names none of them

Golden datasets

that survive reality

What a golden dataset is

A set of real queries, with the relevance you'd expect
The encoded judgment of the people who know your domain

Build it from reality,

not your imagination

Pull real queries from production traces
Not five questions you invented at your desk

Label it, and be disciplined

Read the chunks. Label each relevant / not.
Specific criteria, not "this one feels right"

Keep it alive

Your corpus drifts. Your users drift.
Version it, refresh it, add new failures as you find them

Where LLM-as-judge helps,

and where it lies

Where LLM-as-judge helps

Relevance judgments at machine speed
With an explanation for every call
Catches "similar but irrelevant" that cosine can't

Where LLM-as-judge lies

Inherits the "plausible = relevant" trap
Position and verbosity bias
Self-preference if it grades its own model

Check the judge

against the golden set

Run the judge on your hand-labeled data
Measure its precision and recall like any classifier

Back to generation

Two checks on the answer

Correctness: did it answer the question?
Faithfulness: did it stick to the retrieved context or make things up?

So what do you actually fix?

Low recall → the answer never came back
Low precision / bad ranking → it came back, buried in junk

Low recall:

get the answer in the door

Chunk on meaning, not token counts
Add keyword search alongside vectors — hybrid search
Rewrite the query before you search

Low precision:

clean up what comes back

Reranking:

The single

highest-leverage move

Over-fetch ~20 candidates, then a cross-encoder re-scores them
Catches the relevance your embeddings flattened away

Metadata filtering and dedup

Filter on recency / source — kills the stale perfect match
Dedupe — stops near-duplicates crowding out the answer

Don't fine-tune your embeddings first

Expensive, slow — and it can trade recall away for precision

One rule for all of it

Change one thing at a time, then re-run your evals

Start small

Don't ship retrieval on vibes

Measure relevance, not similarity.
Specify it.
Measure it.
Improve toward it.

Thank you!

@seldo.com on BlueSky 🦋

arize.com/docs/ax

These slides:

slides.com/seldo/stop-vibe-shipping

Stop Vibe Shipping: Evaluate Your Retrieval (Vector Space Day)

By Laurie Voss

Stop Vibe Shipping: Evaluate Your Retrieval (Vector Space Day)

Laurie Voss PRO

seldo.com

Stop Vibe Shipping:

Evaluate Your Retrieval

This is the right room

for this talk

What are we talking about?

The vibe-shipping trap

You can't unit-test

your way out

Why retrieval fails quietly

Most similar ≠ most relevant

Similarity is a proxy

How similarity breaks

How do we measure

retrieval effectively?

Retrieval vs. generation

Relevance is the atomic unit

One question, per chunk

Let an LLM judge relevance

The metrics

that actually matter

Hit rate:

Did we retrieve anything relevant at all?

Precision@k:

How much of what we retrieved is junk?

Recall@k:

Did we leave the answer

on the table?

NDCG and MRR

Is the good stuff near the top?

Four metrics,

four different failures

Golden datasets

that survive reality

What a golden dataset is

Build it from reality,

not your imagination

Label it, and be disciplined

Keep it alive

Where LLM-as-judge helps,

and where it lies

Where LLM-as-judge helps

Where LLM-as-judge lies

Check the judge

against the golden set

Back to generation

Two checks on the answer

So what do you actually fix?

Low recall:

get the answer in the door

Low precision:

clean up what comes back

Reranking:

The single

highest-leverage move

Metadata filtering and dedup

Don't fine-tune your embeddings first

One rule for all of it

Start small

Don't ship retrieval on vibes

Thank you!

Stop Vibe Shipping: Evaluate Your Retrieval (Vector Space Day)

Stop Vibe Shipping: Evaluate Your Retrieval (Vector Space Day)

Laurie Voss PRO

More from Laurie Voss