Stop Vibe Shipping:

Evaluate Your Retrieval

This is the right room

for this talk

What are we talking about?

  • Why retrieval fails quietly
  • Most similar != most relevant
  • The metrics that actually matter
  • How to actually improve relevance
  • Where LLM-as-judge helps — and where it lies

The vibe-shipping trap

You can't unit-test

your way out

Why retrieval fails quietly

A crash gets caught. A confident but wrong answer

gets forwarded to your boss.

Most similar ≠ most relevant

Similarity is a proxy

  • Vector search ranks by similarity.
  • Your user needs relevance.
  • Those are not the same thing.

How similarity breaks

  • Right topic, no answer
  • Perfectly matched, stale data
  • Near-duplicates crowding out the real answer

How do we measure

retrieval effectively?

Retrieval vs. generation

  • Retrieval: did we fetch the right context?
  • Generation: did we use it well?

Relevance is the atomic unit

One question, per chunk

For each retrieved chunk: relevant, or not?

Let an LLM judge relevance

  • Send the query + the chunk to an LLM
  • "Does this text help answer this question?"
  • Get back: relevant / unrelated, plus an explanation

The metrics

that actually matter

Hit rate:

Did we retrieve anything relevant at all?

  • Did at least one relevant chunk make the cut?
  • The floor. If this fails, nothing downstream can work.

Precision@k:

How much of what we retrieved is junk?

  • Of the top k chunks, what fraction are relevant?
  • Junk dilutes and distracts the model

Recall@k:

Did we leave the answer

on the table?

  • Of all the relevant chunks that exist, how many did we get?
  • Low recall = the answer was out there, you just didn't fetch it

NDCG and MRR

Is the good stuff near the top?

  • Ranking metrics: relevant chunks should rank highest
  • Models weight early context — order matters

Four metrics,

four different failures

  • Hit rate
  • Precision@k
  • Recall@k
  • Ranking

"Looks good to me" names none of them

Golden datasets

that survive reality

What a golden dataset is

  • A set of real queries, with the relevance you'd expect
  • The encoded judgment of the people who know your domain

Build it from reality,

not your imagination

  • Pull real queries from production traces
  • Not five questions you invented at your desk

Label it, and be disciplined

  • Read the chunks. Label each relevant / not.
  • Specific criteria, not "this one feels right"

Keep it alive

  • Your corpus drifts. Your users drift.
  • Version it, refresh it, add new failures as you find them

Where LLM-as-judge helps,

and where it lies

Where LLM-as-judge helps

  • Relevance judgments at machine speed
  • With an explanation for every call
  • Catches "similar but irrelevant" that cosine can't

Where LLM-as-judge lies

  • Inherits the "plausible = relevant" trap
  • Position and verbosity bias
  • Self-preference if it grades its own model

Check the judge

against the golden set

  • Run the judge on your hand-labeled data
  • Measure its precision and recall like any classifier

Back to generation

Two checks on the answer

  • Correctness: did it answer the question?
  • Faithfulness: did it stick to the retrieved context or make things up?

So what do you actually fix?

  • Low recall → the answer never came back
  • Low precision / bad ranking → it came back, buried in junk

Low recall:

get the answer in the door

  • Chunk on meaning, not token counts
  • Add keyword search alongside vectors — hybrid search
  • Rewrite the query before you search

Low precision:

clean up what comes back

Reranking:

The single

highest-leverage move

  • Over-fetch ~20 candidates, then a cross-encoder re-scores them
  • Catches the relevance your embeddings flattened away

Metadata filtering and dedup

  • Filter on recency / source — kills the stale perfect match
  • Dedupe — stops near-duplicates crowding out the answer

Don't fine-tune your embeddings first

Expensive, slow — and it can trade recall away for precision

One rule for all of it

Change one thing at a time, then re-run your evals

Start small

Don't ship retrieval on vibes

  • Measure relevance, not similarity.
  • Specify it.
  • Measure it.
  • Improve toward it.

Thank you!

@seldo.com on BlueSky 🦋

 

arize.com/docs/ax

Stop Vibe Shipping: Evaluate Your Retrieval (Vector Space Day)

By Laurie Voss

Stop Vibe Shipping: Evaluate Your Retrieval (Vector Space Day)

  • 15