How would you measure retrieval quality before judging generation quality?

Instruction: Describe the evaluation signals you would use to judge retrieval on its own.

Context: Checks whether the candidate can explain the core concept clearly and connect it to real production decisions. Describe the evaluation signals you would use to judge retrieval on its own.

Example Answer

The way I'd approach it in an interview is this: Before I judge generation, I want to know whether retrieval is surfacing the right evidence often enough to make a good answer possible. So I start with retrieval-specific evals like recall@k, MRR or nDCG, and most importantly, coverage of the gold supporting documents or passages for real user queries.

I also slice aggressively. A blended metric can look fine while the system fails on tables, mixed-language docs, policy versions, or multi-hop questions. I want to know where the retriever is strong, where it is weak, and whether the miss happened in candidate generation or ranking.

I also track answerability. Some queries should return not found or need clarification. If the retrieved set is weak or conflicting, that matters as much as a clean hit. Once I trust the evidence pipeline, then I evaluate synthesis. Otherwise I end up arguing about prompts when the model never saw the right material.

Common Poor Answer

A weak answer is jumping straight to answer quality and saying you would judge retrieval by whether the final response looks good. That skips retrieval-specific signals and makes it hard to tell whether the model ever saw the right evidence.

Related Questions