Design a benchmark harness for multi-hop and ambiguous RAG queries.

Instruction: Explain how you would build a benchmark that reflects the hard parts of retrieval-based assistants.

Context: Assesses whether the candidate can design a practical architecture and explain the main tradeoffs. Explain how you would build a benchmark that reflects the hard parts of retrieval-based assistants.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would build the harness around realistic failure structure, not just one gold answer per query. Multi-hop and ambiguous questions need gold evidence sets, answerability labels, and explicit expectations about when the assistant should clarify, answer partially, or refuse.

So each benchmark item would include the query, the...

Related Questions