Your tool-using agent passes answer-level evals and still makes bad tool calls. How would you fix the blind spot?

Instruction: Explain how you would expand evaluation when answer-only scoring misses bad execution.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Explain how you would expand evaluation when answer-only scoring misses bad execution.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

If answer-level scoring says we are fine but users still suffer, the eval is too coarse. I would add checks...

Related Questions