Your tool-using agent passes answer-level evals and still makes bad tool calls. How would you fix the blind spot?

Instruction: Explain how you would expand evaluation when answer-only scoring misses bad execution.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Explain how you would expand evaluation when answer-only scoring misses bad execution.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would add tool-level evaluation and stop treating answer correctness as sufficient evidence that the workflow was healthy. If the agent is making bad tool calls but still passing answer checks, the benchmark is rewarding lucky outcomes and missing hidden operational risk.

I would evaluate...

Related Questions