Instruction: Explain what you would look at beyond the final answer when evaluating an agent with tools.
Context: Checks whether the candidate can explain the core concept clearly and connect it to real production decisions. Explain what you would look at beyond the final answer when evaluating an agent with tools.
The way I'd think about it is this: For tool-using agents, I care about more than whether the final answer looked right. I want to know whether the agent chose the right tool, used the right arguments, respected preconditions, recovered from failures sensibly, and stopped when confidence was too low to proceed safely.
So I usually look at step-level correctness, tool-call validity, unnecessary tool usage, latency across the workflow, error recovery behavior, approval handling, and end-to-end task completion. The right tool decision at the wrong time can still break the user experience.
I also separate visible success from hidden risk. An agent can get the answer right once while taking a path that is expensive, brittle, or unsafe. If I only score the surface output, I reward behavior that will hurt me later in production.
A weak answer is grading only the final answer text. For agents, the path matters almost as much as the destination.
easy
easy
easy
easy
easy
easy