Instruction: Explain how you would benchmark an agent on realistic multi-step work.
Context: Assesses whether the candidate can design a practical architecture and explain the main tradeoffs. Explain how you would benchmark an agent on realistic multi-step work.
Official answer available
Preview the opening of the answer, then unlock the full walkthrough.
I would benchmark the whole workflow, not just the final text. Each task should include initial state, allowed tools, approval conditions, expected end state, and failure labels for the major ways the agent can go wrong.
Then I would score across multiple levels: tool selection, step ordering, recovery...
easy
easy
easy
easy
easy
easy