Instruction: Explain how you would benchmark an agent on realistic multi-step work.
Context: Assesses whether the candidate can design a practical architecture and explain the main tradeoffs. Explain how you would benchmark an agent on realistic multi-step work.
Official answer available
Preview the opening of the answer, then unlock the full walkthrough.
I would benchmark the full loop, not just the first answer. A useful harness needs to tell me whether the...
easy
easy
easy
easy
easy
easy