Design a benchmarking harness for multi-step agent tasks, not just single-turn prompts.

Instruction: Explain how you would benchmark an agent on realistic multi-step work.

Context: Assesses whether the candidate can design a practical architecture and explain the main tradeoffs. Explain how you would benchmark an agent on realistic multi-step work.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would benchmark the full loop, not just the first answer. A useful harness needs to tell me whether the...

Related Questions