Design a benchmarking harness for multi-step agent tasks, not just single-turn prompts.

Instruction: Explain how you would benchmark an agent on realistic multi-step work.

Context: Assesses whether the candidate can design a practical architecture and explain the main tradeoffs. Explain how you would benchmark an agent on realistic multi-step work.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would benchmark the whole workflow, not just the final text. Each task should include initial state, allowed tools, approval conditions, expected end state, and failure labels for the major ways the agent can go wrong.

Then I would score across multiple levels: tool selection, step ordering, recovery...

Related Questions