An agent works in staging but fails in production after tool latency spikes. What would you change first?

Instruction: Explain the first production controls you would add when tool latency starts breaking an agent.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Explain the first production controls you would add when tool latency starts breaking an agent.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would start by treating latency as part of the workflow contract, not as background noise. If production latency spikes are enough to break the agent, then the recovery and timeout behavior were not designed for realistic operating conditions.

My first changes would be around orchestration: clearer timeout policies, better...

Related Questions