You see great average metrics, but a small set of long-tail tasks fail badly. How would you investigate?

Instruction: Explain how you would investigate hidden tail risk in an agent system.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Explain how you would investigate hidden tail risk in an agent system.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would break the average apart immediately. Long-tail task failures are exactly the kind of thing blended metrics are good at hiding, especially when the common cases are easy and the rare cases are high stakes.

I would sample the failing slice, inspect traces, and ask whether the tail...

Related Questions