You see great average metrics, but a small set of long-tail tasks fail badly. How would you investigate?

Instruction: Explain how you would investigate hidden tail risk in an agent system.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Explain how you would investigate hidden tail risk in an agent system.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

Average metrics can hide the tasks that actually damage trust. I would segment the failures by complexity and workflow path...

Related Questions