A team keeps adding benchmark cases after each incident, and the suite is getting noisy. How would you clean it up?

Instruction: Explain how you would keep an evaluation suite useful as it grows.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Explain how you would keep an evaluation suite useful as it grows.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would refactor the suite around failure classes instead of preserving every incident as its own permanent monument. Incident cases are useful, but if they pile up without structure, the benchmark becomes redundant, hard to maintain, and hard to interpret.

So I would cluster the...

Related Questions