Instruction: Explain how you would keep an evaluation suite useful as it grows.
Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Explain how you would keep an evaluation suite useful as it grows.
Official answer available
Preview the opening of the answer, then unlock the full walkthrough.
I would refactor the suite around failure classes instead of preserving every incident as its own permanent monument. Incident cases are useful, but if they pile up without structure, the benchmark becomes redundant, hard to maintain, and hard to interpret.
So I would cluster the...
easy
easy
easy
easy
easy
easy