Instruction: Describe how you would localize a model regression that appears only in long tool-using tasks.
Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Describe how you would localize a model regression that appears only in long tool-using tasks.
Official answer available
Preview the opening of the answer, then unlock the full walkthrough.
I would not accept a single aggregate score. I would replay the same multi-step tasks and inspect where the new...
easy
easy
easy
easy
easy
easy