Instruction: Describe how you would localize a model regression that appears only in long tool-using tasks.
Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Describe how you would localize a model regression that appears only in long tool-using tasks.
Official answer available
Preview the opening of the answer, then unlock the full walkthrough.
I would target the regression where it lives: tool-heavy, multi-step workflows. The fact that single-step tasks held up tells me the model can still answer, but something about planning, tool selection, or long-horizon state handling has changed.
I would compare traces before and after...
easy
easy
easy
easy
easy
easy