After a model upgrade, agent success drops only on multi-step tasks with tools. How would you evaluate the regression?

Instruction: Describe how you would localize a model regression that appears only in long tool-using tasks.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Describe how you would localize a model regression that appears only in long tool-using tasks.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would not accept a single aggregate score. I would replay the same multi-step tasks and inspect where the new...

Upgrade to view official answer

After a model upgrade, agent success drops only on multi-step tasks with tools. How would you evaluate the regression?

Related Questions