A model swap looks neutral in offline tests and causes live cost blowups through longer outputs. How would you prevent that in the future?

Instruction: Explain how you would guard against cost regressions that appear only in production behavior.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Explain how you would guard against cost regressions that appear only in production behavior.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would add output-length and cost-behavior checks to the evaluation process. Offline tests that judge only answer quality can easily miss the fact that a new model is more verbose, more likely to call tools, or less concise under the same prompt.

I would also...

Upgrade to view official answer

A model swap looks neutral in offline tests and causes live cost blowups through longer outputs. How would you prevent that in the future?

Related Questions