A feature demos well but fails on messy user inputs. How would you update the evals?

Instruction: Describe how you would respond when the benchmark is too clean compared with production traffic.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Describe how you would respond when the benchmark is too clean compared with production traffic.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would stop letting the demo define the benchmark. If the system falls apart on messy traffic, the eval set is overrepresenting curated examples and underrepresenting the real variation users bring.

So I would pull production-like inputs into the suite quickly: incomplete questions, mixed intents, typos,...

Related Questions