Instruction: Explain the process, including how you would select the metrics, divide the audience, and evaluate the results.
Context: This question tests the candidate's understanding of deploying machine learning models and evaluating their performance in a real-world scenario.
I would start by defining the decision we are trying to make. An A/B test is not just "new model versus old model." It is a causal measurement of whether the new model improves the user or business outcome we care about without breaking important guardrails.
From there, I would make sure randomization happens at the right unit, the exposure is logged correctly, and the treatment assignment is stable enough to avoid contamination. I would choose one primary success metric, a small set of guardrail metrics, and run a power analysis so we know the test can actually detect the effect size we care about. For ML systems, I also want segment cuts because models often help one cohort and hurt another.
Operationally, I would ramp carefully, monitor data quality and serving health during the test, and look for novelty effects before declaring success. Good experimentation discipline matters as much as model quality here.
A weak answer treats A/B testing like a checkbox, without talking about randomization, logging, sample size, guardrails, or cohort-specific impact.
easy
medium
medium
hard
hard