Instruction: Describe the steps you would take to identify the root cause of a significant drop in model accuracy in a production system. Discuss how you would prioritize actions and what measures you might implement to rectify the issue.
Context: This question assesses the candidate's problem-solving skills and understanding of the complexities involved in maintaining ML model performance over time. Candidates should demonstrate their approach to troubleshooting, including potential areas of investigation (e.g., data quality, model drift, feature changes) and remediation strategies.
I would start by isolating whether the issue is model-related, data-related, or system-related. That means checking recent deployments, feature freshness, schema changes, label pipelines, upstream service behavior, and whether the drop is global or isolated to certain segments.
Once the likely cause is clear, the response could be rollback, traffic reduction, feature repair, retraining, or changing thresholds temporarily. The key is to avoid assuming the model itself is broken before ruling out instrumentation and data-pipeline failures.
What matters in an interview is not only knowing the definition, but being able to connect it back to how it changes modeling, evaluation, or deployment decisions in practice.
A weak answer says retrain the model immediately, without first checking whether the accuracy drop came from labels, features, or deployment changes.