Instruction: Detail the process for performing a thorough analysis after an ML model fails or underperforms in a production setting.
Context: This question seeks to understand how the candidate learns from failures and applies those lessons to future ML deployments.
Official answer available
Preview the opening of the answer, then unlock the full walkthrough.
I would treat the post-mortem as a systems investigation, not a blame exercise. First I would establish the timeline: what changed, how the failure was detected, which users or segments were affected, and whether the issue came from data, code, configuration, infrastructure,...