How do you handle missing or corrupted data in a dataset?

Instruction: Describe the strategies you can use to deal with incomplete or corrupted data.

Context: This question assesses the candidate's ability to preprocess data, a critical step in building effective machine learning models.

Example Answer

The way I'd approach it in an interview is this: The first thing I do is figure out why the data is missing or corrupted before I decide how to fix it. If I skip that step, I can end up treating a product bug, a logging failure, or a biased collection pattern as if it were just a harmless preprocessing issue. I want to know whether the problem is random, systematic, recent, or tied to a specific source.

Once I understand that, I decide whether to repair, drop, impute, or explicitly model the missingness. In a lot of real projects, I’ll add missing-value indicators, compare different imputation approaches, and make sure the same handling exists in production. If the corruption is severe, I’d rather narrow the scope of the analysis than pretend the data is cleaner than it really is.

Common Poor Answer

A weak answer says, "I fill missing values with the mean and remove bad rows," without checking why the issue exists or whether the missingness itself carries information.

Related Questions