How would you approach sourcing data from multiple, potentially inconsistent, data sources?

Question

This question evaluates the candidate's ability to handle complex data integration challenges, ensuring the reliability of their analyses.

Accepted Answer

Example Answer

I would start by treating the problem as a data-integration exercise, not just a collection exercise. The first questions I want answered are: what does each source measure, how fresh is it, how reliable is it, and where do the definitions differ. Inconsistent data sources usually create more problems through mismatched semantics than through missing rows.

From there, I would define a canonical schema, map each source into it, and document the transformations and assumptions. I would also validate keys, timestamps, units, entity definitions, and duplication rules before trusting the merged dataset. If two sources disagree, I do not pick one casually. I want a rule that is defensible, traceable, and stable enough for downstream modeling. Good sourced data is not just collected data. It is reconciled data.

Common Poor Answer

A weak answer says merge the sources and clean the duplicates, without dealing with conflicting definitions, timestamp alignment, and which source should be trusted when they disagree.

How would you approach sourcing data from multiple, potentially inconsistent, data sources?

Example Answer

Common Poor Answer

Related Questions