How would you handle extremely large datasets that do not fit into memory?

Instruction: Describe the techniques you would use to train models on datasets too large to fit into memory.

Context: This question tests the candidate's ability to work with big data and their knowledge of scalable machine learning algorithms.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would treat that as a systems problem first, not just a model problem. If the data does not fit into memory, I want to know whether I should stream it, shard it, sample it, use out-of-core algorithms, or move the computation to distributed infrastructure. The answer depends...

Related Questions