Instruction: Define tokenization and explain its importance in natural language processing.
Context: This question aims to evaluate the candidate's familiarity with fundamental NLP concepts.
The way I'd explain it in an interview is this: Tokenization is the process of breaking text into units that an NLP system can work with, such as words, subwords, sentences, or characters. It is one of the first steps in most NLP pipelines because models need text represented as manageable pieces.
The exact tokenization strategy matters because it affects vocabulary size, handling of rare words, multilingual behavior, and model efficiency. In modern systems, tokenization is often subword-based rather than simple word splitting.
What matters in an interview is not only knowing the definition, but being able to connect it back to how it changes modeling, evaluation, or deployment decisions in practice.
A weak answer says tokenization is splitting text into words and ignores sentence, character, and subword tokenization.