What is tokenization in NLP?

Question

This question aims to evaluate the candidate's familiarity with fundamental NLP concepts.

Accepted Answer

Example Answer

The way I'd explain it in an interview is this: Tokenization is the process of breaking text into units that an NLP system can work with, such as words, subwords, sentences, or characters. It is one of the first steps in most NLP pipelines because models need text represented as manageable pieces.

The exact tokenization strategy matters because it affects vocabulary size, handling of rare words, multilingual behavior, and model efficiency. In modern systems, tokenization is often subword-based rather than simple word splitting.

What matters in an interview is not only knowing the definition, but being able to connect it back to how it changes modeling, evaluation, or deployment decisions in practice.

Common Poor Answer

A weak answer says tokenization is splitting text into words and ignores sentence, character, and subword tokenization.

What is tokenization in NLP?

Example Answer

Common Poor Answer

Related Questions