What makes a chunking strategy fail on real documents?

Question

Checks whether the candidate can explain the core concept clearly and connect it to real production decisions. Explain the common reasons chunking looks reasonable in theory and fails in production.

Accepted Answer

Example Answer

The way I'd think about it is this: Chunking fails on real documents because the documents were not written for retrieval. You get repeated headers, tables split across pages, footnotes with important caveats, OCR noise, and effective dates far away from the actual rule. A naive fixed-size chunker ignores all of that.

The symptom is usually subtle. The system retrieves the correct document, but the passage is not self-sufficient enough to support the answer. Or it returns five nearly identical chunks because the structure was never normalized. Then people blame the model when the real problem started in preprocessing.

My first defense is to preserve structure aggressively: headings, lists, table labels, section ancestry, page references, version info, and captions. Then I test on the ugliest documents, not the cleanest ones. If chunking only looks good on tidy markdown, it is not really production-ready.

Common Poor Answer

A weak answer is blaming chunking only on chunk size. Real failures usually come from ugly document structure, OCR noise, tables, footnotes, repeated headers, and lost metadata.

What makes a chunking strategy fail on real documents?

Example Answer

Common Poor Answer

Related Questions