Retrieval quality looks fine in English but fails badly on mixed-language docs. How would you debug it?

Instruction: Describe how you would investigate multilingual retrieval problems.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Describe how you would investigate multilingual retrieval problems.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would debug this as a multilingual retrieval problem end to end, not just an embedding choice. First I would check whether the corpus is normalized consistently across languages: encoding, OCR quality, tokenization artifacts, document metadata, and language labels. Mixed-language failures are often partly preprocessing bugs.

Then I would...

Related Questions