A guardrail blocks the answer, but the unsafe content is actually in the retrieved document, not the user query. How would you investigate?

Instruction: Explain how you would debug a guardrail issue caused by retrieved evidence rather than user intent.

Context: Tests how the candidate diagnoses the problem, chooses the safest next step, and reasons through recovery. Explain how you would debug a guardrail issue caused by retrieved evidence rather than user intent.

Official answer available

Preview the opening of the answer, then unlock the full walkthrough.

I would inspect the retrieval path and content labeling, not just the user prompt. If the unsafe material came from retrieved context, then the system is likely treating retrieval as inherently trustworthy and allowing harmful content to enter the answer path unchecked.

I would compare...

Related Questions