Instruction: Explain the major sources of latency outside raw model inference time.
Context: Checks whether the candidate can explain the core concept clearly and connect it to real production decisions. Explain the major sources of latency outside raw model inference time.
The way I'd think about it is this: Latency in an LLM product is the sum of the whole request path, not just token generation. Prompt assembly, retrieval, routing, queueing, tool calls, policy checks, serialization, and client rendering can matter as much as the model itself.
That is why median model latency can look healthy while the product still feels slow. Users experience end-to-end latency, not just inference latency. If one stage is noisy or serial when it could be parallel, the whole workflow suffers.
I like to budget time by stage so the team can see where the product is actually spending its latency instead of blaming the model by default.
A weak answer is saying LLM latency mostly comes from model size and token count. Those matter, but the workflow around the model often dominates the user experience.
easy
easy
easy
easy
easy
easy