What drives latency in an LLM application besides the model itself?

Question

Checks whether the candidate can explain the core concept clearly and connect it to real production decisions. Explain the major sources of latency outside raw model inference time.

Accepted Answer

Example Answer

The way I'd think about it is this: Latency in an LLM product is the sum of the whole request path, not just token generation. Prompt assembly, retrieval, routing, queueing, tool calls, policy checks, serialization, and client rendering can matter as much as the model itself.

That is why median model latency can look healthy while the product still feels slow. Users experience end-to-end latency, not just inference latency. If one stage is noisy or serial when it could be parallel, the whole workflow suffers.

I like to budget time by stage so the team can see where the product is actually spending its latency instead of blaming the model by default.

Common Poor Answer

A weak answer is saying LLM latency mostly comes from model size and token count. Those matter, but the workflow around the model often dominates the user experience.

What drives latency in an LLM application besides the model itself?

Example Answer

Common Poor Answer

Related Questions