How would you explain throughput versus latency in model serving?

Question

Checks whether the candidate can explain the core concept clearly and connect it to real production decisions. Describe the difference between throughput and latency in a serving system.

Accepted Answer

Example Answer

The way I'd approach it in an interview is this: Latency is how long one request takes. Throughput is how much work the system can complete over time. The two are related, but optimizing one does not automatically optimize the other.

You can improve throughput with batching and queueing while making individual requests wait longer. You can lower latency for one class of requests while reducing overall system efficiency. That is why serving decisions have to be tied to traffic shape and product expectations.

I usually explain it as serving economics versus user experience. Good systems balance both rather than pretending one metric captures the whole problem.

What I always try to avoid is giving a process answer that sounds clean in theory but falls apart once the data, users, or production constraints get messy.

Common Poor Answer

A weak answer is saying throughput is just aggregate latency. They influence each other, but they are not the same thing.

How would you explain throughput versus latency in model serving?

Example Answer

Common Poor Answer

Related Questions