On this page

Defining the Core Metrics

Latency

Throughput

The Anatomy of an LLM Request

Phase 1: The Prefill (Compute-Bound)

Phase 2: The Decode (Memory-Bound)

Optimization Strategy: Batching

The Problem with Naive Batching

The Solution: Continuous Batching

The Memory Bottleneck: KV Caching

Memory Fragmentation

Advanced Optimization: PagedAttention

Designing the Architecture

1. The Gateway and Load Balancer

2. The Orchestrator (Scheduler)

3. The Model Worker

Metrics You Must Monitor

Conclusion

Designing LLM Inference Systems: Batching, Memory, and GPUs

Image
Arslan Ahmad
Master LLM inference architecture. We explain continuous batching, vLLM, and KV caching to help you prepare for system design interviews.
Image

Defining the Core Metrics

Latency

Throughput

The Anatomy of an LLM Request

Phase 1: The Prefill (Compute-Bound)

Phase 2: The Decode (Memory-Bound)

Optimization Strategy: Batching

The Problem with Naive Batching

The Solution: Continuous Batching

The Memory Bottleneck: KV Caching

Memory Fragmentation

Advanced Optimization: PagedAttention

Designing the Architecture

1. The Gateway and Load Balancer

2. The Orchestrator (Scheduler)

3. The Model Worker

Metrics You Must Monitor

Conclusion

Slow software is frustrating.

We have all experienced the irritation of typing a query into a chatbot and staring at a blinking cursor for seconds on end.

When the text finally appears, it might trickle out so slowly that you lose your train of thought. This sluggishness is a major user experience problem.

On the other side of the screen, the engineers running that chatbot face a different problem.

The hardware required to run Large Language Models (LLMs) is incredibly expensive. Graphics Processing Units (GPUs) with high memory bandwidth cost thousands of dollars.

If an engineer dedicates a whole GPU to a single user to make it fast, the operational costs become unsustainable.

This creates the central conflict of AI system design. You must balance latency (speed for the user) against throughput (cost efficiency for the system).

For junior developers and computer science students, understanding this trade-off is essential. It is the difference between a toy project and a scalable production system.

In this post, we will tear down the architecture of a high-performance inference platform. We will look at how modern tools like vLLM use clever memory management and scheduling to get the best of both worlds.

Defining the Core Metrics

Before we design the system, we need to define exactly what we are measuring.

In standard web development, we often look at simple response times.

In the world of Generative AI, metrics are more nuanced because the response is generated over time.

Latency

Latency is the speed of the system from the perspective of a single user. We break this down into two critical numbers.

Time to First Token (TTFT): This is the time elapsed between the user hitting "Enter" and the first character appearing on the screen. It measures how responsive the system feels. If TTFT is high, the user wonders if the app is broken.

Time Per Output Token (TPOT): Once the generation starts, this metric measures the time it takes to generate each subsequent token. A "token" is roughly 0.75 words. If TPOT is high, the text appears to stutter or stream slowly.

Throughput

Throughput is the measure of total system capacity. We measure this in Tokens Per Second (TPS) across all concurrent users.

High throughput means your hardware is working hard. It means you are serving more customers with fewer GPUs.

This is the metric that business teams care about because it directly correlates to profit margins.

The Trade-off: Here is the hard truth. To get the highest throughput, you generally need to process many requests at once. But processing many requests at once usually slows down the specific response time for each individual user. Your job as a system architect is to find the "sweet spot" where the system is cheap enough to run but fast enough to use.

The Anatomy of an LLM Request

To understand why this trade-off exists, we have to look at how an LLM actually works under the hood. It is not a uniform process.

Every request goes through two distinct phases that place different demands on your hardware.

Phase 1: The Prefill (Compute-Bound)

When a request first arrives, the model receives your prompt. It processes all the input tokens in parallel. It does this to build the initial "understanding" or state of the conversation.

This phase requires a massive amount of raw calculation. The GPU cores are crunching numbers as fast as they can. We call this a Compute-Bound task. The speed limit here is how many floating-point operations the GPU can do per second.

Phase 2: The Decode (Memory-Bound)

This is where the model generates the answer.

LLMs are auto-regressive. This means they generate one token, add it to the list, and then use the new list to generate the next token. They do this one by one.

This phase behaves very differently. To generate just one single token, the GPU has to read the entire model from its memory.

For a large model, this could be over 100 gigabytes of data. It moves all that data to the processor, does a tiny bit of math to find one word, and then has to do it all over again for the next word.

Because the system spends more time moving data than doing math, we call this a Memory-Bound task. The GPU compute cores often sit idle, waiting for data to arrive from memory.

Optimization Strategy: Batching

The fact that the Decode phase leaves compute cores idle is a major inefficiency. We are paying for powerful processors that are doing nothing half the time.

The solution is Batching.

Instead of loading the model weights to process one user's request, we load the weights once and apply them to 10, 20, or 50 requests at the same time.

The cost of moving the data is paid once, but the benefit is shared across many users. This dramatically increases throughput.

The Problem with Naive Batching

In the early days of LLM serving, systems used static batching. The server would wait for 4 requests to arrive, bundle them, and send them to the GPU.

The problem is that text length varies wildly.

  • Request A is a short question: "What is the capital of France?" (Outputs 5 tokens).
  • Request B is a creative task: "Write a poem about rust." (Outputs 100 tokens).

If we batch these together, Request A finishes almost instantly. But it cannot leave the batch. The GPU slot assigned to Request A sits empty for the next 95 steps while the system waits for Request B to finish.

This is called the "straggler problem." It wastes valuable GPU space and creates unnecessary latency.

The Solution: Continuous Batching

Modern systems use Continuous Batching (sometimes called iteration-level scheduling).

In this approach, the scheduler does not group "requests." It groups "iterations."

  1. The system runs one generation step for all active requests.
  2. It checks if any request has finished.
  3. If Request A is done, it is removed immediately.
  4. The system grabs a new request (Request C) from the queue and inserts it into the empty slot.
  5. The next step runs immediately.
Image

This ensures the GPU is always fully saturated with active work.

There are no bubbles of idle time. This is a critical feature in engines like vLLM and TGI.

The Memory Bottleneck: KV Caching

Batching helps us use the compute cores, but we still have a memory problem.

Recall that to predict the next token, the model needs to "pay attention" to all previous tokens. Re-calculating the mathematical representations (called Keys and Values) for the entire history at every step is wasteful.

To save time, we calculate these vectors once and store them in the GPU memory. This is called the KV Cache.

The KV Cache is huge.

Image

As the conversation gets longer, the cache grows. For long-context models, the cache can become larger than the model itself. This creates a hard limit on your throughput.

The GPU memory has to hold the model weights plus the KV Cache for every active user. When the memory is full, you cannot accept any new requests.

Memory Fragmentation

In older systems, memory management was inefficient.

The system did not know how long a response would be, so it had to play it safe. It would reserve a large, contiguous block of memory (enough for the maximum possible length) for every user.

If the system reserved space for 2,000 tokens but the user only generated 50, the rest of that block was wasted. It was locked up and could not be used by anyone else.

This is called memory fragmentation. It meant that GPUs often reported "Out of Memory" even when they were actually 40% empty.

Check out the 5 ways developers can use AI.

Advanced Optimization: PagedAttention

This brings us to one of the most important innovations in recent years: PagedAttention.

This concept is borrowed from Operating System design. Your laptop does not give every program a single, giant block of physical RAM. It breaks memory into small "pages" and maps them to physical spots wherever there is room.

PagedAttention does the same thing for the KV Cache.

  1. It breaks the cache into small blocks (e.g., storage for 16 tokens).

  2. These blocks do not need to be next to each other in physical memory.

  3. The system maintains a "Block Table" that maps the logical flow of the conversation to the physical blocks on the GPU.

Image

Why this matters:

  • Zero Waste: The system only allocates memory for the tokens that are actually generated.

  • Dynamic Growth: If a conversation keeps going, the system just grabs another free block from anywhere in memory.

  • Higher Throughput: Because there is no wasted space (fragmentation), you can fit more users into the same GPU.

Designing the Architecture

Now that we understand the internal mechanics, let's look at how we build the actual platform. A production inference system usually has three layers.

1. The Gateway and Load Balancer

This is the front door. It handles authentication and traffic routing.

Standard load balancers use "Round Robin" (taking turns), but this is bad for LLMs because some requests take much longer than others.

Smart load balancers use metrics like "Least Outstanding Tokens" to send traffic to the server that has the most available capacity.

2. The Orchestrator (Scheduler)

This is the brain of the operation. It sits on the inference server.

  • Queue Management: It holds incoming requests.

  • Batch Formation: It decides which requests get to enter the GPU for the next step.

  • Priority: It can let paid users cut the line. The Orchestrator is responsible for the continuous batching logic. It ensures the batch size is large enough to get high throughput but small enough to keep latency low.

3. The Model Worker

This is the muscle. It runs the actual inference engine (like vLLM). It manages the memory allocation (using PagedAttention) and executes the matrix math on the GPU.

Metrics You Must Monitor

You cannot improve what you do not measure.

When you build this system, you need a dashboard that tracks specific inference metrics.

  • GPU Utilization: This measures how busy your GPU cores are. If this is low (below 50%), you are wasting money. You probably need to increase your batch size.

  • Queue Depth: This counts how many requests are waiting to start. If this number is consistently high, you need to add more GPU servers (scaling out).

  • Cache Utilization: This tracks how full your GPU memory is. With PagedAttention, you can safely run this close to 90%. If it hits 100%, the system has to pause requests or move data to the CPU, which causes massive lag.

  • Inter-Token Latency: Watch this closely. If you increase throughput too much, this number will go up, and users will complain that the bot feels slow.

Conclusion

Designing an LLM inference platform is a complex engineering challenge. It requires you to look beyond simple code correctness and understand the physical limitations of the hardware.

Here are the key takeaways for your design journey:

  • Throughput and Latency are Opposites: You must balance the need for system efficiency against the need for user speed.

  • The Bottleneck is Memory: Moving data is slower than doing math. Minimizing data movement is the key to performance.

  • Batching is Essential: Continuous batching solves the "straggler problem" and keeps GPUs saturated.

  • KV Caching is a Double-Edged Sword: It speeds up computation but eats up memory. Efficient management is required.

  • PagedAttention is the Industry Standard: By managing memory in blocks, we eliminate fragmentation and maximize the number of concurrent users.

By mastering these concepts, you can build AI systems that are fast, reliable, and cost-effective.

AI

What our users say

Tonya Sims

DesignGurus.io "Grokking the Coding Interview". One of the best resources I’ve found for learning the major patterns behind solving coding problems.

KAUSHIK JONNADULA

Thanks for a great resource! You guys are a lifesaver. I struggled a lot in design interviews, and this course gave me an organized process to handle a design problem. Please keep adding more questions.

Roger Cruz

The world gets better inch by inch when you help someone else. If you haven't tried Grokking The Coding Interview, check it out, it's a great resource!

More From Designgurus
Substack logo

Designgurus on Substack

Deep dives, systems design teardowns, and interview tactics delivered daily.

Read on Substack
Annual Subscription
Get instant access to all current and upcoming courses for one year.

Access to 50+ courses

New content added monthly

Certificate of completion

$29.08

/month

Billed Annually

Recommended Course
Grokking Prompt Engineering for Professional Portfolio and Job Search

Grokking Prompt Engineering for Professional Portfolio and Job Search

453+ students

4.1

Elevate your career with Grokking Prompt Engineering for Professional Portfolio and Job Search - the ultimate AI-powered guide for crafting a standout portfolio, polishing resumes and cover letters, and nailing interviews in today’s competitive job market.

View Course
Join our Newsletter

Get the latest system design articles and interview tips delivered to your inbox.

Read More

How to Leverage AI as a Software Engineer?

Arslan Ahmad

Arslan Ahmad

System Design for RAG (Retrieval-Augmented Generation): Vector Databases, Chunking, and Re-ranking

Arslan Ahmad

Arslan Ahmad

10 Best AI Tools for Developers in 2025: Boost Productivity by 100x

Arslan Ahmad

Arslan Ahmad

How to Become an AI (Prompt) Engineer in 2025?

Arslan Ahmad

Arslan Ahmad

Image
One-Stop Portal For Tech Interviews.
Copyright © 2026 Design Gurus, LLC. All rights reserved.