Fundamental of Deploying Large Language Model Inference

Hosting a large language model (LLM) can be a complex and challenging task. One of the main challenges is the large model size, which requires significant computational resources and capacity. Another challenge is model sharding, which involves splitting the model across multiple servers to distribute the computational load. Model serving and inference workflows also need to be carefully designed and optimized to handle the high volume of requests and data. Technical expertise is also required to set up and maintain the infrastructure, including knowledge of distributed computing, data management, and . Additionally, the infrastructure setup itself can be complex and requires significant investment in hardware and software.


Some additional points to consider when it comes to the cost of hosting a large language model

  1. Model compilation cost: Compiling a large language model requires significant computational resources and specialized expertise. This process can be time-consuming and expensive, and may require investment in hardware and software.
  2. Model hosting cost: Hosting a large language model requires significant infrastructure, including servers, , and networking equipment. These costs can vary depending on the size and complexity of the model, as well as the hosting provider used.
  3. Operational overhead cost: Operating a large language model also requires ongoing maintenance and support, including software updates, backups, and security measures. These costs can add up over time and require a dedicated team to manage.
  4. Number of models to deploy and manage: Depending on the use case, it may be necessary to deploy and manage multiple models, each with their own unique requirements and costs. This can add up quickly and require significant investment in resources and expertise.

Overall, the cost of hosting a large language model can be significant and require careful planning and budgeting. However, the benefits of using these models for natural language processing tasks can outweigh the costs in many cases.

When it comes to performance of a large language model you also need to think about these terms

  1. Model compilation: Compiling a large language model requires significant computational resources and specialized expertise. This process can be time-consuming and may impact the overall performance of the model.
  2. Model compression: Compressing a large language model can help reduce its size and improve its performance on certain tasks. However, compression can also impact the accuracy and quality of the model.
  3. Latency: Latency is the time it takes for the model to process a request and generate a response. Reducing latency is important for real-time applications and can be achieved through techniques such as model optimization and caching.
  4. Throughput: Throughput is the number of requests the model can process in a given time period. Improving throughput can help increase the efficiency of the model and reduce wait times for users.
  5. Availability: Availability refers to the percentage of time the model is able to function without interruption. Ensuring requires investment in infrastructure and ongoing maintenance and support.

Now let's take a quick look into the memory requirements to load a GPT-J model. The memory requirements depends on whether you are training or serving the model. Lets do a quick math on training the GPT-J.

For FP32, you require 24GB to load the parameters, and then same for Gradients. Further, it uses Adam Optmizer that requires squared gradients too occupying another 24GB. Storing Optimizer States also require 24GB. Thus so far we need 96GB just to load one single model instance of GPT-J. Now in addition to this, we need to also load the training batch along with Activation memory footprint which will easy lead to requiring 200GB+ memory. Obviously the memory requirements will reduce in almost half if you are using FP16 model.


How many GPUs do I need to serve Llama 70B? To answer that, we first need to determine the amount of GPU memory required by the Large Language Model (LLM). This calculation can be done using a straightforward formula:


Symbol Description:

M: GPU memory expressed in Gigabytes P: The number of parameters in the model. For instance, a 7B model has 7 billion parameters. 4B: 4 bytes, indicating the bytes used for each parameter 32: There are 32 bits in 4 bytes Q: The number of bits used for loading the model. For example, 16 bits, 8 bits, or 4 bits. 1.2: Represents a 20% overhead for loading additional elements in GPU memory.

Now, let's illustrate with some examples.

GPU Memory Required for Serving Llama 70B

Let's calculate the GPU memory required for serving Llama 70B, loading it in 16 bits. The model has 70 billion parameters.

70 * 4 bytes 32 / 16 * 1.2 = 168 GB

That's quite a lot of memory. A single A100 80GB wouldn't be enough, although 2x A100 80GB should be enough to serve the Llama 2 70B model in 16 bit mode.

Now since we talked about memory consumption lets talk about how we can reduce the memory by model compression techniques .

Below is the python code to get the size of the model

from accelerate.utils import calculate_maximum_sizes, convert_bytes
from accelerate.commands.estimate import check_has_model, create_empty_model
import torch
DTYPE_MODIFIER = {"float32": 1, "float16/bfloat16": 2, "int8": 4, "int4": 8}

def calculate_memory(model: torch.nn.Module, options: list):
    "Calculates the memory usage for a model init on `meta` device"
    total_size, largest_layer = calculate_maximum_sizes(model)

    data = []
    for dtype in options:
        dtype_total_size = total_size
        dtype_largest_layer = largest_layer[0]

        modifier = DTYPE_MODIFIER[dtype]
        dtype_total_size /= modifier
        dtype_largest_layer /= modifier

        dtype_training_size = convert_bytes(dtype_total_size * 4)
        dtype_total_size = convert_bytes(dtype_total_size)
        dtype_largest_layer = convert_bytes(dtype_largest_layer)
                "dtype": dtype,
                "Largest Layer or Residual Group": dtype_largest_layer,
                "Total Size": dtype_total_size,
                "Training using Adam": dtype_training_size,
    return data

model_name = 'microsoft/phi-2'
model = create_empty_model(model_name, library_name=None, trust_remote_code=True, access_token=None)
results = calculate_memory(model, ["float32"])
for result in results:
    print(f"Total size of the Model with dtype {result['dtype']} is {result['Total Size']}")

Pruning refers to removing redundant or less important parameters from a neural model to reduce its size and computational requirements. This is done by systematically setting low-value weight parameters to zero. Structured pruning removes entire neurons/filters, while unstructured pruning zeros out individual weights. Pruning can reduce model size by over 90% with minimal loss in accuracy.


Knowledge distillation trains a smaller “student” model to mimic the outputs of a larger “teacher” model. The student is trained on soft targets (output distributions) from the teacher, capturing dark knowledge beyond just hard labels. This allows the student to learn complex functions learned by the teacher efficiently. Distillation can reduce compute by over 90% with minimal loss in accuracy.


Quantization reduces the precision of weights and activations from float32 to lower bit widths like int8 or int4. This shrinks model size and speeds up computation on integer-optimized hardware. Quantization applies techniques like clipping, rounding, and rescaling to discretize the continuous values while retaining model accuracy. Typical techniques are post-training quantization, quantization-aware training, and quantization-aware finetuning. So in summary, pruning, distillation and quantization are three key techniques to optimize large models by reducing redundancy, transferring knowledge and lowering precision respectively. Used together, they can provide massive reductions in model size and compute requirements with minimal impact on accuracy. My detailed answer covers the core concepts and tradeoffs for each technique.


Tensor parallelism is a parallelism technique used in large neural models to distribute the computation of large neural layers across multiple devices. The key idea is to partition the layers into smaller chunks called tensors and compute each tensor in parallel on different devices.

Some key points on tensor parallelism:

  • It allows partitioning the weights of a large layer across devices, with each device holding a slice of the weights tensor.
  • During training or inference, the input activations are similarly partitioned and fed to each device, which then performs computation on its slice of weights and activation inputs in parallel.
  • The outputs from each device are gathered and concatenated to reform the outputs of the layer.
  • This differs from data parallelism where the entire model weights are on each device and different minibatches of data are fed to each device. Tensor parallelism splits the model itself.
  • It enables training very large models that don't fit on a single device. The model can be split across multiple GPUs or TPU cores.- Communication between devices is needed while gathering the partial outputs. So high speed interconnect between devices is important.
  • Finding the right way to partition the tensors across devices to minimize communication and load balance is an active area of research.



Pipeline parallelism is a technique for distributed training of large neural network models across multiple devices or nodes. In pipeline parallelism, the model is split into partitions or stages, with each stage assigned to a different device. The input is fed through the pipeline in micro-batches, with each device performing computations on the micro-batch and then passing its outputs to the next device. This allows for parallelization across devices and overlap of computation and communication. In contrast, tensor parallelism splits the model across devices by partitioning tensors, typically along the hidden dimension. For example, different slices of a large weight matrix may be assigned to different devices. The devices collectively compute the result for a layer, synchronizing gradients at each step.

The key differences between pipeline and tensor parallelism are:

  • Pipeline parallelism partitions the model into sequential stages, while tensor parallelism partitions internal tensor dimensions like hidden states.
  • In pipeline parallelism, devices operate on micro-batches independently and asynchronously. Tensor parallelism requires devices to synchronize gradients after each step before proceeding.
  • Pipeline parallelism can provide higher hardware utilization by overlapping computation and communication. Tensor parallelism can reduce activation memory usage but requires more synchronization.
  • Pipeline parallelism partitions training examples, while tensor parallelism partitions tensors. This can lead to differences in convergence behavior.


Overall, generative inference of LLMs has three main challenges (according to Pope et al. 2022🙁

  • A large memory footprint due to massive model parameters and transient state during decoding. The parameters often exceed the memory of a single accelerator chip. Attention key-value caches also require substantial memory.
  • Low parallelizability increases latency, especially with the large memory footprint, requiring substantial data transfers to load parameters and caches into compute cores each step. This results in high total memory bandwidth needs to meet latency targets.
  • Quadratic scaling of attention mechanism compute relative to sequence length compounds the latency and computational challenges.


You can also look at request batching to increase the number of requests


he industry recognized the inefficiency and came up with a better approach. Orca: A Distributed Serving System for Transformer-Based Generative Models is a paper presented in OSDI '22 which is the first to our knowledge to tackle this problem. Instead of waiting until every sequence in a batch has completed generation, Orca implements iteration-level scheduling where the batch size is determined per iteration. The result is that once a sequence in a batch has completed generation, a new sequence can be inserted in its place, yielding higher GPU utilization than static batching.


In the autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as KV cache. The KV cache is

  • Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.
  • Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% — 80% of memory due to fragmentation and over-reservation.

PagedAttention is a new attention mechanism implemented in vLLM (GitHub). It takes inspiration from traditional OS concepts such as paging and virtual memory. They allow the KV cache (what is computed in the “prefill” phase, discussed above) to be non-contiguous by allocating memory in fixed-size “pages”, or blocks. The attention mechanism can then be rewritten to operate on block-aligned inputs, allowing attention to be performed on non-contiguous memory ranges.

This means that buffer allocation can happen just-in-time instead of ahead-of-time: when starting a new generation, the framework does not need to allocate a contiguous buffer of size maximum_context_length. Each iteration, the scheduler can decide if it needs more room for a particular generation, and allocate on the fly without any degradation to PagedAttention's performance. This doesn't guarantee perfect utilization of memory (their blog says the wastage is now limited to under 4%, only in the last block), but it significantly improves upon wastage from ahead-of-time allocation schemes used widely by the industry today.


Altogether, PagedAttention + vLLM enable massive memory savings as most sequences will not consume the entire context window. These memory savings translate directly into a higher batch size, which means higher throughput and cheaper serving.

Dynamic SplitFuse is a novel token composition strategy for prompt processing and token generation. DeepSpeed-FastGen utilizes Dynamic SplitFuse to run at a consistent forward size by leveraging the capability to take partial tokens from prompts and compose this with generation. In particular, Dynamic SplitFuse performs two key behaviors:

  1. Long prompts are decomposed into much smaller chunks and scheduled across multiple forward passes (iterations) with only the final pass performing any generation.
  2. Short prompts will be composed to exactly fill a target token budget. Even short prompts may be decomposed to ensure the budget is precisely met and the forward sizes are well-aligned.

Together, these two techniques provide concrete benefits on all user metrics:

  1. Better Responsiveness: Since long prompts no longer require extremely long forward passes to process, the model will provide lower client latency. More forward passes are performed within the same window of time.
  2. Higher Efficiency: Fusion of short prompts to larger token budgets enables the model to consistently operate in the high throughput regime.
  3. Lower variance and better consistency: Since forward passes are of consistent size and forward pass size is the primary determinant of performance, the latency of each forward pass is much more consistent than competing systems as is the perceived generation frequency. There are no pre-emption or long-running prompts to increase the latency as in other prior work.

Hope knowing all these fundamental concepts helps you in deploying and training large language models using Azure .


This article was originally published by Microsoft's AI - Machine Learning Blog. You can find the original article here.