In this two-part blog series, we explore how to perform optimized training and inference of large language models from Hugging Face, at scale, on Azure Databricks.

*In the first part* we focused on optimized model training, leveraging the distributed parallel infrastructure available on Azure Databricks to train deep learning-based models, and using DeepSpeed to optimize the training process.

In this second part, we focus on model inference, also leveraging the distributed parallel infrastructure available on Azure Databricks and using DeepSpeed to optimize the inference process.

The code for this blog series is available at *this GitHub repository*, as a series of Databricks notebooks.

**Distributed Parallel Processing on Databricks**

*Apache Spark* is the engine that implements distributed parallel processing on Databricks. In a high level, it allows parallel computations over data stored in distributed filesystems. There are different ways in which this parallel computation can be implemented on Spark.

One of the most common is to represent the data as a *Spark DataFrame* and execute parallel computations on top of it using its API.

When the use case is batch inference of deep learning-based models, another option is to use the same distributed parallel infrastructure used for model training, based on *PyTorch on Horovod*.

We explore these two methods here and we show, for both methods, how to use *DeepSpeed to optimize the model for inference*.

**Distributed Parallel Batch Inference with Pandas UDF**

Similar to data preparation, batch inference is also an *embarrassingly parallel task*, which we can perform on Spark through *Pandas UDF* (User Defined Functions) over Spark DataFrames. In our example, we use the *Transformer's Pipeline* abstraction to perform model inference.

By optimizing model inference with DeepSpeed, we observed a **speedup of about 1.35X** when comparing to the inference without DeepSpeed.

Figure 1 below shows a conceptual overview of the batch inference approach with Pandas UDF.

*Figure 1: conceptual overview of distributed batch inference with Pandas UDF*

The main steps during the batch inference with Pandas UDF are the following:

- Text data for model inference is read from Parquet files into a Spark DataFrame. The DataFrame is partitioned by the total number of GPUs in all worker nodes in the cluster. In our example, each worker has only one GPU, therefore each worker processes one data partition.
- The trained model files are also read into Spark and the model object is instantiated. We use the Transformer's Pipeline for text classification, and the model is wrapped with DeepSpeed for optimized inference.
- A Pandas UDF is defined to be run on each data partition from the Spark DataFrame containing the data to be classified by the model.
- The Pandas UDF is called on the Spark DataFrame column containing the input text for model inference. As we are using the Pipeline abstraction, text tokenization is performed on-the-fly. The resulting columns with model predictions and probabilities are then added to the Spark DataFrame.
- The DataFrame with model predictions and probabilities is then saved to the Data Lake as Parquet files.

Please refer to *model_inference_pudf_deepspeed.ipynb* for implementation details.

**Distributed Parallel Batch Inference with Horovod**

Here we leverage the same distributed deep learning infrastructure provided by *Horovod on Azure Databricks* that was used for fine tuning the model.

We use a workflow very similar to the one used for model fine tuning, except that here we just need to execute the model's forward pass, to get model predictions and probabilities, without computing losses and updating the model weights.

By optimizing model inference with DeepSpeed in this case, **we also observed a speedup of about 1.35X** when comparing to the same inference workflow without DeepSpeed.

Figure 2 below shows a conceptual overview of the batch inference approach with Horovod.

*Figure 2: conceptual overview of distributed batch inference with Horovod*

The main steps during the batch inference with Horovod are the following:

- Prepared data for model inference is read from Parquet files into a Spark DataFrame. The DataFrame is partitioned by the total number of GPUs in all worker nodes in the cluster. In our example, each worker has only one GPU, therefore each worker processes one data partition.
- A Horovod MPI cluster is created using all worker nodes. Then the pre-trained model is initialized in all worker nodes and wrapped with DeepSpeed.
- Each worker reads one data partition into a PyTorch Dataloader. The interface between the data in the Spark DataFrame and the PyTorch Dataloader is provided by Petastorm.
- The fine-tuned model, optimized by DeepSpeed, is then used to compute model predictions and probabilities.
- Model predictions and probabilities from each partition are saved to the Data Lake as Parquet files.

Please refer to *model_inference_hvd_deepspeed.ipynb* for implementation details.

**In conclusion**

In this post we have shown two approaches to perform batch scoring of a large model from Hugging Face, both in an optimized and distributed way on Azure Databricks, by using well established open-source technologies such as Spark, Petastorm, PyTorch, Horovod, and DeepSpeed.

By using that infrastructure, you can efficiently use Azure Databricks to perform batch inference for very large deep learning models and for very large datasets, even when the model wouldn't fit on a single GPU. Even for smaller models, as in our example, we have seen 1.35X improvement in latency when performing model inference with DeepSpeed versus without DeepSpeed.