In this two-part blog series, we explore how to perform optimized training and inference of large language models from Hugging Face, at scale, on Azure Databricks.
In this first part we focus on optimized model training, leveraging the distributed parallel infrastructure available on Azure Databricks to train deep learning-based models, and using DeepSpeed to optimize the training process.
Part two focuses on model inference, also leveraging the distributed parallel infrastructure available on Azure Databricks and using DeepSpeed to optimize the inference process.
The code for this blog series is available at this GitHub repository, as a series of Databricks notebooks.
Databricks is one of the leaders and a very popular platform for data analytics and AI development, being an important component of Microsoft's Azure Data & AI Services. Hugging Face is also very popular amongst AI developers and mostly known for its Transformers library used for natural language processing applications.
Despite both technologies being so popular, we don't see much published material showing how they can be efficiently used together.
Here we try to fill this gap and share our learnings about efficient ways to work with large language models from Hugging Face on Azure Databricks, using well established open-source technologies such as Spark, Petastorm, PyTorch, Horovod, and DeepSpeed, in an optimized way at scale.
The Use Case
To exemplify a typical use case, we consider the task of fine tuning a pre-trained Transformer model for text classification and then use the fine-tuned model to perform batch inference on a large quantity of documents. We want to implement that task on Azure Databricks, leveraging its distributed platform and parallel computation, in an optimized manner.
The dataset we use is the Rotten Tomatoes movie review dataset. It is a simple dataset with only two columns: text, which is the movie review, and label, which is either 1 (positive review) or 0 (negative review).
We will get the pre-trained DeBERTa v3 model and fine tune it for text classification using that dataset. In this way, we will leverage the knowledge the pre-trained model already has about natural language and augment it with the specific knowledge about how to classify movie reviews into positive and negative reviews. Once we have the fine-tuned model, we can use it to classify new reviews.
Before we start fine tuning the model, we need to extract the numeric features from the text, which are used as inputs to the model. For Hugging Face models this is facilitated by the Transformers library using its Tokenizer class.
Please refer to data_preparation.ipynb for implementation details.
Model Fine Tuning
We also optimize the model training with DeepSpeed. DeepSpeed provides several benefits for model training, resulting in faster training with quicker and better convergence, and optimized GPU memory utilization, allowing us to work with larger models. In our sample code we noticed a better convergence in half of the training epochs and a total speed up of about 4.5X, when compared to the training without DeepSpeed (20 epochs and 1,147 seconds without DeepSpeed versus 10 epochs and 255 seconds with DeepSpeed).
In our example, the model fits in a single GPU and we are basically running optimized distributed data parallel training. But if needed, we could also perform model parameter partitioning with DeepSpeed, for working with larger models when they don't fit in a single GPU.
Figure 1 below shows a conceptual overview of the fine-tuning procedure.
Figure 1: conceptual overview of distributed training for the model fine tuning
The main steps during the fine tuning are the following:
- Prepared data for model training is read from Parquet files into a Spark DataFrame. The DataFrame is partitioned by the total number of GPUs in all worker nodes in the cluster. In our example, each worker has only one GPU, therefore each worker processes one data partition.
- A Horovod MPI cluster is created using all worker nodes. Then the pre-trained model is initialized in all worker nodes and wrapped with DeepSpeed. DeepSpeed is aware of the distributed infrastructure provided by Horovod and provides the APIs for PyTorch optimized distributed training.
- Each worker reads one training data partition into a PyTorch Dataloader. The interface between the data in the Spark DataFrame and the PyTorch Dataloader is provided by Petastorm.
- The Hugging Face pre-trained model is fine tuned in an optimized distributed manner, using DeepSpeed's API.
- The fine-tuned model files are saved to the Data Lake, to be used later for model inference.
Please refer to model_training_hvd_deepspeed.ipynb for implementation details.
In this post we have shown how the fine tuning of a large pre-trained model from Hugging Face can be done in an optimized and distributed way on Azure Databricks, by using well established open-source technologies such as Spark, Petastorm, PyTorch, Horovod, and DeepSpeed.
By using that infrastructure, you can efficiently train large deep learning models on Azure Databricks, even when the model wouldn't fit on a single GPU. We have seen the model converging faster and an overall speedup of 4.5X when training with DeepSpeed versus without DeepSpeed.