Building a Document Intelligence Custom Classification Model with the Python SDK

Introduction:

In the world of document processing and , one of the most frequent use cases is categorizing and organizing documents into predefined classes. For instance, an organization may have a process that ingests documents that then need to be classified into separate categories such as “invoices”, “contracts”, “reports”, etc. Azure Document Intelligence custom classification models can address these needs and offer a powerful way to bring order to document management.

Document Intelligence is a cloud-based Azure AI service that uses models to automate document processing in applications and workflows. New users and those unfamiliar with Document Intelligence's capabilities may be interested in starting their journey using Document Intelligence Studio—an online tool to visually explore, understand, train, and implement features from the Document Intelligence service without having to write a single line of code. However, more advanced use cases and integrations may necessitate interacting with the Document Intelligence service programmatically. This can be achieved using the Document Intelligence REST API or SDKs available for .NET, Java, JavaScript, and Python. In this article we‘ll focus specifically on building a custom classification model using Python, one of the more popular languages amongst data science and developers.

Those wanting to get a head start creating a custom classification model programmatically may look to utilize the existing sample_build_classifier.py code sample from the azure-sdk-for-python repository. However, for this sample script to work, the classifier training data set must already include ocr.json files for each document. Optical Character Recognition (OCR) is a critical step in converting scanned documents into editable and searchable data. While Azure AI Document Intelligence Studio automatically generates the OCR files behind-the-scenes when building a custom classification model using the visual interface, those utilizing the Python SDK may find themselves at a crossroads due to the lack of this built-in functionality.

studio-create-classifier-project.png

The Challenge:

The Document Intelligence Python SDK provides a powerful set of tools for extracting information from forms and documents. However, one key limitation is its lack of a method to easily generate ocr.json files from layout analysis results, a feature that is completely integrated and handled automatically in Document Intelligence Studio.

As described in the documentation here, the required ocr.json files can be created by analyzing each training document with Document Intelligence's pre-built layout model and saving the results in the proper API response format. There is a sample Python script sample_analyze_layout.py but since the SDK‘s layout results object is structured differently than the API‘s layout results object, there isn't a clear way to generate the required ocr.json files strictly using the Python SDK. This blog post delves into the custom solution we developed to manually code this process, addressing a common problem discussed in the Microsoft community

 

Screenshot (23).png

Our Custom Solution:

In order to bridge this gap and create the ocr.json files in the correct format programmatically, we‘ve implemented custom code to access the API layout results object using a little-known callback method available within the Document Intelligence Python SDK. We developed a custom classifier code that emulates the OCR file creation process that Document Intelligence Studio performs. The code leverages the Python SDK to extract text and structural data from documents and then formats this information into the required JSON structure for OCR files.

Step-by-Step Guide to Building the Classifier:

The custom classifier code consists of several key components:
1. Preparation of Documents: Start with gathering the documents you wish to analyze. These could be in various formats, such as PDFs, Word documents, or images. You can reference the documentation here for the full list of supported file types. Make sure they are in a separate “training folder” that the code will reference as structured:
Screenshot (22).png
2. Document Analysis with Azure AI Document Intelligence Layout Model: Utilize the Azure Document Intelligence Layout model to analyze the documents. This is done by running analyze_layout.py, which will iterate through files in the specified directory (TRAINING_DOCUMENTS) and analyze each document using Azure AI Document Intelligence. It saves the results in a .ocr.json file alongside the original document. This format mirrors the OCR output of the Document Intelligence Studio, maintaining consistency and compatibility.
# Use begin_analyze_document to start the analysis process, and use a callback in order to recieve the raw response
with open(document_file_path, "rb") as f:
                            poller = document_analysis_client.begin_analyze_document(
                                "prebuilt-layout", document=f, cls=lambda raw_response, _, headers: create_ocr_json(ocr_json_file_path, raw_response)
                            )
// ... other code ...
# Callback function to save the API raw response as .ocr.json file
def create_ocr_json(ocr_json_file_path, raw_response):
    with open(ocr_json_file_path, "w") as f:
        f.write(raw_response.http_response.body().decode("utf-8"))
        print(f"tOutput saved to {ocr_json_file_path}")
3. Upload Documents with the labeled data to Azure Blob container: This is done by running upload_documents.py, which will upload all the training documents, along with the .ocr.json files and a.jsonl file that will be used in building the classifier to reference each of the documents. The .jsonl file allows us to process multiple documents in a batch, improving the efficiency of the training process.
4. Build Classifier: The build_classifier.pyscript initiates the process of building a custom document classifier using the document types and labeled data from the .jsonl files. It utilizes the DocumentModelAdministrationClient and BlobServiceClientclasses, which are used to interface with the Document Intelligence and Azure Blob services to retrieve and process the training data uploaded in the previous step. Once finished, it prints the results including the classifier ID, API version, description, and document classes used for training.
5. Classify Documents: classify_document.py uses two requests together to classify a document using a trained document classifier. The first request sends the document for classification, and the second request retrieves the results of the classification process. This approach allows for asynchronous processing of document classification, where the analysis can take some time to complete, especially for large or complex documents.
    1. POST Request: The _post_to_classification_model function performs a POST request to the Azure AI classification model for prediction. It uses the specified Document Intelligence key and model specifications to post the document for classification. The request URL includes the classifier model ID and the API version. The function reads the document as binary data and sends it in the request body along with the necessary headers. If the POST request is successful, it returns the response.
      post_url = (
              ENDPOINT
              + f"/documentintelligence/{API_TYPE}/{MODEL_ID}:analyze?api-version={API_VERSION}"
          )
    2. GET Request: The _get_classification_results function retrieves the classification results from the Azure AI classification model. It takes the response from the POST request as input and extracts the operation-location URL from the response headers. It then makes GET requests to this URL in a loop, waiting for the analysis to complete. It retries the GET request multiple times until the analysis succeeds, fails, or reaches a maximum number of retries. Once the analysis is complete, it returns the classification results as a JSON object.
      get_url = post_response.headers["operation-location"]
      resp = get(
                      url=get_url,
                      headers={
                          "Ocp-Apim-Subscription-Key": FORM_RECOGNIZER_KEY
                      },
                  )
      ...
      result = _get_classification_results(request)["analyzeResult"]

Conclusion:

While the Python SDK does not provide an out-of-the-box solution for OCR file generation, our custom classifier code offers a viable workaround. By understanding the limitations of the SDK, we were able to create a tool that not only solves the immediate problem but also enhances our overall document processing capabilities.

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.