A Heuristic Method of Merging Cross-Page Tables based on Document Intelligence Layout Model

Introduction

Tables contain valuable structured information for businesses to manage, share  and analyze data, make informed decisions, and increase efficiency. Cross-page tables are common especially in lengthy or dense documents. Azure AI Document Intelligence Layout model extracts tables within each page, effectively parsing the table may require reconstituting the extracted tables into a single table. This is specifically a challenge when automating document processing with large language models (LLMs)  to ensure high relevance or accuracy with tasks. This blog provides a heuristic approach to identifying and merging  tables, this process accounts for a few different variations including vertical or horizontal cross-page tables, tables with repeating  headers or continuation cells. The output can then be fed into the LLM for improved context resulting in more relevant and accurate responses. In future updates the layout model will support cross-page tables.

cross-page-tables.png

Step-by-Step Guide

The sample notebook consists of several key components:

1. Preparation of your document

Start by preparing the document with cross-page tables that you want to analyze. This could be in various formats, such as PDFs, Word documents, HTMLs, or images.

2. Return basic information of tables

The get_table_page_numbers function returns a list of page numbers where tables appear in your given document. The get_table_span_offsets function calculates the minimum and maximum offsets of a table's spans.

3. Find the merge table candidates

The get_merge_table_candidates_and_table_integral_span function finds the merge table candidates and calculates the integral span of each table based on the list of tables obtained in step 2 ahead.

The check_paragraph_presence function checks if there is a paragraph within a specified range that is not a page header, page footer, or page number. If this is the case, the table is not a merge table candidate.

def check_paragraph_presence(paragraphs, start, end):
    """
    Checks if there is a paragraph within the specified range that is not a page header, page footer, or page number. If this were the case, the table would not be a merge table candidate.

    Args:
        paragraphs (list): List of paragraphs to check.
        start (int): Start offset of the range.
        end (int): End offset of the range.

    Returns:
        bool: True if a paragraph is found within the range that meets the conditions, False otherwise.
    """
    for paragraph in paragraphs:
        for span in paragraph.spans:
            if span.offset > start and span.offset < end:
                # The logic role of a parapgaph is used to idenfiy if it's page header, page footer, page number, title, section heading, etc. Learn more: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0#document-layout-analysis
                if not hasattr(paragraph, 'role'):
                    return True
                elif hasattr(paragraph, 'role') and paragraph.role not in ["pageHeader", "pageFooter", "pageNumber"]:
                    return True
    return False

4. Determine whether it is a cross-page table

Vertical table: If there are two or more tables that emerge in successive pages, with only page headers, page footers, or page numbers lying between them, and these tables possess the identical number of columns, then these tables can be considered as one vertical table. This applies to both tables with headers on both pages and tables with headers only on the initial page.

Horizontal table: Where there are two or more tables that appear in consecutive pages, with the right side of the table being closely adjacent to the right edge of the current page, and the left side of the following table being proximate to the left edge of the succeeding page, and these tables sharing the same number of row counts, then these tables can be regarded as one horizontal table. The check_tables_are_ horizontal_distribution function identifies whether two consecutive pages are horizontally distributed.

5. Merge cross-page tables

Vertical table: If an actual table is distributed into two pages vertically. From analysis result, it will be generated as two tables in markdown format. To merge them into one table, the markdown table-header format string must be removed using the remove_header_from_markdown_table function. Then the merge_vertical_ tables function merges the two consecutive vertical markdown tables into one. When a cross-page table has headers on both pages, all header texts will be output, but the first one will be used as the header of the merged table.

def merge_vertical_tables(md_table_1, md_table_2) :
    """
    Merge two consecutive vertical markdown tables into one markdown table.

    Args:
        md_table_1: markdown table 1
        md_table_2: markdown table 2
    
    Returns:
        string: merged markdown table
    """
    table2_without_header = remove_header_from_markdown_table(md_table_2)
    rows1 = md_table_1.strip().splitlines()
    rows2 = table2_without_header.strip().splitlines()

    num_columns1 = len(rows1[0].split(BORDER_SYMBOL)) - 2
    num_columns2 = len(rows2[0].split(BORDER_SYMBOL)) - 2

    if num_columns1 != num_columns2:
        raise ValueError("Different count of columns")

    merged_rows = rows1 + rows2
    merged_table = 'n'.join(merged_rows)

    return merged_table

vertical_layout.png

Figure 1. Illustration of merging Vertical Layout

Horizontal table: If an actual table is distributed into two pages horizontally. From analysis result, it will be generated as two tables in markdown format. The merge_horizontal_tables function merges two consecutive horizontal markdown tables into one markdown table.

def merge_horizontal_tables(md_table_1, md_table_2):
    """
    Merge two consecutive horizontal markdown tables into one markdown table.

    Args:
        md_table_1: markdown table 1
        md_table_2: markdown table 2
    
    Returns:
        string: merged markdown table
    """
    rows1 = md_table_1.strip().splitlines()
    rows2 = md_table_2.strip().splitlines()

    merged_rows = []
    for row1, row2 in zip(rows1, rows2):
        merged_row = (
            (row1[:-1] if row1.endswith(BORDER_SYMBOL) else row1)
            + BORDER_SYMBOL
            + (row2[1:] if row2.startswith(BORDER_SYMBOL) else row2)
        )
        merged_rows.append(merged_row)

    merged_table = "n".join(merged_rows)
    return merged_table

horizontal_layout.png

Figure 2. Illustration of merging Horizontal Layout

6. Merge multiple consecutive pages

The identify_and_merge_cross_page_tables function is the main function of the script. It takes an input file path as an argument and uses the Azure Document Intelligence service (involving the Layout model) to analyze the document and identify and merge tables that span across multiple pages. This solution can handle tables split over 3 or more pages. The function comprises four main steps:

Step1: Create an instance of the DocumentIntelligenceClient, specify the file path, and then use the begin_analyze_document method to analyze the document. 

Step2: Get the merge tables candidates and the list of table integral span.

Step3: Make judgments and operations on table merging.

Step4: Generate optimized content based on the merged table list.

Advantages

This solution has the following benefits, allowing users to effortlessly handle cross-page tables in a wide variety of scenarios:

  • Preserves table semantics: By obtaining information such as the page number and span offset of the table, various advanced techniques are used to ensure that the merged tables keep the original semantics and structure of the data.
  • Enhances LLM table comprehension: Markdown format provides a simple way to create and format tables, making it easier for LLM to read and understand the data in the tables. This solution merges the tables in the markdown output, further enhancing LLM's ability to handle tabular data.
  • Streamlines data processing: Whether you need to process many documents or have high requirements for data processing, through this solution, the process of analyzing and working with the data will be simplified.

Conclusion

This solution provides a flexible and customizable solution for identifying and merging cross-page tables. Users can tailor the rules to fit their specific scenario and requirements, making it a highly versatile tool for handling documents with multiple page tables.

Get started

  1. Merging tables: This sample notebook demonstrates use the output of Layout model and some business rules to identify cross-page tables. You can also run this corresponding python file. Once identified, it will be processed to merge these tables and keep the semantics of a table.
  2. Applying to RAG: You can further integrate the merged markdown table with LangChain's MarkdownHeaderTextSplitter.

    Step1: Store the markdown output of the fused table as a file in .md format.

    Step2: Replace the ‘Load a document and split it into semantic chunks' section with the following code snippet and replace with the path to the file you stored in step 1.

# Initiate Azure AI Document Intelligence to load the document. You can either specify file_path or url_path to load the document.
file_path = ""
with open(file_path, 'r', encoding='utf-8') as file:
    markdown_text = file.read()
 
# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
 
docs_string = markdown_text
splits = text_splitter.split_text(docs_string)
 
print("Length of splits: " + str(len(splits)))

This enriched output allows for a more detailed understanding of the data. When used in conjunction with the RAG sample notebook

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.