How LayoutLM stand out as an effective pre-training technique in various NLP tasks
The brain behind creating Artificial Intelligence was the needed technology that could function intelligently by bridging the gap between computers and machines. As the scope of AI widened, the Natural Learning Process (NLP), a branch of AI, was created to understand and interpret human language. Its main goal was to build small sub-systems that can make sense of text language and convert them into binary while performing prescribed tasks.
In recent years, various Pre-training techniques have been verified successfully in a variety of NLP tasks. Despite the widespread use of different pre-training models for NLP applications, they solely focus on text-level manipulation while mostly neglecting layout and style information vital for document image understanding. We at Oodles, an AI Development Company, discuss practical NLP applications in AI to significantly enhance development output.
What is the LayoutLM model?
LayoutLM model is a simple and very effective pre-training method of text and layout for document image understanding and practical information extraction tasks. Document AI is a relatively new research topic that refers to different techniques for automatically reading, understanding, and analyzing various business documents.
Business documents are very critical to a company's efficiency and productivity. The exact format of any business document may vary from Document to document, so the information, usually in the Natural Language, can be organized in various ways, from simple text, multi-column layouts to numerous tables, forms, and figures.
The need for the LayoutLM model
Nowadays, some companies extract data from business documents through manual efforts that are inefficient, time-consuming, and expensive. They also require extensive manual customization or configuration.
Each Document's rules and workflows often need to be hard-coded and updated with changes to the specific format or when dealing with various layouts. It designs the Document AI to model.
Algorithms automatically extract, classify, and structuralize information from the business documents, accelerating automated document processing workflows to address the above problems. LayoutLM model is one of them.
For the first time, considering textual and layout information from scanned document images is pre-trained in a single framework. New and better state-of-the-art results also leverage image features.
The LayoutLM Model
This model aims to use the critical visually rich information from document layouts and align them with the input texts from the Document. Two types of features substantially improvise the language representation in a visually rich document, which are:
For document-level visual elements, the entire image can indicate the document layout, a crucial component used in the document image classification. Various word-level visual features, such as bold, underline, etc., are also very significant hints for the sequence labeling tasks. Therefore, they believed combining the image features with the text representations can bring better and richer semantic representations to the processed documents.
Source: https://www.programmersought.com/article/26975262864/
Model Architecture
Fine Tuning existing pre-trained models and adapt to document image understanding tasks, they used the BERT(Bi-directional Encoder Representations from Transformers) architecture as the backbone and added two new input embeddings:
1) A 2-D position embedding
2) An image embedding.
1) 2-D Position Embedding: The 2-D position embedding aims to model the relative spatial position in a document, unlike the position embedding that models the word function correctly. They consider a document page as a coordinate system with the top-left origin to represent the spatial position of elements in the scanned document images. In this case, the setting is predefined such that the bounding box is accurately defined (x0, y0, x1, y1), whereas (x0, y0) resembles the position of the upper left in the bounding box and the specific position of the lower right is represented by (x1, y1). We add four-position embedding layers with two embedding tables, where the embedding layers representing the same dimension share the same embedding table. Hence, they looked up the position embedding of x0 and x1 in the embedding table X and lookup y0 and y1 in table Y.
2) Image Embedding: To utilize the image feature and align the image feature with the text, we add an extra image embedding layer to represent image features in language representation. So, with the bounding box of each word from OCR results, we split the image into many pieces, and they have a one-to-one correspondence with the terms. These images generate the image region, which features the Faster R-CNN model as the token image embeddings. They also use the famous R-CNN model to produce embeddings which use the complete scanned Document image as the Region of Interest (ROI) to serve their downstream tasks that require the representation of the [CLS] token.
So, the layout model is jointly learned in a single framework for document-level pre-training by considering text and layout information.
The layoutLM now achieves new state-of-the-art results in various downstream tasks such as Form understanding, Receipt understanding, and Document image classification.