Layout Language Model

This post is mainly based on

LayoutLM: Pre-training of Text and Layout for Document Image Understanding, SIGKDD 2020
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding, 2020
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, 2021
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking, 2022

Why Visually-Rich Document Understanding is Hard?

Relies on: not only textual information, but also visual and layout information
Text style and format also convey information
Hypothesis: textual, visual, and layout information should be jointly modeled and learned end-to-end in a single framework

Comparison of LayoutLMs

LayoutLM
- 2-D absolute position embedding + Image embedding
- Image encoder: only fine-tuned on down stream task, not pre-trained
- Masked Visual-Language Model (MVLM) loss + Multi-label Document Classification (MDC) loss (similar to BERT)
LayoutLMv2
- 2-D relative position embedding
- Image encoder: ResNeXt101-FPN, pre-trained
- New loss function: Text-Image Alignment Loss, Text-Image Matching Loss
LayoutXLM
- Multilingual extension of LayoutLMv2
LayoutLMv3
- Image encoder: Linear projection to visual embedding
- New loss function: Masked Image Modeling, Word-Patch Alignment Loss

LayoutLM

2-D position embedding
Image embedding
Masked Visual-Language Model (MVLM) loss
SOTA

v1-arch

LayoutLM architecture. The Pre-built OCR extract: text, text position and text ROI. Text and text position are encoded by text encoder into LayoutLM embeddings. Text ROI is encoded by Faster R-CNN and FC layers into image embeddings. Text encoder is pre-trained. Image encoder can only be fine-tuned.

Architecture

LLM Backbone: BERT
- BERT-base
- 12-layer
- 768 hidden sizes
- 12 attention heads
- ~113M parameters
- BERT-large
Image embedding backbone: ResNet-101
- From Faster R-CNN pre-trained on the Visual Genome dataset
Additional input embeddings
- 2-D position embedding: relative position of a token within a document
- Image embedding (fine-tuning only): scanned token images within a document
Loss
- Masked Visual-Language Model (MVLM) loss
  - Similar to Masked Language Model loss
  - Mask the text embedding but keep positional embedding
  - Predict masked token
  - 15% of the input tokens for prediction
- Multi-label Document Classification (MDC) loss
Optimization
- 8 NVIDIA Tesla V100 32GB GPUs
- Batch size: 80
- Adam, lr=5e-5, linear decay
- BASE model: 80 hours to finish one epoch on 11M documents

Pre-training Dataset

IIT-CDIP Test Collection 1.0
- [Building a Test Collection for Complex Document Information Processing, 2006]
- 6 million scanned documents
Pre-processing
- Tesseract: obtain the recognition as well as the 2-D positions
- Store the OCR results in hOCR format (OCR + hierarchical representation)

Benchmark Dataset

FUNSD
- Spatial layout analysis and form understanding
SROIE
- Scanned Receipts Information Extraction
RVL-CDIP
- Document image classification

Ablation

MVLM vs MVLM+MDC

MVLM+MDC only performs marginal better than MVLM

v1-arch-ablation

Model accuracy (Precision, Recall, F1) on the FUNSD dataset.

Different initialization of BERT

Initialization has large impact

v1-init-ablation

Different initialization methods for BASE and LARGE (Text + Layout, MVLM).

LayoutLMv2

Spatial-Aware SelfAttention Mechanism: attention score $\alpha$ is modified by learnable bias
Image encoder in pre-training + 2 new pre-training loss
Significantly outperforms SOTA / LayoutLM

v2-arch

LayoutLMv2 architecture. Visual Encoder encoder patches into visual embedding. Visual Token Embeddings and Text Token Embeddings have same dimensions. Transformer Layers, text embedding and visual encoder are jointly pre-trained.

Architecture

Text Embedding
- Token embedding $TokEmb(w_i)$
- 1D positional embedding $PosEmb1D(i)$: token index
- Segment embedding $SegEmb(s_i)$: distinguish different text segments
- $t_i = TokEmb(w_i) + PosEmb1D(i) + SegEmb(s_i)$
Visual Embedding
- Converts the page image $I$ into a 224 x 224 fixed-length sequence
- Backbone CNN: ResNeXt-FPN
- Output feature map is average-pooled to dimension W x H (e.g., 2 x 2), then flattened into $VisTokEmb(I)$
- $VisTokEmb(I)$ is then projected into dimensionality of the text embeddings
- Also have 1D positional embedding and Segment embedding to indicate its location
- $v_i = Proj(VisTokEmb(I)_i)+PosEmb1D(i)+SegEmb([C])$
- $0 \leq i < WH$
Layout Embedding
- Spatial layout information represented by axis-aligned token bounding boxes from the OCR result
- Normalize and discretize all coordinates to integers in the range [0, 1000]
- $l_i = Concat(PosEmb2D_x(x_{min}, x_{max}, width), PosEmb2D_y(y_{min}, y_{max}, height))$
Combine
- Concatenates visual embeddings and text embeddings
- $X = { v_0, …, v_{WH−1}, t_0, …, t_{L−1} }$
Spatial-Aware Self-Attention
- Let $\alpha_{ij}$ be the vanilla attention score between $x_i, x_j$
- $\alpha’{ij} = \alpha{ij} + b_{j-i}^{(1D)} + b_{x_j-x_i}^{(2D_x)} + b_{y_j-y_i}^{(2D_y)}$
- bias $b$ are learnable

Loss

Masked Visual-Language Model (MVLM) Loss
- Same as LayoutLMv1 except mask OCR detected token region before pass to image encoder
Text-Image Alignment (TIA) Loss
- Covering: some tokens lines are randomly selected, and their image regions are covered on the document image; covering operation is performed at the line-level
- Pre-training: a classification layer is built above the encoder outputs; for each token, predict if it is covered ([Covered] or [Not Covered], BCE loss)
- When MVLM and TIA are performed simultaneously, TIA losses of the tokens masked in MVLM are not taken into account (Due to [MASK] => [Covered])
Text-Image Matching (TIM) Loss
- Feed the output representation at [CLS] into a classifier to predict whether the image and text are from the same document page (BCE loss)
- Positive samples: matched text-image
- Negative samples: image replaced by a page image from another document or dropped
- Perform the same masking and covering operations to images in negative samples to prevent the model from cheating by finding task features

Pre-training

IIT-CDIP Test Collection
- Random sliding window of the text sequence if the sample is too long
- Maximum sequence length L = 512 and assign all text tokens to the segment [A]
- Output shape of the image encoder average pooling layer: W = H = 7 (49 tokens)
Text encoder
- Architecture: BERT BASE or BERT LARGE
- Init: UniLMv2
Image encoder
- Architecture: ResNeXt101-FPN
- 200M and 426M parameters due to different projection layers
- Init: Mask R-CNN trained from PubLayNet
Loss
- MVLM: 15% masked (80% [Mask], 10% random, 10% unchange)
- TIA: 15% of the lines are covered
- TIM: 15% images are replaced, and 5% are dropped

Results

Fine-tuning: task dependent fine-tuning, but limited by BERT architecture
Baseline: text-only pre-trained models, LayoutLM
Backbone: BERT Base & Large
LayoutLMv2 outperforms baseline by a large margin / new SOTA on downstream tasks

Benchmark Dataset

Form understanding: FUNSD
Receipt understanding: CORD, SROIE
Long document with a complex layout: Kleister-NDA
Document image classification: RVL-CDIP
Visual question answering: DocVQA

v2-docvqa

ANLS score on the DocVQA dataset, “QG” denotes the data augmentation with the question generation dataset.

Ablation

v2-docvqa-ablation

Ablation study on the DocVQA dataset, where ANLS scores on the validation set are reported. “SASAM” means the spatial-aware self-attention mechanism. “MVLM”, “TIA” and “TIM” are the three pre-training tasks. All the models are trained using the whole pre-training dataset for one epoch with the BASE model size.

Comments: ANLS of line 6 model in Table 5 doesn’t match ANLS of line 7 model in Table 4, which I believe to be the same model. Not sure why.

LayoutLMv3

Pre-train multimodal Transformers with unified text and image masking
Word-patch alignment objective: for cross-modal alignment learning
SOTA on both text-centric and image-centric tasks

v3-arch-comp

LayoutLMv3 vs DocFormer vs SelfDoc. Image embedding: linear patches reduce computational complexity. Image pre-training objectives: reconstruct image tokens instead of raw pixels or region features, capture high-level structures.

Architecture

v3-arch

Text embedding: $Y = y_{1:L}$
- OCR: text + bounding box
- Position embeddings: 1D position + 2D layout position
- Modification to 2D layout position embedding
  - LayoutLM and LayoutLMv2: word-level layout position
  - LayoutLMv3: segment-level layout positions (words in same segment usually express the same semantic meaning)
Image embedding: $X = x_{1:M}$
- Linear Image Patches inspired by ViT and ViLT
- First multimodal pre-trained Document AI model without CNNs backbone as image encoder
- Reduce heavy computation bottleneck and remove the need for region supervision
- $I \in \mathbb{R}^{C \times H \times W}$ (channel size, width and height)
- Split the image into a sequence of uniform $P \times P$ patches
- Linearly project the image patches to $D$ dimensions, then flatten
- Length $M = HW/P^2$
- Learnable 1D position embeddings

Loss

Masked Language Modeling (MLM)
- Mask 30% of text tokens with Span masking strategy
- Span lengths drawn from a Poisson distribution $\lambda = 3$
- Input include both image tokens $X$ and text tokens $Y$
Masked Image Modeling (MIM)
- MIM pre-training objective in BEiT
- Mask 40% image tokens with the blockwise masking strategy
- Cross-entropy loss to reconstruct the masked image tokens based on surrounding text and image tokens
Word-Patch Alignment (WPA)
- Since each word corresponds to an image patch, the goal is to learn the cross-modal alignment
- WPA objective
  - Predict whether the corresponding image patches of a text word are masked
  - Aligned: for an unmasked text token, corresponding image tokens are also unmasked
  - Unaligned: for an unmasked text token, corresponding image tokens are masked
  - Exclude the masked text tokens when calculating WPA loss
- Classification head: two-layer MLP, binary cross-entropy loss

Pre-training

Tokenizer: BPE, with Max sequence length $L = 512$
BERT Base or Large, init by DiT
Image preproc
- $C \times H \times W = 3 \times 224 \times 224$, $P = 16$, $M = 196$
Image encoder
- Init by RoBERTa
- Image tokens vocabulary size: 8192
Optimizer
- Adam, batch size of 2048, 500000 steps
- Lr: BERT Base = $1e-4$, BERT Large = $5e-5$
Distributed mixed-precision training + gradient accumulation + gradient checkpoint
Modified softmax computation follow CogView, to stabilize training
Dataset: IITCDIP Test Collection 1.0 (11 million docs, 42 million pages)

Results

Benchmark Dataset

Form understanding: FUNSD
Receipt understanding: CORD
Visual question answering: DocVQA
Document image classification: RVL-CDIP
Document layout analysis: PubLayNet

Layout Parsing

Use LayoutLMv3 as feature backbone in the Cascade R-CNN detector with FPN implemented using the Detectron2
Extract single-scale features from layers 4, 6, 8, and 12 of the LayoutLMv3 base
Resolution-modifying modules: convert the single-scale features into the multiscale FPN features
PubLayNet: 335703 training, 11245 validation, and 11405 testing images
Detailed fine-tuning config see paper Section 3.4

v3-publaynet

Document layout analysis mAP @ IOU [0.50:0.95] on PubLayNet validation set. All models use only information from the vision modality. LayoutLMv3 outperforms the compared ResNets and vision Transformer backbones.

Ablation Study

MLM head
- 39.2M params
- Vocabulary size: 50265
Benefit of Image Embedding on Text centric tasks are surprising low
- However, the authors argue that
  - Computation overhead of image encoder is also low
  - Linear: 0.6M params; ResNet-101: 44M params
- Note that $M=196$ increase input sequence length for transformer, this leads to increased computation
MIM loss may assist learning features for Image centric tasks
- 6.9M params
WPA loss leads to marginal improvements across all tasks
- 0.6M params

v3-ablation

Ablation study on image embeddings and pre-training objectives on typical text-centric tasks (form and receipt understanding on FUNSD and CORD) and image-centric tasks (document image classification on RVL-CDIP and document layout analysis on PubLayNet). All models were trained at BASE size on 1 million data for 150,000 steps with learning rate 3e-4.