This post is mainly based on

Why Visually-Rich Document Understanding is Hard?

  • Relies on: not only textual information, but also visual and layout information
  • Text style and format also convey information
  • Hypothesis: textual, visual, and layout information should be jointly modeled and learned end-to-end in a single framework

Comparison of LayoutLMs

  • LayoutLM
    • 2-D absolute position embedding + Image embedding
    • Image encoder: only fine-tuned on down stream task, not pre-trained
    • Masked Visual-Language Model (MVLM) loss + Multi-label Document Classification (MDC) loss (similar to BERT)
  • LayoutLMv2
    • 2-D relative position embedding
    • Image encoder: ResNeXt101-FPN, pre-trained
    • New loss function: Text-Image Alignment Loss, Text-Image Matching Loss
  • LayoutXLM
    • Multilingual extension of LayoutLMv2
  • LayoutLMv3
    • Image encoder: Linear projection to visual embedding
    • New loss function: Masked Image Modeling, Word-Patch Alignment Loss

LayoutLM

  • 2-D position embedding
  • Image embedding
  • Masked Visual-Language Model (MVLM) loss
  • SOTA

v1-arch

LayoutLM architecture. The Pre-built OCR extract: text, text position and text ROI. Text and text position are encoded by text encoder into LayoutLM embeddings. Text ROI is encoded by Faster R-CNN and FC layers into image embeddings. Text encoder is pre-trained. Image encoder can only be fine-tuned.

Architecture

  • LLM Backbone: BERT
    • BERT-base
    • 12-layer
    • 768 hidden sizes
    • 12 attention heads
    • ~113M parameters
    • BERT-large
  • Image embedding backbone: ResNet-101
    • From Faster R-CNN pre-trained on the Visual Genome dataset
  • Additional input embeddings
    • 2-D position embedding: relative position of a token within a document
    • Image embedding (fine-tuning only): scanned token images within a document
  • Loss
    • Masked Visual-Language Model (MVLM) loss
      • Similar to Masked Language Model loss
      • Mask the text embedding but keep positional embedding
      • Predict masked token
      • 15% of the input tokens for prediction
    • Multi-label Document Classification (MDC) loss
  • Optimization
    • 8 NVIDIA Tesla V100 32GB GPUs
    • Batch size: 80
    • Adam, lr=5e-5, linear decay
    • BASE model: 80 hours to finish one epoch on 11M documents

Pre-training Dataset

  • IIT-CDIP Test Collection 1.0
    • [Building a Test Collection for Complex Document Information Processing, 2006]
    • 6 million scanned documents
  • Pre-processing
    • Tesseract: obtain the recognition as well as the 2-D positions
    • Store the OCR results in hOCR format (OCR + hierarchical representation)

Benchmark Dataset

  • FUNSD
    • Spatial layout analysis and form understanding
  • SROIE
    • Scanned Receipts Information Extraction
  • RVL-CDIP
    • Document image classification

Ablation

MVLM vs MVLM+MDC

  • MVLM+MDC only performs marginal better than MVLM

v1-arch-ablation

Model accuracy (Precision, Recall, F1) on the FUNSD dataset.

Different initialization of BERT

  • Initialization has large impact

v1-init-ablation

Different initialization methods for BASE and LARGE (Text + Layout, MVLM).

LayoutLMv2

  • Spatial-Aware SelfAttention Mechanism: attention score $\alpha$ is modified by learnable bias
  • Image encoder in pre-training + 2 new pre-training loss
  • Significantly outperforms SOTA / LayoutLM

v2-arch

LayoutLMv2 architecture. Visual Encoder encoder patches into visual embedding. Visual Token Embeddings and Text Token Embeddings have same dimensions. Transformer Layers, text embedding and visual encoder are jointly pre-trained.

Architecture

  • Text Embedding
    • Token embedding $TokEmb(w_i)$
    • 1D positional embedding $PosEmb1D(i)$: token index
    • Segment embedding $SegEmb(s_i)$: distinguish different text segments
    • $t_i = TokEmb(w_i) + PosEmb1D(i) + SegEmb(s_i)$
  • Visual Embedding
    • Converts the page image $I$ into a 224 x 224 fixed-length sequence
    • Backbone CNN: ResNeXt-FPN
    • Output feature map is average-pooled to dimension W x H (e.g., 2 x 2), then flattened into $VisTokEmb(I)$
    • $VisTokEmb(I)$ is then projected into dimensionality of the text embeddings
    • Also have 1D positional embedding and Segment embedding to indicate its location
    • $v_i = Proj(VisTokEmb(I)_i)+PosEmb1D(i)+SegEmb([C])$
    • $0 \leq i < WH$
  • Layout Embedding
    • Spatial layout information represented by axis-aligned token bounding boxes from the OCR result
    • Normalize and discretize all coordinates to integers in the range [0, 1000]
    • $l_i = Concat(PosEmb2D_x(x_{min}, x_{max}, width), PosEmb2D_y(y_{min}, y_{max}, height))$
  • Combine
    • Concatenates visual embeddings and text embeddings
    • $X = { v_0, …, v_{WH−1}, t_0, …, t_{L−1} }$
  • Spatial-Aware Self-Attention
    • Let $\alpha_{ij}$ be the vanilla attention score between $x_i, x_j$
    • $\alpha’{ij} = \alpha{ij} + b_{j-i}^{(1D)} + b_{x_j-x_i}^{(2D_x)} + b_{y_j-y_i}^{(2D_y)}$
    • bias $b$ are learnable

Loss

  • Masked Visual-Language Model (MVLM) Loss
    • Same as LayoutLMv1 except mask OCR detected token region before pass to image encoder
  • Text-Image Alignment (TIA) Loss
    • Covering: some tokens lines are randomly selected, and their image regions are covered on the document image; covering operation is performed at the line-level
    • Pre-training: a classification layer is built above the encoder outputs; for each token, predict if it is covered ([Covered] or [Not Covered], BCE loss)
    • When MVLM and TIA are performed simultaneously, TIA losses of the tokens masked in MVLM are not taken into account (Due to [MASK] => [Covered])
  • Text-Image Matching (TIM) Loss
    • Feed the output representation at [CLS] into a classifier to predict whether the image and text are from the same document page (BCE loss)
    • Positive samples: matched text-image
    • Negative samples: image replaced by a page image from another document or dropped
    • Perform the same masking and covering operations to images in negative samples to prevent the model from cheating by finding task features

Pre-training

  • IIT-CDIP Test Collection
    • Random sliding window of the text sequence if the sample is too long
    • Maximum sequence length L = 512 and assign all text tokens to the segment [A]
    • Output shape of the image encoder average pooling layer: W = H = 7 (49 tokens)
  • Text encoder
    • Architecture: BERT BASE or BERT LARGE
    • Init: UniLMv2
  • Image encoder
    • Architecture: ResNeXt101-FPN
    • 200M and 426M parameters due to different projection layers
    • Init: Mask R-CNN trained from PubLayNet
  • Loss
    • MVLM: 15% masked (80% [Mask], 10% random, 10% unchange)
    • TIA: 15% of the lines are covered
    • TIM: 15% images are replaced, and 5% are dropped

Results

  • Fine-tuning: task dependent fine-tuning, but limited by BERT architecture
  • Baseline: text-only pre-trained models, LayoutLM
  • Backbone: BERT Base & Large
  • LayoutLMv2 outperforms baseline by a large margin / new SOTA on downstream tasks

Benchmark Dataset

  • Form understanding: FUNSD
  • Receipt understanding: CORD, SROIE
  • Long document with a complex layout: Kleister-NDA
  • Document image classification: RVL-CDIP
  • Visual question answering: DocVQA

v2-docvqa

ANLS score on the DocVQA dataset, “QG” denotes the data augmentation with the question generation dataset.

Ablation

v2-docvqa-ablation

Ablation study on the DocVQA dataset, where ANLS scores on the validation set are reported. “SASAM” means the spatial-aware self-attention mechanism. “MVLM”, “TIA” and “TIM” are the three pre-training tasks. All the models are trained using the whole pre-training dataset for one epoch with the BASE model size.

Comments: ANLS of line 6 model in Table 5 doesn’t match ANLS of line 7 model in Table 4, which I believe to be the same model. Not sure why.

LayoutLMv3

  • Pre-train multimodal Transformers with unified text and image masking
  • Word-patch alignment objective: for cross-modal alignment learning
  • SOTA on both text-centric and image-centric tasks

v3-arch-comp

LayoutLMv3 vs DocFormer vs SelfDoc. Image embedding: linear patches reduce computational complexity. Image pre-training objectives: reconstruct image tokens instead of raw pixels or region features, capture high-level structures.

Architecture

v3-arch

  • Text embedding: $Y = y_{1:L}$
    • OCR: text + bounding box
    • Position embeddings: 1D position + 2D layout position
    • Modification to 2D layout position embedding
      • LayoutLM and LayoutLMv2: word-level layout position
      • LayoutLMv3: segment-level layout positions (words in same segment usually express the same semantic meaning)
  • Image embedding: $X = x_{1:M}$
    • Linear Image Patches inspired by ViT and ViLT
    • First multimodal pre-trained Document AI model without CNNs backbone as image encoder
    • Reduce heavy computation bottleneck and remove the need for region supervision
    • $I \in \mathbb{R}^{C \times H \times W}$ (channel size, width and height)
    • Split the image into a sequence of uniform $P \times P$ patches
    • Linearly project the image patches to $D$ dimensions, then flatten
    • Length $M = HW/P^2$
    • Learnable 1D position embeddings

Loss

  • Masked Language Modeling (MLM)
    • Mask 30% of text tokens with Span masking strategy
    • Span lengths drawn from a Poisson distribution $\lambda = 3$
    • Input include both image tokens $X$ and text tokens $Y$
  • Masked Image Modeling (MIM)
    • MIM pre-training objective in BEiT
    • Mask 40% image tokens with the blockwise masking strategy
    • Cross-entropy loss to reconstruct the masked image tokens based on surrounding text and image tokens
  • Word-Patch Alignment (WPA)
    • Since each word corresponds to an image patch, the goal is to learn the cross-modal alignment
    • WPA objective
      • Predict whether the corresponding image patches of a text word are masked
      • Aligned: for an unmasked text token, corresponding image tokens are also unmasked
      • Unaligned: for an unmasked text token, corresponding image tokens are masked
      • Exclude the masked text tokens when calculating WPA loss
    • Classification head: two-layer MLP, binary cross-entropy loss

Pre-training

  • Tokenizer: BPE, with Max sequence length $L = 512$
  • BERT Base or Large, init by DiT
  • Image preproc
    • $C \times H \times W = 3 \times 224 \times 224$, $P = 16$, $M = 196$
  • Image encoder
    • Init by RoBERTa
    • Image tokens vocabulary size: 8192
  • Optimizer
    • Adam, batch size of 2048, 500000 steps
    • Lr: BERT Base = $1e-4$, BERT Large = $5e-5$
  • Distributed mixed-precision training + gradient accumulation + gradient checkpoint
  • Modified softmax computation follow CogView, to stabilize training
  • Dataset: IITCDIP Test Collection 1.0 (11 million docs, 42 million pages)

Results

Benchmark Dataset

  • Form understanding: FUNSD
  • Receipt understanding: CORD
  • Visual question answering: DocVQA
  • Document image classification: RVL-CDIP
  • Document layout analysis: PubLayNet

Layout Parsing

  • Use LayoutLMv3 as feature backbone in the Cascade R-CNN detector with FPN implemented using the Detectron2
  • Extract single-scale features from layers 4, 6, 8, and 12 of the LayoutLMv3 base
  • Resolution-modifying modules: convert the single-scale features into the multiscale FPN features
  • PubLayNet: 335703 training, 11245 validation, and 11405 testing images
  • Detailed fine-tuning config see paper Section 3.4

v3-publaynet

Document layout analysis mAP @ IOU [0.50:0.95] on PubLayNet validation set. All models use only information from the vision modality. LayoutLMv3 outperforms the compared ResNets and vision Transformer backbones.

Ablation Study

  • MLM head
    • 39.2M params
    • Vocabulary size: 50265
  • Benefit of Image Embedding on Text centric tasks are surprising low
    • However, the authors argue that
      • Computation overhead of image encoder is also low
      • Linear: 0.6M params; ResNet-101: 44M params
    • Note that $M=196$ increase input sequence length for transformer, this leads to increased computation
  • MIM loss may assist learning features for Image centric tasks
    • 6.9M params
  • WPA loss leads to marginal improvements across all tasks
    • 0.6M params

v3-ablation

Ablation study on image embeddings and pre-training objectives on typical text-centric tasks (form and receipt understanding on FUNSD and CORD) and image-centric tasks (document image classification on RVL-CDIP and document layout analysis on PubLayNet). All models were trained at BASE size on 1 million data for 150,000 steps with learning rate 3e-4.