LayoutLM vs pic2struct

golfboy689 · December 1, 2023, 4:31am

Layout LM vs pic2struct

Pros/cons?

nielsr · December 1, 2023, 8:21am

Pros of LayoutLM:

it’s fast, only requires a single forward pass. It’s a Transformer encoder, hence it’s BERT-like, so all optimizations you can do to BERT, you can also apply to LayoutLM.
it’s highly performant given a capable OCR engine, like Microsoft’s Read API

Cons of LayoutLM:

it relies on an OCR engine, hence you incur an additional cost for that. Also, if the OCR engine makes some mistakes, then those mistakes are propagated to the model’s predictions. It’s a pipeline approach (first OCR, then a Transformer applied to it).
the model only has a max sequence length of typically 512 tokens, so not possible to apply on PDFs with a large amount of text, unless you use a sliding window approach as is done with BERT
you need to deal with subword tokens, which is a bit painful

Pros of Pix2Struct:

it’s an end-to-end model: just takes in an image, and produces text as output. It doesn’t require an OCR engine, it’s not a pipeline approach, hence easier to deal with.
you don’t need to worry about the token limit of 512 tokens as the model just takes an image as an input.

Cons of Pix2Struct:

it’s slower, as it generates one token at a time (autoregressive generation). Hence inference will be slower than the LayoutLM series. However, with all the optimizations currently being done to LLMs, those will apply to Pix2Struct as well eventually (blazingly fast tokens/sec generation). See frameworks like TGI, Llama c++, vLLM, etc.
it’s quite a heavy model: it requires high resolution images for training (typically > 1000 pixels per side) in order to produce accurate results, and the model itself is pretty big. Hence it requires quite a bit of memory both for training and inference (one could use an A100 GPU).

Topic		Replies	Views
No performance decrease in layoutlm v2 when using blank image instead of original image Beginners	0	124	February 7, 2024
Can LayoutLM be used for images? Beginners	2	843	January 11, 2021
Model inference using batch (Encoder-decoder) Models	0	641	September 13, 2023
Optimal Approach for Fine-Tuning LayoutLMv3 for Token Classification with 80 Labels Models	3	31	May 26, 2025
LayoutLM for table detection and extraction Beginners	3	8237	July 11, 2023