it’s fast, only requires a single forward pass. It’s a Transformer encoder, hence it’s BERT-like, so all optimizations you can do to BERT, you can also apply to LayoutLM.
it’s highly performant given a capable OCR engine, like Microsoft’s Read API
Cons of LayoutLM:
it relies on an OCR engine, hence you incur an additional cost for that. Also, if the OCR engine makes some mistakes, then those mistakes are propagated to the model’s predictions. It’s a pipeline approach (first OCR, then a Transformer applied to it).
the model only has a max sequence length of typically 512 tokens, so not possible to apply on PDFs with a large amount of text, unless you use a sliding window approach as is done with BERT
you need to deal with subword tokens, which is a bit painful
Pros of Pix2Struct:
it’s an end-to-end model: just takes in an image, and produces text as output. It doesn’t require an OCR engine, it’s not a pipeline approach, hence easier to deal with.
you don’t need to worry about the token limit of 512 tokens as the model just takes an image as an input.
Cons of Pix2Struct:
it’s slower, as it generates one token at a time (autoregressive generation). Hence inference will be slower than the LayoutLM series. However, with all the optimizations currently being done to LLMs, those will apply to Pix2Struct as well eventually (blazingly fast tokens/sec generation). See frameworks like TGI, Llama c++, vLLM, etc.
it’s quite a heavy model: it requires high resolution images for training (typically > 1000 pixels per side) in order to produce accurate results, and the model itself is pretty big. Hence it requires quite a bit of memory both for training and inference (one could use an A100 GPU).