Which is the correct bbox ocr level for LiLT? block level or word level?

This is regarding the lilt model below

In the above link, author of LILT has mentioned that the model is pretrained on “segment-level box”.


  1. which kind of ocr is assumed by LiltModel ? word token level or “segment-level box”?
  2. How to ensure the same “segment-level box” or word level ocr is applied for finetuning and inference?
  3. Any pointers on implement the correct ocr level using pytesseract?