LayoutLMV3 for Token Classification

Hi @nielsr,

Thanks in advance for implementing this model in the HuggingFace library :slight_smile:

I annotated several Images using Label Studio ML Backend Tesseract: label-studio-ml-backend/label_studio_ml/examples/tesseract at master 路 heartexlabs/label-studio-ml-backend 路 GitHub


With this tool you draw the box with the selected label and it extracts the text for you. You can see this in the above gif.

After that I exported the annotations and created a dataset using the bbox format expected by the model, I saw this here

Finally, I trained the model for Token Classification.

However, the model is not working well at inference time. At inference time I set the processor to apply OCR:

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=True)

And I just pass an image:

encoding = processor(image, truncation=True, return_tensors="pt")

The model doesn麓t classify the tokens well. However, If i pass the bboxes and text from my annotations it works properly.

How is this model supposed to be used for inference? Do you need to pass the hand-drawn bboxes and text?

I want to use this model to extract information automatically and if I have to pass these annotations manually it makes no sense.

Maybe I did something wrong at labelling? Should I run the image through tesseract and then label all the bboxes it returns instead of drawing them by hand?

I passed all images throught easy OCR and annotated all the boxes with label studio following next tutorial: Label Studio Blog 鈥 Improve OCR quality for receipt processing with Tesseract and Label Studio

Then I trained the model and at inference time I use boxes and text frome asyocr.


Thanks for your interest in LayoutLMv3. That labelling tool likes nice!

I鈥檇 say that you need to make sure that the OCR settings between training and inference should be identical, otherwise the model will not work as expected at inference time.

e.g. are you making sure bounding boxes are provided in the appropriate format during training? I鈥檇 check things such as:

Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.

So make sure to provide the same settings for the bounding boxes during training vs inference, make sure you provide them in the right order (from top left in the document to bottom right), etc.

Yes, the problem was that when I annotated the images I drew the boxes and put the text.

What I did was send all the images through EasyOCR and then I annotated the bounding boxes with Label Studio. The guide I followed was: Label Studio Blog 鈥 Improve OCR quality for receipt processing with Tesseract and Label Studio, the only thing that I did was changing Tesseract for EasyOCR.

With these annotations my model worked fine at inference time I just needed to send the image first to EasyOCR.

Hi @WaterKnight, For token classification other than the labels we need to classify, dont we need one more label named 鈥渙thers鈥.

So when you used labelstudio, did you select the text that belongs only to those labels we need to classify or you also selected the text belongs to 鈥淥thers鈥 label as well.

Just want to know. Thanks

I have the same question as @purnasai . When I run train and test on custom dataset, the model performs nicely on test; however, when I follow inference guide using an image very similar to a test image, the model performs poorly. The image output on inference has many 鈥渙ther鈥 bounding boxes and the two classes of labels found, are wrong. I did not annotate 鈥渙ther鈥 labels during labeling and annotating; only used two classes of labels, which are inferred nicely on test set.

When labeling training images, do I need to identify the words I鈥檓 not interested in as 鈥渙ther鈥 along with the other two classes that I am interested in?