Data augmentation FUNSD dataset & LayoutLMv3

aghenghiu · October 23, 2023, 7:20am

Hi all,
So, I just created my first LayoutLMv3 model for token classification over the FUNSD dataset. Now, I would like to fine tune it with my own version of FUNSD dataset. But, since the amout of documents is not big enough, data augmentation comes to mind.

I need some guidance on this topic. Sice text extraction from documents is a big part of this problem, I don’t think any kind of transformation over the original image is valid to obtain a new one (resizing, blurring, changing background colors, to name a few, could negatively impact on text extraction).

Is there any data augmentation technique that I could implement safely to get new valid data?

Greetings

aghenghiu · October 31, 2023, 12:31pm

@nielsr is your answer here aplicable in this specific scenario?

nielsr · November 13, 2023, 10:44am

Hi,

No since the text to be extracted will change if you use things like random cropping or flipping.

For document AI, one typically applies augmentation like here: https://github.com/facebookresearch/nougat/blob/f5d2cd525979e24c01c72fe223feff2eda555a0c/nougat/transforms.py. Things like erosion, dilation, bitmap transformations (which preserve the content of the images).

Topic		Replies	Views
Improving Key-Value Pair Extraction with LayoutLM and LiLT on Custom OCR Dataset Research	2	238	February 21, 2025
Optimal Approach for Fine-Tuning LayoutLMv3 for Token Classification with 80 Labels Models	3	29	May 26, 2025
LayoutLMv3 for tokenClassification-within-a-table/Table Extraction Beginners	0	753	November 6, 2023
Looking for OCR post-processing for Visual Document Understanding Research	0	633	December 15, 2023
Suggestions for an open source tagging tool to build custom LayoutLMv2 datasets Awesome paper	0	910	January 25, 2022

Data augmentation FUNSD dataset & LayoutLMv3

Related topics