Pre-Train LayoutLM

sharathmk99 · October 2, 2020, 11:47pm

Hello,

We are using pretrained LayoutLM model which is working well with only English Language. We have many forms or invoices from different languages.

How can i pre-train LayoutLM model with my own corpus?

Thank you.

rgwatwormhill · October 3, 2020, 11:40am

Hi sharathmk99,

LayoutLM model is not currently available in the huggingface transformers library. If you want to add it that should be possible (though not simple). Alternatively, you could put in a suggestion and hope that someone else will incorporate it.

If you decide instead to pre-train a LayoutLM model using native Tensorflow or native PyTorch, the first question is whether you have enough data. How large is your corpus?

If your corpus is not large enough, you might be better off using a different model that has been pre-trained for the language(s) you need.

Do you definitely want to pre-train (from randomly-initialized) or would it work to fine-tune? I don’t know what results people get for fine-tuning with a new language. I expect it would not work at all if the alphabet is different, but it might be at least partly effective if the languages are quite similar (eg english + french which have almost the same alphabet and many of the same word-pieces).

sharathmk99 · October 3, 2020, 1:51pm

Hi @rgwatwormhill,

LayoutLM is available in the huggingface transformers right?
Link: https://huggingface.co/transformers/model_doc/layoutlm.html

sharathmk99 · October 3, 2020, 2:41pm

Hi @rgwatwormhill,

I’m planning to use https://www.cs.cmu.edu/~aharley/rvl-cdip/ dataset by adding my own domain data.
Planning to train multilingual model with multiple languages.

rgwatwormhill · October 4, 2020, 7:01pm

hi @sharathmk99,

Sorry, I must have been looking at an old version of the documentation, or something. You are correct, and it’s clearly present on the page you’ve linked.

(It doesn’t seem to be on the model summary page https://huggingface.co/transformers/model_summary.html , but there are two options on the pretrained-models page https://huggingface.co/transformers/pretrained_models.html ).

To train from scratch, you need to start by defining your model based on the LayoutLM config. Have you read the Training and Fine Tuning page https://huggingface.co/transformers/training.html .

The amount of data you have available will be important.

sharathmk99 · October 4, 2020, 8:17pm

Hi @rgwatwormhill,

Thank you for your response.
Training and Fine Tuning page doesn’t show how to pre train the model. It shows how to fine tune the model.

Thank you!

tuner007 · November 30, 2020, 6:39pm

Hi @sharathmk99 ,

By any chance you tried fine tuning layout_lm ? If yes, did you use transformers library ? or official source ?

thanks !

g3casey · November 5, 2021, 8:05pm

@tuner007 we were able to fine tune the model using the sample notebook code in COLAB. We had some trouble getting it to run on other platforms though.

Topic		Replies	Views
Finetune a pretrained huggingface translation model on a new language pair Models	1	1033	January 12, 2024
Saving underlying language model after trained on downstream task 🤗Transformers	0	422	September 14, 2020
Multi-input classification (images + Texts) Beginners	6	1140	February 18, 2024
Further pre-train language model in transformers like BERT Models	3	1108	March 27, 2022
Using Huggingface Trainer for custom models Beginners	5	4361	May 29, 2023

Pre-Train LayoutLM

Related topics