Pre-Train LayoutLM

Hello,

We are using pretrained LayoutLM model which is working well with only English Language. We have many forms or invoices from different languages.

How can i pre-train LayoutLM model with my own corpus?

Thank you.

Hi sharathmk99,

LayoutLM model is not currently available in the huggingface transformers library. If you want to add it that should be possible (though not simple). Alternatively, you could put in a suggestion and hope that someone else will incorporate it.

If you decide instead to pre-train a LayoutLM model using native Tensorflow or native PyTorch, the first question is whether you have enough data. How large is your corpus?

If your corpus is not large enough, you might be better off using a different model that has been pre-trained for the language(s) you need.

Do you definitely want to pre-train (from randomly-initialized) or would it work to fine-tune? I don’t know what results people get for fine-tuning with a new language. I expect it would not work at all if the alphabet is different, but it might be at least partly effective if the languages are quite similar (eg english + french which have almost the same alphabet and many of the same word-pieces).

Hi @rgwatwormhill,

LayoutLM is available in the huggingface transformers right?
Link: https://huggingface.co/transformers/model_doc/layoutlm.html

Hi @rgwatwormhill,

I’m planning to use https://www.cs.cmu.edu/~aharley/rvl-cdip/ dataset by adding my own domain data.
Planning to train multilingual model with multiple languages.

hi @sharathmk99,

Sorry, I must have been looking at an old version of the documentation, or something. You are correct, and it’s clearly present on the page you’ve linked.

(It doesn’t seem to be on the model summary page https://huggingface.co/transformers/model_summary.html , but there are two options on the pretrained-models page https://huggingface.co/transformers/pretrained_models.html ).

To train from scratch, you need to start by defining your model based on the LayoutLM config. Have you read the Training and Fine Tuning page https://huggingface.co/transformers/training.html .

The amount of data you have available will be important.

Hi @rgwatwormhill,

Thank you for your response.
Training and Fine Tuning page doesn’t show how to pre train the model. It shows how to fine tune the model.

Thank you!

Hi @sharathmk99 ,

By any chance you tried fine tuning layout_lm ? If yes, did you use transformers library ? or official source ?

thanks !

@tuner007 we were able to fine tune the model using the sample notebook code in COLAB. We had some trouble getting it to run on other platforms though.