How to change the Text embedder(Layoutlmv2Tokenizer) in LayoutLMv2 model?

yellowjs0304 · May 2, 2022, 7:09am

Hi, I’m a beginner in Transformers huggingface.
I wonder How to i change the text embedding models in LayoutLMv2(original to KoBERT)

I knew there’s a processor for LayoutLMv2, LayoutXLM.
So, I think i need to change the text tokenizer for data loading, and change the text encoding weights (in Original LayoutLMv2 model) as KoBERT’s like below codes.

kobert_name = "monologg/kobert"
bert_model = BertModel.from_pretrained(kobert_name)
kobert_tokenizer = KoBertTokenizer.from_pretrained(kobert_name)

feature_extractor = LayoutLMv2FeatureExtractor(apply_ocr=False)
processor = LayoutLMv2Processor(feature_extractor, kobert_tokenizer) => Returned below error

# ValueError: Received a KoBertTokenizer for argument tokenizer, but a ('LayoutLMv2Tokenizer', 'LayoutLMv2TokenizerFast') was expected.

model = LayoutLMv2ForTokenClassification.from_pretrained("microsoft/layoutxlm-base", num_labels=len(labels))
# Need to exchange the layoutlmv2.embeddings. as kobert parameters(weights)

But I got error in processor defined part… I think the original LayoutLMv2Processor only define the originals

I’m using this code

What point should i modify for changing text embedding?
Please share any tips for beginner.

purnasai · September 29, 2022, 6:54am

Hi, Were you able to find the solution. I think LMV2 only accepts its own Tokenizer and Feature extractor. I was also trying something similar to this. Just wanted to know were you found the solution.

nielsr · September 29, 2022, 7:38am

I don’t think you can just swap the text embedding layer, unfortunately.

However I’ll work on a new model that will allow to easily use any text encoder from the hub (hence making it possible to have a LayoutLM-like model for any language): GitHub - jpWang/LiLT: Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

purnasai · September 29, 2022, 4:03pm

Thank you. That would be helpful. Nice Repo.

Topic		Replies	Views
Layoutlmv2 token classification on documents having tokens larger than 512 Models	8	2314	October 20, 2022
LayoutLMv3 Q/A Inference Beginners	2	2447	January 23, 2023
Different embeddings when using sentence transformers and transformers.js Beginners	3	902	April 19, 2024
HTML Embedding processing Intermediate	8	3857	February 13, 2022
Transformers v3.0.0 is out! 🤗Transformers	0	1936	July 7, 2020

How to change the Text embedder(Layoutlmv2Tokenizer) in LayoutLMv2 model?

Related topics