How to change the Text embedder(Layoutlmv2Tokenizer) in LayoutLMv2 model?

Hi, I’m a beginner in Transformers huggingface.
I wonder How to i change the text embedding models in LayoutLMv2(original to KoBERT)

I knew there’s a processor for LayoutLMv2, LayoutXLM.
So, I think i need to change the text tokenizer for data loading, and change the text encoding weights (in Original LayoutLMv2 model) as KoBERT’s like below codes.

kobert_name = "monologg/kobert"
bert_model = BertModel.from_pretrained(kobert_name)
kobert_tokenizer = KoBertTokenizer.from_pretrained(kobert_name)

feature_extractor = LayoutLMv2FeatureExtractor(apply_ocr=False)
processor = LayoutLMv2Processor(feature_extractor, kobert_tokenizer) => Returned below error

# ValueError: Received a KoBertTokenizer for argument tokenizer, but a ('LayoutLMv2Tokenizer', 'LayoutLMv2TokenizerFast') was expected.

model = LayoutLMv2ForTokenClassification.from_pretrained("microsoft/layoutxlm-base", num_labels=len(labels))
# Need to exchange the layoutlmv2.embeddings. as kobert parameters(weights)

But I got error in processor defined part… I think the original LayoutLMv2Processor only define the originals

I’m using this code

What point should i modify for changing text embedding?
Please share any tips for beginner. :smile:

Hi, Were you able to find the solution. I think LMV2 only accepts its own Tokenizer and Feature extractor. I was also trying something similar to this. Just wanted to know were you found the solution.

I don’t think you can just swap the text embedding layer, unfortunately.

However I’ll work on a new model that will allow to easily use any text encoder from the hub (hence making it possible to have a LayoutLM-like model for any language): GitHub - jpWang/LiLT: Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

Thank you. That would be helpful. Nice Repo.