Layoutlmv2 token classification on documents having tokens larger than 512

Hello everyone, I am trying to finetune/create a Layoutlmv2 model for documents having tokens larger than 512. I have tried following but its not working:

Initializing Tokenizer and Layoutlmv2 from scratch:

That is how I am initializing the tokenizer and model. I have am training 50 data instances but training loss /epoch is clearly showing overfitting and loss is coming down as a very steep graph

I wanted to change the num_hidden_layers=24 and num_attention_heads=16 but on google colab it shows CUDA memory error.

I want to know if I am doing it right or i am missing something…? Before I move to sagemaker to train model with num_hidden_layers=24 and num_attention_heads=16 on a bigger GPU, I want to make sure I am doing it right. Looking forward to your helpful responses.

Hi @navdeep were you able to find the solution. What is the workaround you followed to solve the above one.

I have tried LMV3 and Changed the model embedding layer from 512 to 1024. Able to train. but not able to load at inference time.

Hi @purnasai I am still implementing the solution on the basis of this issue more than 512 tokens · Issue #23 · NielsRogge/Transformers-Tutorials ( although this issue resolves around just the text but i am trying to extend to the images. I havent tried the LMV3 but will try and keep this thread posted. Thanks

Hi @navdeep, I can think of 4 workarounds here.

  1. Replacing existing tokenizer with other tokenizer that can handle tokens 1024,2048.
  2. Using Stride option in processor, using collate, to input data with length > 512.
  3. Replace 512 with 1024 in model architecture.
  4. Crop the image to capture only 512 tokens.(crop the image to 2parts).

I Have the workaround1 here: LayoutLMV3 Training with Morethan 512 tokens. · Issue #19190 · huggingface/transformers · GitHub

sorry for delayed reply @purnasai . I actually tried using custom tokenizer, custom processor and a new custom model using a new configuration with max sequence length= 1024. Model was able to detect more than 1024 tokens but with its internal architect of 12 hidden layers and 12 attention heads it was giving me a bad accuracy. As I had very less data to train I will not say that this solution wont work for others. If I try to change the internal architecture of model using the new config object (for example atten heads to 24) and hidden layers to (16) pytorch shows out of memory.

Current situation: I will try the same scenario with Layoutlmv3 and see if that works (90% chance it wont work).

@purnasai can u please tell me if you are initializing model from base-uncased for downstream training or from the scratch using custom configuration object.?

Hi @navdeep, Using a Custom Tokenizer, Processor and Custom model would increase the complexity of the usecase. Again you are changing attentions heads and hidden layers. Having to learn the weights from the begining would also increase its computation time and goes outofmemory. Like you said, as you do not have much data to train, the above process is not a good approach, I would say.

Using the base-uncased, as I want to make use of the pretrained memory to downstream.

Thanks for info @purnasai. I have resolved it using return_overflowing_tokens param in the processor and writing custom get_item function.