Out of Memory on very small custom transformer

VictorCallejas · September 23, 2020, 8:25am

Hi everyone,

I am creating a custom transformer to classify protein sequences. These sequences range from 20 to 60000 in length in the form “ANTGGTANGT…”.

Once tokenized with a custom BPE tokenizer of vocab 1000, the longest input_ids sequence is 12000.

For the transformer I have tried Roberta and Longformer.

I got out of memory with both, even with very small parameters and splitting long sequences. For example:

Config:

config = LongformerConfig(
    vocab_size=1000,
    max_position_embeddings=5000,
    num_attention_heads=8,
    num_hidden_layers=3,
    type_vocab_size=1,
    hidden_size=64,
    intermediate_size=128
)

Model:

Net(
  (roberta): LongformerModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(1000, 64, padding_idx=1)
      (position_embeddings): Embedding(12000, 64, padding_idx=1)
      (token_type_embeddings): Embedding(1, 64)
      (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): LongformerEncoder(
      (layer): ModuleList(
        (0): LongformerLayer(
          (attention): LongformerAttention(
            (self): LongformerSelfAttention(
              (query): Linear(in_features=64, out_features=64, bias=True)
              (key): Linear(in_features=64, out_features=64, bias=True)
              (value): Linear(in_features=64, out_features=64, bias=True)
              (query_global): Linear(in_features=64, out_features=64, bias=True)
              (key_global): Linear(in_features=64, out_features=64, bias=True)
              (value_global): Linear(in_features=64, out_features=64, bias=True)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=64, out_features=64, bias=True)
              (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=64, out_features=128, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=128, out_features=64, bias=True)
            (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): LongformerLayer(
          (attention): LongformerAttention(
            (self): LongformerSelfAttention(
              (query): Linear(in_features=64, out_features=64, bias=True)
              (key): Linear(in_features=64, out_features=64, bias=True)
              (value): Linear(in_features=64, out_features=64, bias=True)
              (query_global): Linear(in_features=64, out_features=64, bias=True)
              (key_global): Linear(in_features=64, out_features=64, bias=True)
              (value_global): Linear(in_features=64, out_features=64, bias=True)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=64, out_features=64, bias=True)
              (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=64, out_features=128, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=128, out_features=64, bias=True)
            (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (2): LongformerLayer(
          (attention): LongformerAttention(
            (self): LongformerSelfAttention(
              (query): Linear(in_features=64, out_features=64, bias=True)
              (key): Linear(in_features=64, out_features=64, bias=True)
              (value): Linear(in_features=64, out_features=64, bias=True)
              (query_global): Linear(in_features=64, out_features=64, bias=True)
              (key_global): Linear(in_features=64, out_features=64, bias=True)
              (value_global): Linear(in_features=64, out_features=64, bias=True)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=64, out_features=64, bias=True)
              (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=64, out_features=128, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=128, out_features=64, bias=True)
            (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=64, out_features=64, bias=True)
      (activation): Tanh()
    )
  )
  (cls): Linear(in_features=64, out_features=1314, bias=True)
)

The out of memory error I do not get it when instanciating the model and moving it to a cuda device, I got the error always at the start of training, even with a batch size of 1.

I got 8GB of cuda memory, and previously I have been able to train pretrained bert-base, roberta-base… and more models before.

Is there something I am doing wrong? Do I need to change other parameters of the transformer architecture?

Thank you very much

valhalla · September 24, 2020, 4:12pm

5000 max_positions_embedings is still too much for a 8GB GPU, you could try using fp16

VictorCallejas · September 24, 2020, 9:47pm

@valhalla thank you very much for the advise, it worked.

I have been able to train the model with a batch size of 3. Still I need to improve it as 1 epoch will take 60 hours .

I am thinking about distributed training on the cloud. Still I need to iterate the model.

What do you think about these trade-offs?

Shorter vocab size or bigger as it will result in shorter tokenized sequences?
Bucket longer sequences that 2000 and mean vector representations?
Less layers or smaller hidden and intermediate size?
Reducing the number of attention heads?

Which of these parameters have a higher memory? Is there a methodology to choose them?

Thank you very much for the advices.

valhalla · September 29, 2020, 9:34am

@sgugger might have better answer for this

sgugger · September 29, 2020, 1:55pm

I think the two most important parameters to save memory are:

sequence length as all your hidden states (and their gradients) have that dimension
vocabulary size as it controls directly the size of the biggest weight matrix in your model (the embeddings table)

VictorCallejas · September 29, 2020, 2:10pm

Thank you very much for your help. @valhalla @sgugger

shainaraza · October 11, 2020, 4:09pm

I have the same issue, also want to know do I need to use config = LongformerConfig() before tokenizer? I am following the code from here for fine-tuning on custom dataset. Any help?

BramVanroy · October 12, 2020, 8:01am

Please create your own topic, and don’t hijack others’.

Topic		Replies	Views
Huggingface longformer memory issues 🤗Transformers	0	540	March 31, 2022
CUDA out of memory for Longformer Beginners	6	1269	October 22, 2021
Token Classification Models on (Very) Long Text Models	8	11205	March 9, 2023
Self-made Longformer doesn't take more than 512 token 🤗Transformers	0	459	January 5, 2022
Fine-tuning BERT with sequences longer than 512 tokens Models	7	27746	April 4, 2022

Out of Memory on very small custom transformer

Related topics