Out of Memory on very small custom transformer

Hi everyone,

I am creating a custom transformer to classify protein sequences. These sequences range from 20 to 60000 in length in the form “ANTGGTANGT…”.

Once tokenized with a custom BPE tokenizer of vocab 1000, the longest input_ids sequence is 12000.

For the transformer I have tried Roberta and Longformer.

I got out of memory with both, even with very small parameters and splitting long sequences. For example:

Config:

config = LongformerConfig(
    vocab_size=1000,
    max_position_embeddings=5000,
    num_attention_heads=8,
    num_hidden_layers=3,
    type_vocab_size=1,
    hidden_size=64,
    intermediate_size=128
)

Model:

Net(
  (roberta): LongformerModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(1000, 64, padding_idx=1)
      (position_embeddings): Embedding(12000, 64, padding_idx=1)
      (token_type_embeddings): Embedding(1, 64)
      (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): LongformerEncoder(
      (layer): ModuleList(
        (0): LongformerLayer(
          (attention): LongformerAttention(
            (self): LongformerSelfAttention(
              (query): Linear(in_features=64, out_features=64, bias=True)
              (key): Linear(in_features=64, out_features=64, bias=True)
              (value): Linear(in_features=64, out_features=64, bias=True)
              (query_global): Linear(in_features=64, out_features=64, bias=True)
              (key_global): Linear(in_features=64, out_features=64, bias=True)
              (value_global): Linear(in_features=64, out_features=64, bias=True)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=64, out_features=64, bias=True)
              (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=64, out_features=128, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=128, out_features=64, bias=True)
            (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): LongformerLayer(
          (attention): LongformerAttention(
            (self): LongformerSelfAttention(
              (query): Linear(in_features=64, out_features=64, bias=True)
              (key): Linear(in_features=64, out_features=64, bias=True)
              (value): Linear(in_features=64, out_features=64, bias=True)
              (query_global): Linear(in_features=64, out_features=64, bias=True)
              (key_global): Linear(in_features=64, out_features=64, bias=True)
              (value_global): Linear(in_features=64, out_features=64, bias=True)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=64, out_features=64, bias=True)
              (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=64, out_features=128, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=128, out_features=64, bias=True)
            (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (2): LongformerLayer(
          (attention): LongformerAttention(
            (self): LongformerSelfAttention(
              (query): Linear(in_features=64, out_features=64, bias=True)
              (key): Linear(in_features=64, out_features=64, bias=True)
              (value): Linear(in_features=64, out_features=64, bias=True)
              (query_global): Linear(in_features=64, out_features=64, bias=True)
              (key_global): Linear(in_features=64, out_features=64, bias=True)
              (value_global): Linear(in_features=64, out_features=64, bias=True)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=64, out_features=64, bias=True)
              (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=64, out_features=128, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=128, out_features=64, bias=True)
            (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=64, out_features=64, bias=True)
      (activation): Tanh()
    )
  )
  (cls): Linear(in_features=64, out_features=1314, bias=True)
)

The out of memory error I do not get it when instanciating the model and moving it to a cuda device, I got the error always at the start of training, even with a batch size of 1.

I got 8GB of cuda memory, and previously I have been able to train pretrained bert-base, roberta-base… and more models before.

Is there something I am doing wrong? Do I need to change other parameters of the transformer architecture?

Thank you very much

5000 max_positions_embedings is still too much for a 8GB GPU, you could try using fp16

1 Like

@valhalla thank you very much for the advise, it worked.

I have been able to train the model with a batch size of 3. Still I need to improve it as 1 epoch will take 60 hours :sweat_smile: .

I am thinking about distributed training on the cloud. Still I need to iterate the model.

What do you think about these trade-offs?

  • Shorter vocab size or bigger as it will result in shorter tokenized sequences?
  • Bucket longer sequences that 2000 and mean vector representations?
  • Less layers or smaller hidden and intermediate size?
  • Reducing the number of attention heads?

Which of these parameters have a higher memory? Is there a methodology to choose them?

Thank you very much for the advices.

@sgugger might have better answer for this :slight_smile:

I think the two most important parameters to save memory are:

  • sequence length as all your hidden states (and their gradients) have that dimension
  • vocabulary size as it controls directly the size of the biggest weight matrix in your model (the embeddings table)
1 Like

Thank you very much for your help. @valhalla @sgugger

I have the same issue, also want to know do I need to use config = LongformerConfig() before tokenizer? I am following the code from here for fine-tuning on custom dataset. Any help?

Please create your own topic, and don’t hijack others’.