Hi everyone,
I am creating a custom transformer to classify protein sequences. These sequences range from 20 to 60000 in length in the form “ANTGGTANGT…”.
Once tokenized with a custom BPE tokenizer of vocab 1000, the longest input_ids sequence is 12000.
For the transformer I have tried Roberta and Longformer.
I got out of memory with both, even with very small parameters and splitting long sequences. For example:
Config:
config = LongformerConfig(
vocab_size=1000,
max_position_embeddings=5000,
num_attention_heads=8,
num_hidden_layers=3,
type_vocab_size=1,
hidden_size=64,
intermediate_size=128
)
Model:
Net(
(roberta): LongformerModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(1000, 64, padding_idx=1)
(position_embeddings): Embedding(12000, 64, padding_idx=1)
(token_type_embeddings): Embedding(1, 64)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): LongformerEncoder(
(layer): ModuleList(
(0): LongformerLayer(
(attention): LongformerAttention(
(self): LongformerSelfAttention(
(query): Linear(in_features=64, out_features=64, bias=True)
(key): Linear(in_features=64, out_features=64, bias=True)
(value): Linear(in_features=64, out_features=64, bias=True)
(query_global): Linear(in_features=64, out_features=64, bias=True)
(key_global): Linear(in_features=64, out_features=64, bias=True)
(value_global): Linear(in_features=64, out_features=64, bias=True)
)
(output): BertSelfOutput(
(dense): Linear(in_features=64, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=64, out_features=128, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=128, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(1): LongformerLayer(
(attention): LongformerAttention(
(self): LongformerSelfAttention(
(query): Linear(in_features=64, out_features=64, bias=True)
(key): Linear(in_features=64, out_features=64, bias=True)
(value): Linear(in_features=64, out_features=64, bias=True)
(query_global): Linear(in_features=64, out_features=64, bias=True)
(key_global): Linear(in_features=64, out_features=64, bias=True)
(value_global): Linear(in_features=64, out_features=64, bias=True)
)
(output): BertSelfOutput(
(dense): Linear(in_features=64, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=64, out_features=128, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=128, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(2): LongformerLayer(
(attention): LongformerAttention(
(self): LongformerSelfAttention(
(query): Linear(in_features=64, out_features=64, bias=True)
(key): Linear(in_features=64, out_features=64, bias=True)
(value): Linear(in_features=64, out_features=64, bias=True)
(query_global): Linear(in_features=64, out_features=64, bias=True)
(key_global): Linear(in_features=64, out_features=64, bias=True)
(value_global): Linear(in_features=64, out_features=64, bias=True)
)
(output): BertSelfOutput(
(dense): Linear(in_features=64, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=64, out_features=128, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=128, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=64, out_features=64, bias=True)
(activation): Tanh()
)
)
(cls): Linear(in_features=64, out_features=1314, bias=True)
)
The out of memory error I do not get it when instanciating the model and moving it to a cuda device, I got the error always at the start of training, even with a batch size of 1.
I got 8GB of cuda memory, and previously I have been able to train pretrained bert-base, roberta-base… and more models before.
Is there something I am doing wrong? Do I need to change other parameters of the transformer architecture?
Thank you very much