Positional Encoding error, Protein Bert Model

donal · October 25, 2020, 8:54am

Hi Guys,

This seems very obivious but I can’t seem to find an answer anywhere. I’m trying to build a very basic roberta protein model similar to ProTrans. It’s just Roberta but I need to use a very long positional encodings of 40_000, because protein seqeunces are about 40,000 amino acids long. But anytime I change the max postional embeddings to 40k I keep getting an CUDA error: device-side assert triggered error.

(additidonal question I save the linebylinetext to a variable. From what I understand linebyline splits the sentence into chunks, but how can I index in to see each chunk?)
(also other than a lambda function is there any more efficient way I can preprocess and tokenise this dataset?)

Steps

Preprocess a Uniref50 sequence into a single space text document. Treating each Amino acid as a word and each protein as a sentence.
Tokenise, I can load in the tokens from ProTrans model, tokenises fine.
Use line by line text dataset to split each into 40_0000 blocks,
Step up the config, important changes to roberta vocab size 30, max_positional_embeddings= 40_000
Run through datacollator, model setup, train setup, get error.

Zhylkaaa · October 25, 2020, 3:15pm

Hi @donal I guess it’s near impossible to put 40 000 tokens in any transformer model that uses full attention. Maybe a better choice is Reformer or LongFormer(I am not sure that LongFormer can handle 40 000 tokens too). The main problem of BERT architecture is that it’s memory and computational complexity is O(L^2) where L is max_seq_len, so I guess about 2048 tokens is upper limit.
Or maybe consider using something like sliding window or some kind of memory banks

donal · October 25, 2020, 9:44pm

I figured it out, but now I keep getting freezes with a mutli Gpu set up. Anyone else having this problem?

Topic		Replies	Views
PyTorch version Beginners	7	1724	July 12, 2022
Positional encoding error in RoBERTa 🤗Transformers	1	346	October 2, 2023
EncoderDecoderModel with Longformer and Bert 🤗Transformers	1	635	February 11, 2021
Different size of Roberta-base tokenizer and model embedding Beginners	1	1159	March 1, 2022
Trying to process longer documents with BERT-based models Intermediate	0	627	March 8, 2021

Positional Encoding error, Protein Bert Model

Related topics