Positional Encoding error, Protein Bert Model

Hi Guys,

This seems very obivious but I can’t seem to find an answer anywhere. I’m trying to build a very basic roberta protein model similar to ProTrans. It’s just Roberta but I need to use a very long positional encodings of 40_000, because protein seqeunces are about 40,000 amino acids long. But anytime I change the max postional embeddings to 40k I keep getting an CUDA error: device-side assert triggered error.

(additidonal question I save the linebylinetext to a variable. From what I understand linebyline splits the sentence into chunks, but how can I index in to see each chunk?)
(also other than a lambda function is there any more efficient way I can preprocess and tokenise this dataset?)

Steps

  1. Preprocess a Uniref50 sequence into a single space text document. Treating each Amino acid as a word and each protein as a sentence.
  2. Tokenise, I can load in the tokens from ProTrans model, tokenises fine.
  3. Use line by line text dataset to split each into 40_0000 blocks,
  4. Step up the config, important changes to roberta vocab size 30, max_positional_embeddings= 40_000
  5. Run through datacollator, model setup, train setup, get error.

Hi @donal I guess it’s near impossible to put 40 000 tokens in any transformer model that uses full attention. Maybe a better choice is Reformer or LongFormer(I am not sure that LongFormer can handle 40 000 tokens too). The main problem of BERT architecture is that it’s memory and computational complexity is O(L^2) where L is max_seq_len, so I guess about 2048 tokens is upper limit.
Or maybe consider using something like sliding window or some kind of memory banks :man_shrugging:

I figured it out, but now I keep getting freezes with a mutli Gpu set up. Anyone else having this problem?