Positional Encoding error, Protein Bert Model

donal · October 25, 2020, 8:54am

Hi Guys,

This seems very obivious but I can’t seem to find an answer anywhere. I’m trying to build a very basic roberta protein model similar to ProTrans. It’s just Roberta but I need to use a very long positional encodings of 40_000, because protein seqeunces are about 40,000 amino acids long. But anytime I change the max postional embeddings to 40k I keep getting an CUDA error: device-side assert triggered error.

(additidonal question I save the linebylinetext to a variable. From what I understand linebyline splits the sentence into chunks, but how can I index in to see each chunk?)
(also other than a lambda function is there any more efficient way I can preprocess and tokenise this dataset?)

Steps

Preprocess a Uniref50 sequence into a single space text document. Treating each Amino acid as a word and each protein as a sentence.
Tokenise, I can load in the tokens from ProTrans model, tokenises fine.
Use line by line text dataset to split each into 40_0000 blocks,
Step up the config, important changes to roberta vocab size 30, max_positional_embeddings= 40_000
Run through datacollator, model setup, train setup, get error.

Topic		Replies	Views
PyTorch version Beginners	7	1692	July 12, 2022
Positional encoding error in RoBERTa 🤗Transformers	1	339	October 2, 2023
Claritifcation about the `max_position_embeddings` argument 🤗Transformers	1	505	January 27, 2023
EncoderDecoderModel with Longformer and Bert 🤗Transformers	1	630	February 11, 2021
Different size of Roberta-base tokenizer and model embedding Beginners	1	1142	March 1, 2022

Positional Encoding error, Protein Bert Model

Related topics