Hi Guys,
This seems very obivious but I can’t seem to find an answer anywhere. I’m trying to build a very basic roberta protein model similar to ProTrans. It’s just Roberta but I need to use a very long positional encodings of 40_000, because protein seqeunces are about 40,000 amino acids long. But anytime I change the max postional embeddings to 40k I keep getting an CUDA error: device-side assert triggered error.
(additidonal question I save the linebylinetext to a variable. From what I understand linebyline splits the sentence into chunks, but how can I index in to see each chunk?)
(also other than a lambda function is there any more efficient way I can preprocess and tokenise this dataset?)
Steps
- Preprocess a Uniref50 sequence into a single space text document. Treating each Amino acid as a word and each protein as a sentence.
- Tokenise, I can load in the tokens from ProTrans model, tokenises fine.
- Use line by line text dataset to split each into 40_0000 blocks,
- Step up the config, important changes to roberta vocab size 30, max_positional_embeddings= 40_000
- Run through datacollator, model setup, train setup, get error.