I’m looking to train a RoBERTa model on protein sequences, which is in many ways similar to normal nlp training, but in others quite different.
In the language of proteins, I have 20 characters instead of the normal 26 characters used in english (it is 26 right? :D), so that is rather similar. The big difference is that you don’t really combine the characters in proteins to actual words, but rather just keep each character as a distinct token or class.
Hence essentially my input to the transformer model could just be a list of numbers ranging from 0-19. However that would mean that my input would only have 1 feature if I did that, and I’m not sure a transformer could work with that?
I’m thinking of just doing a onehot encoding of these characters, which would give me 20 input features. However this is of course still very low in comparison to how normal transformers are trained, where d_model is somewhere in the range of 128-512 if I understand correctly.
Does anyone have any experience with anything like this? any good advice for how it is most likely to work?
Yes, it will work. It can give you a very close results compared to MSA methods, sometimes even better results. If you combine it with MSA, it will even give you a better results compared to MSA methods alone.
We have trained (Transformer XL, XLNet, Bert, Albert, Electra and T5) for Uniref100 and BFD dataset. I would recommend to simply use on of these models, because it requires tremendous amount of computing power to reach good results.
You can find them here:
You can find more details on our paper:
Facebook also trained Roberta using Unrief50 dataset:
Unfortunately, we don’t have a notebook for training from scratch, but you can find more details to replicate our results here:
Is there a way for me to use any of the models to return probability distributions?
More specifically I would like to see how exactly the model has learned and test it out a bit. To this effect I would love to be able to feed it a protein sequence where I have masked out some of the amino acids, and then have it return a probability distribution for the full returned protein.
I’m sure this is possible, after all this is how the model was trained in the first place, but I’m just a bit overwhelmed by all the models, so I haven’t managed to figure out how to do this.
Hmm that still doesn’t quite do it unless I’m missing something.
This does allow masking of a sequence, but you can only mask 1 amino acid in the sequence, and it doesn’t give the actual probabilities on output, but only the top5 probabilities for that single masked amino acid.