Using transformers (BERT, RoBERTa) without embedding layer

I’m looking to train a RoBERTa model on protein sequences, which is in many ways similar to normal nlp training, but in others quite different.

In the language of proteins, I have 20 characters instead of the normal 26 characters used in english (it is 26 right? :D), so that is rather similar. The big difference is that you don’t really combine the characters in proteins to actual words, but rather just keep each character as a distinct token or class.

Hence essentially my input to the transformer model could just be a list of numbers ranging from 0-19. However that would mean that my input would only have 1 feature if I did that, and I’m not sure a transformer could work with that?

I’m thinking of just doing a onehot encoding of these characters, which would give me 20 input features. However this is of course still very low in comparison to how normal transformers are trained, where d_model is somewhere in the range of 128-512 if I understand correctly.
Does anyone have any experience with anything like this? any good advice for how it is most likely to work?


I’d recommend taking a look at this repo: by @agemagician . This repo uses transformer models for protein sequences if I understand it correctly.

Also, taking a look at those models:

might help. Not sure if there is a notebook on doing protein sequence LM, maybe @agemagician has a good pointer by chance :slight_smile:


Hi @tueboesen,

Yes, it will work. It can give you a very close results compared to MSA methods, sometimes even better results. If you combine it with MSA, it will even give you a better results compared to MSA methods alone.

We have trained (Transformer XL, XLNet, Bert, Albert, Electra and T5) for Uniref100 and BFD dataset. I would recommend to simply use on of these models, because it requires tremendous amount of computing power to reach good results.

You can find them here:

You can find more details on our paper:

Facebook also trained Roberta using Unrief50 dataset:

Unfortunately, we don’t have a notebook for training from scratch, but you can find more details to replicate our results here:

@patrickvonplaten :
You meant :

Not :


ProtTrans: Provides the SOT pre-trained models for protein sequences.
CodeTrans: Provides the SOTpre-trained models for computer source code.

1 Like

Wow this is an amazing response, thank you so much for this. I will need some time to digest it all, but this is exactly what I need!

Is there a way for me to use any of the models to return probability distributions?

More specifically I would like to see how exactly the model has learned and test it out a bit. To this effect I would love to be able to feed it a protein sequence where I have masked out some of the amino acids, and then have it return a probability distribution for the full returned protein.

I’m sure this is possible, after all this is how the model was trained in the first place, but I’m just a bit overwhelmed by all the models, so I haven’t managed to figure out how to do this.

You can find an answer to your question here:

1 Like

Hmm that still doesn’t quite do it unless I’m missing something.
This does allow masking of a sequence, but you can only mask 1 amino acid in the sequence, and it doesn’t give the actual probabilities on output, but only the top5 probabilities for that single masked amino acid.

You can send “top_k” parameter to “fill-mask” method, to return more/all tokens.
Check here:

If it is still doesn’t fit your use-case, then you have to implement it your self.

Something like that could be a good starting point for you: