Using transformers (BERT, RoBERTa) without embedding layer

tueboesen · December 13, 2020, 6:16pm

I’m looking to train a RoBERTa model on protein sequences, which is in many ways similar to normal nlp training, but in others quite different.

In the language of proteins, I have 20 characters instead of the normal 26 characters used in english (it is 26 right? :D), so that is rather similar. The big difference is that you don’t really combine the characters in proteins to actual words, but rather just keep each character as a distinct token or class.

Hence essentially my input to the transformer model could just be a list of numbers ranging from 0-19. However that would mean that my input would only have 1 feature if I did that, and I’m not sure a transformer could work with that?

I’m thinking of just doing a onehot encoding of these characters, which would give me 20 input features. However this is of course still very low in comparison to how normal transformers are trained, where d_model is somewhere in the range of 128-512 if I understand correctly.
Does anyone have any experience with anything like this? any good advice for how it is most likely to work?

patrickvonplaten · December 13, 2020, 9:34pm

Hey,

I’d recommend taking a look at this repo: GitHub - agemagician/CodeTrans: Pretrained Language Models for Source code by @agemagician . This repo uses transformer models for protein sequences if I understand it correctly.

Also, taking a look at those models:

might help. Not sure if there is a notebook on doing protein sequence LM, maybe @agemagician has a good pointer by chance

agemagician · December 13, 2020, 10:00pm

Hi @tueboesen,

Yes, it will work. It can give you a very close results compared to MSA methods, sometimes even better results. If you combine it with MSA, it will even give you a better results compared to MSA methods alone.

We have trained (Transformer XL, XLNet, Bert, Albert, Electra and T5) for Uniref100 and BFD dataset. I would recommend to simply use on of these models, because it requires tremendous amount of computing power to reach good results.

You can find them here:

You can find more details on our paper:

Facebook also trained Roberta using Unrief50 dataset:

Unfortunately, we don’t have a notebook for training from scratch, but you can find more details to replicate our results here:

@patrickvonplaten :
You meant :

Not :

ProtTrans: Provides the SOT pre-trained models for protein sequences.
CodeTrans: Provides the SOTpre-trained models for computer source code.

tueboesen · December 16, 2020, 3:35pm

Wow this is an amazing response, thank you so much for this. I will need some time to digest it all, but this is exactly what I need!

tueboesen · December 16, 2020, 4:21pm

Is there a way for me to use any of the models to return probability distributions?

More specifically I would like to see how exactly the model has learned and test it out a bit. To this effect I would love to be able to feed it a protein sequence where I have masked out some of the amino acids, and then have it return a probability distribution for the full returned protein.

I’m sure this is possible, after all this is how the model was trained in the first place, but I’m just a bit overwhelmed by all the models, so I haven’t managed to figure out how to do this.

agemagician · December 16, 2020, 5:37pm

You can find an answer to your question here:
https://github.com/agemagician/ProtTrans/issues/5

tueboesen · December 16, 2020, 7:19pm

Hmm that still doesn’t quite do it unless I’m missing something.
This does allow masking of a sequence, but you can only mask 1 amino acid in the sequence, and it doesn’t give the actual probabilities on output, but only the top5 probabilities for that single masked amino acid.

agemagician · December 16, 2020, 7:55pm

You can send “top_k” parameter to “fill-mask” method, to return more/all tokens.
Check here:

github.com

huggingface/transformers/blob/1c1a2ffbff2052100053cddb3a87d45fb9d210ca/src/transformers/pipelines.py#L1184


"""
def __init__(
    self,
    model: Union["PreTrainedModel", "TFPreTrainedModel"],
    tokenizer: PreTrainedTokenizer,
    modelcard: Optional[ModelCard] = None,
    framework: Optional[str] = None,
    args_parser: ArgumentHandler = None,
    device: int = -1,
    top_k=5,
    task: str = "",
):
    super().__init__(
        model=model,
        tokenizer=tokenizer,
        modelcard=modelcard,
        framework=framework,
        args_parser=args_parser,
        device=device,
        binary_output=True,

If it is still doesn’t fit your use-case, then you have to implement it your self.

agemagician · December 16, 2020, 9:00pm

Something like that could be a good starting point for you:

Topic		Replies	Views
How could protein language models generate outputs for natural language input texts? 🤗Transformers	4	419	November 21, 2023
Transformers with protein data Beginners	0	323	July 6, 2022
PreTrain ProteinBERT from scratch Flax/JAX Projects	5	2319	July 6, 2022
How to Fine-tune Rostlab/prot_t5_xl_uniref50 Model for Sequence Generation Beginners	0	389	April 5, 2023
Unmasker probabilities for all tokens in sequence 🤗Transformers	0	223	December 23, 2022

Related topics