PreTrain ProteinBERT from scratch

PreTrain ProteinBERT from scratch

ProteinBERT is a universal deep-learning model of protein sequence and function based on the BERT architecture. The goal of this project is to pretrain the ProteinBERT model in JAX/Flax for downstream finetuning tasks like predicting protein structure, post translational modifications and/or biophysical attributes. The model itself can be seen as an extension to the Protein Transformer models trained by the Rostlab.

2. Language

The model will be trained on protein sequences and gene ontology (GO) annotations.

3. Model

The classic Transformer/BERT architecture, with some additions.

4. Datasets

ProteinBERT is pretrained on a dataset derived from UniRef90 which consists of ~106M protein sequences. Protein Gene Ontology (GO) annotations are used as additional inputs (to help the model infer about the function of the input protein and update its internal representations and outputs accordingly).

Possible links to publicly available datasets include:

5. Training scripts

Data preprocessing scripts are available on the Official Github Repo. The training script can be adapted from run_mlm_flax.py.

6. Challenges

  • The data of protein sequences and GO annotations require ~1 TB of scratch disk space.
  • The original paper states a pretraining time of ~28 days on a single GPU (Nvidia Quadro RTX 5000).
  • Besides NLP skills there may be some bioinformatics skills required

7. Desired project outcome

The model that can be further finetuned to predict protein structure, post translational modifications, and/or biophysical attributes.

8. Reads

The following links can be useful to better understand the project and
what has previously been done.

1 Like

Thanks for the shoutout :sweat_smile:. I’m definitely interested. What’s the course of action @timothyman ??

1 Like

I believe we have to wait till this project is officially accepted, so that we get access to those TPUs. :sweat_smile: But I will have a look and see whether I can preprocess the training data, since that seems to be taking some time too.

2 Likes

Yes definitely a great idea! Finalizing this project :slight_smile:

1 Like

Hey join us on the discord channel we’re in the #proteinbert Text Channel

1 Like

Hi

Can this be used for classifying for any protein-based task? Also, can the input data be in FASTA file format, or what is the best practices for input data for training, and testing?

Thank you,
Zakia