PreTrain ProteinBERT from scratch

timothyman · June 28, 2021, 12:33pm

PreTrain ProteinBERT from scratch

ProteinBERT is a universal deep-learning model of protein sequence and function based on the BERT architecture. The goal of this project is to pretrain the ProteinBERT model in JAX/Flax for downstream finetuning tasks like predicting protein structure, post translational modifications and/or biophysical attributes. The model itself can be seen as an extension to the Protein Transformer models trained by the Rostlab.

2. Language

The model will be trained on protein sequences and gene ontology (GO) annotations.

3. Model

The classic Transformer/BERT architecture, with some additions.

4. Datasets

ProteinBERT is pretrained on a dataset derived from UniRef90 which consists of ~106M protein sequences. Protein Gene Ontology (GO) annotations are used as additional inputs (to help the model infer about the function of the input protein and update its internal representations and outputs accordingly).

Possible links to publicly available datasets include:

5. Training scripts

Data preprocessing scripts are available on the Official Github Repo. The training script can be adapted from run_mlm_flax.py.

6. Challenges

The data of protein sequences and GO annotations require ~1 TB of scratch disk space.
The original paper states a pretraining time of ~28 days on a single GPU (Nvidia Quadro RTX 5000).
Besides NLP skills there may be some bioinformatics skills required

7. Desired project outcome

The model that can be further finetuned to predict protein structure, post translational modifications, and/or biophysical attributes.

8. Reads

The following links can be useful to better understand the project and
what has previously been done.

SauravMaheshkar · June 28, 2021, 1:36pm

Thanks for the shoutout . I’m definitely interested. What’s the course of action @timothyman ??

timothyman · June 28, 2021, 1:51pm

I believe we have to wait till this project is officially accepted, so that we get access to those TPUs. But I will have a look and see whether I can preprocess the training data, since that seems to be taking some time too.

patrickvonplaten · June 29, 2021, 2:11pm

Yes definitely a great idea! Finalizing this project

SauravMaheshkar · June 30, 2021, 6:19am

Hey join us on the discord channel we’re in the #proteinbert Text Channel

Zakia · July 6, 2022, 6:34pm

Hi

Can this be used for classifying for any protein-based task? Also, can the input data be in FASTA file format, or what is the best practices for input data for training, and testing?

Thank you,
Zakia

Topic		Replies	Views
BigBirDNA - Pretraining BigBird on DNA sequences Flax/JAX Projects	20	3838	March 21, 2023
How do I construct a function to inference? Flax/JAX Projects	0	1384	September 13, 2022
Using transformers (BERT, RoBERTa) without embedding layer Research	8	4149	December 16, 2020
Correct way to structure BERT for genetic segmentation? Beginners	1	625	October 31, 2020
BERTIN: PreTrain RoBERTa-large from scratch in Spanish Flax/JAX Projects	23	2004	July 19, 2021

PreTrain ProteinBERT from scratch

PreTrain ProteinBERT from scratch

2. Language

3. Model

4. Datasets

5. Training scripts

6. Challenges

7. Desired project outcome

8. Reads

Related topics