ProteinBERT is a universal deep-learning model of protein sequence and function based on the BERT architecture. The goal of this project is to pretrain the ProteinBERT model in JAX/Flax for downstream finetuning tasks like predicting protein structure, post translational modifications and/or biophysical attributes. The model itself can be seen as an extension to the Protein Transformer models trained by the Rostlab.
The model will be trained on protein sequences and gene ontology (GO) annotations.
The classic Transformer/BERT architecture, with some additions.
ProteinBERT is pretrained on a dataset derived from UniRef90 which consists of ~106M protein sequences. Protein Gene Ontology (GO) annotations are used as additional inputs (to help the model infer about the function of the input protein and update its internal representations and outputs accordingly).
Possible links to publicly available datasets include:
- The data of protein sequences and GO annotations require ~1 TB of scratch disk space.
- The original paper states a pretraining time of ~28 days on a single GPU (Nvidia Quadro RTX 5000).
- Besides NLP skills there may be some bioinformatics skills required
The model that can be further finetuned to predict protein structure, post translational modifications, and/or biophysical attributes.
The following links can be useful to better understand the project and
what has previously been done.