BigBirDNA

kees · June 24, 2021, 11:43pm

Pretraining BigBird on DNA sequences. This provides a base model for downstream DNA sequence analysis tasks

2. Language

The model will be trained in DNA

3. Model

BigBird.

4. Datasets

All the available DNA sequences.

Possible links to publicly available datasets include:

www.ncbi.nlm.nih.gov/genbank/
Others can be found on www.insdc.org

5. Training scripts

There is a Flax BigBird implementation at transformers/modeling_flax_big_bird.py at master · huggingface/transformers · GitHub
There is transformers/run_mlm_flax.py at master · huggingface/transformers · GitHub for word masking which we may be able to tweak for bigbird
There is GitHub - jerryji1993/DNABERT: DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
that can be used to copy/tweak downstream tasks

6. (Optional) Challenges

There will be quite a bit of preprocessing involved
There will be quite a lot of data involved (about half a TB compressed)
Besides NLP skills there may be some bioinformatics skills required

7. (Optional) Desired project outcome

Pretrain a transformer that can emulate or improve on the downstream tasks mentioned reads below

8. (Optional) Reads

The following links can be useful to better understand the project and
what has previously been done.

SauravMaheshkar · June 25, 2021, 11:16am

Sounds interesting @kees, I was infact working on something similar → a jax/flax implementation of ProteinBERT. Pretraining RoBERTa seems like a good strategy. I’m definitely interested.

agemagician · June 25, 2021, 11:38am

DNA Sequences are very long, it is better to use “BigBird” model than “RoBERTA”.

patrickvonplaten · June 25, 2021, 5:28pm

Agree with @agemagician - this could be a very interesting project to use BigBird for!

We have FlaxBigBird: transformers/modeling_flax_big_bird.py at master · huggingface/transformers · GitHub merged and the official BigBird paper also did some experiments on protein sequence modeling: [2007.14062] Big Bird: Transformers for Longer Sequences

I think one should be able to slightly tweak https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_mlm_flax.py for it to work well with BigBird Also pinging our “BigBird” expert @vasudevgupta

vasudevgupta · June 25, 2021, 5:44pm

Thanks @patrickvonplaten for pinging me here. Training bigbird on dna-sequences sounds so interesting (Always wanted to do that ). I would also be happy to work on this one.

kees · June 25, 2021, 9:15pm

I had a look at the bigbird paper and it sounds like a great starting point - thanks for the suggestion!

Dimitre · June 27, 2021, 12:41am

Hi everyone, this project seems very interesting and I would love to join, if you wanna know a little more about my background check out my GitHub, I have worked on a somewhat similar problem.

valhalla · June 28, 2021, 5:11pm

Let’s officially define this project

Putting everybody in the official sheet here . More people can still join! Leave a comment here or on the sheet if you want to change something.

SauravMaheshkar · June 28, 2021, 5:28pm

Hey @valhalla, I’m still not sure if I want to be a part of this project or the ProteinBERT project. I’d rather be a part of ProteinBERT, incase it gets accepted

patrickvonplaten · June 29, 2021, 2:13pm

Protein BERT is accepted - I’ve added you there Will remove you from BigBirDNA then

patrickvonplaten · June 29, 2021, 2:13pm

Let me know if you want to take part in both projects @SauravMaheshkar

shpotes · June 29, 2021, 9:04pm

Do the attention patterns that Big Bird uses have any biological sense or could they deteriorate performance for some tasks?

I mean, BigBird induces sparsity to the model using different attention patterns (see figure below) that have a certain linguistic sense, however, I don’t know if they can be easily extrapolated to relationships between k-mers.

SauravMaheshkar · June 30, 2021, 2:23am

Thanks @patrickvonplaten, I didn’t know if one could participate in two projects. I’ve done some work on ProteinBERT already, so working on BigBirDNA would be a learning opportunity. Although I’m not sure how much time I’ll be able to give to two projects. For now I’ll stick to ProteinBERT.

patrickvonplaten · June 30, 2021, 12:28pm

Sounds good!

agemagician · July 1, 2021, 7:18pm

I didn’t train and compare both of them for DNA sequences.
However, the main problem that DNA sequences are extremely long, and the only option to train them is using an efficient transformer.
Currently, AFAIK, the only efficient transformer that was ported to Trax is BigBird.

SauravMaheshkar · July 7, 2021, 8:07pm

Hey @patrickvonplaten, I’m almost done with my set of contributions to the ProteinBERT project. My work was mostly focused on creating the model architecture, and I’ve almost completed the work. I was wondering if I could join the BigBird team as well. I know it’s kinda late but if it’s okay with you guys and the members of the group, I’d love to contribute to the BigBirdDNA project as well.

kees · July 8, 2021, 5:09am

It would be awesome if you can jump in - I am a little bit in over my head with other tasks at the moment so the project has not booked a lot of progress just yet

GrimSqueaker · February 13, 2022, 9:14am

Hi, one of the authors of ProteinBert here - please let me or Nadav know if you have questions for the project

jel4h · May 19, 2022, 7:45pm

Hi. I’d love to use this model when it’s ready. Do you have an ETA on it’s availability?

Thanks,
Julie

exnx · February 24, 2023, 9:14am

Curious what the status of this effort is

Topic		Replies	Views
Train Dutch FlaxBigBird Flax/JAX Projects	6	786	July 2, 2021
PreTrain ProteinBERT from scratch Flax/JAX Projects	5	2316	July 6, 2022
GPT-2 in DNA data Research	1	1278	August 6, 2023
Using transformers (BERT, RoBERTa) without embedding layer Research	8	4154	December 16, 2020
DNA long sequence tokenization 🤗Tokenizers	2	2768	August 6, 2023

BigBirDNA - Pretraining BigBird on DNA sequences