BigBirDNA - Pretraining BigBird on DNA sequences


Pretraining BigBird on DNA sequences. This provides a base model for downstream DNA sequence analysis tasks

2. Language

The model will be trained in DNA

3. Model


4. Datasets

All the available DNA sequences.

Possible links to publicly available datasets include:

5. Training scripts

6. (Optional) Challenges

  • There will be quite a bit of preprocessing involved
  • There will be quite a lot of data involved (about half a TB compressed)
  • Besides NLP skills there may be some bioinformatics skills required

7. (Optional) Desired project outcome

  • Pretrain a transformer that can emulate or improve on the downstream tasks mentioned reads below

8. (Optional) Reads

The following links can be useful to better understand the project and
what has previously been done.


Sounds interesting @kees, I was infact working on something similar → a jax/flax implementation of ProteinBERT. Pretraining RoBERTa seems like a good strategy. I’m definitely interested.


DNA Sequences are very long, it is better to use “BigBird” model than “RoBERTA”.

1 Like

Agree with @agemagician - this could be a very interesting project to use BigBird for!

We have FlaxBigBird: transformers/ at master · huggingface/transformers · GitHub merged and the official BigBird paper also did some experiments on protein sequence modeling: [2007.14062] Big Bird: Transformers for Longer Sequences

I think one should be able to slightly tweak for it to work well with BigBird :slight_smile: Also pinging our “BigBird” expert @vasudevgupta :wink:

1 Like

Thanks @patrickvonplaten for pinging me here. Training bigbird on dna-sequences sounds so interesting (Always wanted to do that :star_struck::star_struck:). I would also be happy to work on this one.


I had a look at the bigbird paper and it sounds like a great starting point - thanks for the suggestion!

1 Like

Hi everyone, this project seems very interesting and I would love to join, if you wanna know a little more about my background check out my GitHub, I have worked on a somewhat similar problem.

Let’s officially define this project :slight_smile:

Putting everybody in the official sheet here . More people can still join! Leave a comment here or on the sheet if you want to change something.

Hey @valhalla, I’m still not sure if I want to be a part of this project or the ProteinBERT project. I’d rather be a part of ProteinBERT, incase it gets accepted

1 Like

Protein BERT is accepted - I’ve added you there :slight_smile: Will remove you from BigBirDNA then :slight_smile:

Let me know if you want to take part in both projects @SauravMaheshkar

1 Like

Do the attention patterns that Big Bird uses have any biological sense or could they deteriorate performance for some tasks?

I mean, BigBird induces sparsity to the model using different attention patterns (see figure below) that have a certain linguistic sense, however, I don’t know if they can be easily extrapolated to relationships between k-mers.

Thanks @patrickvonplaten, I didn’t know if one could participate in two projects. I’ve done some work on ProteinBERT already, so working on BigBirDNA would be a learning opportunity. Although I’m not sure how much time I’ll be able to give to two projects. For now I’ll stick to ProteinBERT.

Sounds good!

I didn’t train and compare both of them for DNA sequences.
However, the main problem that DNA sequences are extremely long, and the only option to train them is using an efficient transformer.
Currently, AFAIK, the only efficient transformer that was ported to Trax is BigBird.

Hey @patrickvonplaten, I’m almost done with my set of contributions to the ProteinBERT project. My work was mostly focused on creating the model architecture, and I’ve almost completed the work. I was wondering if I could join the BigBird team as well. I know it’s kinda late but if it’s okay with you guys and the members of the group, I’d love to contribute to the BigBirdDNA project as well.

It would be awesome if you can jump in - I am a little bit in over my head with other tasks at the moment so the project has not booked a lot of progress just yet