Pretraining BigBird on DNA sequences. This provides a base model for downstream DNA sequence analysis tasks
The model will be trained in DNA
All the available DNA sequences.
Possible links to publicly available datasets include:
- There is a Flax BigBird implementation at transformers/modeling_flax_big_bird.py at master · huggingface/transformers · GitHub
- There is transformers/run_mlm_flax.py at master · huggingface/transformers · GitHub for word masking which we may be able to tweak for bigbird
- There is GitHub - jerryji1993/DNABERT: DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
that can be used to copy/tweak downstream tasks
- There will be quite a bit of preprocessing involved
- There will be quite a lot of data involved (about half a TB compressed)
- Besides NLP skills there may be some bioinformatics skills required
- Pretrain a transformer that can emulate or improve on the downstream tasks mentioned reads below
The following links can be useful to better understand the project and
what has previously been done.