GPT-2 in DNA data

Dear community,

I’m trying to build a GPT-2 transformer from scratch (without any pre-train model) with DNA sequences in order to generate DNA sequences on top of smaller ones. I am a bit stuck and I couldn’t find any repo applying this kind of decoder-transformer with a DNA background, to have some clues in what’s the best tokenization, and some other technical choices…

Does someone have any references or think that’s a good idea?

Thank you in advance!

1 Like

Normally the dna sequence is segmented by k-mers method.
For example “ATCG” is segmented into ATC, TCG by 3-mers method. The k could be 6-13.
dnabert model just use this method.

Some method also use BPE method. For example:
gena-llm(AIRI-Institute/gena-lm-bert-base · Hugging Face) ,
dangpt2(dnagpt/human_gpt2-v1 · Hugging Face)

The tokenizaion example:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1') #AIRI-Institute/gena-lm-bert-base,zhihan1996/DNABERT-2-117M
tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG")
#result: [G','AGCAC','ATTCGCC',....]
`
1 Like