I’m trying to build a GPT-2 transformer from scratch (without any pre-train model) with DNA sequences in order to generate DNA sequences on top of smaller ones. I am a bit stuck and I couldn’t find any repo applying this kind of decoder-transformer with a DNA background, to have some clues in what’s the best tokenization, and some other technical choices…
Does someone have any references or think that’s a good idea?
Normally the dna sequence is segmented by k-mers method.
For example “ATCG” is segmented into ATC, TCG by 3-mers method. The k could be 6-13.
dnabert model just use this method.