GPT-2 in DNA data

mdelas · September 8, 2022, 11:11am

Dear community,

I’m trying to build a GPT-2 transformer from scratch (without any pre-train model) with DNA sequences in order to generate DNA sequences on top of smaller ones. I am a bit stuck and I couldn’t find any repo applying this kind of decoder-transformer with a DNA background, to have some clues in what’s the best tokenization, and some other technical choices…

Does someone have any references or think that’s a good idea?

Thank you in advance!

marisming · August 6, 2023, 2:37am

Normally the dna sequence is segmented by k-mers method.
For example “ATCG” is segmented into ATC, TCG by 3-mers method. The k could be 6-13.
dnabert model just use this method.

Some method also use BPE method. For example:
gena-llm(AIRI-Institute/gena-lm-bert-base · Hugging Face) ,
dangpt2(dnagpt/human_gpt2-v1 · Hugging Face)

The tokenizaion example:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1') #AIRI-Institute/gena-lm-bert-base,zhihan1996/DNABERT-2-117M
tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG")
#result: [G','AGCAC','ATTCGCC',....]
`

Topic		Replies	Views
DNA long sequence tokenization 🤗Tokenizers	2	2762	August 6, 2023
Map tokenization and posterior to smaller substrings 🤗Tokenizers	0	367	September 29, 2022
Looking for help with GPT-2 code Models	0	226	February 7, 2024
Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2 Beginners	7	8447	September 21, 2020
Training GPT-2 from scratch Beginners	2	1232	August 3, 2020

GPT-2 in DNA data

Related topics