DNA long sequence tokenization

mdelas · March 28, 2022, 9:13am

Hello everyone!

I am struggling with transformers in DNA data for a supervised binary classification problem. I have very long DNA sequences (the mean is 6E7 characters) and, to be able to pass longer sequences as input to the Neural Network, I am trying to tokenize using different algorithms to work with longer sequences tokens rather than only (C, G, A, T) ones.

At the moment I am trying with HuggingFace to implement BPE, WordPiece, and Unigram algorithms. However, before training those models I do have to apply a pretokenizer to my data. All of them are based into “classic” language structures like Whitespace() but in my case I only have a list of DNA sequences like (small chunk):

['CCAGCAGCTCGGTGCGCTTGCCGCTCCAGTCGCCCAGCAGCTCGGTGCGCTTGCCGCCCCAGTCGC']

My intention is to group those characters to work with bigger tokens than only 1 single character. However, when I use for example Whitespace() , my model does not learn…

Could you recommend me some pre_tokenizer for passing as input to BPE, WPiece and UNIGRAM only characters?

Also, would you recommend padding sequence before or after tokenization process?

Thank you very much

ratish · June 29, 2023, 6:17pm

Hello @mdelas,
I was wondering if you got the solution to your problem?

I have a similar doubt, I am currently working with the Bacteria Genome sequences for which I need to Pre-Train a Model from Scratch.

Thanks in advance!

marisming · August 6, 2023, 3:08am

BPE just combine the short segments to long segment, So the pre tokenization is not necessary.

If you want, k-mers method could be used.
For example “ATCG” is segmented into ATC, TCG by 3-mers method. The k could be 6-13.
dnabert model just use this method.

There are also some pretained BPE tokenizer in huggingface, for example:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1') #AIRI-Institute/gena-lm-bert-base,zhihan1996/DNABERT-2-117M
tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG")
#result: [G','AGCAC','ATTCGCC',....]


for classification problem, here is some examples:
https://huggingface.co/spaces/dnagpt/dnabert_pretrain_v1/blob/main/dnagpt/class6.ipynb


https://github.com/AIRI-Institute/GENA_LM/blob/main/notebooks/GENA_sequence_classification_example.ipynb

Topic		Replies	Views
GPT-2 in DNA data Research	1	1278	August 6, 2023
Map tokenization and posterior to smaller substrings 🤗Tokenizers	0	367	September 29, 2022
Question on splitting input sequence Beginners	3	5601	June 14, 2022
wordLevel tokenizing Beginners	0	247	July 4, 2024
All my sequences get tokenized the same 🤗Tokenizers	2	609	February 12, 2022

DNA long sequence tokenization

Related topics