DNA long sequence tokenization

Hello everyone!

I am struggling with transformers in DNA data for a supervised binary classification problem. I have very long DNA sequences (the mean is 6E7 characters) and, to be able to pass longer sequences as input to the Neural Network, I am trying to tokenize using different algorithms to work with longer sequences tokens rather than only (C, G, A, T) ones.

At the moment I am trying with HuggingFace to implement BPE, WordPiece, and Unigram algorithms. However, before training those models I do have to apply a pretokenizer to my data. All of them are based into “classic” language structures like Whitespace() but in my case I only have a list of DNA sequences like (small chunk):


My intention is to group those characters to work with bigger tokens than only 1 single character. However, when I use for example Whitespace() , my model does not learn…

Could you recommend me some pre_tokenizer for passing as input to BPE, WPiece and UNIGRAM only characters?

Also, would you recommend padding sequence before or after tokenization process?

Thank you very much


Hello @mdelas,
I was wondering if you got the solution to your problem?

I have a similar doubt, I am currently working with the Bacteria Genome sequences for which I need to Pre-Train a Model from Scratch.

Thanks in advance!

BPE just combine the short segments to long segment, So the pre tokenization is not necessary.

If you want, k-mers method could be used.
For example “ATCG” is segmented into ATC, TCG by 3-mers method. The k could be 6-13.
dnabert model just use this method.

There are also some pretained BPE tokenizer in huggingface, for example:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1') #AIRI-Institute/gena-lm-bert-base,zhihan1996/DNABERT-2-117M
#result: [G','AGCAC','ATTCGCC',....]

for classification problem, here is some examples:

1 Like