DNA long sequence tokenization

Hello everyone!

I am struggling with transformers in DNA data for a supervised binary classification problem. I have very long DNA sequences (the mean is 6E7 characters) and, to be able to pass longer sequences as input to the Neural Network, I am trying to tokenize using different algorithms to work with longer sequences tokens rather than only (C, G, A, T) ones.

At the moment I am trying with HuggingFace to implement BPE, WordPiece, and Unigram algorithms. However, before training those models I do have to apply a pretokenizer to my data. All of them are based into “classic” language structures like Whitespace() but in my case I only have a list of DNA sequences like (small chunk):


My intention is to group those characters to work with bigger tokens than only 1 single character. However, when I use for example Whitespace() , my model does not learn…

Could you recommend me some pre_tokenizer for passing as input to BPE, WPiece and UNIGRAM only characters?

Also, would you recommend padding sequence before or after tokenization process?

Thank you very much