I am struggling with transformers in DNA data for a supervised binary classification problem. I have very long DNA sequences (the mean is 6E7 characters) and, to be able to pass longer sequences as input to the Neural Network, I am trying to tokenize using different algorithms to work with longer sequences tokens rather than only (C, G, A, T) ones.
At the moment I am trying with HuggingFace to implement BPE, WordPiece, and Unigram algorithms. However, before training those models I do have to apply a pretokenizer to my data. All of them are based into “classic” language structures like Whitespace() but in my case I only have a list of DNA sequences like (small chunk):
My intention is to group those characters to work with bigger tokens than only 1 single character. However, when I use for example Whitespace() , my model does not learn…
Could you recommend me some pre_tokenizer for passing as input to BPE, WPiece and UNIGRAM only characters?
Also, would you recommend padding sequence before or after tokenization process?
BPE just combine the short segments to long segment, So the pre tokenization is not necessary.
If you want, k-mers method could be used.
For example “ATCG” is segmented into ATC, TCG by 3-mers method. The k could be 6-13.
dnabert model just use this method.
There are also some pretained BPE tokenizer in huggingface, for example:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1') #AIRI-Institute/gena-lm-bert-base,zhihan1996/DNABERT-2-117M
tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG")
#result: [G','AGCAC','ATTCGCC',....]
for classification problem, here is some examples:
https://huggingface.co/spaces/dnagpt/dnabert_pretrain_v1/blob/main/dnagpt/class6.ipynb
https://github.com/AIRI-Institute/GENA_LM/blob/main/notebooks/GENA_sequence_classification_example.ipynb