I have been building a Byte Level BPE tokenizer from DNA data by using the ByteLevelBPETokenizer method from tokenizers HuggingFace. I already obtained the vocab and the merges but there’s a lot of “garbage” vocabulary expected from a human language model (e.g. “!”:3,“"”:4,“#”:5, “$”:6,“%”:7,“&”:8,“'”:9,“(”:10) whereas I know for sure that my data will only be ACGT base DNA format sequences (e.g. “GGT”:270,“AGC”:271,“ACC”:272,“GGCC”:273,“TTC”:274,“AA”:275).
Does anyone know if there is a way to clean all these symbols and automatically generated tokens in my vocabulary?
I could already fix this problem by editing the method train in the class ByteLevelBPETokenizer. I changed the .alphabet from the BaseTokenizer into my choice which was not the 256 characters but [“A”,“C”,“G”,“T”,“N”].