How to properly clean vocabulary from BBPE tokenizer

Hello everyone,

I have been building a Byte Level BPE tokenizer from DNA data by using the ByteLevelBPETokenizer method from tokenizers HuggingFace. I already obtained the vocab and the merges but there’s a lot of “garbage” vocabulary expected from a human language model (e.g. “!”:3,“"”:4,“#”:5, “$”:6,“%”:7,“&”:8,“'”:9,“(”:10) whereas I know for sure that my data will only be ACGT base DNA format sequences (e.g. “GGT”:270,“AGC”:271,“ACC”:272,“GGCC”:273,“TTC”:274,“AA”:275).

Does anyone know if there is a way to clean all these symbols and automatically generated tokens in my vocabulary?

Thank you very much in advance!

I could already fix this problem by editing the method train in the class ByteLevelBPETokenizer. I changed the .alphabet from the BaseTokenizer into my choice which was not the 256 characters but [“A”,“C”,“G”,“T”,“N”].

Hello, how long time took your tokenizer for the training?
Thank you

Sorry for this late answer,

Actually, it is going very fast but for 1.4Mb just a few seconds…

1 Like