How to properly clean vocabulary from BBPE tokenizer

mdelas · September 9, 2022, 9:36am

Hello everyone,

I have been building a Byte Level BPE tokenizer from DNA data by using the ByteLevelBPETokenizer method from tokenizers HuggingFace. I already obtained the vocab and the merges but there’s a lot of “garbage” vocabulary expected from a human language model (e.g. “!”:3,“"”:4,“#”:5, “$”:6,“%”:7,“&”:8,“'”:9,“(”:10) whereas I know for sure that my data will only be ACGT base DNA format sequences (e.g. “GGT”:270,“AGC”:271,“ACC”:272,“GGCC”:273,“TTC”:274,“AA”:275).

Does anyone know if there is a way to clean all these symbols and automatically generated tokens in my vocabulary?

Thank you very much in advance!

mdelas · September 9, 2022, 10:46am

I could already fix this problem by editing the method train in the class ByteLevelBPETokenizer. I changed the .alphabet from the BaseTokenizer into my choice which was not the 256 characters but [“A”,“C”,“G”,“T”,“N”].

jessicalopez · September 13, 2022, 9:56am

Hello, how long time took your tokenizer for the training?
Thank you

mdelas · October 1, 2022, 9:20pm

Sorry for this late answer,

Actually, it is going very fast but for 1.4Mb just a few seconds…

Topic		Replies	Views
Does the ByteLevelBPETokenizer need to be wrapped in a normal Tokenizer? 🤗Tokenizers	0	1833	March 18, 2023
DNA long sequence tokenization 🤗Tokenizers	2	2760	August 6, 2023
Using HuggingFace Tokenizers Without Special Characters 🤗Tokenizers	2	1937	November 2, 2022
Byte Level Tokenizer While Training 🤗Tokenizers	0	52	December 14, 2024
Loading BPE modeled Tokenizer results in empty tokenizer 🤗Tokenizers	0	327	April 15, 2024

How to properly clean vocabulary from BBPE tokenizer

Related topics