I would like to use HuggingFace Tokenizers for a unique dataset which doesn’t require any special characters. Thus, the results vocabulary should consist only characters from the input file / files. For example, if my file contains the sentence:
The vocabulary should consist words with the letters: “A”, “B” and “C” only.
My intentions are to run the following tokenizers: BPE, sentence piece and word piece. I looked at the base code but couldn’t find the parameter to do so.