Using HuggingFace Tokenizers Without Special Characters


I would like to use HuggingFace Tokenizers for a unique dataset which doesn’t require any special characters. Thus, the results vocabulary should consist only characters from the input file / files. For example, if my file contains the sentence:
The vocabulary should consist words with the letters: “A”, “B” and “C” only.
My intentions are to run the following tokenizers: BPE, sentence piece and word piece. I looked at the base code but couldn’t find the parameter to do so.

Hey @dotan1111, have you solved this question yet?
HF Tokenizers provide several models and trainers to learn how to tokenize the sentence to smaller piece. In modern way, it often split the sentence to subwords, and let tokenizer use certain model to learn how merge them to specific token. So, the question you ask is how to tokenize the sentence “AAABBBCCC” to the letters: “A”, “B” and “C” only, without any merges. However, the models in HF Tokenizers often learn to build a vocabulary with letters and merged pattern, and I believe you already know this. Here, I’ll provide several ways for you to build up a “char-level” tokenizer using BPE.

  1. BPE with defined minimum merged frequency
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

# Data
file = ['/path/to/AAABBBCCC.txt'] # AAABBBCCC only

# Use BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# Setup BPE trainer
trainer = BpeTrainer(
    min_frequency=1_000_000_000 # We set min_freq to large threshold here!
tokenizer.train(file, trainer)

# Show vocab list
# > {'A': 0, 'B': 1, 'C': 2}
  1. Pythonic way
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

# Context from Char.txt

# Get unique letters
letters = set()
with open('/path/to/Char.txt', 'r') as f:
    for line in f.readlines():
        for char in [*line.rstrip()]:

# Build your own vocab.json
import json
vocab = dict()
with open('/path/to/char_vocab.json', 'w') as f:
    for idx, item in enumerate(letters):
        vocab[item] = idx
    json.dump(vocab, f, indent = 6)

# Build your own mergs.txt: please create an empty file and named it `char_merges.txt`

# Init a new tokenizer
new_tokenizer = Tokenizer(BPE.from_file('/path/to/char_vocab.json', '/path/to/char_merges.txt'))
# > {'C': 7, 'F': 4, 'G': 3, 'E': 2, 'A': 0, 'B': 6, 'H': 5, 'D': 1}

I wouldn’t say which method is the best way to implement char-level tokenizer :stuck_out_tongue: Hope this solve your question. :hugs:

1 Like

Thanks! I have used a similar approach

1 Like