Using HuggingFace Tokenizers Without Special Characters

dotan1111 · August 23, 2022, 7:20am

Hey,

I would like to use HuggingFace Tokenizers for a unique dataset which doesn’t require any special characters. Thus, the results vocabulary should consist only characters from the input file / files. For example, if my file contains the sentence:
“AAABBBCCC”
The vocabulary should consist words with the letters: “A”, “B” and “C” only.
My intentions are to run the following tokenizers: BPE, sentence piece and word piece. I looked at the base code but couldn’t find the parameter to do so.

lianghsun · October 27, 2022, 3:35am

Hey @dotan1111, have you solved this question yet?
HF Tokenizers provide several models and trainers to learn how to tokenize the sentence to smaller piece. In modern way, it often split the sentence to subwords, and let tokenizer use certain model to learn how merge them to specific token. So, the question you ask is how to tokenize the sentence “AAABBBCCC” to the letters: “A”, “B” and “C” only, without any merges. However, the models in HF Tokenizers often learn to build a vocabulary with letters and merged pattern, and I believe you already know this. Here, I’ll provide several ways for you to build up a “char-level” tokenizer using BPE.

BPE with defined minimum merged frequency

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

# Data
file = ['/path/to/AAABBBCCC.txt'] # AAABBBCCC only

# Use BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# Setup BPE trainer
trainer = BpeTrainer(
    min_frequency=1_000_000_000 # We set min_freq to large threshold here!
)
tokenizer.train(file, trainer)

# Show vocab list
tokernizer.get_vocab()
# > {'A': 0, 'B': 1, 'C': 2}

Pythonic way

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

# Context from Char.txt
# AAAABBB
# CCCCDDDD
# EEEFF
# GGHHHHH

# Get unique letters
letters = set()
with open('/path/to/Char.txt', 'r') as f:
    for line in f.readlines():
        for char in [*line.rstrip()]:
            letters.add(char)
print(letters)

# Build your own vocab.json
import json
vocab = dict()
with open('/path/to/char_vocab.json', 'w') as f:
    for idx, item in enumerate(letters):
        vocab[item] = idx
    json.dump(vocab, f, indent = 6)

# Build your own mergs.txt: please create an empty file and named it `char_merges.txt`

# Init a new tokenizer
new_tokenizer = Tokenizer(BPE.from_file('/path/to/char_vocab.json', '/path/to/char_merges.txt'))
new_tokenizer.get_vocab()
# > {'C': 7, 'F': 4, 'G': 3, 'E': 2, 'A': 0, 'B': 6, 'H': 5, 'D': 1}

I wouldn’t say which method is the best way to implement char-level tokenizer Hope this solve your question.

dotan1111 · November 2, 2022, 6:50am

Thanks! I have used a similar approach

Topic		Replies	Views
Word-based tokenizers Beginners	1	768	March 17, 2023
How to create a hugging face compatible tokenizer from a vocab file? Beginners	0	250	May 23, 2024
Character-level tokenizer Beginners	6	9573	May 8, 2024
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
SentencePiece tokenizer encodes to unknown token 🤗Tokenizers	0	879	August 2, 2023

Using HuggingFace Tokenizers Without Special Characters

Related topics