Two approaches to training a tokenizer

mcgusty1g · March 6, 2023, 3:28pm

I see these two approaches for training a tokenizer in HuggingFace:

Approach 1

Ref: How to train a new language model from scratch using Transformers and Tokenizers

from tokenizers.implementations import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()
paths = ['wikitext-2.txt']
tokenizer.train(files=paths)
tokenizer.save_model('../data/hf_bpe')

encoding = tokenizer.encode('hello world')
print (encoding.ids)

Approach 2

Ref: Building a tokenizer, block by block - Hugging Face Course

from tokenizers import models, trainers, Tokenizer

tokenizer = Tokenizer(model=models.WordPiece(unk_token="[UNK]"))

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

tokenizer.train(["wikitext-2.txt"], trainer=trainer)

encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.ids)

My question is what is the difference between two approaches and when should I use which approach?

If I understand correctly, in the latter approach, a model represents the tokenization algorithm. In that case, what does the trainer do? Does it represent the vocabulary and add new tokens to it?

Also, in approach 1, the tokenizer implicitly contains the model (the .model attribute). But it’s train method does not contain an argument for Trainer. Why?

Topic		Replies	Views
Does the ByteLevelBPETokenizer need to be wrapped in a normal Tokenizer? 🤗Tokenizers	0	1839	March 18, 2023
Training sentencePiece from scratch? 🤗Tokenizers	8	19274	December 19, 2023
Save tokenizer with argument 🤗Tokenizers	2	1963	October 26, 2022
Issue with Transformer notebook's Getting Started Tokenizers Beginners	2	2123	January 30, 2021
Training Transformer XL from scratch Beginners	0	893	May 22, 2021

Two approaches to training a tokenizer

Approach 1

Approach 2

Related topics