I see these two approaches for training a tokenizer in HuggingFace:
Approach 1
Ref: How to train a new language model from scratch using Transformers and Tokenizers
from tokenizers.implementations import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
paths = ['wikitext-2.txt']
tokenizer.train(files=paths)
tokenizer.save_model('../data/hf_bpe')
encoding = tokenizer.encode('hello world')
print (encoding.ids)
Approach 2
Ref: Building a tokenizer, block by block - Hugging Face Course
from tokenizers import models, trainers, Tokenizer
tokenizer = Tokenizer(model=models.WordPiece(unk_token="[UNK]"))
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train(["wikitext-2.txt"], trainer=trainer)
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.ids)
My question is what is the difference between two approaches and when should I use which approach?
If I understand correctly, in the latter approach, a model represents the tokenization algorithm. In that case, what does the trainer do? Does it represent the vocabulary and add new tokens to it?
Also, in approach 1, the tokenizer implicitly contains the model (the .model
attribute). But it’s train
method does not contain an argument for Trainer. Why?