Issue with Transformer notebook's Getting Started Tokenizers

krwin · January 29, 2021, 11:11pm

In Documentation, I went to Transformer Notebooks and clicked on the collab for Getting Started Tokenizer. I executed each cell and when I got to the cell where:

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(trainer, [“big.txt”])

print(“Trained vocab size: {}”.format(tokenizer.get_vocab_size()))

I ran the cell and got this:

TypeError: Can’t convert <tokenizers.trainers.BpeTrainer object at 0x7f8641325570> to Sequence

I am assuming these cells should work so something changed with the software but not updated in the notebook. I am trying to learn transformers on my own so where can I go to learn if Hugging Face Doc is not up to date? Any help will be appreciated.

lewtun · January 30, 2021, 11:43am

Hi @krwin, this indeed seems to be a bug in the notebook, where the order of the arguments for tokenizer.train() in this cell

from tokenizers.trainers import BpeTrainer

# We initialize our trainer, giving him the details about the vocabulary we want to generate
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(trainer, ["big.txt"])

print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))

is back-to-front (see the docs). To fix the problem you can just specify the arguments explicitly:

tokenizer.train(trainer=trainer, files=["big.txt"])

cc: @anthony

krwin · January 30, 2021, 9:58pm

Thank you very much for helping me and being so prompt.
Hope you have a great day!

Topic		Replies	Views
HuggingFace BPE Trainer Error - Training Tokenizer 🤗Tokenizers	1	2997	July 14, 2022
Tokenized sequence lengths 🤗Tokenizers	6	2039	March 10, 2022
Trained tokenizer API as PretrainedTokenizer 🤗Tokenizers	1	524	October 25, 2022
Tokenizer taking extremely long time to train 🤗Tokenizers	1	973	March 19, 2025
Load tokenizer from file : Exception: data did not match any variant of untagged enum ModelWrapper 🤗Tokenizers	3	9459	August 1, 2023

Issue with Transformer notebook's Getting Started Tokenizers

In Documentation, I went to Transformer Notebooks and clicked on the collab for Getting Started Tokenizer. I executed each cell and when I got to the cell where:

print(“Trained vocab size: {}”.format(tokenizer.get_vocab_size()))

TypeError: Can’t convert <tokenizers.trainers.BpeTrainer object at 0x7f8641325570> to Sequence

Related topics