Issue with Transformer notebook's Getting Started Tokenizers

In Documentation, I went to Transformer Notebooks and clicked on the collab for Getting Started Tokenizer. I executed each cell and when I got to the cell where:

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(trainer, [“big.txt”])

print(“Trained vocab size: {}”.format(tokenizer.get_vocab_size()))

I ran the cell and got this:

TypeError: Can’t convert <tokenizers.trainers.BpeTrainer object at 0x7f8641325570> to Sequence

I am assuming these cells should work so something changed with the software but not updated in the notebook. I am trying to learn transformers on my own so where can I go to learn if Hugging Face Doc is not up to date? Any help will be appreciated.

Hi @krwin, this indeed seems to be a bug in the notebook, where the order of the arguments for tokenizer.train() in this cell

from tokenizers.trainers import BpeTrainer

# We initialize our trainer, giving him the details about the vocabulary we want to generate
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(trainer, ["big.txt"])

print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))

is back-to-front (see the docs). To fix the problem you can just specify the arguments explicitly:

tokenizer.train(trainer=trainer, files=["big.txt"])

cc: @anthony

1 Like

Thank you very much for helping me and being so prompt.
Hope you have a great day!

1 Like