Below, you can find code for reproducing the problem. Most of it is from the tokenizers Quicktour, so youâll need to download the data files as per the instructions there (or modify files
if using your own files). The rest is from the official transformers docs on how to load a tokenizer from tokenizers
into transformers
.
from tokenizers import BpeTrainer, Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
sentences = ["Hello, y'all!", "How are you đ ?"]
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train(files, trainer)
# Enable padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
# Now use this tokenizer to tokenize a couple of sentences.
output = tokenizer.encode_batch(sentences)
# The output is padded, as it should be:
print(output[0].tokens)
# ['Hello', ',', 'y', "'", 'all', '!']
print(output[1].tokens)
# ['How', 'are', 'you', '[UNK]', '?', '[PAD]']
# But now let's say we load the tokenizer into transformers- let's try loading it directly from the tokenizer object:
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
# Tokenize two strings of different token length with padding
fast_output = fast_tokenizer(sentences, padding=True)
This gives us the error:
Using pad_token, but it is not set yet.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/apatil/anaconda3/envs/lm-training/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/Users/apatil/anaconda3/envs/lm-training/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one
return self.batch_encode_plus(
File "/Users/apatil/anaconda3/envs/lm-training/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2816, in batch_encode_plus
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
File "/Users/apatil/anaconda3/envs/lm-training/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2453, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
We can resolve the issue by explicitly specifying the special tokens when initializing the PreTrainedTokenizerFast
:
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer, pad_token="[PAD]", unk_token="[UNK]")
# Now padding works as expected
fast_output = fast_tokenizer(sentences, padding=True)
print(fast_output[0].tokens)
# ['Hello', ',', 'y', "'", 'all', '!']
print(fast_output[1].tokens)
# ['How', 'are', 'you', '[UNK]', '?', '[PAD]']
The code above uses the tokenizer_object
parameter to load the fast tokenizer as a PreTrainedTokenizerFast
instance, but as you can confirm for yourselves, the same behavior occurs if you first save the tokenizer to file, then load it into PreTrainedTokenizerFast
using the tokenizer_file
parameter instead.
Bottom line: I donât understand, if the padding information is already in the tokenizer (or in the saved tokenizer config file), why I should need to explicitly specify the padding token again when transferring the tokenizer. This introduces a lot of totally unnecessary friction into what should be a painless process. The tokenizer object/config should be self-contained. I should not have to re-hardcode what the padding token is in the code that loads it into transformers
, if that information is already encapsulated in the tokenizer object or its saved config file, any more than I should need to specify the vocab file, or the pretokenizers to use, etc. Thatâs the whole point of the tokenizer object/config file: to uniquely determine the behavior the tokenizer.
Am I doing something wrong, or is this just how this works?