Padding not transferring when loading a tokenizer trained via the tokenizers library into transformers

Reposting this here from the transformers forum because I got no answer there:

Hi,

I trained a simple WhitespaceSplit/WordLevel tokenizer using the tokenizers library. I added padding by calling enable_padding(pad_token="<pad>") on the Tokenizer instance. Then I saved it to a JSON file and then loaded it into transformers using the instructions here:

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

When using the tokenizers.Tokenizer object directly, encode correctly adds the padding tokens. However, if I try padding when tokenizing using the PreTrainedTokenizerFast instance, I get the exception:

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Sure enough, if I follow the instructions and add the pad token as a special token, it works. But I want the tokenizer to work out of the box, exactly as the equivalent tokenizer.Tokenizer instance does, including in terms of padding behavior.

Why is this not the case? Why do I have to enable padding for the tokenizer.Tokenizer instance, and then again for the PreTrainedTokenizerFast instance? Am I doing something wrong or missing something?

To reproduce the problem, you can use the code below. Most of it is from the tokenizers Quicktour, so youā€™ll need to download the data files as per the instructions there (or modify files if using your own files). The rest is from the official transformers docs on how to load a tokenizer from tokenizers into transformers:

from tokenizers import BpeTrainer, Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast

files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
sentences = ["Hello, y'all!", "How are you šŸ˜ ?"]

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(files, trainer)

# Enable padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

# Now use this tokenizer to tokenize a couple of sentences.
output = tokenizer.encode_batch(sentences)

# The output is padded, as it should be:
print(output[0].tokens)
# ['Hello', ',', 'y', "'", 'all', '!']
print(output[1].tokens)
# ['How', 'are', 'you', '[UNK]', '?', '[PAD]']

# But now let's say we load the tokenizer into transformers- let's try loading it directly from the tokenizer object:

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

# Tokenize two strings of different token length with padding
fast_output = fast_tokenizer(sentences, padding=True)

This gives us the error:

Using pad_token, but it is not set yet.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/apatil/anaconda3/envs/lm-training/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/Users/apatil/anaconda3/envs/lm-training/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one
    return self.batch_encode_plus(
  File "/Users/apatil/anaconda3/envs/lm-training/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2816, in batch_encode_plus
    padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
  File "/Users/apatil/anaconda3/envs/lm-training/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2453, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

We can resolve the issue by explicitly specifying the special tokens when initializing the PreTrainedTokenizerFast:

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer,  pad_token="[PAD]", unk_token="[UNK]")

# Now padding works as expected
fast_output = fast_tokenizer(sentences, padding=True)

print(fast_output[0].tokens)
# ['Hello', ',', 'y', "'", 'all', '!']
print(fast_output[1].tokens)
# ['How', 'are', 'you', '[UNK]', '?', '[PAD]']

The code above uses the tokenizer_object parameter to load the fast tokenizer as a PreTrainedTokenizerFast instance, but as you can confirm for yourselves, the same behavior occurs if you first save the tokenizer to file, then load it into PreTrainedTokenizerFast using the tokenizer_file parameter instead.

Bottom line: I donā€™t understand, if the padding information is already in the tokenizer (or in the saved tokenizer config file), why I should need to explicitly specify the padding token again when transferring the tokenizer. This introduces a lot of totally unnecessary friction into what should be a painless process. The tokenizer object/config should be self-contained. I should not have to re-hardcode what the padding token is in the code that loads it into transformers, if that information is already encapsulated in the tokenizer object or its saved config file, any more than I should need to specify the vocab file, or the pretokenizers to use, etc. Thatā€™s the whole point of the tokenizer object/config file: to uniquely determine the behavior the tokenizer.

Am I doing something wrong, or is this just how this works?