Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library

I can’t seem to create a “PreTrainedTokenizerFast” object from my original tokenizers tokenizer object that has the same proporties. This is the code for a byte pair tokenizer I have experimented on. The resulting fast tokenizer does not have a [PAD] token, and does not have any special tokens at all.

    tokenizer = ByteLevelBPETokenizer()
    tokenizer.preprocessor = pre_tokenizers.BertPreTokenizer()
    tokenizer.normalizer = normalizers.BertNormalizer()
    tokenizer.train_from_iterator(docs, vocab_size=16_000, min_frequency=15, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    tokenizer._tokenizer.post_processor = processors.BertProcessing(
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
    )
    tokenizer.enable_truncation(max_length=256)
    tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
    fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

The result of printing the fast_tokenizer is:

PreTrainedTokenizerFast(name_or_path='', vocab_size=16000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={})

Which model_max_len and special_tokens are wrong in it. Also, there is no pad_token and pad_token_id in the fast_tokenizer object. (the warning for pad_token for example: Using pad_token, but it is not set yet. ) Have I done anything wrong, or is this not supposed to happen?

The versions of libraries I’m using:

'tokenizers                    0.10.3',
'transformers                  4.10.0.dev0'

I have also tested with these versions:

'tokenizers                    0.10.3', 
'transformers                  4.9.2'

When using PreTrainedTokenizerFast directly and not one of the subclasses, you have to manually set all the attributes specific to Transformers: the model_max_length as well as all the special tokens. The reason is that the Tokenizer has no concept of associated model (so it doesn’t know the model max length) and even if it has a concept of special tokens, it doesn’t know the differences between them, so you have to indicate which one is the pad token, which one the mask token etc.

1 Like