I can’t seem to create a “PreTrainedTokenizerFast” object from my original tokenizers
tokenizer object that has the same proporties. This is the code for a byte pair tokenizer I have experimented on. The resulting fast tokenizer does not have a [PAD] token, and does not have any special tokens at all.
tokenizer = ByteLevelBPETokenizer()
tokenizer.preprocessor = pre_tokenizers.BertPreTokenizer()
tokenizer.normalizer = normalizers.BertNormalizer()
tokenizer.train_from_iterator(docs, vocab_size=16_000, min_frequency=15, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer._tokenizer.post_processor = processors.BertProcessing(
("[SEP]", tokenizer.token_to_id("[SEP]")),
("[CLS]", tokenizer.token_to_id("[CLS]")),
)
tokenizer.enable_truncation(max_length=256)
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
The result of printing the fast_tokenizer is:
PreTrainedTokenizerFast(name_or_path='', vocab_size=16000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={})
Which model_max_len
and special_tokens
are wrong in it. Also, there is no pad_token and pad_token_id in the fast_tokenizer object. (the warning for pad_token for example: Using pad_token, but it is not set yet.
) Have I done anything wrong, or is this not supposed to happen?
The versions of libraries I’m using:
'tokenizers 0.10.3',
'transformers 4.10.0.dev0'
I have also tested with these versions:
'tokenizers 0.10.3',
'transformers 4.9.2'