Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library

FeryET · August 18, 2021, 6:43am

I can’t seem to create a “PreTrainedTokenizerFast” object from my original tokenizers tokenizer object that has the same proporties. This is the code for a byte pair tokenizer I have experimented on. The resulting fast tokenizer does not have a [PAD] token, and does not have any special tokens at all.

    tokenizer = ByteLevelBPETokenizer()
    tokenizer.preprocessor = pre_tokenizers.BertPreTokenizer()
    tokenizer.normalizer = normalizers.BertNormalizer()
    tokenizer.train_from_iterator(docs, vocab_size=16_000, min_frequency=15, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    tokenizer._tokenizer.post_processor = processors.BertProcessing(
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
    )
    tokenizer.enable_truncation(max_length=256)
    tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
    fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

The result of printing the fast_tokenizer is:

PreTrainedTokenizerFast(name_or_path='', vocab_size=16000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={})

Which model_max_len and special_tokens are wrong in it. Also, there is no pad_token and pad_token_id in the fast_tokenizer object. (the warning for pad_token for example: Using pad_token, but it is not set yet. ) Have I done anything wrong, or is this not supposed to happen?

The versions of libraries I’m using:

'tokenizers                    0.10.3',
'transformers                  4.10.0.dev0'

I have also tested with these versions:

'tokenizers                    0.10.3', 
'transformers                  4.9.2'

sgugger · August 30, 2021, 6:15pm

When using PreTrainedTokenizerFast directly and not one of the subclasses, you have to manually set all the attributes specific to Transformers: the model_max_length as well as all the special tokens. The reason is that the Tokenizer has no concept of associated model (so it doesn’t know the model max length) and even if it has a concept of special tokens, it doesn’t know the differences between them, so you have to indicate which one is the pad token, which one the mask token etc.

Topic		Replies	Views
Padding not transferring when loading a tokenizer trained via the tokenizers library into transformers 🤗Tokenizers	0	498	June 12, 2023
Padding not working when loading a tokenizer trained via the tokenizers library into transformers 🤗Transformers	1	6237	June 11, 2023
How to create a tokenizers from a custom pretrained tokenizer? 🤗Transformers	1	599	February 24, 2021
Padding and truncation for custom tokenizer 🤗Tokenizers	1	644	January 22, 2023
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	625	July 30, 2021

Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library

Related topics