Can't load pre-trained tokenizer with additional new tokens

rlian · August 4, 2021, 9:43pm

I first pretrained masked language model by adding additional list of words to the tokenizer. Then I saved the pretrained model and tokenizer.

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained(
        'bert-base-uncased'
)
tokenizer.add_tokens(list_of_words)
model.resize_token_embeddings(len(tokenizer))

trainer.train()
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(data_args.output_file)
tokenizer.save_pretrained(data_args.output_file)

After that, I want to load the pre-trained tokenizer and model by

tokenizer = BertTokenizer.from_pretrained(model_args.model_name_or_path)
encoder = BertModel.from_pretrained(model_args.model_name_or_path, num_labels=num_classes)
tokenizer.add_tokens(list_of_words)
encoder.resize_token_embeddings(len(tokenizer))

However, an error occurred as shown below and it seems that the pretrained tokenizer couldn’t be loaded correctly.

AssertionError: Non-consecutive added token 'IncelTears' found. Should have index 30525 but has index 30526 in saved vocabulary.

Does anyone have an idea on this? Thanks a lot.

sgugger · August 5, 2021, 6:48am

Which line threw the error? If it’s tokenizer.add_tokens(list_of_words), it’s because your tokenizer already has those words added from the first sample, so you can’t re-add them.

rlian · August 5, 2021, 2:36pm

This line threw the error while loading the pre-trained the tokenizer. I wonder if I have to set additional_special_tokens while loading?

tokenizer = BertTokenizer.from_pretrained(model_args.model_name_or_path, additional_special_tokens=list_of_word)

kaankork · August 10, 2021, 8:19am

Hi @snugger I thought the original tokenizer just skips the words in list_of_words if tokenizer already has these words in it.

Is it necessary to pre-process this list_of_words before using the add_tokens method so that list_of_words consists of only new words?

Topic		Replies	Views
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	144	August 30, 2024
How to add new tokens for existing masked language modelling? Beginners	3	681	June 11, 2021
How to save my tokenizer using save_pretrained? Beginners	5	28796	August 13, 2021
Adding a new mask_token for BERT-like models/tokenizers Intermediate	0	543	May 26, 2023
'Impossible to guess which tokenizer to use' while loading fine-tuned model on pipeline Beginners	1	2992	December 7, 2023

Can't load pre-trained tokenizer with additional new tokens

Related topics