Can't load pre-trained tokenizer with additional new tokens

I first pretrained masked language model by adding additional list of words to the tokenizer. Then I saved the pretrained model and tokenizer.

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained(
        'bert-base-uncased'
)
tokenizer.add_tokens(list_of_words)
model.resize_token_embeddings(len(tokenizer))

trainer.train()
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(data_args.output_file)
tokenizer.save_pretrained(data_args.output_file)

After that, I want to load the pre-trained tokenizer and model by

tokenizer = BertTokenizer.from_pretrained(model_args.model_name_or_path)
encoder = BertModel.from_pretrained(model_args.model_name_or_path, num_labels=num_classes)
tokenizer.add_tokens(list_of_words)
encoder.resize_token_embeddings(len(tokenizer))

However, an error occurred as shown below and it seems that the pretrained tokenizer couldn’t be loaded correctly.

AssertionError: Non-consecutive added token 'IncelTears' found. Should have index 30525 but has index 30526 in saved vocabulary.

Does anyone have an idea on this? Thanks a lot.

Which line threw the error? If it’s tokenizer.add_tokens(list_of_words), it’s because your tokenizer already has those words added from the first sample, so you can’t re-add them.

1 Like

This line threw the error while loading the pre-trained the tokenizer. I wonder if I have to set additional_special_tokens while loading?

tokenizer = BertTokenizer.from_pretrained(model_args.model_name_or_path, additional_special_tokens=list_of_word)

Hi @snugger I thought the original tokenizer just skips the words in list_of_words if tokenizer already has these words in it.

Is it necessary to pre-process this list_of_words before using the add_tokens method so that list_of_words consists of only new words?