I first pretrained masked language model by adding additional list of words to the tokenizer. Then I saved the pretrained model and tokenizer.
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained(
'bert-base-uncased'
)
tokenizer.add_tokens(list_of_words)
model.resize_token_embeddings(len(tokenizer))
trainer.train()
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(data_args.output_file)
tokenizer.save_pretrained(data_args.output_file)
After that, I want to load the pre-trained tokenizer and model by
tokenizer = BertTokenizer.from_pretrained(model_args.model_name_or_path)
encoder = BertModel.from_pretrained(model_args.model_name_or_path, num_labels=num_classes)
tokenizer.add_tokens(list_of_words)
encoder.resize_token_embeddings(len(tokenizer))
However, an error occurred as shown below and it seems that the pretrained tokenizer couldn’t be loaded correctly.
AssertionError: Non-consecutive added token 'IncelTears' found. Should have index 30525 but has index 30526 in saved vocabulary.
Does anyone have an idea on this? Thanks a lot.