Hi! I have an annoying issue that I canāt find answers to anywhere onlineā¦
Hope someone can help.
I am adding a few special tokens to a gpt2 tokenizer using the following code:
model_name = āgpt2ā
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
special_tokens = {
āeos_tokenā: ā<|endoftext|>ā,
ābos_tokenā: ā<|startoftext|>ā,
āadditional_special_tokensā: ["<|speaker1|>", ā<|speaker2|>ā]
}
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens(special_tokens)
vocab = tokenizer.get_vocab()
model.resize_token_embeddings(len(vocab))
I later save the tokenizer using:
model_save_name = āSARC_gpt2_prefinetune_2.0ā
tokenizer.save_pretrained(F"/content/drive/MyDrive/Colab Notebooks/saved_models/{model_save_name}")
But when I load the model in a different script using:
tokenizer_1 = AutoTokenizer.from_pretrained(ā/content/drive/MyDrive/Colab Notebooks/saved_models/SARC_gpt2_prefinetune_2.0ā)
I get this error:
AssertionError: Non-consecutive added token ā<|startoftext|>ā found. Should have index 50260 but has index 50257 in saved vocabulary.
Does anyone know what Iām doing wrong here?
Thanks in advance